Almanac
← Events
6Mistral AI News·1mo ago

Mistral AI Engineering Deep Dive: Debugging a Memory Leak in vLLM

Mistral AI's engineering team investigated a memory leak in vLLM that appeared exclusively during disaggregated prefill/decode serving with Mistral Medium 3.1 and graph compilation enabled, causing ~400 MB/min RSS growth. The leak was not visible in heap profilers (Memray, Guppy3, Heaptrack), pointing to off-heap memory allocation tied to NIXL/UCX-based KV cache transfer over InfiniBand. The post is the first in a new Engineering Deep Dive series and documents a methodical descent from Python-level tools to kernel-level tracing to isolate the root cause.

Related guides (4)

Related events (8)

6arXiv · cs.CL·23d ago·source ↗

MemTrace: Framework for Tracing and Attributing Errors in LLM Memory Systems

MemTrace introduces a framework that converts LLM memory pipelines into executable memory evolution graphs to enable fine-grained error tracing and root-cause attribution. The authors construct MemTraceBench, a benchmark covering Long-Context, RAG, Mem0, and EverMemOS memory systems, to systematically characterize memory failure modes such as information loss and retrieval misalignment. An automatic attribution method iteratively traces operation subgraphs to pinpoint failures, and the resulting signals are used to guide prompt optimization in a closed-loop system that improves end-task performance by up to 7.62%.

7Mistral Ai News·19d ago·source ↗

Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0

Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.

8Mistral Ai News·1mo ago·source ↗

Mistral Launches Medium 3.5 (128B Open Weights), Remote Cloud Coding Agents in Vibe, and Work Mode in Le Chat

Mistral AI has released Mistral Medium 3.5, a 128B dense open-weights model with a 256k context window, configurable reasoning effort, and a vision encoder trained from scratch, scoring 77.6% on SWE-Bench Verified. Alongside the model, Mistral is launching remote cloud-based coding agents in its Vibe CLI and Le Chat interface, enabling async parallel coding sessions that run independently and notify users on completion. A new Work mode in Le Chat provides a multi-step agentic interface for cross-tool workflows including email, calendar, research, and issue tracking. Mistral Medium 3.5 replaces Devstral 2 as the default model in both Le Chat and the Vibe CLI, and is available for self-hosting on as few as four GPUs under a modified MIT license.

8Mistral Ai News·1mo ago·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

6Mistral Ai News·19d ago·source ↗

Mistral AI Publishes First Comprehensive Lifecycle Analysis of LLM Environmental Footprint

Mistral AI has released what it claims is the first comprehensive lifecycle analysis (LCA) of an AI model, conducted in collaboration with Carbone 4 and French agency ADEME, covering greenhouse gas emissions, water use, and resource depletion. Key findings include Mistral Large 2 generating 20.4 ktCO₂e, 281,000 m³ of water, and 660 kg Sb eq over 18 months of training and usage, with a single 400-token Le Chat inference costing 1.14 gCO₂e and 45 mL of water. The study proposes three standardized reporting indicators for the industry and advocates for mandatory disclosure of training and inference environmental impacts. Mistral argues model size correlates roughly linearly with environmental footprint, emphasizing the importance of right-sizing model selection.

7Mistral Ai News·19d ago·source ↗

Mistral Medium 3: Frontier-Class Performance at 8x Lower Cost

Mistral AI has released Mistral Medium 3, a new enterprise-focused language model priced at $0.4/$2 per million input/output tokens. The model claims to achieve 90%+ of Claude Sonnet 3.7's benchmark performance while undercutting cost leaders like DeepSeek v3, and outperforming open models including Llama 4 Maverick. It supports hybrid, on-premises, and in-VPC deployment on as few as four GPUs, and is available immediately on Mistral La Plateforme and Amazon SageMaker, with additional cloud platforms coming soon. The announcement also teases an upcoming large open-weights model release.

5Mistral Ai News·19d ago·source ↗

Mistral Launches Le Chat Memory (Beta) with Transparency and User-Control Design Principles

Mistral AI has released a beta memory system for its Le Chat assistant, featuring automatic storage with smart, visible recall and source citations. The system is built around three principles—transparency, agency, and sovereignty—allowing users to view, edit, delete, export, and import memories. Under the hood, Mistral uses a graph-based architecture to improve context-awareness over time. A companion feature called Memory Insights surfaces trends and summaries derived from a user's stored data.

7Mistral Ai News·19d ago·source ↗

Mistral AI Releases Devstral: Apache 2.0 Agentic Coding Model with SWE-Bench SOTA

Mistral AI, in collaboration with All Hands AI, releases Devstral, an agentic LLM specialized for software engineering tasks under the Apache 2.0 license. The model achieves 46.8% on SWE-Bench Verified, surpassing prior open-source state-of-the-art by over 6 percentage points and outperforming larger models like DeepSeek-V3-0324 (671B) and Qwen3 232B-A22B under the same OpenHands scaffold. Devstral is small enough to run on a single RTX 4090 or a Mac with 32GB RAM, and is available via Mistral's API at $0.1/M input tokens, as well as on HuggingFace, Ollama, and other platforms. Mistral indicates a larger agentic coding model is in development.