Prefill/Decode Disaggregation
prefill-decode-disaggregation-9725723e·2 events·first seen 1mo agoAliases: Prefill/Decode Disaggregation, prefill-decode disaggregation
Co-occurring entities
More like this (12)
Recent events (2)
Mistral AI Engineering Deep Dive: Debugging a Memory Leak in vLLM
Mistral AI's engineering team investigated a memory leak in vLLM that appeared exclusively during disaggregated prefill/decode serving with Mistral Medium 3.1 and graph compilation enabled, causing ~400 MB/min RSS growth. The leak was not visible in heap profilers (Memray, Guppy3, Heaptrack), pointing to off-heap memory allocation tied to NIXL/UCX-based KV cache transfer over InfiniBand. The post is the first in a new Engineering Deep Dive series and documents a methodical descent from Python-level tools to kernel-level tracing to isolate the root cause.
Prefill and Decode for Concurrent Requests - Optimizing LLM Performance
This Hugging Face blog post from TNG Technology Consulting examines how prefill and decode phases interact under concurrent request loads in LLM serving systems. It analyzes performance bottlenecks that arise when multiple requests share GPU resources, covering throughput-latency tradeoffs and optimization strategies. The piece targets practitioners deploying LLMs at scale who need to understand scheduling and batching behavior.