Entity · product

vLLM

productactivevllm-25792bfb·21 events·first seen May 18, 2026

Aliases: vLLM

Co-occurring entities

More like this (12)

LLM EvalLLM whichllm LLM (CLI tool)LLM CLI LLM evaluation SpeechLLM StreamingLLM Arabic LLMs LLM-as-a-Judge LLM inference SVD-LLM

Recent events (21)

4Github Trending·Jul 22, 2026·source ↗

NVIDIA Model-Optimizer: unified library for quantization, pruning, distillation, and inference optimization

NVIDIA's Model-Optimizer is an open-source Python library consolidating state-of-the-art model compression techniques including quantization, distillation, pruning, neural architecture search, and speculative decoding. It targets downstream deployment on TensorRT-LLM, TensorRT, and vLLM to improve inference speed. The repository has accumulated 3,286 stars with modest recent activity (+8 today).

Training Infrastructure Inference Economics speculative decoding NVIDIA Model-Optimizer NVIDIA +2 more

7The Batch·Jul 8, 2026·source ↗

GPT-5.6 wider API release imminent after government delay; roundup covers Microsoft MAI shift, Claude Cowork mobile, Nvidia Audex, OpenAI mini voice

OpenAI's GPT-5.6 models are set for broader API release following a Department of Commerce-approved safety review that delayed launch for weeks; GPT-5.6 Sol Ultra scores 91.9% on TerminalBench 2.1 versus Claude Mythos 5 at 88%, with pricing roughly half of Anthropic's comparable tier. Microsoft is actively replacing OpenAI and Anthropic models in Excel, Outlook, and Teams with its internally built MAI models to reduce third-party dependency as its OpenAI discount partnership nears expiration. Anthropic expanded Claude Cowork to web and mobile for Max plan subscribers, with usage data from 1.2 million sessions showing over 90% of use is non-developer work. Nvidia released Audex, a 30B MoE audio-text model that avoids the typical 'text tax' of multimodal models, shipping under a noncommercial license.

Frontier Model Releases Inference Economics Claude Mythos Center for AI Standards and Innovation Microsoft +19 more

6Hugging Face Blog·Jul 8, 2026·source ↗

Hugging Face introduces native-speed vLLM transformers modeling backend

Hugging Face announced a native-speed backend integration between vLLM and the Transformers library, enabling vLLM to use Transformers model implementations directly at native inference speed. This removes the need to maintain separate model code in vLLM, broadening model coverage and simplifying the ecosystem. The integration is significant for practitioners deploying open-weights models at scale, as it reduces friction between the two dominant open-source inference stacks.

Inference Economics Agent and Tool Ecosystem Transformers Hugging Face vLLM

4Github Trending·Jul 5, 2026·source ↗

nano-vllm: Lightweight Python reimplementation of vLLM gains traction on GitHub

nano-vllm is a minimal Python reimplementation of the vLLM inference engine, accumulating over 14,000 GitHub stars. The project appears aimed at educational or lightweight deployment use cases where the full vLLM stack is too heavy. Its trending status signals community interest in understanding or simplifying LLM inference serving.

Inference Economics Agent and Tool Ecosystem nano-vllm vLLM

4Hugging Face Blog·Jun 25, 2026·source ↗

Hugging Face integrates vLLM server deployment into HF Jobs with single command

Hugging Face published a guide showing how to launch a vLLM inference server using HF Jobs in a single command. The integration simplifies self-hosted LLM inference deployment on Hugging Face infrastructure. This lowers the operational barrier for practitioners who want managed, scalable vLLM serving without custom orchestration.

Inference Economics Agent and Tool Ecosystem HF Jobs Hugging Face vLLM

8The Batch·Jun 12, 2026·source ↗

Anthropic launches Claude Mythos 5 and Claude Fable 5; Andrew Ng introduces OpenCoworker desktop agent

Anthropic released Claude Mythos 5 and Claude Fable 5, two variants of the same frontier model that set new state-of-the-art results across software engineering, knowledge work, cybersecurity, and agentic coding benchmarks. Claude Fable 5 is the general-availability version with safety classifiers that restrict responses on security, biology, chemistry, and cutting-edge AI topics, priced at $10/$50 per million input/output tokens; Mythos 5 is restricted to selected partners via Project Glasswing. Separately, Andrew Ng and collaborators released OpenCoworker, a free open-source desktop agent harness built on top of aisuite, designed to give users privacy-preserving agentic workflows with their own API keys or local models. The newsletter also contextualizes the broader shift toward LLM-driven agent harnesses as frontier models have become capable enough to reliably drive next-action decisions.

Frontier Model Releases AI Safety Research Ollama DeepLearning.AI Claude Mythos +13 more

6The Batch·Jun 5, 2026·source ↗

The Batch Issue 356: Qwen3.7-Max release, White House AI executive order, fine-tuning breaks copyright alignment

The Batch issue 356 covers several distinct AI developments: Alibaba's release of Qwen3.7-Max, a closed-weights flagship LLM targeting agentic coding and scientific tasks with a novel RL training approach that decouples task, harness, and verifier; a new White House executive order on frontier AI models focused on cybersecurity, including voluntary model-sharing with government; and a finding that fine-tuning breaks copyright alignment in LLMs. Andrew Ng's editorial commentary frames the executive order as a reasonable compromise, noting Anthropic's Mythos vulnerability-detection model as a key driver of the cybersecurity concerns behind the regulation.

Frontier Model Releases AI Safety Research Qwen3.7-Plus-Preview DeepLearning.AI Artificial Analysis Intelligence Index +9 more

4Github Trending·Jun 5, 2026·source ↗

vllm-omni: framework for efficient inference with omni-modality models

The vllm-project has published vllm-omni, a Python framework extending vLLM's inference capabilities to omni-modality models. The repository has accumulated ~4,956 GitHub stars. It represents an expansion of the vLLM ecosystem into multimodal inference serving.

Inference Economics Multimodal Progress vllm-project vllm-omni vLLM

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mistral Small v24.09, Free API Tier, and Pixtral 12B Vision on le Chat with Broad Price Cuts

Mistral AI announced a multi-part release on September 17, 2024: a free tier for la Plateforme API, significant price reductions across its model family (up to 80% for Mistral Small and Codestral), an updated Mistral Small v24.09 (22B parameters, improved alignment and reasoning), and the availability of Pixtral 12B vision capabilities on le Chat. Pixtral 12B, released under Apache 2.0, supports images of any size without text performance degradation and is now accessible for free on le Chat. The pricing updates also apply to cloud partner deployments on Azure AI Studio, Amazon Bedrock, and Google Vertex AI.

Frontier Model Releases Open Weights Progress Mistral AI Amazon Bedrock Apache 2.0 +14 more

9Mistral Ai News·Jun 1, 2026·source ↗

Mixtral 8x7B: Mistral AI Releases Sparse Mixture-of-Experts Open-Weight Model

Mistral AI has released Mixtral 8x7B, a sparse mixture-of-experts (SMoE) model with 46.7B total parameters but only 12.9B active parameters per token, enabling inference speed and cost equivalent to a 12.9B model. Licensed under Apache 2.0, Mixtral outperforms Llama 2 70B on most benchmarks and matches or exceeds GPT-3.5, with support for 32k context, five European languages, and strong code generation. An instruction-tuned variant (Mixtral 8x7B Instruct) achieves 8.3 on MT-Bench, claimed best among open-source models at release. The model is deployed behind Mistral's mistral-small API endpoint and supported via vLLM with Megablocks CUDA kernels.

Frontier Model Releases Evaluation and Benchmarking Mistral AI Llama 2 70B Mistral Small 4 +15 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Ministral 3B and 8B Edge Models

Mistral AI has introduced two new small language models, Ministral 3B and Ministral 8B, targeting on-device and edge computing use cases. Both models support up to 128k context length and claim state-of-the-art performance in the sub-10B parameter category, outperforming comparable models from Google and Meta on internal benchmarks. Ministral 8B features an interleaved sliding-window attention mechanism for memory-efficient inference and is priced at $0.1/M tokens via API, while Ministral 3B is priced at $0.04/M tokens. Weights for Ministral 8B Instruct are available for research use, with commercial licensing available on request.

Long Context Evolution Frontier Model Releases Mistral AI Gemma 2 9B Ministral 8B +12 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral 7B: Open-Weights 7B Model Outperforming Llama 2 13B

Mistral AI released Mistral 7B, a 7.3B parameter language model under the Apache 2.0 license that outperforms Llama 2 13B across all evaluated benchmarks and approaches Llama 34B on many tasks. The model employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at reduced cost, achieving roughly 2x speed improvement at 16k sequence length. A fine-tuned chat variant, Mistral 7B Instruct, outperforms all 7B chat models on MT-Bench and is competitive with 13B-class chat models. The release includes deployment support for AWS, GCP, Azure, HuggingFace, and local use via vLLM.

Long Context Evolution Frontier Model Releases Mistral AI MT-Bench Mistral 7B Instruct v0.2 +13 more

4Github Trending·May 24, 2026·source ↗

earendil-works/pi: AI Agent Toolkit with Coding Agent CLI, Unified LLM API, and Multi-UI Libraries

The earendil-works/pi repository is an open-source TypeScript toolkit providing a coding agent CLI, unified LLM API abstraction, TUI and web UI libraries, a Slack bot integration, and vLLM pod support. It has accumulated 53,875 GitHub stars with 444 new stars today, indicating significant community traction. The project spans multiple components of the agent-tool ecosystem including inference backends and developer-facing interfaces.

Inference Economics Agent and Tool Ecosystem Slack earendil-works/pi vLLM

6arXiv · cs.AI·May 21, 2026·source ↗

PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models

PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.

Training Infrastructure Inference Economics PALS Mixture of Experts GPU power capping +2 more

3Github Trending·May 20, 2026·source ↗

vLLM: High-Throughput LLM Inference and Serving Engine Trending on GitHub

vLLM is an open-source Python library providing high-throughput and memory-efficient inference and serving for large language models. The project has accumulated over 80,500 GitHub stars with 98 new stars today, indicating continued strong community interest. It is a widely adopted inference backend in the AI/ML ecosystem, supporting PagedAttention and various optimization techniques for LLM deployment.

Inference Economics Agent and Tool Ecosystem vllm-project vLLM

6Hugging Face Blog·May 19, 2026·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

Inference Economics Enterprise Deployment Patterns Text Generation Inference NVIDIA TensorRT-LLM +3 more

5Hugging Face Blog·May 19, 2026·source ↗

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

Hugging Face's TRL library now supports co-locating vLLM inference alongside training on the same GPUs, eliminating the idle GPU problem that arises when separate inference and training processes alternate. This approach allows reinforcement learning from human feedback (RLHF) and online RL training pipelines to use GPUs continuously rather than leaving them idle during generation or gradient update phases. The integration targets efficiency gains in online RL training workflows such as GRPO and PPO, where generation and training steps previously required dedicated, alternating GPU allocations.

Training Infrastructure Inference Economics GRPO PPO Hugging Face +4 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Mistral 3 Family: Mistral Large 3 (675B MoE) and Ministral 3 Series (3B–14B), All Apache 2.0

Mistral AI has announced Mistral 3, a family of open-weight models including Mistral Large 3 (41B active / 675B total sparse MoE) and three dense Ministral 3 edge models (3B, 8B, 14B), all released under Apache 2.0. Mistral Large 3 debuts at #2 on LMArena's OSS non-reasoning leaderboard, supports image understanding, and was trained on 3,000 NVIDIA H200 GPUs; a reasoning variant is forthcoming. The Ministral 3 series includes base, instruct, and reasoning variants with multimodal and multilingual capabilities, with the 14B reasoning model achieving 85% on AIME '25. The release involves deep co-optimization with NVIDIA (Blackwell/Hopper kernels, NVFP4 format), vLLM, and Red Hat, and is available across major cloud and inference platforms.

Training Infrastructure Frontier Model Releases Mistral AI Amazon Bedrock Red Hat +16 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

Long Context Evolution Frontier Model Releases Mistral AI Mistral Small 4 Pixtral +14 more

6Mistral Ai News·May 18, 2026·source ↗

Mistral AI Engineering Deep Dive: Debugging a Memory Leak in vLLM

Mistral AI's engineering team investigated a memory leak in vLLM that appeared exclusively during disaggregated prefill/decode serving with Mistral Medium 3.1 and graph compilation enabled, causing ~400 MB/min RSS growth. The leak was not visible in heap profilers (Memray, Guppy3, Heaptrack), pointing to off-heap memory allocation tied to NIXL/UCX-based KV cache transfer over InfiniBand. The post is the first in a new Engineering Deep Dive series and documents a methodical descent from Python-level tools to kernel-level tracing to isolate the root cause.

Training Infrastructure Inference Economics Mistral AI Prefill/Decode Disaggregation Mistral-medium +7 more

4Hugging Face Blog·May 18, 2026·source ↗

vLLM V0 to V1: Correctness Before Corrections in RL

A ServiceNow AI blog post on Hugging Face discusses lessons learned migrating reinforcement learning training pipelines from vLLM V0 to V1. The piece focuses on correctness issues encountered during the transition and how they were diagnosed and resolved before applying RL corrections. This is relevant to practitioners using vLLM as an inference backend for RL-based LLM training workflows.

Inference Economics Agent and Tool Ecosystem ServiceNow AI Reinforcement Learning from Human Feedback vLLM +1 more