
TensorRT-LLM
tensorrt-llm-555d11d0·6 events·first seen 1mo agoAliases: TensorRT-LLM
Co-occurring entities
More like this (12)
Recent events (6)
Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference
Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.
Optimum-NVIDIA: One-Line LLM Inference Acceleration via TensorRT-LLM
Hugging Face's Optimum-NVIDIA integration wraps NVIDIA's TensorRT-LLM backend to enable high-performance LLM inference with minimal code changes. The library targets developers who want near-peak GPU throughput without manually configuring TensorRT-LLM pipelines. It positions as a bridge between the Hugging Face ecosystem and NVIDIA's optimized inference stack.
Accelerate a World of LLMs on Hugging Face with NVIDIA NIM
NVIDIA NIM microservices are being integrated with Hugging Face to enable optimized inference deployment for a broad range of LLMs hosted on the Hub. The partnership allows developers to deploy Hugging Face models via NIM's containerized inference stack, leveraging NVIDIA's TensorRT-LLM and other optimizations. This expands the ecosystem of models accessible through NIM beyond NVIDIA's own catalog to the wider Hugging Face model repository.
Mistral AI Launches La Plateforme: First API Endpoints in Early Access
Mistral AI opened beta access to its first developer platform, La Plateforme, offering three generative text endpoints (mistral-tiny, mistral-small, mistral-medium) and an embedding endpoint. Mistral-tiny serves Mistral 7B Instruct v0.2, mistral-small serves Mixtral 8x7B, and mistral-medium serves an unreleased prototype model scoring 8.6 on MT-Bench. The platform also introduces Mistral-embed with a 1024-dimension embedding model achieving 55.26 on MTEB. The API follows OpenAI-compatible chat interface specifications and is ramping toward general availability.
Mistral Releases Mistral 3 Family: Mistral Large 3 (675B MoE) and Ministral 3 Series (3B–14B), All Apache 2.0
Mistral AI has announced Mistral 3, a family of open-weight models including Mistral Large 3 (41B active / 675B total sparse MoE) and three dense Ministral 3 edge models (3B, 8B, 14B), all released under Apache 2.0. Mistral Large 3 debuts at #2 on LMArena's OSS non-reasoning leaderboard, supports image understanding, and was trained on 3,000 NVIDIA H200 GPUs; a reasoning variant is forthcoming. The Ministral 3 series includes base, instruct, and reasoning variants with multimodal and multilingual capabilities, with the 14B reasoning model achieving 85% on AIME '25. The release involves deep co-optimization with NVIDIA (Blackwell/Hopper kernels, NVFP4 format), vLLM, and Red Hat, and is available across major cloud and inference platforms.
Codestral Mamba: Mistral AI Releases Apache 2.0 Mamba-Architecture Code Model
Mistral AI has released Codestral Mamba, a 7.3B-parameter code-focused language model built on the Mamba state-space architecture rather than the Transformer architecture. The model offers linear-time inference and theoretically infinite sequence length, tested up to 256k tokens in-context retrieval. Developed with Mamba co-creators Albert Gu and Tri Dao, it is released under Apache 2.0 and available via HuggingFace, mistral-inference SDK, TensorRT-LLM, and Mistral's la Plateforme API. Mistral positions it as a local code assistant that performs on par with state-of-the-art transformer-based code models.