Almanac
Topic

Multimodal Progress

activemultimodal-progress·522 events·last 2d ago

Vision-language models, audio/speech models, video understanding and generation, and the unification (or not) of modalities in single architectures.

Related entities

Related topics (8)

Guides (1)

Recent events (50)

5arXiv · cs.LG·1mo ago·source ↗

RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation

RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

5arXiv · cs.AI·1mo ago·source ↗

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

5arXiv · cs.AI·1mo ago·source ↗

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

6Qwen Research·1mo ago·source ↗

Qwen-Image-Edit: Image Editing Model with Text Rendering and Dual Visual Control

Alibaba's Qwen team has released Qwen-Image-Edit, a 20B-parameter image editing model built on the Qwen-Image foundation. The model extends Qwen-Image's text rendering capabilities to editing tasks, enabling precise in-image text modification. It uses a dual-path architecture that simultaneously feeds input images into Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling both semantic and appearance-level edits.

7Qwen Research·1mo ago·source ↗

Qwen-Image: 20B MMDiT Image Foundation Model with Native Text Rendering

Alibaba's Qwen team has released Qwen-Image, a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model. The model claims significant advances in complex text rendering capabilities, including multi-line layouts, paragraph-level semantics, and fine-grained typographic details across alphabetic and other language scripts. It also features precise image editing capabilities and is accessible via Qwen Chat and open-weight repositories on HuggingFace and ModelScope.

6Hugging Face Blog·1mo ago·source ↗

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA has released Nemotron 3 Nano Omni, a multimodal model targeting long-context understanding across documents, audio, and video modalities. The model is positioned for agentic use cases requiring cross-modal reasoning. It is published via the Hugging Face blog as part of NVIDIA's Nemotron model family. No detailed technical specifications or benchmark results are provided in the available body text.

5Hugging Face Blog·1mo ago·source ↗

H Company's Holo2 235B-A22B Model Leads in UI Localization

H Company has released Holo2, a 235B parameter mixture-of-experts model with 22B active parameters, announced via the Hugging Face blog. The model is positioned as a leader in UI localization tasks, suggesting a focus on agent-oriented or multimodal UI understanding capabilities. The post appears to be a product/model introduction from H Company, a relatively newer AI lab.

4Hugging Face Blog·1mo ago·source ↗

Training Design for Text-to-Image Models: Lessons from Ablations

Photoroom shares practical lessons from ablation studies on training design choices for text-to-image diffusion models. The post covers decisions around data curation, model architecture, and training hyperparameters derived from systematic experimentation. This is part two of a series documenting Photoroom's internal research into building production-grade image generation systems.

5Hugging Face Blog·1mo ago·source ↗

Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

H Company has released Holo1, a new family of vision-language models specifically designed for GUI automation tasks. These models power Surfer-H, a GUI agent capable of interacting with graphical interfaces. The release represents a specialized VLM family targeting the agent-tool ecosystem for desktop/web automation. Details on architecture, training data, and benchmarks are expected in the accompanying blog post.

5Hugging Face Blog·1mo ago·source ↗

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

5Qwen Research·1mo ago·source ↗

Qwen-MT Turbo: Alibaba Releases Specialized Translation Model Supporting 92 Languages

Alibaba's Qwen team has released qwen-mt-turbo, a specialized machine translation model built on Qwen3 and trained on trillions of multilingual and translation tokens. The model supports 92 languages and dialects covering over 95% of the global population. It incorporates reinforcement learning techniques to improve translation accuracy and linguistic fluency, and is available via the Qwen API.

5Qwen Research·1mo ago·source ↗

Qwen-TTS Updated with Chinese Dialect Support and Bilingual Voices

Alibaba's Qwen team has released an update to Qwen-TTS (qwen-tts-2025-05-22), a text-to-speech model trained on millions of hours of speech data. The model claims human-level naturalness and expressiveness, with automatic prosody and emotional inflection adjustment. A notable new capability is support for three Chinese dialects—Pekingese, Shanghainese, and Sichuanese—delivered through seven named Chinese-English bilingual voices accessible via the Qwen API.

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

7Qwen Research·1mo ago·source ↗

Qwen VLo: Unified Multimodal Understanding and Generation Model

Alibaba's Qwen team has announced Qwen VLo, a new model that unifies multimodal understanding and image generation in a single architecture. Building on the Qwen2.5 VL lineage, the model is positioned to both comprehend and generate high-quality visual content. This represents a step toward unified perception-and-creation models, a direction several frontier labs are pursuing simultaneously.

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PLAID: Repurposing Protein Folding Models for Multimodal Protein Generation with Latent Diffusion

PLAID is a generative model that simultaneously produces protein 1D sequences and 3D all-atom structures by learning a diffusion model over the latent space of ESMFold, a protein folding model. It requires only sequence data for training—leveraging databases 2-4 orders of magnitude larger than structure databases—and decodes structure at inference via frozen folding model weights. The approach supports compositional prompting by function and organism, addressing practical drug-design constraints like humanization and solubility. A companion compression model, CHEAP, addresses the high-dimensionality of transformer latent spaces to make the diffusion training tractable.

7Qwen Research·1mo ago·source ↗

QVQ-Max: Alibaba Qwen Releases Visual Reasoning Model with Multimodal Chain-of-Thought

Alibaba's Qwen team has officially released QVQ-Max, a visual reasoning model succeeding the December 2024 QVQ-72B-Preview. The model is designed to analyze and reason over images and videos, covering domains including mathematics, programming, and creative tasks. It represents a step beyond the exploratory preview, positioning as a production-grade multimodal reasoning system.

5Hugging Face Blog·1mo ago·source ↗

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face published a blog post detailing how to train and finetune multimodal embedding and reranker models using the Sentence Transformers library. The post covers techniques for building models that can jointly embed text and images for retrieval and reranking tasks. This represents an extension of the Sentence Transformers ecosystem into multimodal territory, enabling practitioners to build cross-modal search and ranking systems.

7Qwen Research·1mo ago·source ↗

Qwen2.5-Omni: Alibaba Releases End-to-End Multimodal Model with Real-Time Streaming

Alibaba's Qwen team releases Qwen2.5-Omni, a 7B-parameter end-to-end multimodal model capable of processing text, images, audio, and video simultaneously. The model delivers real-time streaming responses in both text and natural speech synthesis. It is openly available on Hugging Face, ModelScope, DashScope, and GitHub, accompanied by a technical paper.

7Qwen Research·1mo ago·source ↗

Qwen2.5-VL-32B: Reinforcement-Learning-Optimized Vision-Language Model Released

Alibaba's Qwen team has released Qwen2.5-VL-32B-Instruct, a 32-billion-parameter vision-language model built on the Qwen2.5-VL series and further optimized with reinforcement learning. The model is open-sourced under the Apache 2.0 license and available on Hugging Face and ModelScope. It follows the January 2025 launch of the broader Qwen2.5-VL series, positioning the 32B scale as a balance between capability and deployment practicality.

5Hugging Face Blog·1mo ago·source ↗

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs

Hugging Face published a blog post introducing Waypoint-1.5, a model or system for generating higher-fidelity interactive world simulations designed to run on consumer-grade GPUs. The post appears to describe advances in interactive world modeling or simulation quality relative to a prior Waypoint-1 release. As a tier-2 source with no body text available, specific technical details about architecture, benchmarks, or training methodology cannot be assessed.

5Hugging Face Blog·1mo ago·source ↗

Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face's Sentence Transformers library has added support for multimodal embedding and reranking models, enabling joint text-image (and potentially other modality) representations within a unified framework. The update extends the library's existing text-focused embedding capabilities to handle cross-modal retrieval and reranking tasks. This lowers the barrier for practitioners building multimodal search and RAG pipelines using open-weights models.

7Hugging Face Blog·1mo ago·source ↗

Welcome Gemma 4: Frontier Multimodal Intelligence on Device

Google has released Gemma 4, a new open-weights multimodal model family announced via the Hugging Face blog. The release positions Gemma 4 as capable of frontier-level multimodal intelligence while being deployable on-device. As a tier-2 source commentary, the post likely covers model capabilities, availability on Hugging Face Hub, and integration details.

5Hugging Face Blog·1mo ago·source ↗

Falcon Perception: TII Announces Multimodal Perception Capabilities for Falcon

TII (Technology Innovation Institute) has published a blog post on Hugging Face introducing Falcon Perception, a multimodal extension of the Falcon model family. The post appears to detail perception capabilities added to the Falcon series, likely covering vision-language or other sensory modalities. As the body content is empty, specific technical details about architecture, benchmarks, or release scope are unavailable from this source.

5Hugging Face Blog·1mo ago·source ↗

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

IBM released Granite 4.0 3B Vision, a compact multimodal model targeting enterprise document understanding tasks. The model is hosted on Hugging Face and positioned for deployment in resource-constrained enterprise environments. As a 3B-parameter vision-language model, it competes in the small-but-capable segment increasingly favored for on-premise and edge deployments.

8Qwen Research·1mo ago·source ↗

Qwen2.5-VL: Alibaba's New Flagship Vision-Language Model Released in 3B/7B/72B Sizes

Alibaba's Qwen team has released Qwen2.5-VL, their new flagship vision-language model, representing a significant upgrade over Qwen2-VL. The release includes both base and instruct variants in three sizes (3B, 7B, 72B), all open-weighted and available on Hugging Face and ModelScope. The 72B instruct model is also accessible via Qwen Chat. Key capabilities highlighted include enhanced visual understanding, with the model positioned as a major step forward in multimodal performance.

7Qwen Research·1mo ago·source ↗

QVQ-72B-Preview: Qwen Visual Reasoning Model Release

Alibaba's Qwen team has released QVQ-72B-Preview, a 72-billion parameter multimodal model designed to integrate visual understanding with advanced reasoning capabilities. The model is positioned as an extension of Qwen's language reasoning work into the visual domain. It is available on GitHub, Hugging Face, ModelScope, and Kaggle with a live demo.

5Hugging Face Blog·1mo ago·source ↗

Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine-Tuning, and On-Device Optimizations

NXP and Hugging Face describe a pipeline for deploying Vision-Language-Action (VLA) models on embedded/edge hardware, covering dataset recording, fine-tuning, and on-device optimization techniques. The post targets robotics applications where inference must run on resource-constrained microcontrollers or SoCs rather than cloud GPUs. Key topics include quantization, model compression, and integration with the LeRobot ecosystem. This represents a practical engineering bridge between frontier VLA research and real-world embedded robotics deployment.

5Hugging Face Blog·1mo ago·source ↗

Introducing Modular Diffusers - Composable Building Blocks for Diffusion Pipelines

Hugging Face introduces Modular Diffusers, a new framework design that breaks diffusion pipelines into composable, interchangeable building blocks. The approach aims to make it easier to mix and match components such as encoders, denoisers, and decoders across different diffusion model architectures. This represents a significant refactor of the Diffusers library's pipeline abstraction, targeting researchers and developers who need flexible pipeline construction without rewriting boilerplate code.

4Hugging Face Blog·1mo ago·source ↗

PRX Part 3 — Training a Text-to-Image Model in 24 Hours

Photoroom shares the third installment of their PRX series on Hugging Face, detailing how they trained a text-to-image model within a 24-hour window. The post covers the practical engineering and training infrastructure decisions that enabled rapid model development. This is part of an ongoing series documenting Photoroom's internal model development process.

7Qwen Research·1mo ago·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

6Qwen Research·1mo ago·source ↗

Qwen2-Audio: Multimodal Audio-Language Model Release

Alibaba's Qwen team releases Qwen2-Audio, the successor to Qwen-Audio, capable of accepting both audio and text inputs and generating text outputs. The model is positioned as a step toward AGI by extending large language model capabilities to audio modalities. It is released with accompanying paper, GitHub repository, and model weights on Hugging Face and ModelScope.

8Qwen Research·1mo ago·source ↗

Qwen2 Model Family Released: Five Sizes, 128K Context, Multilingual

Alibaba's Qwen team has released Qwen2, an evolution from Qwen1.5, comprising five pretrained and instruction-tuned models ranging from 0.5B to 72B parameters, including a 57B mixture-of-experts variant (57B-A14B). The release highlights training on 27 additional languages beyond English and Chinese, significantly improved coding and mathematics performance, and extended context support up to 128K tokens for the 7B and 72B instruct variants. Benchmark results are claimed to be state-of-the-art across a large number of evaluations.

6Qwen Research·1mo ago·source ↗

Introducing Qwen-VL-Plus and Qwen-VL-Max: Upgraded Multimodal Models from Alibaba

Alibaba's Qwen team has launched two enhanced versions of their multimodal model, Qwen-VL-Plus and Qwen-VL-Max, building on the open-sourced Qwen-VL released in September 2023. Key improvements include substantially boosted image reasoning capabilities, enhanced detail recognition and text extraction from images, and support for high-definition images exceeding one million pixels across various aspect ratios. The upgrades represent a significant step forward in the Qwen-VL series' generalization and visual understanding capabilities.

4Import Ai·1mo ago·source ↗

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Import AI issue 449 covers several AI/ML developments including LLMs being used to train other LLMs, a 72B parameter distributed training run, and analysis of why computer vision remains harder than generative text. The newsletter also touches on potential political implications of AI progress. As a tier-2 commentary source, this aggregates and contextualizes multiple technical developments across the AI landscape.

5Hugging Face Blog·1mo ago·source ↗

TimeScope: How Long Can Your Video Large Multimodal Model Go?

Hugging Face introduces TimeScope, a benchmark designed to evaluate video large multimodal models (LMMs) across varying video lengths and temporal reasoning demands. The benchmark targets a known gap in existing evaluations: most video benchmarks use short clips, leaving long-video understanding largely untested. TimeScope aims to systematically probe how model performance degrades or holds as video duration increases.

4Hugging Face Blog·1mo ago·source ↗

nanoVLM: Minimal Pure-PyTorch Repository for Training Vision-Language Models

Hugging Face published nanoVLM, a minimal open-source repository designed to make training vision-language models (VLMs) as simple as possible using pure PyTorch. The project aims to lower the barrier to entry for VLM research and experimentation by providing a clean, readable codebase without heavy abstractions. It follows in the tradition of educational ML repositories like nanoGPT, targeting researchers and practitioners who want to understand or customize VLM training from scratch.

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

4Hugging Face Blog·1mo ago·source ↗

Finetuning olmOCR to be a faithful OCR-Engine

TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.

4Hugging Face Blog·1mo ago·source ↗

Visual Salamandra: Pushing the Boundaries of Multimodal Understanding

BSC-LT (Barcelona Supercomputing Center Language Technologies) has released Visual Salamandra, a 7B multimodal model announced via Hugging Face blog. The post describes a vision-language model building on the Salamandra language model family. As a tier-2 source with an empty body, specific capability details and benchmark results are not available from this item alone.

4Qwen Research·1mo ago·source ↗

OFA: Towards Building a One-For-All Unified Multimodal Pretrained Model

Alibaba's Qwen team introduces OFA (One-For-All), a unified multimodal pretrained model designed to handle both understanding and generation tasks across multiple modalities within a single framework. The model is pretrained using instruction-based multitask pretraining to endow it with diverse capabilities. This work was published in late 2022 as part of the broader wave of generalist multimodal models. It represents an early effort toward a single model architecture capable of spanning vision, language, and cross-modal tasks.

4Qwen Research·1mo ago·source ↗

OFASys: Multitask Multimodal Learning Framework from Alibaba/Qwen

Alibaba's Qwen team released OFASys, an open-source framework designed to simplify multimodal multitask learning, building on their earlier OFA unified pretrained model. The system aims to reduce engineering friction in setting up multi-task, multi-modal training pipelines, including data batching and training stability. It is positioned as infrastructure for building generalist AI models with minimal code overhead.

4Qwen Research·1mo ago·source ↗

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Alibaba's Qwen team released Chinese CLIP, a language-specific vision-language contrastive pretraining model targeting Chinese multimodal representation learning. The project addresses a gap in open-source Chinese CLIP models, particularly for cross-modal retrieval tasks. It follows the CLIP framework but is adapted for Chinese language and cultural context.

8Anthropic News·1mo ago·source ↗

Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards

Anthropic has released Claude Opus 4.7, a general-availability model positioned as a meaningful improvement over Opus 4.6 in advanced software engineering, long-horizon agentic tasks, and vision capabilities including higher image resolution. The model is notably the first to receive new cybersecurity safeguards developed in response to Project Glasswing, with automatic detection and blocking of prohibited cyber uses and a new Cyber Verification Program for legitimate security professionals. Opus 4.7 is available across Claude products, API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same pricing as Opus 4.6 ($5/$25 per million input/output tokens). The release is explicitly positioned below Claude Mythos Preview in overall capability, serving as a testbed for safety mechanisms before broader deployment of Mythos-class models.

6Anthropic News·1mo ago·source ↗

Anthropic Launches Claude for Creative Work with Eight New MCP Connectors

Anthropic has released a suite of connectors enabling Claude to integrate directly with major creative software platforms including Adobe Creative Cloud, Blender, Autodesk Fusion, Ableton, Affinity by Canva, SketchUp, Resolume, and Splice. The connectors are built on the Model Context Protocol (MCP), making them accessible to other LLMs as well. Anthropic also announced Claude Design, a new product from Anthropic Labs for exploring software UI concepts with export to Canva, and partnerships with RISD, Ringling College, and Goldsmiths to support creative computing curricula. A one-time donation was made to the Blender project to support its Python API development.

7Anthropic News·1mo ago·source ↗

Anthropic Launches Claude Design: AI-Powered Visual Design and Prototyping Tool

Anthropic has launched Claude Design, a new product under its Anthropic Labs umbrella that enables collaborative visual design work including prototypes, slides, wireframes, and marketing collateral. Powered by Claude Opus 4.7, the tool supports brand system ingestion, inline editing, multi-user collaboration, and direct handoff to Claude Code for implementation. It is available in research preview for Claude Pro, Max, Team, and Enterprise subscribers, with integrations including Canva and PPTX export. The product targets both professional designers seeking faster exploration and non-designers needing to produce visual work.

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Voxtral TTS: 4B-Parameter Multilingual Text-to-Speech Model

Mistral AI has launched Voxtral TTS, its first text-to-speech model, built on a 4B-parameter transformer-based autoregressive flow-matching architecture derived from Ministral 3B. The model supports 9 languages with zero-shot voice adaptation from as little as 3 seconds of reference audio, achieving 70ms latency for typical inputs and a real-time factor of ~9.7x. Human evaluations claim superior naturalness compared to ElevenLabs Flash v2.5 and parity with ElevenLabs v3. The model is available via Mistral Studio and API, targeting enterprise voice agent workflows.

8Mistral Ai News·1mo ago·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Voxtral Transcribe 2: State-of-the-Art Speech-to-Text with Sub-200ms Realtime Model

Mistral AI has released Voxtral Transcribe 2, a family of two speech-to-text models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime features a novel streaming architecture with configurable latency down to sub-200ms, a 4B parameter footprint suitable for edge deployment, and is released as open weights under Apache 2.0. Voxtral Mini Transcribe V2 claims state-of-the-art word error rate on FLEURS at $0.003/min, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI, and Deepgram Nova on accuracy benchmarks. Both models support 13 languages with speaker diarization, word-level timestamps, and context biasing.

7Mistral Ai News·1mo ago·source ↗

Mistral AI joins NVIDIA Nemotron Coalition as founding member, co-developing open frontier models

Mistral AI has announced a strategic partnership with NVIDIA as a founding member of the newly formed NVIDIA Nemotron Coalition, a multi-lab initiative to advance open-source frontier foundation models. The collaboration will combine Mistral's model architectures, multimodal capabilities, and fine-tuning expertise with NVIDIA's DGX Cloud compute and synthetic data pipelines. The coalition's first deliverable is a base model trained on DGX Cloud that will underpin the upcoming NVIDIA Nemotron 4 model family, to be open-sourced. Coinciding with the announcement, Mistral is also releasing Mistral Small 4.