6Latent Space (swyx)·1mo ago

Thinking Machines' TML-Interaction-Small 276B-A12B Advances SOTA Realtime Voice and VAD

Thinking Machines has released TML-Interaction-Small, a 276B-A12B mixture-of-experts model targeting native interaction capabilities including realtime voice. The model is reported to advance state-of-the-art in realtime voice interaction and supersedes standard voice activity detection (VAD) approaches. The item is a brief AINews digest entry from Latent Space with minimal technical detail beyond the headline claims.

Frontier Model Releases Agent and Tool Ecosystem Multimodal Progress Thinking Machines TML-Interaction-Small Voice Activity Detection (VAD)

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

7The Batch·28d ago·source ↗

Thinking Machines Lab Reveals TML-Interaction-Small: Real-Time Multimodal Interaction Model

Thinking Machines Lab (founded by Mira Murati) has announced TML-Interaction-Small, a 276B-parameter mixture-of-experts multimodal model that processes audio, video, and text concurrently using 200ms 'micro-turns' rather than waiting for conversational turns to complete. The architecture uses encoder-free early fusion, pairing a fast foreground interaction model with an asynchronous background reasoning model that shares context. On interactivity benchmarks (FD-bench V1/V1.5), it outperforms GPT-Realtime-2 and Gemini-3.1-flash-live-preview, though it trails GPT-Realtime-2 on intelligence benchmarks. A closed research preview is expected in coming months with wider release later in 2026.

Frontier Model Releases Inference Economics encoder-free early fusion Thinking Machines GPT-Realtime-2 +16 more

8Mistral Ai News·19d ago·source ↗

Mistral AI Releases Voxtral: Open-Weight Speech Understanding Models in 24B and 3B Sizes

Mistral AI has released Voxtral, a family of two open-weight speech understanding models (Voxtral Small at 24B and Voxtral Mini at 3B) under the Apache 2.0 license. Both models support long-form audio up to 30-40 minutes, native multilingual transcription, built-in Q&A and summarization, and function-calling directly from voice, built on the Mistral Small 3.1 language model backbone. Benchmarks show Voxtral outperforms Whisper large-v3 across all tasks and is competitive with GPT-4o mini and Gemini 2.5 Flash on audio understanding, while pricing starts at $0.001/minute via API. Models are available on Hugging Face and through Mistral's API, with a transcription-optimized variant (Voxtral Mini Transcribe) also offered.

Frontier Model Releases Open Weights Progress Mistral AI FLEURS Mistral Small 4 +14 more

6arXiv · cs.AI·16d ago·source ↗

Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop

Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.

Frontier Model Releases Multimodal Progress Proactive-Sound-Bench Audio Interaction Model StreamAudio-2M +1 more

6The Batch·1mo ago·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

6Google Deepmind Blog·1mo ago·source ↗

Improved Gemini Audio Models for Powerful Voice Experiences

DeepMind has announced improved Gemini audio models targeting enhanced voice experience capabilities. The announcement comes from the official DeepMind blog, indicating a formal product or capability update to the Gemini model family's audio processing and generation features. Specific technical details were not available in the body text, but the framing suggests advances in speech understanding, synthesis, or real-time voice interaction. This is part of Google DeepMind's ongoing development of multimodal Gemini capabilities.

Frontier Model Releases Multimodal Progress Gemini Audio Google DeepMind Gemini

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Voxtral TTS: 4B-Parameter Multilingual Text-to-Speech Model

Mistral AI has launched Voxtral TTS, its first text-to-speech model, built on a 4B-parameter transformer-based autoregressive flow-matching architecture derived from Ministral 3B. The model supports 9 languages with zero-shot voice adaptation from as little as 3 seconds of reference audio, achieving 70ms latency for typical inputs and a real-time factor of ~9.7x. Human evaluations claim superior naturalness compared to ElevenLabs Flash v2.5 and parity with ElevenLabs v3. The model is available via Mistral Studio and API, targeting enterprise voice agent workflows.

Inference Economics Enterprise Deployment Patterns ElevenLabs Flash v2.5 Mistral AI ElevenLabs v3 +5 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLM - Small Yet Mighty Vision Language Model

Hugging Face introduces SmolVLM, a compact vision-language model designed to deliver strong multimodal performance at small parameter counts. The model targets edge and resource-constrained deployment scenarios while maintaining competitive capabilities relative to its size. The announcement highlights efficiency improvements in both training and inference for small-scale VLMs.

Open Weights Progress Inference Economics SmolVLM Hugging Face +1 more