UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs
UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.
Related guides (2)
Related events (8)
Acoustic cue alignment tokens improve speech emotion recognition in audio language models
Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.
Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop
Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.
AuRA: Distilling audio understanding into LLMs via LoRA adaptation
AuRA is a new method for integrating speech understanding into LLMs by distilling audio encoding capability directly into LoRA-adapted model weights, bypassing cascaded ASR-LLM pipelines. A lightweight audio embedding layer feeds speech to both an ASR encoder (teacher) and a LoRA-adapted LLM (student), with layer-wise distillation aligning hidden states. The approach claims to outperform cascaded systems, bridge-based adaptation baselines, and large-scale multimodal models on multiple speech-language benchmarks while enabling parallel end-to-end inference without large-scale multimodal training.
USAD 2.0: Universal audio encoder scales to 1B parameters via representation distillation
USAD 2.0 is a new universal audio encoder that integrates knowledge from both self-supervised and supervised foundation models through domain-aware distillation, extending coverage to speech, music, and general audio domains. The model scales to one billion parameters via depth scaling and adds a second-stage supervised distillation step for downstream alignment with audio LLMs. Experiments report strong or state-of-the-art results across probing and LLM-based evaluations, addressing limitations of prior multi-domain encoders like USAD and SPEAR.
Meta Introduces SAM Audio: Unified Multimodal Model for Audio Separation with PE-AV, Benchmark, and Judge Model
Meta has released SAM Audio, a unified multimodal audio separation model that accepts text, visual, and temporal span prompts to isolate sounds from complex audio mixtures. The system is powered by Perception Encoder Audiovisual (PE-AV), an extension of Meta's open-source Perception Encoder released earlier in 2025, and uses a flow-matching diffusion transformer architecture. Alongside the model, Meta is releasing SAM Audio-Bench (the first in-the-wild audio separation benchmark) and SAM Audio Judge (an automatic evaluation model for audio separation). All components are available today via the Segment Anything Playground.
Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects
Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.
AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026
Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.
Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill
A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

