6arXiv cs.CL (Computation and Language)·19d ago

UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs

UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.

Agent and Tool Ecosystem Multimodal Progress Audio-LLM UniAudio-Token Tencent Semantic-Acoustic Equilibrium Semantic-Acoustic Primitives

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·12d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

6arXiv · cs.AI·16d ago·source ↗

Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop

Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.

Frontier Model Releases Multimodal Progress Proactive-Sound-Bench Audio Interaction Model StreamAudio-2M +1 more

5arXiv · cs.CL·10d ago·source ↗

AuRA: Distilling audio understanding into LLMs via LoRA adaptation

AuRA is a new method for integrating speech understanding into LLMs by distilling audio encoding capability directly into LoRA-adapted model weights, bypassing cascaded ASR-LLM pipelines. A lightweight audio embedding layer feeds speech to both an ASR encoder (teacher) and a LoRA-adapted LLM (student), with layer-wise distillation aligning hidden states. The approach claims to outperform cascaded systems, bridge-based adaptation baselines, and large-scale multimodal models on multiple speech-language benchmarks while enabling parallel end-to-end inference without large-scale multimodal training.

Multimodal Progress LoRA AuRA

5arXiv · cs.CL·15d ago·source ↗

USAD 2.0: Universal audio encoder scales to 1B parameters via representation distillation

USAD 2.0 is a new universal audio encoder that integrates knowledge from both self-supervised and supervised foundation models through domain-aware distillation, extending coverage to speech, music, and general audio domains. The model scales to one billion parameters via depth scaling and adds a second-stage supervised distillation step for downstream alignment with audio LLMs. Experiments report strong or state-of-the-art results across probing and LLM-based evaluations, addressing limitations of prior multi-domain encoders like USAD and SPEAR.

Frontier Model Releases Multimodal Progress USAD SPEAR USAD 2.0

7Meta Ai Blog·1mo ago·source ↗

Meta Introduces SAM Audio: Unified Multimodal Model for Audio Separation with PE-AV, Benchmark, and Judge Model

Meta has released SAM Audio, a unified multimodal audio separation model that accepts text, visual, and temporal span prompts to isolate sounds from complex audio mixtures. The system is powered by Perception Encoder Audiovisual (PE-AV), an extension of Meta's open-source Perception Encoder released earlier in 2025, and uses a flow-matching diffusion transformer architecture. Alongside the model, Meta is releasing SAM Audio-Bench (the first in-the-wild audio separation benchmark) and SAM Audio Judge (an automatic evaluation model for audio separation). All components are available today via the Segment Anything Playground.

Evaluation and Benchmarking Agent and Tool Ecosystem SAM Audio Judge Segment Anything Model 2 SAM Audio +7 more

6The Batch·18d ago·source ↗

Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects

Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.

Frontier Model Releases Multimodal Progress FLUX.1-dev Rotary Position Embedding (RoPE)Jiasen Lu +8 more

4arXiv · cs.CL·17d ago·source ↗

AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026

Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.

Evaluation and Benchmarking Multimodal Progress Gemma-4 E4B-it IWSLT 2026 AlignAtt +2 more

4arXiv · cs.CL·8d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more