4arXiv cs.AI (Artificial Intelligence)·5d ago

AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models

Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.

Evaluation and Benchmarking Multimodal Progress AudioDER Qwen2-Audio-7B-Instruct Qwen3-30B MMSU MMAU-mini MMAR

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·10d ago·source ↗

AuRA: Distilling audio understanding into LLMs via LoRA adaptation

AuRA is a new method for integrating speech understanding into LLMs by distilling audio encoding capability directly into LoRA-adapted model weights, bypassing cascaded ASR-LLM pipelines. A lightweight audio embedding layer feeds speech to both an ASR encoder (teacher) and a LoRA-adapted LLM (student), with layer-wise distillation aligning hidden states. The approach claims to outperform cascaded systems, bridge-based adaptation baselines, and large-scale multimodal models on multiple speech-language benchmarks while enabling parallel end-to-end inference without large-scale multimodal training.

Multimodal Progress LoRA AuRA

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

4arXiv · cs.CL·8d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

6Qwen Research·1mo ago·source ↗

Qwen2-Audio: Multimodal Audio-Language Model Release

Alibaba's Qwen team releases Qwen2-Audio, the successor to Qwen-Audio, capable of accepting both audio and text inputs and generating text outputs. The model is positioned as a step toward AGI by extending large language model capabilities to audio modalities. It is released with accompanying paper, GitHub repository, and model weights on Hugging Face and ModelScope.

Frontier Model Releases Open Weights Progress Alibaba Qwen Hugging Face +3 more

6arXiv · cs.AI·16d ago·source ↗

Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop

Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.

Frontier Model Releases Multimodal Progress Proactive-Sound-Bench Audio Interaction Model StreamAudio-2M +1 more

5arXiv · cs.CL·15d ago·source ↗

USAD 2.0: Universal audio encoder scales to 1B parameters via representation distillation

USAD 2.0 is a new universal audio encoder that integrates knowledge from both self-supervised and supervised foundation models through domain-aware distillation, extending coverage to speech, music, and general audio domains. The model scales to one billion parameters via depth scaling and adds a second-stage supervised distillation step for downstream alignment with audio LLMs. Experiments report strong or state-of-the-art results across probing and LLM-based evaluations, addressing limitations of prior multi-domain encoders like USAD and SPEAR.

Frontier Model Releases Multimodal Progress USAD SPEAR USAD 2.0

4arXiv · cs.CL·5d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL