MMAE: First comprehensive benchmark for instruction-based audio editing across 7 modalities
Researchers introduce MMAE, a 2,000-sample benchmark for evaluating general-purpose instruction-based audio editing systems, covering 7 audio modalities (sound, speech, music, and mixtures) and 6 levels of task complexity. The benchmark uses a rubric-based evaluation framework decomposing tasks into 17,741 verifiable criteria to assess instruction following and context consistency. Evaluation of leading models reveals severe limitations: Exact Match Rate falls below 5% overall and hits 0% on complex mixed-modality tasks, exposing fundamental gaps in current audio editing systems.
Related guides (2)
Related events (8)
IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages
Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.
Meta Introduces SAM Audio: Unified Multimodal Model for Audio Separation with PE-AV, Benchmark, and Judge Model
Meta has released SAM Audio, a unified multimodal audio separation model that accepts text, visual, and temporal span prompts to isolate sounds from complex audio mixtures. The system is powered by Perception Encoder Audiovisual (PE-AV), an extension of Meta's open-source Perception Encoder released earlier in 2025, and uses a flow-matching diffusion transformer architecture. Alongside the model, Meta is releasing SAM Audio-Bench (the first in-the-wild audio separation benchmark) and SAM Audio Judge (an automatic evaluation model for audio separation). All components are available today via the Segment Anything Playground.
Evaluating Audio Reasoning with Big Bench Audio
Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.
M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions
Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.
DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast
Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.
Acoustic cue alignment tokens improve speech emotion recognition in audio language models
Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.
AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models
Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.
MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation
MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.

