5arXiv cs.CL (Computation and Language)·12d ago

MMAE: First comprehensive benchmark for instruction-based audio editing across 7 modalities

Researchers introduce MMAE, a 2,000-sample benchmark for evaluating general-purpose instruction-based audio editing systems, covering 7 audio modalities (sound, speech, music, and mixtures) and 6 levels of task complexity. The benchmark uses a rubric-based evaluation framework decomposing tasks into 17,741 verifiable criteria to assess instruction following and context consistency. Evaluation of leading models reveals severe limitations: Exact Match Rate falls below 5% overall and hits 0% on complex mixed-modality tasks, exposing fundamental gaps in current audio editing systems.

Evaluation and Benchmarking Multimodal Progress MMAE Gemini Omni Nano Banana 2

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·2d ago·source ↗

IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages

Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.

Evaluation and Benchmarking Multimodal Progress IndicContextEval

7Meta Ai Blog·1mo ago·source ↗

Meta Introduces SAM Audio: Unified Multimodal Model for Audio Separation with PE-AV, Benchmark, and Judge Model

Meta has released SAM Audio, a unified multimodal audio separation model that accepts text, visual, and temporal span prompts to isolate sounds from complex audio mixtures. The system is powered by Perception Encoder Audiovisual (PE-AV), an extension of Meta's open-source Perception Encoder released earlier in 2025, and uses a flow-matching diffusion transformer architecture. Alongside the model, Meta is releasing SAM Audio-Bench (the first in-the-wild audio separation benchmark) and SAM Audio Judge (an automatic evaluation model for audio separation). All components are available today via the Segment Anything Playground.

Evaluation and Benchmarking Agent and Tool Ecosystem SAM Audio Judge Segment Anything Model 2 SAM Audio +7 more

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench

5arXiv · cs.CL·12d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more

4arXiv · cs.CL·12d ago·source ↗

DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast

Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.

Multimodal Progress DirectAudioEdit DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

4arXiv · cs.CL·12d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

4arXiv · cs.AI·5d ago·source ↗

AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models

Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.

Evaluation and Benchmarking Multimodal Progress AudioDER Qwen2-Audio-7B-Instruct Qwen3-30B +3 more

6arXiv · cs.CL·24d ago·source ↗

MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation

MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.

Evaluation and Benchmarking AI Safety Research TruthfulQA ROUGE-L Siran Li +3 more