benchmark

M³Exam

benchmarkactiveprovisionalm-exam-6a47f7ef·1 events·first seen 9d ago

Aliases: M³Exam

Co-occurring entities

M³Proctor

More like this (12)

M³Proctor MVBench MLE Bench Lite MMLU MATH benchmark MMLU-Pro MMVU MemBench HM3D Query Monitor Exa METR

Recent events (1)

5arXiv · cs.CL·9d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more