Almanac
paper

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

paperactiveprovisionalsame-evidence-different-answer-auditing-order-sensitivity-in-multimodal-large-language-models-daaadc56·1 events·first seen 8h ago

Aliases: Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·8h ago·source ↗

Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%

Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.