paper
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
paperactiveprovisional
same-evidence-different-answer-auditing-order-sensitivity-in-multimodal-large-language-models-daaadc56·1 events·first seen 8h agoAliases: Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
Co-occurring entities
More like this (12)
Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language ModelsThe Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language ModelsSame Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language ModelsMultimodal Large Language ModelsEvaluation Awareness Is Not One Capability: Evidence from Open Language ModelsWhere Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous DrivingApparent Psychological Profiles of Large Language Models are Largely a Measurement ArtifactDecomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape RobustnessAutomated reproducibility assessments in the social and behavioral sciences using large language modelsmultimodal classification modelsCivil Court Simulation with Large Language ModelsLatent World Recovery for Multimodal Learning with Missing Modalities
Recent events (1)
Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%
Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.