paper

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

paperactiveprovisionalsame-evidence-different-answer-auditing-order-sensitivity-in-multimodal-large-language-models-daaadc56·1 events·first seen 8h ago

Aliases: Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Co-occurring entities

Gemini 3.1 Pro Google Facet-Probe

More like this (12)

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models Multimodal Large Language Models Evaluation Awareness Is Not One Capability: Evidence from Open Language Models Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness Automated reproducibility assessments in the social and behavioral sciences using large language models multimodal classification models Civil Court Simulation with Large Language Models Latent World Recovery for Multimodal Learning with Missing Modalities

Recent events (1)

6arXiv · cs.LG·8h ago·source ↗

Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%

Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.

Evaluation and Benchmarking AI Safety Research Gemini 3.1 Pro Google Facet-Probe +2 more