benchmark
Facet-Probe
benchmarkactiveprovisional
facet-probe-f7eddd54·1 events·first seen 9h agoAliases: Facet-Probe
Co-occurring entities
More like this (12)
Recent events (1)
Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%
Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.