Almanac
benchmark

Facet-Probe

benchmarkactiveprovisionalfacet-probe-f7eddd54·1 events·first seen 9h ago

Aliases: Facet-Probe

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·9h ago·source ↗

Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%

Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.