benchmark

Facet-Probe

benchmarkactiveprovisionalfacet-probe-f7eddd54·1 events·first seen 9h ago

Aliases: Facet-Probe

Co-occurring entities

Gemini 3.1 Pro Google Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

More like this (12)

Reverse Probing Unified Latent Probe Probe Trajectories Text-Only Probe Chain-Text Probe paired-scenario forced-choice probe MemProbe DeepSeek-Prover-V2-7B DeepSeek-V4-Pro Preview probing classifiers ProActEval Future Probe Controlled Generation

Recent events (1)

6arXiv · cs.LG·9h ago·source ↗

Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%

Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.

Evaluation and Benchmarking AI Safety Research Gemini 3.1 Pro Google Facet-Probe +2 more