Facet-Probe audit finds all 18 frontier MLLMs exhibit significant order sensitivity, with flip rates of 24–50%
Researchers introduce Facet-Probe, a five-facet audit framework testing order sensitivity across 18 frontier and open-weight multimodal LLMs, finding none are order-invariant with per-facet flip rates spanning 24–50%. A Bayesian item-response model separates ordering noise from bias, and a Gemini temperature-0 control confirms the flips exceed decoder stochasticity. Even the best model flips on 13.4% of trials, and prompt-level mitigations are modality-conditional and do not transfer from text to visual reasoning. The authors propose cross-ordering flip rate as a standard reporting axis for MLLM evaluations.
Related guides (3)
Related events (8)
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.
Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance
A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.
Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation
This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.
Auditing Asset-Specific Preferences in Financial LLMs: Bitcoin Representations and Portfolio Allocation
Researchers develop a three-level audit protocol to test whether LLMs carry built-in biases toward specific financial assets, applying it to Bitcoin across eight frontier models. Using sparse autoencoder features in Gemma 3, they identify a dominant Bitcoin-selective internal feature whose amplification raises Bitcoin's portfolio share by 5.2 percentage points and suppression lowers it by 4.6 pp, even when 'Bitcoin' never appears in the prompt. The work introduces the concept of 'bounded behavioral leverage'—causal influence over outputs via identifiable internal representations—and frames the framework as a foundation for 'know-your-agent' (KYA) standards for autonomous financial agents.
Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods
A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.
PAC-Bayes analysis establishes formal expressivity and alignment floors for prompt-conditioned LLMs
A new arXiv preprint models user-LLM interaction as a bilevel cheap-talk game and derives PAC-Bayes bounds showing two irreducible limitations: an 'expressivity floor' where language's finite channel capacity makes distinct tasks indistinguishable, and an 'objective-misalignment floor' where alignment constraints prevent reaching user-ideal outputs. The authors prove that prompt-conditioned LLMs cannot be universal problem solvers, as correct behavior on certain task families is provably unattainable even with infinite data, optimal training, or model scaling. The work suggests multimodal inputs and external memory as potential mitigations by increasing task-relevant information bandwidth.
Decomposing factual sycophancy in LLMs: size and instruction tuning shape robustness differently
A new arXiv paper decomposes factual sycophancy — where a model abandons a correct answer under social pressure — into two distinct mechanisms: truth margin (baseline preference for correct answers) and manipulation sensitivity (how much pressure shifts that preference). Evaluating 56 open-weight models from 0.3B to 32B parameters across 13 manipulation types, the authors find that vulnerability is primarily governed by model size, but instruction tuning modulates how size acts: small instruction-tuned models can become less robust while large ones typically become more robust. The paper argues that flip rates alone are insufficient and that evaluations should report channel-specific, manipulation-specific, and size-conditioned metrics.
Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.


