5arXiv cs.CL (Computation and Language)·47h ago

Study finds no detectable self-preference bias when LLMs revise their own instruction-following drafts

A new arXiv preprint tests whether LLMs resist valid corrections to their own writing by using IFEval's deterministic verifier to establish ground-truth correctness, bypassing model-as-judge subjectivity. Across four mid-tier model families and 85 author-versus-fresh comparisons, no statistically significant self-preference bias was detected (gap -5.1 pp, 95% CI [-12.9, +2.7]). A qualitative finding shows that when authors do reject verified-good fixes, 97% of stated reasons are substantive flaw-catching rather than preference. The result challenges the assumption that documented self-preference in judging tasks extends to self-revision contexts.

Evaluation and Benchmarking Alignment and RLHF Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship IFEval

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

Counterfactual context revision framework for auditing LLM-based stance simulation in online discussions

Researchers introduce a counterfactual context revision framework to audit how LLMs simulate individual users' stances in online discussions. By applying controlled text-only and multimodal (meme-based) revisions to conversational contexts, they measure how readily simulated stances shift in response to semantically independent changes. Results show effective and robust stance transitions across both revision types and polarization-preference mechanisms, raising concerns about whether LLM simulations reflect genuine user-specific beliefs or are highly context-sensitive artifacts. The work contributes an evaluation framework and highlights risks of using LLMs to model online opinion dynamics.

Evaluation and Benchmarking AI Safety Research Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

6arXiv · cs.AI·10d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

6arXiv · cs.AI·5d ago·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

4arXiv · cs.CL·15d ago·source ↗

EDIT framework trains more rubric-faithful LLM graders via internal-state diagnostics

Researchers introduce Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for improving LLM-based rubric grading. The first phase (EDIT-SFT) identifies problematic reasoning steps using posterior belief signals and input-grounding scores, then revises only those steps with rubric checklists; the second phase (EDIT-RL) uses belief-guided reward shaping to penalize harmful belief drifts during RL. Experiments on two real-world multi-subject grading benchmarks show consistent improvements over SFT and RL baselines on both in-domain and out-of-domain splits.

Evaluation and Benchmarking Alignment and RLHF Evidence-Diagnosed Intervention Training EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

6arXiv · cs.AI·8d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

7arXiv · cs.CL·47h ago·source ↗

LLM psychological profiles are largely measurement artifacts, not model properties

A new arXiv preprint administers a battery of personality and risk-preference instruments to 56 instruction-tuned LLMs alongside large human reference samples, finding that 81-90% of between-model variation is explained by directional response bias rather than the traits the instruments target. The authors introduce the concept of 'response orthogonality' to explain why some instruments appear more reliable than others, and show that apparent psychological profiles can be manufactured through item selection. The findings challenge the validity of using human-designed psychometric tools to characterize LLMs, with direct implications for safety assessment and the use of LLMs as proxies for human participants in research.

Evaluation and Benchmarking AI Safety Research Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study