paper

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

paperactiveprovisional

self-preference-is-weak-or-absent-in-verifiable-instruction-following-revision-a-four-model-test-under-genuine-authorship-242471f1

·1 events·first seen 47h ago

Aliases: Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

Co-occurring entities

IFEval

More like this (12)

Chain-of-Thought Self-Consistency Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks Rubric-Conditioned Self-Distillation source-level self-rewriting Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models Self-Trained Verification (STV)Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Recent events (1)

5arXiv · cs.CL·47h ago·source ↗

Study finds no detectable self-preference bias when LLMs revise their own instruction-following drafts

A new arXiv preprint tests whether LLMs resist valid corrections to their own writing by using IFEval's deterministic verifier to establish ground-truth correctness, bypassing model-as-judge subjectivity. Across four mid-tier model families and 85 author-versus-fresh comparisons, no statistically significant self-preference bias was detected (gap -5.1 pp, 95% CI [-12.9, +2.7]). A qualitative finding shows that when authors do reject verified-good fixes, 97% of stated reasons are substantive flaw-catching rather than preference. The result challenges the assumption that documented self-preference in judging tasks extends to self-revision contexts.

Evaluation and Benchmarking Alignment and RLHF Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship IFEval