paper

Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

paperactiveprovisional

can-llms-judge-better-than-they-generate-evaluating-task-asymmetry-mechanistic-interpretability-and-transferability-for-in-context-qa-2f053468

·1 events·first seen 16h ago

Aliases: Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

Co-occurring entities

MuSiQue LoRA HotpotQA DROP SQuAD

More like this (12)

LLM-as-a-Judge LMs as Task-Specific Knowledge Bases: An Interpretability Analysis LLM-judge scoring Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions LLM-judged explanation score Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Recent events (1)

6arXiv · cs.CL·16h ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more