6arXiv cs.CL (Computation and Language)·1mo ago

AMARIS: Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS introduces a persistent evaluation memory system to improve rubric-based reward shaping in LLM fine-tuning via reinforcement learning. Unlike prior adaptive rubric methods that discard evaluation diagnostics after each step, AMARIS accumulates step-level summaries and retrieves relevant historical context via both static (recent steps) and dynamic (semantic similarity) retrieval to inform rubric updates. The system runs asynchronously alongside the RL training loop with approximately 5% time overhead. Experiments across closed and open-ended domains show consistent improvements over baselines, with ablations confirming that combining both retrieval modes yields the strongest results.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF semantic retrieval Reinforcement Learning from Human Feedback AMARIS rubric-based reward shaping

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·19d ago·source ↗

PARL: Preference-Aware Rubric Learning for Personalized LLM Evaluation

This paper introduces PARL (Preference-Aware Rubric Learning), a framework that reframes personalized LLM evaluation as a learning problem rather than static judgment. PARL induces preference-aware evaluation rubrics from raw user interaction histories and uses a discriminative reinforcement learning objective to contrast user-authored responses against model outputs, capturing user-specific decision boundaries. Experiments on personalized text generation tasks show PARL produces high-fidelity rubrics that generalize across users and tasks, outperforming existing LLM-as-a-judge and automatic metric approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem Preference-Aware Rubric Learning LLM-as-a-Judge PARL +3 more

5arXiv · cs.AI·2d ago·source ↗

Rubric-Conditioned Self-Distillation: structured feedback for reasoning model post-training

A new arXiv preprint proposes Rubric-Conditioned Self-Distillation (RCSD), a post-training framework that replaces scalar reward signals and noisy chain-of-thought annotations with structured rubrics for fine-grained credit assignment. The method conditions a teacher model on criterion-level rubrics to provide token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. Evaluated on science reasoning benchmarks, RCSD outperforms GRPO by 1.0 points and OPSD by 0.9 points on average.

Evaluation and Benchmarking Alignment and RLHF OPSD GRPO Rubric-Conditioned Self-Distillation

6arXiv · cs.AI·1mo ago·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.

Evaluation and Benchmarking Alignment and RLHF rubric-based rewards GRPO POW3R +2 more

6arXiv · cs.CL·17d ago·source ↗

QUBRIC: Co-designing queries and rubrics for RL beyond verifiable rewards

QUBRIC is a framework that jointly optimizes queries and rubrics for reinforcement learning in settings where rewards are not strictly verifiable. The approach uses teacher-derived key points to rewrite open-ended queries into evaluable scenarios, applies contrastive rubric generation to capture teacher-policy gaps, and filters for learnability before GRPO training. Trained only on instruction-following data, QUBRIC achieves a +5.5 point gain on ArenaHard over an SFT baseline and transfers to legal, moral, and narrative reasoning benchmarks (+6.3 points average), suggesting rubric-based RL can complement RLVR in non-verifiable domains.

Evaluation and Benchmarking Alignment and RLHF QUBRIC GRPO ArenaHard

6arXiv · cs.CL·4d ago·source ↗

DeepRubric: Evidence-tree rubric supervision cuts RL training cost for deep research agents by 13x

DeepRubric is a data construction framework that improves reinforcement learning efficiency for deep research agents by reversing the typical rubric-generation process: rather than inferring evaluation criteria from a query, it builds an evidence tree of verifiable sub-questions first, then synthesizes aligned query-rubric pairs. The authors construct 9K training examples and train DeepRubric-8B using rubric-based GRPO, achieving comparable performance to prior open-source state-of-the-art deep research models on three benchmarks while using roughly 13x fewer RL GPU-hours. The work addresses a key bottleneck in RL-based training of long-form research agents: unreliable reward signals from incomplete rubrics.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepRubric GRPO +1 more

6arXiv · cs.AI·8d ago·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more

4arXiv · cs.CL·15d ago·source ↗

EDIT framework trains more rubric-faithful LLM graders via internal-state diagnostics

Researchers introduce Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for improving LLM-based rubric grading. The first phase (EDIT-SFT) identifies problematic reasoning steps using posterior belief signals and input-grounding scores, then revises only those steps with rubric checklists; the second phase (EDIT-RL) uses belief-guided reward shaping to penalize harmful belief drifts during RL. Experiments on two real-world multi-subject grading benchmarks show consistent improvements over SFT and RL baselines on both in-domain and out-of-domain splits.

Evaluation and Benchmarking Alignment and RLHF Evidence-Diagnosed Intervention Training EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

6arXiv · cs.CL·19d ago·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

Long Context Evolution Evaluation and Benchmarking tiered distractors Knowledge Graph Random Walk Long-context Reasoning Benchmarks +8 more