LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards
LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.
Related guides (4)

Agent and Tool EcosystemTopic guide
Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating
Related events (8)
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
TRACE: Tree-structured rollout budget allocation for efficient agentic RL training
TRACE (Tree Rollout Allocation for Contrastive Exploration) is a new framework for improving reinforcement learning with verifiable rewards (RLVR) in multi-turn agentic LLM settings. The method models each ReAct-style thought-action-observation turn as a distinct node, enabling budget allocation across both prompt-level and turn-level prefixes in a tree structure, rather than only at the prompt level. A shared predictor estimates conditional success probability at each anchor to guide allocation, enriching reward contrast within a fixed sampling budget. Empirically, TRACE improves Qwen3-14B multi-hop QA accuracy by 2.8 points over baselines at equal sampling cost.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
DeepRubric: Evidence-tree rubric supervision cuts RL training cost for deep research agents by 13x
DeepRubric is a data construction framework that improves reinforcement learning efficiency for deep research agents by reversing the typical rubric-generation process: rather than inferring evaluation criteria from a query, it builds an evidence tree of verifiable sub-questions first, then synthesizes aligned query-rubric pairs. The authors construct 9K training examples and train DeepRubric-8B using rubric-based GRPO, achieving comparable performance to prior open-source state-of-the-art deep research models on three benchmarks while using roughly 13x fewer RL GPU-hours. The work addresses a key bottleneck in RL-based training of long-form research agents: unreliable reward signals from incomplete rubrics.
QUBRIC: Co-designing queries and rubrics for RL beyond verifiable rewards
QUBRIC is a framework that jointly optimizes queries and rubrics for reinforcement learning in settings where rewards are not strictly verifiable. The approach uses teacher-derived key points to rewrite open-ended queries into evaluable scenarios, applies contrastive rubric generation to capture teacher-policy gaps, and filters for learnability before GRPO training. Trained only on instruction-following data, QUBRIC achieves a +5.5 point gain on ArenaHard over an SFT baseline and transfers to legal, moral, and narrative reasoning benchmarks (+6.3 points average), suggesting rubric-based RL can complement RLVR in non-verifiable domains.
RL-trained LLMs learn retriever-specific query formulation strategies for RAG
A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.
CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR
Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.


