RL-trained LLMs learn retriever-specific query formulation strategies for RAG
A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.
Related guides (2)
Related events (8)
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards
LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.
RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training
Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.
Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods
A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.
Adaptive LLM tutoring system with subject-aware prompt routing improves high-school student engagement
Researchers develop and evaluate an LLM-based tutoring system that uses a learned prompt routing model to dynamically select pedagogical strategies based on 14 features extracted from conversation transcripts. The system was trained in simulation and deployed in an A/B test with 359 high-school students (656 conversations), showing sim-to-real transfer and reducing required interactions by ~3 turns. A stochastic routing strategy achieved a notably higher exercise conversion rate (28.1%) compared to a greedy router (19.1%) and static baseline (19.6%).
Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages
Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

