Reinforcement learning enables meta-skill for translating unseen low-resource languages via in-context linguistic knowledge
Researchers propose an RL-based training approach for translating extremely low-resource or unseen languages by rewarding models for extracting and applying in-context linguistic knowledge (e.g., grammar books) rather than memorizing specific languages. Using chrF as a surface-level reward signal, RL-trained models outperform both in-context learning and supervised fine-tuning on completely unseen languages at test time. The work extends outcome-based RL beyond math and coding reasoning tasks, suggesting broader applicability to language learning from context.
Related guides (2)
Related events (8)
Luar: Selective Translation via Reinforcement Learning for Multilingual Reasoning
Luar is a reinforcement learning framework that trains reasoning language models to selectively invoke English translation only when direct understanding of a non-English input is deemed unreliable. The approach, built on top of GRPO, outperforms standard multilingual baselines across reasoning benchmarks, with especially large gains on low-resource languages. Analysis confirms the model learns to avoid unnecessary translation when direct reasoning suffices, and generalizes the translation-call behavior to unseen low-resource languages.
Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning
Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.
LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models
Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.
Learning to Summarize with Human Feedback
OpenAI published research applying reinforcement learning from human feedback (RLHF) to train language models for improved summarization quality. The work demonstrated that models trained with human preference signals outperform those trained purely on supervised objectives for summarization tasks. This paper is an early foundational contribution to the RLHF methodology that later became central to aligning large language models.
Aligning language models to follow instructions
OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.

