6arXiv cs.CL (Computation and Language)·18d ago

Luar: Selective Translation via Reinforcement Learning for Multilingual Reasoning

Luar is a reinforcement learning framework that trains reasoning language models to selectively invoke English translation only when direct understanding of a non-English input is deemed unreliable. The approach, built on top of GRPO, outperforms standard multilingual baselines across reasoning benchmarks, with especially large gains on low-resource languages. Analysis confirms the model learns to avoid unnecessary translation when direct reasoning suffices, and generalizes the translation-call behavior to unseen low-resource languages.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF GRPO Luar Reasoning Language Models deokhk

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: From Human Feedback to Scalable Post-Training

Read asIn-depth

Related events (8)

6arXiv · cs.CL·29d ago·source ↗

LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.

Evaluation and Benchmarking Alignment and RLHF large language models LANG multilingual mathematical benchmarks +3 more

5arXiv · cs.CL·15d ago·source ↗

Reinforcement learning enables meta-skill for translating unseen low-resource languages via in-context linguistic knowledge

Researchers propose an RL-based training approach for translating extremely low-resource or unseen languages by rewarding models for extracting and applying in-context linguistic knowledge (e.g., grammar books) rather than memorizing specific languages. Using chrF as a surface-level reward signal, RL-trained models outperform both in-context learning and supervised fine-tuning on completely unseen languages at test time. The work extends outcome-based RL beyond math and coding reasoning tasks, suggesting broader applicability to language learning from context.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation chrF

5arXiv · cs.CL·22d ago·source ↗

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong is a long document translation agent that uses a 3E memory module (Essence-Exemplar-Entity) to store structured historical context, replacing passive full-context attention with RL-optimized adaptive context selection. The agent learns its context retrieval policy via reinforcement learning on self-sampled reasoning trajectories. Evaluations show average gains of up to 13.0 points across three metrics in English↔Chinese, German, and French translation directions, with strong generalization and robustness to noise in ultra-long documents.

Long Context Evolution Agent and Tool Ecosystem YutongWang1216 3E Memory Module Reinforcement Learning +3 more

4arXiv · cs.CL·17d ago·source ↗

Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning

Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.

Evaluation and Benchmarking Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Universal Dependencies

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more