Almanac
← Events
6arXiv cs.CL (Computation and Language)·18d ago

Luar: Selective Translation via Reinforcement Learning for Multilingual Reasoning

Luar is a reinforcement learning framework that trains reasoning language models to selectively invoke English translation only when direct understanding of a non-English input is deemed unreliable. The approach, built on top of GRPO, outperforms standard multilingual baselines across reasoning benchmarks, with especially large gains on low-resource languages. Analysis confirms the model learns to avoid unnecessary translation when direct reasoning suffices, and generalizes the translation-call behavior to unseen low-resource languages.

Related guides (4)

Related events (8)

6arXiv · cs.CL·29d ago·source ↗

LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.

5arXiv · cs.CL·15d ago·source ↗

Reinforcement learning enables meta-skill for translating unseen low-resource languages via in-context linguistic knowledge

Researchers propose an RL-based training approach for translating extremely low-resource or unseen languages by rewarding models for extracting and applying in-context linguistic knowledge (e.g., grammar books) rather than memorizing specific languages. Using chrF as a surface-level reward signal, RL-trained models outperform both in-context learning and supervised fine-tuning on completely unseen languages at test time. The work extends outcome-based RL beyond math and coding reasoning tasks, suggesting broader applicability to language learning from context.

5arXiv · cs.CL·22d ago·source ↗

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong is a long document translation agent that uses a 3E memory module (Essence-Exemplar-Entity) to store structured historical context, replacing passive full-context attention with RL-optimized adaptive context selection. The agent learns its context retrieval policy via reinforcement learning on self-sampled reasoning trajectories. Evaluations show average gains of up to 13.0 points across three metrics in English↔Chinese, German, and French translation directions, with strong generalization and robustness to noise in ultra-long documents.

4arXiv · cs.CL·17d ago·source ↗

Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning

Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.