5arXiv cs.CL (Computation and Language)·3d ago

Context-aware translation cascades improve multilingual reasoning across 285 languages

A new arXiv preprint identifies a structural flaw in standard translation cascades for multilingual reasoning—each stage discards context needed by later stages—and proposes a training-free fix: providing the original question, English translation, and reasoning trace to the final translation module. The intervention is evaluated on nine multilingual benchmarks across three backbone models and 285 languages, showing strong gains for open-ended generation. The key finding is that preserving the original-language question until the end of the pipeline captures most of the benefit.

Evaluation and Benchmarking Multilingual Reasoning Cascades Need More Context

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·26d ago·source ↗

Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning

Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.

Evaluation and Benchmarking Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Universal Dependencies

6arXiv · cs.CL·27d ago·source ↗

Luar: Selective Translation via Reinforcement Learning for Multilingual Reasoning

Luar is a reinforcement learning framework that trains reasoning language models to selectively invoke English translation only when direct understanding of a non-English input is deemed unreliable. The approach, built on top of GRPO, outperforms standard multilingual baselines across reasoning benchmarks, with especially large gains on low-resource languages. Analysis confirms the model learns to avoid unnecessary translation when direct reasoning suffices, and generalizes the translation-call behavior to unseen low-resource languages.

Frontier Model Releases Evaluation and Benchmarking GRPO Luar Reasoning Language Models +3 more

4arXiv · cs.CL·11d ago·source ↗

Steerable Model Merging (ST-Merge) improves multilingual reasoning via adaptive gated cross-attention

Researchers propose ST-Merge, a framework for adaptively merging a multilingual model and a reasoning model using a gated cross-attention mechanism that weights each source model's contribution based on input characteristics. The approach addresses the limitation of static one-size-fits-all merging strategies that fail to resolve conflicts between source models. Experiments across 21 languages on four multilingual reasoning benchmarks show consistent improvements over strong baselines.

Evaluation and Benchmarking Steerable Model Merging

6arXiv · cs.CL·1mo ago·source ↗

LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.

Evaluation and Benchmarking Alignment and RLHF large language models LANG multilingual mathematical benchmarks +3 more

5arXiv · cs.CL·5d ago·source ↗

Cross-lingual prompting strategies unlock hidden parametric knowledge in LLMs

A new arXiv preprint investigates how cross-lingual prompting can surface factual knowledge that standard inference techniques fail to retrieve in multilingual LLMs. The authors identify four dimensions of cross-lingual exploration governing parametric knowledge retrieval and evaluate them on multilingual factual benchmarks across 17 typologically diverse languages. Results show cross-lingual exploration improves both factual recall and cross-lingual consistency, and is claimed to be a more compute-efficient approach than scaling native-language inference.

Evaluation and Benchmarking Cross-Lingual Exploration for Parametric Knowledge

4arXiv · cs.CL·12d ago·source ↗

Cross-lingual in-context learning source language selection challenges fine-tuning assumptions

A new arXiv paper conducts a broad empirical study of cross-lingual transfer in few-shot in-context learning (ICL), spanning seven tasks, six models, and a typologically diverse set of languages. The study finds that conventional heuristics from supervised fine-tuning — such as relying on linguistic similarity or data availability — do not consistently transfer to the ICL regime. The authors also analyze language confusion as a key obstacle in generative cross-lingual ICL and propose alternative heuristics for source language selection.

Evaluation and Benchmarking When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

5arXiv · cs.CL·24d ago·source ↗

Reinforcement learning enables meta-skill for translating unseen low-resource languages via in-context linguistic knowledge

Researchers propose an RL-based training approach for translating extremely low-resource or unseen languages by rewarding models for extracting and applying in-context linguistic knowledge (e.g., grammar books) rather than memorizing specific languages. Using chrF as a surface-level reward signal, RL-trained models outperform both in-context learning and supervised fine-tuning on completely unseen languages at test time. The work extends outcome-based RL beyond math and coding reasoning tasks, suggesting broader applicability to language learning from context.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation chrF

3arXiv · cs.CL·5d ago·source ↗

Framework for measuring users' mental models of machine translation quality in human-AI collaboration

A new arXiv paper introduces a cross-lingual question answering framework to study how users form mental models of speech translation systems, measuring whether users can predict where MT output is likely to be wrong. The study finds that users develop stronger mental models with practice, particularly when they have some source-language knowledge or access to speech transcriptions. Results suggest cross-lingual QA is a viable downstream task for studying human-AI collaboration in translation contexts.

Evaluation and Benchmarking Measuring User's Mental Models of Speech Translation in Human-AI Collaboration