4arXiv cs.CL (Computation and Language)·3d ago

Cross-lingual in-context learning source language selection challenges fine-tuning assumptions

A new arXiv paper conducts a broad empirical study of cross-lingual transfer in few-shot in-context learning (ICL), spanning seven tasks, six models, and a typologically diverse set of languages. The study finds that conventional heuristics from supervised fine-tuning — such as relying on linguistic similarity or data availability — do not consistently transfer to the ICL regime. The authors also analyze language confusion as a key obstacle in generative cross-lingual ICL and propose alternative heuristics for source language selection.

Evaluation and Benchmarking When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

Reinforcement learning enables meta-skill for translating unseen low-resource languages via in-context linguistic knowledge

Researchers propose an RL-based training approach for translating extremely low-resource or unseen languages by rewarding models for extracting and applying in-context linguistic knowledge (e.g., grammar books) rather than memorizing specific languages. Using chrF as a surface-level reward signal, RL-trained models outperform both in-context learning and supervised fine-tuning on completely unseen languages at test time. The work extends outcome-based RL beyond math and coding reasoning tasks, suggesting broader applicability to language learning from context.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation chrF

5arXiv · cs.CL·4d ago·source ↗

ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs

Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.

Agent and Tool Ecosystem Alignment and RLHF GRPO ContextRL +1 more

3arXiv · cs.CL·12d ago·source ↗

Supervised vs. in-context learning for Turkish multiword expression classification

A new arXiv paper evaluates Turkish idiomatic light verb construction (LVC) detection as a binary classification task, comparing a supervised BERTurk baseline against three instruction-tuned LLMs under zero-shot, one-shot, and few-shot prompting. Results show LLMs have very low LVC recall in zero-shot but improve substantially with demonstrations, though one-shot prompting can introduce strong model-specific biases. The supervised baseline remains competitive, while carefully constructed few-shot prompts allow GPT-OSS-20B and Qwen 2.5-14B to match or exceed it. The study highlights significant prompt sensitivity in Turkish metalinguistic classification tasks.

Evaluation and Benchmarking Qwen2.5-7B BERTurk gpt-oss-20b

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

4arXiv · cs.CL·17d ago·source ↗

Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning

Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.

Evaluation and Benchmarking Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Universal Dependencies

6arXiv · cs.CL·12d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

6arXiv · cs.CL·18d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more