4arXiv cs.CL (Computation and Language)·2d ago

G-IdiomAlign: Gloss-pivoted benchmark for cross-lingual idiom alignment in LLMs

Researchers introduce G-IdiomAlign, a benchmark anchoring idioms via English glosses from Wiktionary to evaluate cross-lingual idiom equivalence in LLMs. The benchmark supports two evaluation protocols: a multiple-choice task with typed distractors and a gloss-contrastive generation task isolating the effect of explicit semantic pivots. Experiments across diverse LLMs find that literal translation bias is the dominant failure mode, especially for low-resource languages, and that gloss conditioning improves performance but leaves substantial headroom. Mechanistic analysis on Qwen3-8B suggests cross-condition differences are concentrated in attention heads rather than layers.

Evaluation and Benchmarking Qwen3-4B G-IdiomAlign Wiktionary

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.

Frontier Model Releases Evaluation and Benchmarking 2-Parameter Logistic IRT Model GIM (Grounded Integration Measure)test-time compute +4 more

4arXiv · cs.CL·17d ago·source ↗

AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026

Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.

Evaluation and Benchmarking Multimodal Progress Gemma-4 E4B-it IWSLT 2026 AlignAtt +2 more

4arXiv · cs.CL·1mo ago·source ↗

LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting

Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem LexNeo-Bench knowledge graph prompting LuxBorrow +2 more

4arXiv · cs.CL·1mo ago·source ↗

Quantifying Cross-Linguistic Effects of Syncretism on Agreement Attraction Using LLM Processing Proxies

This paper investigates why morphological syncretism amplifies agreement attraction errors in some languages (English, German, Russian) but not others (Turkish, Armenian), a pattern lacking a principled account. The authors use surprisal and attention entropy derived from large language models as proxies for human sentence processing across four languages. LLM-derived measures successfully replicate behavioral findings in English and German, align with Turkish null results, and partially capture Russian patterns. The work demonstrates LLMs as tools for cross-linguistic psycholinguistic investigation.

Evaluation and Benchmarking agreement attraction large language models surprisal +2 more

6arXiv · cs.CL·10d ago·source ↗

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research DeepSeek V4 Mistral Large 2 GPT-4o +8 more

4arXiv · cs.CL·12d ago·source ↗

Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding

Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.

Evaluation and Benchmarking Phun-Bench

6arXiv · cs.CL·24d ago·source ↗

MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation

MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.

Evaluation and Benchmarking AI Safety Research TruthfulQA ROUGE-L Siran Li +3 more

4arXiv · cs.CL·2d ago·source ↗

IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages

Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.

Evaluation and Benchmarking Multimodal Progress IndicContextEval