Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding
Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.
Related guides (1)
Related events (8)
FilBench: Benchmarking LLM Capabilities in Filipino Language
FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.
SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability
SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.
BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding
BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.
LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting
Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.
C4STYLI Benchmark: Probing Cultural Aesthetic Stylistics Awareness in LLMs
Researchers introduce C4STYLI, a benchmark of stylized translated movie titles and advertising slogans from Hong Kong and mainland China, designed to evaluate LLMs on cross-cultural aesthetic stylistics. Evaluations reveal that LLMs diverge from human stylistic recognition, with recognition ability varying by text domain and not consistently predicting generation performance. Structural ablation using logistic regression probes shows that LLMs in the Hong Kong setting rely on surface-level linguistic cues rather than deeper stylistic structure, indicating limited cultural sensitivity.
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.
MalayPrag: Benchmarking LLM Handling of Discourse Particles in Colloquial Malay
This paper introduces MalayPrag, a benchmark for evaluating LLMs' ability to handle discourse particles in colloquial Malay, a low-resource Southeast Asian language. The authors define five linguistically grounded attributes for interpreting pragmatic functions of discourse particles and test ten off-the-shelf LLMs on three prediction tasks. Results show substantial challenges for current LLMs in connecting discourse particles to their pragmatic functions in Malay. Providing the five structured attributes as scaffolding significantly improves model performance, suggesting that explicit pragmatic frameworks can compensate for low-resource language deficits.
SupraBench: First benchmark for evaluating LLMs on supramolecular chemistry reasoning
Researchers introduce SupraBench, the first benchmark designed to systematically evaluate LLMs on supramolecular chemistry tasks including binding affinity prediction, top-binder selection, solvent identification, and host-guest description. The work also releases SupraPMC, a 16M-token corpus of supramolecular chemistry articles from Europe PMC to support domain adaptation. Evaluation of broad open and proprietary LLMs reveals substantial headroom across all tasks, with domain pretraining improving in-distribution regression but creating format compliance tradeoffs. The benchmark targets a narrow but practically important scientific domain where LLM acceleration could reduce days-long dry-lab verification cycles.
