C4STYLI Benchmark: Probing Cultural Aesthetic Stylistics Awareness in LLMs
Researchers introduce C4STYLI, a benchmark of stylized translated movie titles and advertising slogans from Hong Kong and mainland China, designed to evaluate LLMs on cross-cultural aesthetic stylistics. Evaluations reveal that LLMs diverge from human stylistic recognition, with recognition ability varying by text domain and not consistently predicting generation performance. Structural ablation using logistic regression probes shows that LLMs in the Hong Kong setting rely on surface-level linguistic cues rather than deeper stylistic structure, indicating limited cultural sensitivity.
Related guides (1)
Related events (8)
StylisticBias benchmark reveals a small set of visual cues drives most social bias in MLLMs
Researchers introduce StylisticBias, a controlled benchmark of ~25K photorealistic face images with single-attribute variations designed to isolate how specific visual cues shift social judgments in multimodal LLMs. Evaluating six MLLMs across 25 binary social judgment scenarios, they find that age and body type dominate identity-level effects, while fashion style drives the largest attribute-level shifts, with ~15 attributes accounting for ~80% of total bias variation. The benchmark is released publicly on GitHub and Hugging Face, enabling fine-grained bias auditing of multimodal models.
Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding
Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.
FRANZ: A Communicative Audit Framework for LLM Response Framing on Subjective Questions
Researchers introduce FRANZ, an automated framework for auditing how LLMs frame responses to subjective, culturally-sensitive questions across four dimensions: cultural positioning, generalizing language, anthropomorphic cues, and conversational maxims. The work is paired with SQUARE, a 376k-question corpus drawn from 57 subreddits and mapped to 7 countries and 19 question categories. Applying FRANZ to three open-weight LLMs reveals statistically significant differences in framing behavior, and uncovers a positive coupling between insider positioning and anthropomorphism that varies by country. The study argues that existing evaluations focused on factual correctness miss important communicative dimensions of LLM outputs.
MalayPrag: Benchmarking LLM Handling of Discourse Particles in Colloquial Malay
This paper introduces MalayPrag, a benchmark for evaluating LLMs' ability to handle discourse particles in colloquial Malay, a low-resource Southeast Asian language. The authors define five linguistically grounded attributes for interpreting pragmatic functions of discourse particles and test ten off-the-shelf LLMs on three prediction tasks. Results show substantial challenges for current LLMs in connecting discourse particles to their pragmatic functions in Malay. Providing the five structured attributes as scaffolding significantly improves model performance, suggesting that explicit pragmatic frameworks can compensate for low-resource language deficits.
Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled
A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.
DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs
DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.
FilBench: Benchmarking LLM Capabilities in Filipino Language
FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.
