3arXiv cs.CL (Computation and Language)·2d ago

NuclearQAv2: A benchmark for evaluating LLM competence in nuclear engineering

Researchers introduce NuclearQAv2, a ~1,240 question benchmark for assessing LLM performance on nuclear engineering knowledge across boolean, numeric, and verbal question types. The benchmark is constructed via a hybrid pipeline combining expert-authored questions, existing datasets, and LLM-assisted generation from domain-specific corpora. Evaluation of multiple LLMs reveals strong performance on factual recall but significant gaps in quantitative reasoning and conceptual understanding, highlighting the need for multi-faceted domain-specific evaluation.

Evaluation and Benchmarking AI Safety Research NuclearQAv2

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·2d ago·source ↗

BINEVAL: Binary question decomposition for interpretable LLM evaluation and prompt optimization

Researchers introduce BINEVAL, a framework that decomposes LLM evaluation criteria into atomic binary yes/no questions, aggregating answers into multi-dimensional interpretable scores. The approach matches or outperforms baselines including UniEval and G-Eval on SummEval, Topical-Chat, and QAGS benchmarks, with particular strength on factual consistency. Beyond evaluation, the binary question feedback is shown to support iterative prompt optimization in both self-update and cross-model settings on IFBench. The framework is training-free and task-agnostic, addressing opacity and ceiling-effect problems common in holistic LLM judges.

Evaluation and Benchmarking Alignment and RLHF IFBench G-Eval SummEval +3 more

5arXiv · cs.CL·17d ago·source ↗

New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence

Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.

Evaluation and Benchmarking Qwen3.5-122B Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

6arXiv · cs.AI·18d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

5arXiv · cs.AI·20d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

4Hugging Face Blog·1mo ago·source ↗

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

Evaluation and Benchmarking Open Weights Progress 3LM Hugging Face TII UAE

5arXiv · cs.CL·1mo ago·source ↗

QUIET: Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation

QUIET (Quality Understanding via Interlocked Evaluation Testing) is a new benchmark designed to evaluate LLM creative generation capability rather than discriminative recognition, addressing limitations of benchmarks like Story Cloze Test and HellaSwag. The benchmark places 10-20 blanks with explicit content constraints and cascade dependencies into complete stories, requiring open-ended generation rather than multiple-choice selection. Scoring uses an information-theoretic automated protocol operationalizing a 'calibrated surprise' framework: score = satisfy * (1 + lambda * surprise), combining constraint satisfaction with a surprise measure, enabling objective automated evaluation without human graders or LLM-as-Judge subjectivity.

Frontier Model Releases Evaluation and Benchmarking Zou & Xu HellaSwag QUIET +2 more

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more