benchmark

NuclearQAv2

benchmarkactiveprovisionalnuclearqav2-58964ebc·1 events·first seen 2d ago

Aliases: NuclearQAv2

More like this (12)

SQA3D Qwen-3-VL-2B QwQ-32B Qwen-2.5-VL-3B MQuAKE QVQ-72B-Preview QwQ-32B-Preview Nuclear Proliferation Risk Classifier PQuAD nuclear norm Qwen3-14B Qwen2.5-14B

Recent events (1)

3arXiv · cs.CL·2d ago·source ↗

NuclearQAv2: A benchmark for evaluating LLM competence in nuclear engineering

Researchers introduce NuclearQAv2, a ~1,240 question benchmark for assessing LLM performance on nuclear engineering knowledge across boolean, numeric, and verbal question types. The benchmark is constructed via a hybrid pipeline combining expert-authored questions, existing datasets, and LLM-assisted generation from domain-specific corpora. Evaluation of multiple LLMs reveals strong performance on factual recall but significant gaps in quantitative reasoning and conceptual understanding, highlighting the need for multi-faceted domain-specific evaluation.

Evaluation and Benchmarking AI Safety Research NuclearQAv2