4arXiv cs.CL (Computation and Language)·12d ago

DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs

DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.

Evaluation and Benchmarking DEFINED

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·46h ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more

6arXiv · cs.AI·10d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

4arXiv · cs.CL·11d ago·source ↗

Dep-LLM: Training-free depression diagnosis framework using structured multi-factor LLM reasoning

Dep-LLM is a training-free framework for automatic depression detection from clinical interviews that uses frozen foundation LLMs without fine-tuning. The system decomposes long clinical dialogues into five thematic factors via Chain-of-Thought analysis, applies token-level entropy-based confidence modulation, and integrates multi-factor signals for final diagnosis. Evaluated on DAIC-WOZ and E-DAIC datasets, it outperforms zero-shot baselines across 21 foundation LLMs and surpasses supervised domain-specific and commercial LLMs on multiple metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem Chain-of-Thought Reasoning Dep-LLM DAIC-WOZ +1 more

5arXiv · cs.CL·25d ago·source ↗

QUIET: Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation

QUIET (Quality Understanding via Interlocked Evaluation Testing) is a new benchmark designed to evaluate LLM creative generation capability rather than discriminative recognition, addressing limitations of benchmarks like Story Cloze Test and HellaSwag. The benchmark places 10-20 blanks with explicit content constraints and cascade dependencies into complete stories, requiring open-ended generation rather than multiple-choice selection. Scoring uses an information-theoretic automated protocol operationalizing a 'calibrated surprise' framework: score = satisfy * (1 + lambda * surprise), combining constraint satisfaction with a surprise measure, enabling objective automated evaluation without human graders or LLM-as-Judge subjectivity.

Frontier Model Releases Evaluation and Benchmarking Zou & Xu HellaSwag QUIET +2 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4Hugging Face Blog·1mo ago·source ↗

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Hugging Face introduces a multilingual LLM debate competition where large language models compete against each other in structured debates. The initiative explores multi-agent interaction, argumentation quality, and cross-lingual reasoning capabilities. This represents an evaluation framework for assessing LLM persuasion, coherence, and multilingual performance in adversarial settings.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face LLM Debate Competition

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

6arXiv · cs.LG·22d ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more