Entity · benchmark

QUIET

benchmarkactivequiet-2e745823·1 events·first seen May 26, 2026

Aliases: QUIET

Co-occurring entities

Zou & Xu HellaSwag Story Cloze Test Calibrated Surprise

More like this (12)

NequIP MILD PQuAD CLEVER QVal QVal SQUALITY MQuAKE QUBRIC Soft Q-Function MedQADE Whisper

Recent events (1)

5arXiv · cs.CL·May 26, 2026·source ↗

QUIET: Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation

QUIET (Quality Understanding via Interlocked Evaluation Testing) is a new benchmark designed to evaluate LLM creative generation capability rather than discriminative recognition, addressing limitations of benchmarks like Story Cloze Test and HellaSwag. The benchmark places 10-20 blanks with explicit content constraints and cascade dependencies into complete stories, requiring open-ended generation rather than multiple-choice selection. Scoring uses an information-theoretic automated protocol operationalizing a 'calibrated surprise' framework: score = satisfy * (1 + lambda * surprise), combining constraint satisfaction with a surprise measure, enabling objective automated evaluation without human graders or LLM-as-Judge subjectivity.

Frontier Model Releases Evaluation and Benchmarking Zou & Xu HellaSwag QUIET +2 more