Entity · paper

Calibrated Surprise

paperactivecalibrated-surprise-114f85d6·2 events·first seen May 26, 2026

Aliases: Calibrated Surprise

Co-occurring entities

Zou & Xu HellaSwag QUIET Story Cloze Test BC Protocol Creative Quality Alignment (CQA)Zou LIMA Chain-of-Thought Fine-Tuning Xu

More like this (12)

Uncertainty Calibration Expected Calibration Error surprisal Surprises in Proper Positive-Only Learning multicalibration surprisal is Not a Theory Variance-Calibrated Modulation Surprisal Theory is Tautological (without Rational Grounding)Toward Calibrated Mixture-of-Experts Under Distribution Shift Optimal Deterministic Multicalibration and Omniprediction BeyondUncertainty Post-Training Shifts Confidence: A Three-Stage Analysis of How SFT, RL, and OPD Shape Pre-, Intra-, and Post-CoT Calibration

Recent events (2)

5arXiv · cs.CL·May 26, 2026·source ↗

QUIET: Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation

QUIET (Quality Understanding via Interlocked Evaluation Testing) is a new benchmark designed to evaluate LLM creative generation capability rather than discriminative recognition, addressing limitations of benchmarks like Story Cloze Test and HellaSwag. The benchmark places 10-20 blanks with explicit content constraints and cascade dependencies into complete stories, requiring open-ended generation rather than multiple-choice selection. Scoring uses an information-theoretic automated protocol operationalizing a 'calibrated surprise' framework: score = satisfy * (1 + lambda * surprise), combining constraint satisfaction with a surprise measure, enabling objective automated evaluation without human graders or LLM-as-Judge subjectivity.

Frontier Model Releases Evaluation and Benchmarking Zou & Xu HellaSwag QUIET +2 more

4arXiv · cs.CL·May 26, 2026·source ↗

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

This paper empirically validates a creative quality metric from a companion work (Calibrated Surprise, Zou & Xu 2026a) under strict low-resource conditions: ~100 expert chain-of-thought annotations and a small base model. The authors introduce Creative Quality Alignment (CQA) as a class of engineering methods and identify a systematic bias in public alignment datasets toward craft knowledge, with weak coverage of audience modeling and reality-logic. A theoretical argument based on 'architectural duality' in single conditional distribution LLMs is offered to explain why so few examples suffice, distinguishing the result from purely empirical findings like LIMA.

Evaluation and Benchmarking Alignment and RLHF BC Protocol Creative Quality Alignment (CQA)Zou +4 more