Entity · paper

Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

paperactive

selection-without-signal-recovery-through-expression-a-measurement-study-of-post-hoc-falsification-operators-for-frozen-small-code-models-4eea9bcd

·1 events·first seen Jun 16, 2026

Aliases: Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

Co-occurring entities

deepseek-coder Best-of-N HumanEval MBPP

More like this (12)

Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Quantifying Faithful Confidence Expression in Large Reasoning Models Faithfulness to Refusal: A Causal Audit of Neuron Selectors Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection Win by Silence: Deletion Non-Monotonicity, Autonomous Exploitation, and Typed-State Gating in LLM Plan Evaluation Explicit Fuzzy Logic in the Feed-Forward Layer: Self-Forgetting Quantifiers Discover Legible Grammatical-Licensing Detectors Same Evidence, Different Target: Decoding How Diagnostic Evidence Bears on Causal Questions from Language-Model States Operadic consistency: a label-free signal for compositional reasoning failures in LLMs Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

Recent events (1)

5arXiv · cs.CL·Jun 16, 2026·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more