technique

Best-of-N

techniqueactiveprovisionalbest-of-n-e78afc85·1 events·first seen 37h ago

Aliases: Best-of-N

Co-occurring entities

Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder HumanEval MBPP

More like this (12)

Best-of-N Sampling Nx Nex-N2 leave-one-out baseline Pair Opt-dist Rank-to-Distill Top-p sampling n8n OFA (One-For-All)Optimum-NVIDIA Benchmark Everything Everywhere All at Once Top-k Accuracy

Recent events (1)

5arXiv · cs.CL·37h ago·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more