Entity · paper

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

paperactivescaffold-not-vocabulary-a-controlled-two-tier-pre-registered-study-of-a-popperian-code-generation-skill-375d450d·1 events·first seen Jun 5, 2026

Aliases: Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

Co-occurring entities

Claude Sonnet 4 HumanEval Qwen2.5-Coder Anthropic

More like this (12)

Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models Multi-Agent Scaffold Grammar-Constrained Decoding Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages RL Post-Training Builds Compositional Reasoning Strategies Code Is More Than Text: Uncertainty Estimation for Code Generation Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection CodeParrot Function-Aware Fill-in-the-Middle as Mid-Training for Coding Agent Foundation Models SlopCodeBench Arithmetic Pedagogy for Language Models A Factorial Study of Synthetic Data Generation for Low-Resource Machine Translation using Grammar Books

Recent events (1)

5arXiv · cs.CL·Jun 5, 2026·source ↗

Pre-registered study finds Popperian code-generation prompt skills add no benefit beyond structural scaffolding

A pre-registered two-tier ablation study tests whether 'Popperian falsificationist' prompt skills improve LLM code generation through their procedural content or merely through structural scaffolding. Using Claude Sonnet 4.6 and Qwen2.5-Coder-0.5B with execution-based evaluation (HumanEval+ unit tests) rather than LLM-as-judge, the authors find that on the small model, structured prompts lift correctness by 20-22 points but the full Popperian skill shows no separable benefit over a labels-only scaffold. The paper contributes a calibrated negative result and a reusable disambiguation protocol for evaluating prompt-skill families, while also documenting that LLM self-judges at 0.5B scale perform no better than random selection.

Evaluation and Benchmarking Claude Sonnet 4 Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill HumanEval +2 more