paper
All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code
paperactiveprovisional
all-smoke-no-alarm-oracle-signals-in-agent-authored-test-code-429b3dc1·1 events·first seen 5h agoAliases: All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code
Co-occurring entities
More like this (12)
Benchmark Agentoracle testingWill the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny SignalsBaseline AgentSuper-Agent benchmarkDo Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized TestsCodeAgentsActivation OraclesMeta-Agent ChallengeMemoryAgentBenchCognizant Agent FoundryAgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
Recent events (1)
Empirical study finds 80% of AI agent-authored test patches lack meaningful verification logic
A large-scale empirical study of 86,156 test-file patches from 33,596 agent-authored GitHub PRs finds that 80.2% contain weak or no explicit oracle signals — meaning they execute code without verifying behavior. The study covers five coding agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code) across 2,807 repositories, and introduces a syntactic taxonomy of eight oracle signal categories. Despite lower raw merge rates, regression analysis shows strong oracles significantly improve merge likelihood (OR=1.28), suggesting current quality gates based on test-file presence substantially overestimate verification strength.