Entity · paper

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

paperactiveall-smoke-no-alarm-oracle-signals-in-agent-authored-test-code-429b3dc1·1 events·first seen Jun 17, 2026

Aliases: All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

Co-occurring entities

GitHub Devin Cursor Claude Code OpenAI Codex GitHub Copilot

More like this (12)

Benchmark Agent AlphaOracle oracle testing Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals GridDebugAgent Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability Baseline Agent Super-Agent benchmark Token-Flow Firewall: Semantic Runtime Auditing for Persistent AI Agents Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests OpenSkillRisk: Benchmarking Agent Safety When Using Real-World Risky Third-Party Skills Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

Recent events (1)

6arXiv · cs.AI·Jun 17, 2026·source ↗

Empirical study finds 80% of AI agent-authored test patches lack meaningful verification logic

A large-scale empirical study of 86,156 test-file patches from 33,596 agent-authored GitHub PRs finds that 80.2% contain weak or no explicit oracle signals — meaning they execute code without verifying behavior. The study covers five coding agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code) across 2,807 repositories, and introduces a syntactic taxonomy of eight oracle signal categories. Despite lower raw merge rates, regression analysis shows strong oracles significantly improve merge likelihood (OR=1.28), suggesting current quality gates based on test-file presence substantially overestimate verification strength.

Evaluation and Benchmarking Agent and Tool Ecosystem GitHub Devin Cursor +4 more