swe-interact-9f887376·1 events·first seen Aliases: SWE-Interact
SWE-Interact is a new benchmark testbed that evaluates coding agents in realistic multi-turn developer workflows, where a user simulator starts with vague instructions and progressively reveals requirements. Unlike existing SWE benchmarks that provide complete specs upfront, SWE-Interact tests interactive goal discovery and iterative refinement. Frontier models including Claude Opus 4.8 and GPT-5.5 solve ~50% of single-turn baseline tasks but only ~25% of SWE-Interact tasks, revealing a significant capability gap. The benchmark is grounded in large-scale studies of real coding-agent interactions and identifies failure modes like over-agentic coding, requirement forgetting, and early abandonment under ambiguity.