Entity · benchmark

SWE-Interact

benchmarkactiveprovisionalswe-interact-9f887376·1 events·first seen 3d ago

Aliases: SWE-Interact

Co-occurring entities

SWE-Bench Verified OpenAI GPT-5.5 Claude Opus 4.8 Anthropic

More like this (12)

SWE-Explore SWE-Pro SWE-Agent SWE-Perf SWE-Smith FrontierSWE SWE-fficiency DeepSWE Mini-SWE-Agent SWE-Marathon SWE-Bench Lite DeepSWIP

Recent events (1)

7arXiv · cs.LG·3d ago·source ↗

SWE-Interact benchmark evaluates coding agents on multi-turn, user-driven software engineering tasks

SWE-Interact is a new benchmark testbed that evaluates coding agents in realistic multi-turn developer workflows, where a user simulator starts with vague instructions and progressively reveals requirements. Unlike existing SWE benchmarks that provide complete specs upfront, SWE-Interact tests interactive goal discovery and iterative refinement. Frontier models including Claude Opus 4.8 and GPT-5.5 solve ~50% of single-turn baseline tasks but only ~25% of SWE-Interact tasks, revealing a significant capability gap. The benchmark is grounded in large-scale studies of real coding-agent interactions and identifies failure modes like over-agentic coding, requirement forgetting, and early abandonment under ambiguity.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Interact SWE-Bench Verified OpenAI +3 more