organization

Harvard University

organizationactiveprovisionalharvard-university-1c914e24·1 events·first seen 36h ago

Aliases: Harvard University

Co-occurring entities

Artificial Analysis Llama 3.1 70B Datacurve IBM Claude Opus 4.6 Stanford University Claude Sonnet 4.5 ITBench-AA SWE-bench SWE-Agent Stirrup Meta DeepSWE ProgramBench GPT-5.5

More like this (12)

Yale University Princeton University Yale Law School University of Cambridge University of Oxford University of California, Berkeley University of Chicago Columbia Law School University of Virginia New York University Northeastern University Stanford University

Recent events (1)

6The Batch·36h ago·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more