Entity · benchmark

OSWorld

benchmarkactiveosworld-7a4974c9·9 events·first seen May 28, 2026

Aliases: OSWorld, OSWorld 2.0

Co-occurring entities

More like this (12)

OSWorld-Verified iOSWorld DevicesWorld SpatialWorld SearchOS OS-Shepherd AppWorld ScienceWorld MobileWorld Meta-World Word2World OWL

Recent events (9)

6arXiv · cs.AI·18h ago·source ↗

Empirical study finds inference-time scaling yields diminishing returns for local computer-use agents

Researchers present a systematic empirical study of inference-time scaling across four dimensions (contextual, temporal, structural, parallel) for locally-deployed computer-use agents under hardware constraints. Evaluating Qwen3-VL-8B/30B-A3B, UI-TARS-1.5-7B, and OpenCUA-7B on OSWorld, they find that additional compute often shifts rather than eliminates failure modes—contextual scaling saturates, temporal scaling extends erroneous trajectories, and structural decomposition adds overhead. The findings argue for selective compute allocation and failure-aware control mechanisms tailored to local model capabilities.

Evaluation and Benchmarking Inference Economics Qwen3-30B-A3B Qwen3-4B OpenCUA-7B +4 more

9Hacker News·Jul 24, 2026·source ↗

Anthropic releases Claude Opus 5

Anthropic has announced Claude Opus 5, a new flagship model release. The item originates from Anthropic's official news domain, indicating a primary source announcement. This would represent a significant step beyond the current Claude Opus 4.8 flagship and is likely to be a major frontier model release.

Frontier Model Releases Inference Economics Zapier Claude Max Claude Opus 4.6 +15 more

8Anthropic News·Jul 3, 2026·source ↗

Anthropic introduces extended thinking mode in Claude 3.7 Sonnet with visible reasoning traces

Anthropic released Claude 3.7 Sonnet with an 'extended thinking' capability that allows the model to allocate more compute and reasoning time to difficult problems, with a configurable 'thinking budget' for developers. The model's internal reasoning chain is exposed in raw form as a research preview, enabling transparency but raising faithfulness concerns — Anthropic notes that models often make decisions based on factors not explicitly discussed in their visible thinking. The release also includes improved agentic capabilities ('action scaling') demonstrated on OSWorld computer-use benchmarks and a Pokémon Red gameplay evaluation.

Frontier Model Releases AI Safety Research OSWorld Claude 3.7 Sonnet Anthropic +2 more

9Anthropic News·Jun 3, 2026·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

Frontier Model Releases Evaluation and Benchmarking OpenAI o1-preview Amazon Bedrock Claude 3.5 Sonnet +15 more

8Anthropic News·Jun 2, 2026·source ↗

Anthropic Releases Computer Use Capability for Claude 3.5 Sonnet

Anthropic has launched a public beta of computer use for Claude 3.5 Sonnet, enabling the model to control a computer by interpreting screenshots and issuing pixel-level cursor and keyboard commands. The model achieves 14.9% on the OSWorld benchmark, roughly double the next-best AI model's 7.7%, though well below human-level performance of 70-75%. Anthropic trained the model on a small set of simple software tools and found it generalized rapidly to broader computer interaction. Safety analysis confirmed the capability remains at AI Safety Level 2, with prompt injection identified as a primary near-term risk.

Evaluation and Benchmarking AI Safety Research prompt injection Claude 3.5 Sonnet Responsible Scaling Policy +6 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.5: Top Coding and Computer-Use Model with Agent SDK

Anthropic has released Claude Sonnet 4.5, claiming it is the best coding model and strongest model for building complex agents, with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4) and state-of-the-art performance on SWE-bench Verified. The release is accompanied by major product upgrades including checkpoints in Claude Code, a native VS Code extension, a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code, and new context editing and memory tools in the Claude API. Pricing is unchanged from Sonnet 4 at $3/$15 per million input/output tokens. Early enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma report significant gains in coding, agentic, and long-context tasks.

Frontier Model Releases Evaluation and Benchmarking Canva Claude for Chrome Figma +13 more

8Anthropic News·Jun 1, 2026·source ↗

Anthropic Acquires Vercept to Advance Claude's Computer Use Capabilities

Anthropic has acquired Vercept, a team specializing in AI perception and interaction for computer use tasks, whose co-founders include Kiana Ehsani, Luca Weihs, and Ross Girshick. Vercept will wind down its external product and join Anthropic to push computer use capabilities further. The announcement coincides with the launch of Claude Sonnet 4.6, which achieved 72.5% on the OSWorld benchmark—up from under 15% in late 2024—approaching human-level performance on tasks like navigating spreadsheets and completing web forms. This follows Anthropic's earlier acquisition of Bun and is part of a broader strategy to build agentic, multi-step task capabilities into Claude.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Luca Weihs Kiana Ehsani +7 more

8Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.6 with 1M Token Context, Improved Computer Use, and Coding Capabilities

Anthropic has released Claude Sonnet 4.6, positioned as a major upgrade over Sonnet 4.5 with improvements across coding, computer use, long-context reasoning, and agent planning. The model features a 1M token context window in beta and is now the default on claude.ai Free and Pro plans at unchanged pricing ($3/$15 per million tokens). Notably, users preferred Sonnet 4.6 over the prior Opus 4.5 frontier model 59% of the time in coding tasks, and the model shows significant gains on OSWorld computer-use benchmarks alongside improved prompt injection resistance. Safety evaluations found no major alignment concerns and rated it as safe or safer than prior Claude models.

Long Context Evolution Frontier Model Releases claude.ai Claude Sonnet 4 Claude Opus 4.6 +11 more

6arXiv · cs.AI·May 28, 2026·source ↗

LearnWeak: Automated Domain Specialization for Small Computer-Use Agents via Weakness-Targeted Synthesis

LearnWeak is an annotation-free framework for specializing small computer-use agents (CUAs) in specific software domains without deploying large expert models. It uses a stronger reference agent to identify weaknesses in a smaller student agent, synthesizes targeted tasks, and applies an error-aware training objective that disentangles planning from execution errors. On OSWorld, LearnWeak achieves gains of ~11 percentage points over 7B-8B baseline CUAs across eight domains. The work demonstrates that student-aware data synthesis substantially outperforms naive large-scale data generation for domain specialization.

Evaluation and Benchmarking Open Weights Progress LearnWeak computer-use agents OpenCUA-7B +6 more