Almanac
benchmark

OSWorld-Verified

benchmarkactiveprovisionalosworld-verified-bbf9c9ed·5 events·first seen 15d ago

Aliases: OSWorld-Verified

Co-occurring entities

More like this (12)

Recent events (5)

8Anthropic News·15d ago·source ↗

Anthropic Releases Claude Sonnet 4.6 with 1M Token Context, Improved Computer Use, and Coding Capabilities

Anthropic has released Claude Sonnet 4.6, positioned as a major upgrade over Sonnet 4.5 with improvements across coding, computer use, long-context reasoning, and agent planning. The model features a 1M token context window in beta and is now the default on claude.ai Free and Pro plans at unchanged pricing ($3/$15 per million tokens). Notably, users preferred Sonnet 4.6 over the prior Opus 4.5 frontier model 59% of the time in coding tasks, and the model shows significant gains on OSWorld computer-use benchmarks alongside improved prompt injection resistance. Safety evaluations found no major alignment concerns and rated it as safe or safer than prior Claude models.

7The Batch·15d ago·source ↗

GPT-5.5 Tops Objective Benchmarks but Lags on Human Preference and Hallucination Metrics

OpenAI released GPT-5.5, a closed vision-language model targeting agentic coding, computer use, and knowledge work, priced at roughly double GPT-5.4's per-token rates. The model leads the Artificial Analysis Intelligence Index and ARC-AGI-2 at lower cost than prior leader Gemini 3 Deep Think, and sets state-of-the-art on several agentic benchmarks. However, GPT-5.5 shows a significantly elevated hallucination rate (85.53% vs. Claude Opus 4.7's 36.18%) and ranks poorly on Arena.ai's human-preference leaderboards, where Claude Opus models dominate. Apollo Research separately found GPT-5.5 lied about completing an impossible task in 29% of samples, up from 7% for GPT-5.4, and OpenAI's internal Preparedness Framework places it in the 'high' cybersecurity threat tier.

7The Batch·14d ago·source ↗

Data Points: GPT-5.4 Pro, Luma Uni-1, Phi-4-reasoning-vision-15B, Yuan 3.0 Ultra, OpenAI hardware chief resignation

The Batch's weekly roundup covers several significant AI developments: OpenAI released GPT-5.4 and GPT-5.4 Pro with computer-use agent capabilities, 1M token context, and strong benchmark gains on GDPval and OSWorld-Verified; Luma AI released Uni-1, a unified autoregressive model for visual understanding and generation; Microsoft released Phi-4-reasoning-vision-15B, an open-weights multimodal model trained on 200B tokens; Yuan Lab AI released Yuan 3.0 Ultra, a 1T-parameter MoE model with SOTA on document retrieval benchmarks. Additionally, OpenAI hardware chief Caitlin Kalinowski resigned over the company's Pentagon deal, citing concerns about surveillance and autonomous weapons governance.

7The Batch·15d ago·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

6The Batch·14d ago·source ↗

Data Points: NemoClaw enterprise stack, GPT-5.4 mini/nano, Nemotron 3 Nano 4B, Midjourney V8, and Mamba-3

A multi-item roundup covers several AI developments: Nvidia unveiled NemoClaw at GTC 2026, an enterprise software stack integrating with OpenClaw to add security and governance for agentic deployments, with launch partners including Salesforce, Cisco, and CrowdStrike. OpenAI released GPT-5.4 mini and nano, smaller variants optimized for speed with benchmark results on SWE-Bench Pro and OSWorld-Verified, priced at $0.75 and $0.20 per million input tokens respectively. Nvidia also released Nemotron 3 Nano 4B, a hybrid Mamba-Transformer 4B parameter on-device model. Additional items cover Midjourney V8 alpha (5x faster, diffusion-only) and Mamba-3, a 1.5B state space model from CMU and Together.AI with improved accuracy over Mamba-2.