Entity · organization

METR

organizationactivemetr-82294f6d·3 events·first seen Jun 1, 2026

Aliases: METR

Co-occurring entities

More like this (12)

MED CoMet MATS MMTEB MAST MetaSyn MARC Mercor EMPATH Meta UTMOS MER-TRANS 2026

Recent events (3)

8The Batch·Jul 3, 2026·source ↗

OpenAI Previews GPT-5.6 Family (Sol, Terra, Luna) with Government-Only Access and Advanced Safety Guardrails

OpenAI announced a preview of three vision-language models — GPT-5.6 Sol, Terra, and Luna — descending in capability and price, currently available only to U.S. government-approved organizations via API and Codex. GPT-5.6 Sol, the flagship tier, features a new 'max reasoning' mode and 'ultra mode' that spawns multiple subagents for multi-step tasks, and achieved state-of-the-art results on Terminal-Bench 2.1 (91.9%) while approaching Claude Mythos 5 on ExploitBench. The models include layered biosecurity and cybersecurity guardrails, with independent evaluations from METR and SecureBio yielding mixed but notable findings — particularly a near-10-point biology knowledge jump over GPT-5.5 and ambiguous autonomous task-duration results from METR. Wider public release is planned within weeks.

Frontier Model Releases AI Safety Research World-Class Bio GPT-5.6 Terra GPT-5.6 Sol +11 more

7Anthropic News·Jun 4, 2026·source ↗

Anthropic launches initiative to fund third-party AI safety evaluations

Anthropic announced a funded initiative to source third-party evaluations measuring advanced AI capabilities and safety risks, with priority areas including cybersecurity, CBRN threats, model autonomy, national security risks, social manipulation, and misalignment. The initiative is tied to Anthropic's Responsible Scaling Policy and AI Safety Level (ASL) framework, aiming to address a gap between demand and supply of high-quality safety-relevant evals. Proposals are solicited via an application form, with Anthropic framing the effort as benefiting the broader AI safety ecosystem rather than just internal use.

Evaluation and Benchmarking AI Safety Research METR Google-Proof Q&A Responsible Scaling Policy +1 more

7The Batch·Jun 1, 2026·source ↗

Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation

Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more