Almanac
benchmark

HLE

benchmarkactivehle-2d613b3b·3 events·first seen 28d ago

Aliases: HLE

Co-occurring entities

More like this (12)

Recent events (3)

7The Batch·13d ago·source ↗

Google's Aletheia agent uses Gemini 3 Deep Think to generate novel solutions to unsolved Erdős problems

Google researchers introduced Aletheia, an agentic workflow using Gemini 3 Deep Think that generates, verifies, and revises solutions to previously unsolved mathematical problems. Applied to Erdős problems, Aletheia produced 13 correct solutions out of 200 evaluated, with 4 being genuinely novel contributions not found in existing literature. The announcement also reveals Gemini 3 Deep Think's benchmark performance: 48.4% on HLE, 84.6% on ARC-AGI-2, and 93.8% on GPQA Diamond. The system demonstrates both the promise and current limitations of AI-assisted mathematical research, with a 6.5% correct-under-intended-interpretation rate on a hard problem set.

8The Batch·15d ago·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

6arXiv · cs.CL·28d ago·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.