Entity · benchmark

HealthBench

benchmarkactivehealthbench-e82d24af·5 events·first seen May 20, 2026

Aliases: HealthBench, HealthBench Hard

Co-occurring entities

More like this (12)

VitaBench MedAgentBench FinBench IT-Bench FutureBench BigCodeBench PaperBench SpecBench LiveBench WildBench LiveCodeBench SorryBench

Recent events (5)

6arXiv · cs.CL·2d ago·source ↗

SERPO: Self-evolving rubric policy optimization enables test-time RL for open-ended generation

SERPO (Self-Evolving Rubric Policy Optimization) is a new method for test-time reinforcement learning that extends self-evolution to open-ended generation tasks without requiring labeled feedback, external reward models, or stronger judge models. The approach co-evolves response archives, query-specific rubrics, and policy parameters in a closed loop using a Good-Normal-Bad response organization scheme and probabilistic criterion scoring. Evaluated across six benchmarks, SERPO improves HealthBench and ResearchQA by up to 20.63 and 20.31 points over base models and raises the macro-average by up to 8.06 points. The method addresses a key limitation of existing TTRL approaches that rely on answer voting and cannot generalize to tasks without canonical answers.

Evaluation and Benchmarking Alignment and RLHF HealthBench ResearchQA SERPO +1 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

KG grounding helps LLMs only for out-of-training knowledge: controlled clinical QA study

A new arXiv paper investigates when knowledge-graph (KG) grounding improves LLM performance on clinical question answering, finding that structured KG retrieval over the public biomedical graph PrimeKG provides no meaningful improvement on MedQA (all deltas ≤3.4) because the relevant facts are already in the model's training data. On synthetic counterfactual and hybrid benchmarks containing genuinely novel facts, the same pipeline lifts out-of-training accuracy from chance to ~100%. The paper also reproduces and partially corrects a recent Nature Medicine study on frontier LLMs vs. clinical RAG tools, flagging a score-deflating grader bug and clarifying that the reported ~88 HealthBench score reflects the Consensus variant, not full HealthBench (~46-47). The core finding — that RAG/KG grounding pays off only when the decisive fact is outside the model's training distribution — has direct implications for when retrieval augmentation is worth deploying.

Evaluation and Benchmarking Enterprise Deployment Patterns HealthBench samyama-graph MedQA +5 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

RubricsTree: Scalable hierarchical rubric framework for evaluating personal health AI agents

RubricsTree is a new evaluation framework for LLM-powered personal health agents, built around a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics derived from 4,000 real user queries and curated with physician oversight. A context-aware router activates only relevant rubrics per query, enabling scalable yet expert-aligned evaluation. The framework outperforms strong LLM-as-a-judge baselines on expert alignment and, when used as training signal, yields up to ~66% relative gains on HealthBench across Gemini, GPT, and Qwen model families. The work addresses a concrete bottleneck in clinical deployment of health AI: the cost-quality tradeoff in evaluation.

Evaluation and Benchmarking AI Safety Research HealthBench RubricsTree Qwen +2 more

8The Batch·Jun 1, 2026·source ↗

Meta Introduces Muse Spark: First Closed-Weights Model from Superintelligence Labs

Meta released Muse Spark, its first AI model in roughly a year and the debut product of its Superintelligence Labs, marking a significant departure from its open-weights Llama strategy. The natively multimodal reasoning model supports tool use and multi-agent orchestration, achieves fourth place on the Artificial Analysis Intelligence Index, and claims notable token efficiency—matching Llama 4 Maverick with over 10x less training compute. Meta withheld parameter count, architecture, and training details, positioning Muse Spark as a closed commercial product competing with OpenAI, Google, and Anthropic. The release introduces 'thought compression' via RL and a parallel multi-agent 'contemplating' mode, while showing gaps in coding and agentic benchmarks.

Frontier Model Releases Open Weights Progress Scale AI Artificial Analysis Intelligence Index Claude Opus 4.6 +18 more

6Openai Blog·May 20, 2026·source ↗

Introducing HealthBench

OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.

Evaluation and Benchmarking AI Safety Research HealthBench OpenAI +1 more