Entity · benchmark

Bias Benchmark for Question Answering

benchmarkactivebias-benchmark-for-question-answering-9089b397·1 events·first seen Jun 3, 2026

Aliases: Bias Benchmark for Question Answering

Co-occurring entities

Claude Opus 4.6 Constitutional AI Claude Haiku 4.5 Needle-in-a-Haystack MMLU Claude 3 Sonnet GPQA GSM8K Anthropic

More like this (12)

temporally grounded QA benchmark Long-context Reasoning Benchmarks CORE benchmark AI Reproducibility Benchmark Multi-hop Question Answering Evidence-Backed Video Question Answering Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering ICML 2026 Workshop on Efficient Multimodal Question Answering multi-turn agent benchmarks Vals AI Finance Agent Benchmark Cost-Sensitive Conformal Prediction and Human-in-the-Loop Abstention for Imbalanced High-Stakes Decision Support: A Multi-Domain Benchmark

Recent events (1)

9Anthropic News·Jun 3, 2026·source ↗

Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus

Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.

Long Context Evolution Frontier Model Releases Claude Opus 4.6 Constitutional AI Claude Haiku 4.5 +8 more