benchmark
Bias Benchmark for Question Answering
benchmarkactiveprovisional
bias-benchmark-for-question-answering-9089b397·1 events·first seen 14d agoAliases: Bias Benchmark for Question Answering
Co-occurring entities
More like this (12)
temporally grounded QA benchmarkLong-context Reasoning BenchmarksCORE benchmarkAI Reproducibility BenchmarkMulti-hop Question AnsweringWhere Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Drivingmulti-turn agent benchmarksVals AI Finance Agent Benchmarkharness-level benchmarksTrace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question AnsweringAuto Benchmark Audit (ABA)tabular question answering
Recent events (1)
Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus
Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.