Almanac
benchmark

SimpleQA

benchmarkactivesimpleqa-f4314ada·2 events·first seen 1mo ago

Aliases: SimpleQA

Co-occurring entities

More like this (12)

Recent events (2)

5Openai Blog·28d ago·source ↗

Introducing SimpleQA: OpenAI's Factuality Benchmark for Language Models

OpenAI has released SimpleQA, a benchmark designed to measure language model factuality on short, fact-seeking questions. The benchmark targets a specific and well-defined capability: answering direct factual queries accurately. It is intended to provide a clean signal on model truthfulness and calibration for this class of questions.

7Mistral Ai News·1mo ago·source ↗

Mistral AI Launches Agents API with Built-in Connectors, MCP Tools, and Persistent Memory

Mistral AI has released a dedicated Agents API that extends beyond chat completion by providing built-in connectors for code execution, web search, image generation, and document retrieval, alongside support for Model Context Protocol (MCP) tools. The API features stateful conversation management with branching, streaming output, and multi-agent orchestration capabilities. Benchmark results show substantial web search augmentation gains: Mistral Large jumps from 23% to 75% on SimpleQA, and Mistral Medium from 22% to 82% with search enabled. The release targets enterprise-grade agentic workflows and is accompanied by cookbooks covering GitHub coding assistants, financial analysis, and travel planning use cases.