Entity · benchmark

SimpleQA

benchmarkactivesimpleqa-f4314ada·4 events·first seen May 18, 2026

Aliases: SimpleQA

Co-occurring entities

Co-LMLM: Continuous-Query Limited Memory Language Models GPT-4o mini Claude Sonnet 4.5 CO-LMLM FineWeb-Edu FreshQA HotpotQA Decoupled Search Grounding MCP OpenAI Mistral AI GitHub Devstral 2 Black Forest Labs FLUX1.1 [pro] Ultra Mistral Large 2 Mistral-medium Linear Mistral Agents API Model Context Protocol

More like this (12)

StrategyQA TableQA FreshQA TruthfulQA PubMedQA ChartQA IndQA RealWorldQA Protocol QA ResearchQA FinQA OfficeQA Pro

Recent events (4)

6arXiv · cs.LG·Jul 9, 2026·source ↗

Co-LMLM: Continuous-query limited memory language models outperform vanilla LLMs on factual tasks at small scale

Researchers introduce CO-LMLM, a limited memory language model that externalizes factual knowledge to a knowledge base during pretraining and retrieves it at inference via continuous vector queries paired with human-readable text values. The approach removes prior restrictions to relational knowledge bases and Wikipedia-only data by introducing an annotation pipeline for arbitrary text. At 360M parameters, CO-LMLM achieves lower perplexity than models trained on 40x more data and SimpleQA factual performance comparable to GPT-4o mini and above Claude Sonnet 4.5, suggesting significant efficiency gains for factual grounding.

Evaluation and Benchmarking Open Weights Progress Co-LMLM: Continuous-Query Limited Memory Language Models GPT-4o mini Claude Sonnet 4.5 +4 more

6arXiv · cs.CL·Jun 18, 2026·source ↗

Decoupled Search Grounding (DSG): vendor-agnostic MCP-compatible architecture for LLM agent retrieval

Researchers introduce Decoupled Search Grounding (DSG), an architecture that moves real-time search grounding outside the reasoning model via an MCP-compatible gateway, exposing provider routing, caching, and retrieval-depth as explicit controls. Evaluated across five frontier models on SimpleQA, FreshQA, and HotpotQA, DSG nearly matches native search accuracy on SimpleQA (86.1% vs. 87.7%) while achieving 91% lower search cost and 68% lower latency via a 99.4% warm-cache hit rate. In a production e-commerce deployment, DSG cuts search cost by over 98% while matching or slightly exceeding native-search accuracy. The work frames real-time grounding as an optimizable interface boundary rather than a fixed model feature, with direct relevance to MCP-based agent infrastructure.

Inference Economics Enterprise Deployment Patterns FreshQA HotpotQA Decoupled Search Grounding +3 more

5Openai Blog·May 20, 2026·source ↗

Introducing SimpleQA: OpenAI's Factuality Benchmark for Language Models

OpenAI has released SimpleQA, a benchmark designed to measure language model factuality on short, fact-seeking questions. The benchmark targets a specific and well-defined capability: answering direct factual queries accurately. It is intended to provide a clean signal on model truthfulness and calibration for this class of questions.

Evaluation and Benchmarking AI Safety Research SimpleQA OpenAI

7Mistral Ai News·May 18, 2026·source ↗

Mistral AI Launches Agents API with Built-in Connectors, MCP Tools, and Persistent Memory

Mistral AI has released a dedicated Agents API that extends beyond chat completion by providing built-in connectors for code execution, web search, image generation, and document retrieval, alongside support for Model Context Protocol (MCP) tools. The API features stateful conversation management with branching, streaming output, and multi-agent orchestration capabilities. Benchmark results show substantial web search augmentation gains: Mistral Large jumps from 23% to 75% on SimpleQA, and Mistral Medium from 22% to 82% with search enabled. The release targets enterprise-grade agentic workflows and is accompanied by cookbooks covering GitHub coding assistants, financial analysis, and travel planning use cases.

Frontier Model Releases Inference Economics Mistral AI GitHub Devstral 2 +9 more