Almanac
benchmark

MCP-Atlas

benchmarkactivemcp-atlas-abd5b051·3 events·first seen 28d ago

Aliases: MCP-Atlas, MCP Atlas

Co-occurring entities

More like this (12)

Recent events (3)

7arXiv · cs.CL·28d ago·source ↗

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory is a fully automated framework for training tool-use LLM agents via Agentic Reinforcement Learning, addressing two key bottlenecks: scalable execution environments and realistic multi-turn training data. It autonomously constructs stateful, executable tool environments from authentic resources and synthesizes natural trajectories with implicit human intents via topology-aware sampling. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks, outperforming prior approaches that use 5x more environments.

7The Batch·14d ago·source ↗

OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position

OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.

8The Batch·14d ago·source ↗

GPT-5.4 released with tool search, computer use, and frontier benchmark performance

OpenAI released GPT-5.4 in Thinking and Pro variants, featuring an expanded context window (up to 1.05M input tokens), native computer use, tool search capabilities, and adjustable reasoning levels. In independent testing by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning achieved state-of-the-art on GDP-Val-AA, BrowseComp, Terminal-Bench-Hard, SWE-Bench-Pro, and MCP Atlas, while trailing Gemini 3.1 Pro Preview on MMMU-Pro and Humanity's Last Exam. Pricing is set at the top of the market ($30/$180 per million input/output tokens for Pro), and the release also powers Codex, OpenAI's competitor to Claude Code. The item is reported via The Batch (tier 2 commentary) and includes additional context on Andrew Ng's chub CLI tool for agent documentation sharing.