7arXiv cs.CL (Computation and Language)·13h ago

Agents-A1: 35B MoE agent matches trillion-parameter models via horizon scaling

Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that claims to match or exceed trillion-parameter models like Kimi-K2 and DeepSeek V4 on long-horizon agentic benchmarks. The approach scales agent trajectory length (averaging 45K tokens) and heterogeneous agent abilities rather than raw parameter count, using a three-stage training recipe including multi-teacher domain-routed distillation. On benchmarks such as SEAL-0, IFBench, HiPhO, and FrontierScience-Olympiad, Agents-A1 achieves leading or competitive results against models with roughly 30x more parameters. The work proposes a practical efficiency path for agentic capability scaling without proportional compute scaling.

Frontier Model Releases Inference Economics Agent and Tool Ecosystem IFBench Kimi-K2 DeepSeek V4 BrowseComp FrontierScience-Olympiad SEAL-0 HiPhO Agents-A1 SciCode HLE

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

DeepSeek V4

DeepSeek V4: The Open-Weights Giant Reshaping AI Economics

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner

Related events (8)

6The Batch·29d ago·source ↗

Kimi K2.6: Moonshot AI's 1T-Parameter Vision-Language Model Matches Open-Weights Peers, Trails Top Closed Models

Moonshot AI released Kimi K2.6, a 1 trillion-parameter mixture-of-experts vision-language model with 32B active parameters, designed for long-horizon autonomous coding sessions lasting multiple days and multi-agent orchestration scaling to 300 parallel subagents executing up to 4,000 steps. The model matches Qwen3.6 Max Preview and DeepSeek-V4-Pro on the Artificial Analysis Intelligence Index (scoring 54 vs. their 52) while trailing closed models like GPT-5.5 and Claude Opus 4.7. Weights are freely downloadable from Hugging Face under a modified MIT license permitting commercial use, with API access priced at $0.95/$0.16/$4.00 per million input/cached/output tokens. Notable features include a 256K token context window, native INT4 quantization, a 'preserve thinking' mode for multi-turn reasoning continuity, and a research preview 'claw groups' feature enabling cross-developer agent collaboration.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis Intelligence Index Claude Opus 4.6 Qwen3.6 Max Preview +14 more

7arXiv · cs.CL·1mo ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more

6The Batch·29d ago·source ↗

GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain

Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index SWE-bench +9 more

6arXiv · cs.AI·18d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more

6arXiv · cs.AI·22d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

7The Batch·4d ago·source ↗

Z.ai releases GLM-5.2, a 753B MoE open-weights model claiming top open-model ranking on agentic coding benchmarks

Z.ai released GLM-5.2, a 753-billion-parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, with a 1-million-token input context and MIT license. The model ranks first among open-weights models on Artificial Analysis's Intelligence Index v4.1 (score 51, behind Claude Opus 4.8 at 56 and GPT-5.5 at 55) and leads all models on PostTrainBench, a benchmark for agentic fine-tuning tasks. Key technical contributions include a modified sparse attention indexer applied every four layers (cutting per-token computation 2.9x at 1M context), a switch from GRPO to PPO for long-horizon RL training, and a reward-hacking mitigation pipeline using rule-based filters and a judge model. API pricing is substantially below comparable proprietary models, and the release coincides with U.S. government restrictions on access to Anthropic's frontier models.

Open Weights Progress Inference Economics Artificial Analysis Intelligence Index AA-Briefcase DeepSeek V4 +14 more

6arXiv · cs.CL·7d ago·source ↗

Tmax: Open RL training recipe for terminal-using agents achieves 27% on Terminal-Bench 2.0 with 9B parameters

Researchers present Tmax, an open RL training recipe for terminal-using language model agents, achieving 27% on Terminal-Bench 2.0 with a 9B parameter model while outperforming larger models from prior work. The recipe combines a novel data generation taxonomy using difficulty control, personas, and verifier diversification to produce a terminal environment dataset over 2.5x larger than previously released datasets. Training uses a simple outcome-only RL approach, and the authors release data, models, and code to lower the barrier for academic research on terminal agents.

Evaluation and Benchmarking Open Weights Progress Tmax Hamish Ivison Terminal-Bench +1 more

6arXiv · cs.CL·22d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies