Tmax: Open RL training recipe for terminal-using agents achieves 27% on Terminal-Bench 2.0 with 9B parameters
Researchers present Tmax, an open RL training recipe for terminal-using language model agents, achieving 27% on Terminal-Bench 2.0 with a 9B parameter model while outperforming larger models from prior work. The recipe combines a novel data generation taxonomy using difficulty control, personas, and verifier diversification to produce a terminal environment dataset over 2.5x larger than previously released datasets. Training uses a simple outcome-only RL approach, and the authors release data, models, and code to lower the barrier for academic research on terminal agents.
Related guides (3)
Related events (8)
Agents-A1: 35B MoE agent matches trillion-parameter models via horizon scaling
Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that claims to match or exceed trillion-parameter models like Kimi-K2 and DeepSeek V4 on long-horizon agentic benchmarks. The approach scales agent trajectory length (averaging 45K tokens) and heterogeneous agent abilities rather than raw parameter count, using a three-stage training recipe including multi-teacher domain-routed distillation. On benchmarks such as SEAL-0, IFBench, HiPhO, and FrontierScience-Olympiad, Agents-A1 achieves leading or competitive results against models with roughly 30x more parameters. The work proposes a practical efficiency path for agentic capability scaling without proportional compute scaling.
Bebop: MTP with rejection sampling and TV loss achieves 1.8x RL training speedup
Researchers introduce Bebop, a framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs. The work identifies that MTP acceptance rates degrade during RL due to entropy fluctuations, and proposes probabilistic rejection sampling plus a novel end-to-end Total Variation (TV) loss that directly optimizes multi-step acceptance rates, achieving up to 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5, Qwen3.6, and Qwen3.7 models, the method yields up to 1.8x end-to-end acceleration in async RL training. The approach eliminates the need for costly online MTP updating by using pre-RL MTP training with the proposed objectives.
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory is a fully automated framework for training tool-use LLM agents via Agentic Reinforcement Learning, addressing two key bottlenecks: scalable execution environments and realistic multi-turn training data. It autonomously constructs stateful, executable tool environments from authentic resources and synthesizes natural trajectories with implicit human intents via topology-aware sampling. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks, outperforming prior approaches that use 5x more environments.
T1-Bench: Multi-scenario agent benchmark across 25 real-world domains
T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.
Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
A Hugging Face blog post authored by LinkedIn describes practical lessons from implementing reinforcement learning training for agentic open-source GPT-class models. The retrospective covers engineering and algorithmic challenges encountered when applying RL to agentic workflows. As a tier-2 source with no body content available, the depth and specific findings cannot be fully assessed, but the topic sits at the intersection of agentic systems and RLHF/RL training pipelines.
GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.
OpenThoughts-Agent: Open data curation pipeline for broadly capable agentic models
The OpenThoughts-Agent (OT-Agent) project releases a fully open data curation pipeline for training agentic language models, addressing the gap left by prior efforts (SWE-Smith, SERA, Nemotron-Terminal) that target single benchmarks. The team conducts over 100 controlled ablation experiments and assembles a 100K-example training set, fine-tuning Qwen3-32B to achieve 44.8% average accuracy across seven agentic benchmarks — a 3.9 percentage point improvement over the strongest existing open agentic model (Nemotron-Terminal-32B at 40.9%). Training data, pipeline, experimental data, and models are publicly released at openthoughts.ai.
PEEU method enables 7B GUI agent to outperform 32B model on web task planning
Researchers introduce PEEU (Planning Experience Exploration and Utilization), a training approach for small open-source multimodal LLMs that autonomously explores GUI environments to collect hindsight experience and synthesizes high-level training data for task planning. A 7B model trained with PEEU achieves 30.6% accuracy on real-world benchmarks, outperforming Qwen2.5-VL-32B. The paper also proposes TDHAF, a hierarchical analysis framework revealing that high-level task training yields stronger out-of-distribution generalization than mastering low-level atomic skills alone.


