MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training
MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.
Related guides (4)
Related events (8)
RealClawBench: Live benchmark framework built from real developer-agent sessions
RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.
AgentMob: Training-free LLM agent framework for evidence-grounded mobility prediction
AgentMob is a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making, using a fast path for routine cases and iterative tool use for ambiguous ones. Evaluated on three mobility datasets, it achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42% Acc@1 on the BW dataset. The framework demonstrates that LLM controllers add most value in resolving ambiguous predictions through adaptive evidence gathering rather than routine cases.
iOSWorld: Benchmark for Personalized iOS Phone Agents with Persistent User Identity
Researchers introduce iOSWorld, the first interactive native iOS simulator benchmark designed to evaluate phone agents on personalized, identity-aware tasks across 26 custom-built iOS apps. The benchmark includes 133 tasks spanning single-app, multi-app, and memory/personalization categories, with connected personal data such as transactions, messages, and social relationships. Frontier models reach only 52% overall and 37% on multi-app tasks; privileged vision+XML access improves frontier models by up to 26 percentage points but does not help smaller models. The benchmark is released open-source with all apps, data, tasks, and evaluation code.
AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies
Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.
HiViG: History-aware visually grounded critic improves computer use agents across GUI benchmarks
Researchers introduce HiViG, a test-time framework for Computer Use Agents that addresses two weaknesses in existing critic models: short-sighted decision loops and lack of visual grounding. The system trains a multimodal critic on real GUI trajectories to maintain a compact macro-action history and verify execution coordinates against live screenshots before action execution. Evaluated on web, mobile, and desktop benchmarks, HiViG improves average success rates by 5.8% over the strongest baseline with Qwen3-VL-32B and 9.0% with Gemini-3-Flash, with both history and grounding components shown to be independently necessary.
Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access
Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.



