PEEU method enables 7B GUI agent to outperform 32B model on web task planning
Researchers introduce PEEU (Planning Experience Exploration and Utilization), a training approach for small open-source multimodal LLMs that autonomously explores GUI environments to collect hindsight experience and synthesizes high-level training data for task planning. A 7B model trained with PEEU achieves 30.6% accuracy on real-world benchmarks, outperforming Qwen2.5-VL-32B. The paper also proposes TDHAF, a hierarchical analysis framework revealing that high-level task training yields stronger out-of-distribution generalization than mastering low-level atomic skills alone.
Related guides (3)
Related events (8)
EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents
EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.
PlanBench-XL: Benchmark for LLM Agent Planning in Large-Scale Tool Ecosystems
Researchers introduce PlanBench-XL, an interactive benchmark of 327 retail tasks spanning 1,665 tools designed to evaluate LLM agents on long-horizon planning under retrieval-limited tool visibility. The benchmark includes a blocking mechanism simulating real-world disruptions such as missing or failing tools, forcing agents to detect and recover from broken execution paths. Experiments on ten leading LLMs reveal severe performance degradation: GPT-5.4 drops from 51.90% accuracy in unblocked settings to 11.36% under the most severe blocking condition, highlighting fragility in adaptive planning for large, imperfect tool environments.
Qwen-AgentWorld: Language world models for general agent simulation and planning
Alibaba's Qwen team introduces Qwen-AgentWorld, a pair of language world models (35B-A3B and 397B-A17B) trained to simulate agentic environments across 7 domains using over 10M interaction trajectories. The models are trained via a three-stage pipeline (CPT, SFT, RL) and evaluated on AgentWorldBench, a new benchmark constructed from 5 frontier models across 9 established benchmarks. Beyond simulation, the work demonstrates two downstream use cases: using the world model as a decoupled RL training environment and as a warm-up for agent foundation models, both yielding gains over baselines.
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters
Alibaba's Qwen team releases Qwen1.5-MoE-A2.7B, a mixture-of-experts model with only 2.7 billion activated parameters that claims performance parity with 7B dense models such as Mistral 7B and Qwen1.5-7B. The model activates roughly one-third of its total parameters during inference, offering significant compute efficiency gains. This release follows growing industry interest in MoE architectures sparked by Mixtral, and the model is available on GitHub, HuggingFace, and ModelScope.
Qwen releases AgentWorld-35B-A3B: a world-model and environment-simulation MoE for agents
Qwen has released Qwen-AgentWorld-35B-A3B on Hugging Face, a 35B-parameter MoE model (3B active) built on the Qwen3.5 MoE architecture. The model is tagged for world-model and environment-simulation use cases, suggesting it is designed to simulate environments for agent training or evaluation. It is paired with a dataset called AgentWorldBench, indicating an associated evaluation suite. Early engagement is minimal (0 downloads, 4 likes) but the model represents a notable direction in agent-environment modeling from a major open-weights lab.
Agents-A1: 35B MoE agent matches trillion-parameter models via horizon scaling
Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that claims to match or exceed trillion-parameter models like Kimi-K2 and DeepSeek V4 on long-horizon agentic benchmarks. The approach scales agent trajectory length (averaging 45K tokens) and heterogeneous agent abilities rather than raw parameter count, using a three-stage training recipe including multi-teacher domain-routed distillation. On benchmarks such as SEAL-0, IFBench, HiPhO, and FrontierScience-Olympiad, Agents-A1 achieves leading or competitive results against models with roughly 30x more parameters. The work proposes a practical efficiency path for agentic capability scaling without proportional compute scaling.
QwQ-32B: Scaling Reinforcement Learning for Enhanced Reasoning
Alibaba's Qwen team releases QwQ-32B, a 32-billion parameter model trained with scaled Reinforcement Learning to improve reasoning capabilities beyond conventional pretraining and post-training methods. The release draws explicit comparison to DeepSeek R1's cold-start and multi-stage RL training approach. The model is available via Qwen Chat, Hugging Face, ModelScope, and a demo interface. This represents Qwen's exploration of RL scalability as a path to enhanced LLM intelligence.


