5arXiv cs.CL (Computation and Language)·4d ago

PEEU method enables 7B GUI agent to outperform 32B model on web task planning

Researchers introduce PEEU (Planning Experience Exploration and Utilization), a training approach for small open-source multimodal LLMs that autonomously explores GUI environments to collect hindsight experience and synthesizes high-level training data for task planning. A 7B model trained with PEEU achieves 30.6% accuracy on real-world benchmarks, outperforming Qwen2.5-VL-32B. The paper also proposes TDHAF, a hierarchical analysis framework revealing that high-level task training yields stronger out-of-distribution generalization than mastering low-level atomic skills alone.

Evaluation and Benchmarking Open Weights Progress Agent and Tool Ecosystem Planning Experience Exploration and Utilization Task Decomposition Hierarchical Analysis Framework Qwen-2.5-VL-3B

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·20d ago·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

6arXiv · cs.CL·7d ago·source ↗

PlanBench-XL: Benchmark for LLM Agent Planning in Large-Scale Tool Ecosystems

Researchers introduce PlanBench-XL, an interactive benchmark of 327 retail tasks spanning 1,665 tools designed to evaluate LLM agents on long-horizon planning under retrieval-limited tool visibility. The benchmark includes a blocking mechanism simulating real-world disruptions such as missing or failing tools, forcing agents to detect and recover from broken execution paths. Experiments on ten leading LLMs reveal severe performance degradation: GPT-5.4 drops from 51.90% accuracy in unblocked settings to 11.36% under the most severe blocking condition, highlighting fragility in adaptive planning for large, imperfect tool environments.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI PlanBench-XL GPT-5.5

7arXiv · cs.CL·6d ago·source ↗

Qwen-AgentWorld: Language world models for general agent simulation and planning

Alibaba's Qwen team introduces Qwen-AgentWorld, a pair of language world models (35B-A3B and 397B-A17B) trained to simulate agentic environments across 7 domains using over 10M interaction trajectories. The models are trained via a three-stage pipeline (CPT, SFT, RL) and evaluated on AgentWorldBench, a new benchmark constructed from 5 frontier models across 9 established benchmarks. Beyond simulation, the work demonstrates two downstream use cases: using the world model as a decoupled RL training environment and as a warm-up for agent foundation models, both yielding gains over baselines.

Frontier Model Releases Evaluation and Benchmarking AgentWorldBench Qwen-AgentWorld-35B-A3B Alibaba +3 more

7arXiv · cs.CL·27d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

6Qwen Research·1mo ago·source ↗

Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters

Alibaba's Qwen team releases Qwen1.5-MoE-A2.7B, a mixture-of-experts model with only 2.7 billion activated parameters that claims performance parity with 7B dense models such as Mistral 7B and Qwen1.5-7B. The model activates roughly one-third of its total parameters during inference, offering significant compute efficiency gains. This release follows growing industry interest in MoE architectures sparked by Mixtral, and the model is available on GitHub, HuggingFace, and ModelScope.

Frontier Model Releases Open Weights Progress Mixtral Qwen1.5-MoE-A2.7B Qwen1.5-7B +6 more

6Qwen·6d ago·source ↗

Qwen releases AgentWorld-35B-A3B: a world-model and environment-simulation MoE for agents

Qwen has released Qwen-AgentWorld-35B-A3B on Hugging Face, a 35B-parameter MoE model (3B active) built on the Qwen3.5 MoE architecture. The model is tagged for world-model and environment-simulation use cases, suggesting it is designed to simulate environments for agent training or evaluation. It is paired with a dataset called AgentWorldBench, indicating an associated evaluation suite. Early engagement is minimal (0 downloads, 4 likes) but the model represents a notable direction in agent-environment modeling from a major open-weights lab.

Open Weights Progress Agent and Tool Ecosystem AgentWorldBench Qwen-AgentWorld-35B-A3B Qwen +1 more

7arXiv · cs.CL·15h ago·source ↗

Agents-A1: 35B MoE agent matches trillion-parameter models via horizon scaling

Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that claims to match or exceed trillion-parameter models like Kimi-K2 and DeepSeek V4 on long-horizon agentic benchmarks. The approach scales agent trajectory length (averaging 45K tokens) and heterogeneous agent abilities rather than raw parameter count, using a three-stage training recipe including multi-teacher domain-routed distillation. On benchmarks such as SEAL-0, IFBench, HiPhO, and FrontierScience-Olympiad, Agents-A1 achieves leading or competitive results against models with roughly 30x more parameters. The work proposes a practical efficiency path for agentic capability scaling without proportional compute scaling.

Frontier Model Releases Inference Economics IFBench Kimi-K2 DeepSeek V4 +8 more

7Qwen Research·1mo ago·source ↗

QwQ-32B: Scaling Reinforcement Learning for Enhanced Reasoning

Alibaba's Qwen team releases QwQ-32B, a 32-billion parameter model trained with scaled Reinforcement Learning to improve reasoning capabilities beyond conventional pretraining and post-training methods. The release draws explicit comparison to DeepSeek R1's cold-start and multi-stage RL training approach. The model is available via Qwen Chat, Hugging Face, ModelScope, and a demo interface. This represents Qwen's exploration of RL scalability as a path to enhanced LLM intelligence.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 Alibaba Qwen +6 more