Entity · model

Qwen3-VL-Thinking

modelactiveqwen3-vl-thinking-1c3c72a1·1 events·first seen May 28, 2026

Aliases: Qwen3-VL-Thinking

Co-occurring entities

More like this (12)

Qwen3-4B-Thinking-2507 Qwen3-Omni-Thinking Qwen-3-VL-2B Qwen-VL Qwen-2.5-VL-3B Qwen3VL-8B Qwen3VL-8B Qwen3.5-Plus Qwen3.6-Plus Qwen3 Qwen-VLA Qwen2.5-VL

Recent events (1)

7arXiv · cs.CL·May 28, 2026·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more