VEPO: Vision-anchored token selection improves RL for visual reasoning
A new arXiv paper identifies a failure mode of entropy-based credit assignment in multimodal reinforcement learning: vision-sensitive tokens with naturally low entropy are systematically ignored, causing the mechanism to collapse in visual reasoning tasks. The authors propose VEPO (Vision-Entropy token-selection for Policy Optimization), which couples visual sensitivity with token entropy via a multiplicative scheme to redirect gradient credit toward tokens that are both visually grounded and semantically informative. VEPO outperforms entropy-only baselines by 2.28 points at 7B scale and 3.15 points at 3B scale on visual reasoning benchmarks.
Related guides (2)
Related events (8)
STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training
Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.
DelTA: Discriminative Token Credit Assignment for RLVR Training
DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.
Reroute: Training-free recoverable visual token routing for vision-language models
A new arXiv preprint proposes Reroute, a training-free plug-in that replaces the standard rank-and-remove visual token pruning paradigm in VLMs with a recoverable routing mechanism. Instead of permanently discarding low-ranked tokens, Reroute defers them to re-enter the candidate pool at later decoder stages, addressing the problem that token importance shifts across decoder depth. Evaluated on LLaVA-1.5 and Qwen backbones augmented with FastV, PDrop, and Nüwa pruning methods, Reroute improves grounding performance under aggressive token reduction without sacrificing general VQA accuracy. The approach preserves the theoretical compute and KV-cache budget of the underlying pruning method.
ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens
ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.
POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training
This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.
Good Token Hunting: Token Selection Framework for Visual Geometry Transformers
This paper introduces a two-stage token selection framework to address the quadratic computational scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. The approach combines diversity-based inter-frame selection (frame-level) with entropy-guided intra-frame sparsification (token-level within frames). Experiments demonstrate over 85% acceleration for 500-image scenes while maintaining or improving baseline reconstruction quality, offering a favorable speed-accuracy trade-off.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

