Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
Related guides (4)
Related events (8)
d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs
Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback
Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.
Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency
This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.
Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates
A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.
ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs
ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.
PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs
This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.
Visual instruction tuning aligns modalities in intermediate LLM layers, not early ones
A new arXiv paper investigates how visual instruction tuning embeds image features into the layer-wise hierarchy of LLM backbones across diverse vision-language architectures. Using probing analyses and causal interventions, the authors find that instruction tuning routes visual features into intermediate semantic layers, bypassing early unimodal-processing layers. They further show that fine-tuning restricted to these intermediate layers alone preserves full fine-tuning performance on vision-centric benchmarks while reducing training time, suggesting multimodal integration is a localized phenomenon.



