Almanac
← Events
6arXiv cs.CL (Computation and Language)·1mo ago

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Related guides (4)

Related events (8)

5arXiv · cs.CL·3d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

5arXiv · cs.CL·5d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.

6arXiv · cs.CL·22d ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

5arXiv · cs.LG·8d ago·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

6arXiv · cs.AI·26d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

6arXiv · cs.AI·26d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.

5arXiv · cs.CL·17d ago·source ↗

Visual instruction tuning aligns modalities in intermediate LLM layers, not early ones

A new arXiv paper investigates how visual instruction tuning embeds image features into the layer-wise hierarchy of LLM backbones across diverse vision-language architectures. Using probing analyses and causal interventions, the authors find that instruction tuning routes visual features into intermediate semantic layers, bypassing early unimodal-processing layers. They further show that fine-tuning restricted to these intermediate layers alone preserves full fine-tuning performance on vision-centric benchmarks while reducing training time, suggesting multimodal integration is a localized phenomenon.