ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs
ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.
Related guides (3)
Related events (8)
Thinking with images
OpenAI announced a new capability allowing its reasoning models to incorporate images directly into their chain-of-thought process, enabling visual reasoning during intermediate thinking steps rather than only at input/output boundaries. This extends multimodal reasoning to the internal computation layer, potentially improving performance on tasks requiring visual analysis combined with multi-step reasoning. The announcement comes from OpenAI's official blog, indicating a product-level capability update.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework
A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
QVQ-Max: Alibaba Qwen Releases Visual Reasoning Model with Multimodal Chain-of-Thought
Alibaba's Qwen team has officially released QVQ-Max, a visual reasoning model succeeding the December 2024 QVQ-72B-Preview. The model is designed to analyze and reason over images and videos, covering domains including mathematics, programming, and creative tasks. It represents a step beyond the exploratory preview, positioning as a production-grade multimodal reasoning system.
Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs
RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.
Study finds thinking mode in LRMs shifts instruction-following errors by constraint type rather than uniformly degrading performance
A new arXiv paper investigates how enabling built-in chain-of-thought reasoning ('Thinking ON/OFF') in Qwen3 and Hunyuan models affects instruction following on IFEval. Aggregate pass-rate changes are small but 10-20% of prompts switch outcomes, with 'Planning' constraints (global counting, structure) improving under thinking while 'Precision' constraints (exact local form) consistently worsen. Activation patching and trace-relevance analyses reveal an execution gap: thinking traces engage with Planning constraints but fail to translate that engagement into compliance, while Precision failures are more mechanistically recoverable. The findings have practical implications for when to enable reasoning modes in instruction-following applications.
Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs
This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.


