4arXiv cs.AI (Artificial Intelligence)·44h ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Researchers propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework that decomposes visual conditioning queries into a structural-to-semantic cascade for text-to-image generation. The method uses training-only sketch supervision to guide structural queries without requiring sketch extraction at inference time, enabling implicit CoT reasoning in a single forward pass. IV-CoT achieves improved results on GenEval and T2I-CompBench benchmarks, targeting persistent weaknesses in multimodal LLMs around object counts, spatial relations, and attribute binding.

Evaluation and Benchmarking Multimodal Progress GenEval T2I-CompBench IV-CoT

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·16d ago·source ↗

IS-CoT framework addresses long-form generation collapse in LLMs via interleaved structural thinking

Researchers introduce IS-CoT (Interleaved Structural Chain-of-Thought), a framework that embeds a dynamic Plan-Write-Reflect cycle into LLM generation to overcome severe length collapse observed in reasoning-enhanced models for open-ended writing tasks beyond 2,000 words. The authors construct a multi-teacher training dataset of interleaved reasoning traces and train IS-Writer-8B, which achieves state-of-the-art results on LongBench-Write, outperforming DeepSeek-V3.2 by 3.08 points. The work identifies static hierarchical planning as a root cause of long-form degradation and proposes an in-model alternative to external agentic workflows.

Long Context Evolution Evaluation and Benchmarking DeepSeek V4 LongBench-Write IS-Writer-8B +1 more

6arXiv · cs.CL·2d ago·source ↗

Systematic evaluation reveals limits of multimodal Chain-of-Thought reasoning across perception and reasoning tasks

A new arXiv paper evaluates multimodal Chain-of-Thought (CoT) reasoning across 12 tasks using 22 models (14 non-reasoning, 8 reasoning), finding that CoT is not universally beneficial: it hurts performance on perception tasks like visual grounding and object counting while helping mathematical and scientific reasoning. The study identifies a 'Look Light, Think Heavy' pattern where visual reflection consistently diminishes during reasoning chains even as verbal reflection fluctuates, pointing to deep visual introspection as a key unsolved bottleneck. Open-source multimodal reasoning models show only marginal overall gains, likely due to overemphasis on mathematical reasoning during training.

Evaluation and Benchmarking Multimodal Progress Chain-of-Thought Reasoning Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

7Openai Blog·1mo ago·source ↗

Thinking with images

OpenAI announced a new capability allowing its reasoning models to incorporate images directly into their chain-of-thought process, enabling visual reasoning during intermediate thinking steps rather than only at input/output boundaries. This extends multimodal reasoning to the internal computation layer, potentially improving performance on tasks requiring visual analysis combined with multi-step reasoning. The announcement comes from OpenAI's official blog, indicating a product-level capability update.

Long Context Evolution Frontier Model Releases OpenAI Reasoning Models Chain-of-Thought Reasoning OpenAI +1 more

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem functional token GRPO Latent-Anchored GRPO +4 more

5arXiv · cs.CL·6d ago·source ↗

Information-theoretic analysis of supervision in latent chain-of-thought reasoning

This paper analyzes Latent Chain-of-Thought (CoT) reasoning — where reasoning occurs in continuous hidden states rather than discrete text — through an information-theoretic lens, identifying a 'dual collapse' failure mode involving gradient attenuation and representational drift. The authors decompose process supervision into Trajectory Supervision and Space Supervision, and introduce the Unified Latent Probe (ULP) to quantify mutual information between latent trajectories and explicit reasoning steps. Experiments reveal an 'Information-Performance Binding' showing reasoning accuracy depends on information fidelity in the latent chain, suggesting supervision should shift from geometric imitation toward mutual information maximization.

Evaluation and Benchmarking Alignment and RLHF EIT-NLP Unified Latent Probe What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

6arXiv · cs.AI·1mo ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

7Openai Blog·1mo ago·source ↗

Reasoning models struggle to control their chains of thought, and that's good

OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.

Frontier Model Releases Evaluation and Benchmarking CoT-Control monitorability Chain-of-Thought Reasoning +3 more

6arXiv · cs.CL·20d ago·source ↗

NF-CoT: Latent reasoning with normalizing flows preserves autoregressive LLM advantages

Researchers propose NF-CoT, a latent reasoning framework that replaces discrete chain-of-thought token streams with continuous intermediate states modeled by normalizing flows embedded inside an LLM backbone. The approach uses a TARFlow-style normalizing flow head alongside the standard language model head, enabling exact likelihoods, KV-cache-compatible left-to-right decoding, and policy-gradient optimization in latent space. On code-generation benchmarks, NF-CoT improves pass rates over both explicit CoT and prior latent-reasoning baselines while reducing intermediate reasoning cost. The work addresses a key limitation of existing latent reasoning methods, which typically sacrifice probabilistic tractability or autoregressive compatibility.

Inference Economics Alignment and RLHF TARFlow NF-CoT Latent Reasoning with Normalizing Flows