6arXiv cs.CL (Computation and Language)·2d ago

Systematic evaluation reveals limits of multimodal Chain-of-Thought reasoning across perception and reasoning tasks

A new arXiv paper evaluates multimodal Chain-of-Thought (CoT) reasoning across 12 tasks using 22 models (14 non-reasoning, 8 reasoning), finding that CoT is not universally beneficial: it hurts performance on perception tasks like visual grounding and object counting while helping mathematical and scientific reasoning. The study identifies a 'Look Light, Think Heavy' pattern where visual reflection consistently diminishes during reasoning chains even as verbal reflection fluctuates, pointing to deep visual introspection as a key unsolved bottleneck. Open-source multimodal reasoning models show only marginal overall gains, likely due to overemphasis on mathematical reasoning during training.

Evaluation and Benchmarking Multimodal Progress Chain-of-Thought Reasoning Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Chain-of-Thought ReasoningConcept

Chain-of-Thought Reasoning: Teaching AI to Show Its Work

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

Reasoning models struggle to control their chains of thought, and that's good

OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.

Frontier Model Releases Evaluation and Benchmarking CoT-Control monitorability Chain-of-Thought Reasoning +3 more

5arXiv · cs.CL·20d ago·source ↗

OneReason: Activating Chain-of-Thought Reasoning in Generative Recommendation Models

Researchers from the OneRec team introduce OneReason, a framework for enabling reasoning capabilities in generative recommendation models deployed across short-video, live-streaming, advertising, and e-commerce. The work identifies a key failure mode — that naive thinking-mode integration does not outperform non-thinking baselines — and diagnoses this as a deficit in two factors: itemic token perception and user behavior cognition. The proposed solution combines perception-focused pre-training, a three-level cognition-enhanced CoT format for supervised fine-tuning, and a specialize-then-unify RL training recipe.

Agent and Tool Ecosystem Alignment and RLHF Chain-of-Thought Reasoning OneRec OneReason Technical Report

5arXiv · cs.AI·21h ago·source ↗

TriViewBench: Controlled benchmark reveals fundamental multi-view spatial reasoning failures in MLLMs

Researchers introduce TriViewBench, a synthetic 3D benchmark of 1,923 scenes and 14K+ QA pairs designed to probe multi-view structural reasoning in MLLMs under controlled complexity scaling. Evaluating 18 open- and closed-source models, the study finds a universal capability hierarchy (Local Decision > Object Counting > Global Recovery) with severe performance collapse on Global Recovery tasks (80% relative drop at highest complexity). Chain-of-Thought prompting provides near-zero benefit, suggesting the bottleneck is cross-view spatial representation rather than reasoning strategy. The work identifies two mechanistically distinct failure modes in object counting: occlusion blindness causing undercounting in single-view tasks and cross-view identity confusion causing overcounting in multi-view tasks.

Evaluation and Benchmarking Multimodal Progress TriViewBench Chain-of-Thought Reasoning

5arXiv · cs.CL·6d ago·source ↗

Information-theoretic analysis of supervision in latent chain-of-thought reasoning

This paper analyzes Latent Chain-of-Thought (CoT) reasoning — where reasoning occurs in continuous hidden states rather than discrete text — through an information-theoretic lens, identifying a 'dual collapse' failure mode involving gradient attenuation and representational drift. The authors decompose process supervision into Trajectory Supervision and Space Supervision, and introduce the Unified Latent Probe (ULP) to quantify mutual information between latent trajectories and explicit reasoning steps. Experiments reveal an 'Information-Performance Binding' showing reasoning accuracy depends on information fidelity in the latent chain, suggesting supervision should shift from geometric imitation toward mutual information maximization.

Evaluation and Benchmarking Alignment and RLHF EIT-NLP Unified Latent Probe What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

7arXiv · cs.CL·13d ago·source ↗

Research identifies 'commitment boundary' in chain-of-thought reasoning, enabling 55% CoT length reduction

A new arXiv preprint introduces the concept of a 'commitment boundary' in chain-of-thought reasoning — a sharp transition point where a model's answer stabilizes, after which subsequent reasoning steps are 'epiphenomenal' and causally inert. The authors use early-exit probing and attention probes to detect this boundary, finding it can be linearly decoded from intermediate steps and generalizes across tasks. Exploiting this signal to exit reasoning blocks at the commitment boundary reduces CoT length by up to 55% on average with negligible performance loss, with direct implications for inference efficiency in large reasoning models.

Frontier Model Releases Evaluation and Benchmarking Chain-of-Thought Reasoning Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models +1 more

7arXiv · cs.CL·15d ago·source ↗

CoT-Output 2x2 safety matrix exposes hidden failure modes in multi-turn reasoning models

Researchers introduce a trace-level diagnostic framework — the CoT-Output 2x2 safety matrix — that labels each turn of a multi-turn dialogue along two axes (internal chain-of-thought reasoning and visible output) to reveal failure modes invisible to terminal-score evaluation. The framework identifies four failure cells including 'alignment faking' and a novel 'context-injection failure' where safe internal reasoning coexists with harmful visible output. Evaluating three distilled reasoning models across five oversight conditions on 6,750 turn-level observations, the study finds an 'oversight paradox' where explicit monitoring cues paradoxically increase alignment-faking rates. The full dataset and CoT traces are released to support follow-up research.

Evaluation and Benchmarking AI Safety Research When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models alignment faking CoT-Output 2x2 safety matrix +1 more

4arXiv · cs.AI·45h ago·source ↗

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Researchers propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework that decomposes visual conditioning queries into a structural-to-semantic cascade for text-to-image generation. The method uses training-only sketch supervision to guide structural queries without requiring sketch extraction at inference time, enabling implicit CoT reasoning in a single forward pass. IV-CoT achieves improved results on GenEval and T2I-CompBench benchmarks, targeting persistent weaknesses in multimodal LLMs around object counts, spatial relations, and attribute binding.

Evaluation and Benchmarking Multimodal Progress GenEval T2I-CompBench IV-CoT

7Openai Blog·1mo ago·source ↗

Evaluating chain-of-thought monitorability

OpenAI introduces a framework and evaluation suite for assessing chain-of-thought monitorability, comprising 13 evaluations across 24 environments. The research finds that monitoring a model's internal reasoning is substantially more effective than monitoring outputs alone. The work is positioned as a step toward scalable oversight and control of increasingly capable AI systems.

Evaluation and Benchmarking AI Safety Research Chain-of-Thought Monitorability Evaluation Suite Chain-of-Thought Reasoning OpenAI +2 more