Entity · paper

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

paperactivewhen-built-in-thinking-helps-and-hurts-constraint-level-error-shifts-in-instruction-following-ee9fbaa8·1 events·first seen Jun 9, 2026

Aliases: When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Co-occurring entities

Hunyuan Alibaba Qwen3 Tencent IFEval

More like this (12)

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models AI-driven constraint reasoning Chain-of-Thought Fine-Tuning Instruction Hierarchy Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity When Do Learned Diffusion Proposals Help Constraint Solving? A Controlled Study on Continuous Algebraic Systems Predicting Future Behaviors in Reasoning Models Enables Better Steering Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models Error-Conditioned Neural Solvers Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter Instruction-Following Pruning Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Recent events (1)

5arXiv · cs.CL·Jun 9, 2026·source ↗

Study finds thinking mode in LRMs shifts instruction-following errors by constraint type rather than uniformly degrading performance

A new arXiv paper investigates how enabling built-in chain-of-thought reasoning ('Thinking ON/OFF') in Qwen3 and Hunyuan models affects instruction following on IFEval. Aggregate pass-rate changes are small but 10-20% of prompts switch outcomes, with 'Planning' constraints (global counting, structure) improving under thinking while 'Precision' constraints (exact local form) consistently worsen. Activation patching and trace-relevance analyses reveal an execution gap: thinking traces engage with Planning constraints but fail to translate that engagement into compliance, while Precision failures are more mechanistically recoverable. The findings have practical implications for when to enable reasoning modes in instruction-following applications.

Frontier Model Releases Evaluation and Benchmarking When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following Hunyuan Alibaba +3 more