paper
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
paperactiveprovisional
attention-amnesia-in-hybrid-llms-when-cot-fine-tuning-breaks-long-range-recall-and-how-to-fix-it-989cae93·1 events·first seen 7d agoAliases: Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
Co-occurring entities
More like this (12)
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented ModelsFlashbackCL: Mitigating Temporal Forgetting in Federated LearningLanguage Models Need Sleep: Learning to Self-Modify and Consolidate MemoriesExpert-Aware Causal Tracing of Factual Recall in Sparse MoE Language ModelsBackdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMsA sleep-like consolidation mechanism for LLMsCLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token InferenceLearning from the Self-future: On-policy Self-distillation for dLLMsLanguage Modeling LossSupervised Memory TrainingLearning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-TuningContinual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
Recent events (1)
QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs
Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.