paper
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
paperactiveprovisional
continual-llm-upcycling-a-predictor-gated-bank-wise-sparsity-training-recipe-for-dense-to-sparse-llms-90d67440·1 events·first seen 7d agoAliases: Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
Co-occurring entities
More like this (12)
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-TrainingA sleep-like consolidation mechanism for LLMsBackdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMsLearning from the Self-future: On-policy Self-distillation for dLLMsWhich Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMsCLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token InferenceExpRL: Exploratory RL for LLM Mid-TrainingLeveraging Audio-LLMs to Filter Speech-to-Speech Training Datacode synthesis LLMsTailLoR: Protecting Principal Components in Parameter-Efficient Continual Learninglong-context LLMsDense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
Recent events (1)
Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B
A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.