Almanac
paper

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

paperactiveprovisionalcontinual-llm-upcycling-a-predictor-gated-bank-wise-sparsity-training-recipe-for-dense-to-sparse-llms-90d67440·1 events·first seen 7d ago

Aliases: Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·7d ago·source ↗

Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B

A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.