NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
nll-guided-full-attention-layer-selection-for-training-free-sliding-window-adaptation-b7e48c24·1 events·first seen 37h agoAliases: NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
Co-occurring entities
More like this (12)
Recent events (1)
NLL-guided training-free method selects optimal full-attention layers for efficient long-context inference
Researchers propose NLL-guided layer selection, a training-free technique for hybrid attention models that identifies which layers should use full versus sliding-window attention by measuring negative log-likelihood degradation on answer tokens. On LongMemEval with Qwen3-4B, the method achieves 64.6% accuracy using only 1/4 full-attention layers, matching a 1/2-FA periodic baseline while halving compute, and outperforming a periodic 1/4-FA baseline by 10.4 percentage points. The calibration procedure requires approximately 15 minutes of one-time compute, making it practical for deployment. The work advances the efficiency-accuracy tradeoff for long-context LLM inference without requiring any retraining.