Almanac
technique

layer pruning

techniqueactivelayer-pruning-df87f564·1 events·first seen 1mo ago

Aliases: layer pruning

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·1mo ago·source ↗

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

This paper distinguishes two protocols for measuring transformer layer redundancy—replacement (can one layer substitute for another in place?) and interchange (do two layers approximately commute when swapped?)—and shows they can disagree substantially. Experiments on Pythia (410M, 1.4B) and 8B-scale models (Qwen3-8B, Llama-3.1-8B) reveal that the protocol gap grows during training and can change which layers appear safe to prune by several-fold. Notably, Qwen3-8B shows interchange-guided removal is far safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols despite lower interchange KL. The authors recommend scoring both swap-KL metrics before any layer removal or merging, requiring only unlabeled forward passes.