paper
Dynamic Short Convolutions Improve Transformers
paperactiveprovisional
dynamic-short-convolutions-improve-transformers-2fec1b82·1 events·first seen 13d agoAliases: Dynamic Short Convolutions Improve Transformers
Co-occurring entities
More like this (12)
Training-Free Looped TransformersFixed-Point Reasoners: Stable and Adaptive Deep Looped TransformersSparse TransformerVariable-Width TransformersGraph TransformerSwift TransformersSparse Autoencodersinvertible 1x1 convolutionsDisentangled RNNsTransformer Language ModelsSentence Transformerstransformer-based neural renderer
Recent events (1)
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.