Almanac
model

Differential Transformer

modelactivedifferential-transformer-6cb2c7a8·1 events·first seen 28d ago

Aliases: Differential Transformer

Co-occurring entities

More like this (12)

Recent events (1)

5Hugging Face Blog·28d ago·source ↗

Differential Transformer V2

Microsoft has published a blog post on Hugging Face introducing Differential Transformer V2, an updated version of their differential attention mechanism for transformers. The differential attention architecture aims to reduce attention noise by computing attention as a difference between two softmax attention maps. This post likely covers improvements to the original design, training dynamics, or scaling behavior of the V2 iteration.