Almanac
technique

VASAE

techniqueactiveprovisionalvasae-131edda9·1 events·first seen 11h ago

Aliases: VASAE

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·11h ago·source ↗

VASAE: Vocabulary-Aligned Sparse Autoencoder assigns intrinsic token names to SAE features during training

Researchers introduce VASAE (Vocabulary-Aligned Sparse Autoencoder), a method that trains SAE features with vocabulary-aligned anchoring so each feature is intrinsically named by the nearest token in the model's embedding space. Applied to GPT-2-small and Llama-3.1-8B, VASAE achieves ~90% feature alignment in shallow-to-middle layers without degrading reconstruction quality, though final-layer alignment is limited. The work addresses a longstanding interpretability bottleneck where SAE dictionary features require expensive post-hoc labeling, potentially enabling more scalable mechanistic analysis.