Energy-based transformers as unified predictors of reading difficulty in computational psycholinguistics
A new arXiv preprint introduces energy-based transformer measures as predictors of human reading difficulty, evaluated across three reading-time corpora (Natural Stories, UCL eye-tracking, UCL self-paced reading). The energy measure outperforms surprisal alone and appears to subsume both surprisal and attention entropy effects, suggesting it could serve as a single unified predictor. The work connects transformer language models to Hopfield networks and dense associative memory literature, marking the first application of energy-based transformer measures in computational psycholinguistics.
Related guides (1)
Related events (8)
Conditional Scale Entropy: A Wavelet-Derived Tool for Mechanistic Interpretability of Metaphor Processing in Transformers
This paper introduces Conditional Scale Entropy (CSE), a wavelet-derived measure of how transformer computation engages across frequency scales at each layer, and applies it to study metaphor processing in decoder-only language models. The authors prove CSE is invariant to update magnitude, isolating structural computation patterns from intensity. Across architectures ranging from GPT-2 (124M) to LLaMA-2 7B and GPT-oss 20B, metaphorical tokens consistently produce higher spectral breadth than literal tokens in early-to-mid layers, with the effect surviving permutation correction and specificity controls. The work establishes multi-scale coordination as a consistent mechanistic signature of metaphorical language processing and positions CSE as a general interpretability tool for cross-depth structure in transformers.
Roofline-inspired scaling model predicts Transformer fine-tuning energy consumption across GPU configurations
A new arXiv preprint presents a framework for modeling energy consumption during Transformer training on multiple GPUs, using BERT architectural sweeps to relate measured energy to proxies for compute, memory traffic, and hardware efficiency. The approach adapts roofline modeling with a speedup-based hardware-efficiency factor that accounts for tensor parallelism and fully sharded data parallelism. The resulting scaling law accurately predicts training energy across heterogeneous configurations, targeting sustainable and cost-aware system design.
Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry
A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.
Generating Human-level Text with Contrastive Search in Transformers
Hugging Face introduces contrastive search, a decoding strategy for autoregressive language models that aims to produce more coherent and human-like text compared to standard methods like beam search or nucleus sampling. The technique works by balancing a model's confidence in its next-token prediction against a contrastive penalty that discourages repetitive or degenerate outputs. The blog post describes integration of contrastive search into the Hugging Face Transformers library, making it accessible to practitioners.
Program synthesis used to reverse-engineer transformer attention heads with executable Python surrogates
Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.
Probabilistic Time Series Forecasting with Transformers
This Hugging Face blog post introduces probabilistic time series forecasting using Transformer-based models available in the Hugging Face ecosystem. It covers the application of attention-based architectures to sequential prediction tasks with uncertainty quantification. The post serves as a tutorial and capability demonstration for time series modeling within the Transformers library.
The Reformer - Pushing the limits of language modeling
This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.
