4arXiv cs.CL (Computation and Language)·34h ago

Energy-based transformers as unified predictors of reading difficulty in computational psycholinguistics

A new arXiv preprint introduces energy-based transformer measures as predictors of human reading difficulty, evaluated across three reading-time corpora (Natural Stories, UCL eye-tracking, UCL self-paced reading). The energy measure outperforms surprisal alone and appears to subsume both surprisal and attention entropy effects, suggesting it could serve as a single unified predictor. The work connects transformer language models to Hopfield networks and dense associative memory literature, marking the first application of energy-based transformer measures in computational psycholinguistics.

Evaluation and Benchmarking Natural Stories Energy-Based Transformers as Predictors of Reading Difficulty Hopfield Networks

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·1mo ago·source ↗

Conditional Scale Entropy: A Wavelet-Derived Tool for Mechanistic Interpretability of Metaphor Processing in Transformers

This paper introduces Conditional Scale Entropy (CSE), a wavelet-derived measure of how transformer computation engages across frequency scales at each layer, and applies it to study metaphor processing in decoder-only language models. The authors prove CSE is invariant to update magnitude, isolating structural computation patterns from intensity. Across architectures ranging from GPT-2 (124M) to LLaMA-2 7B and GPT-oss 20B, metaphorical tokens consistently produce higher spectral breadth than literal tokens in early-to-mid layers, with the effect surviving permutation correction and specificity controls. The work establishes multi-scale coordination as a consistent mechanistic signature of metaphorical language processing and positions CSE as a general interpretability tool for cross-depth structure in transformers.

Evaluation and Benchmarking AI Safety Research Conditional Scale Entropy mechanistic interpretability GPT-2 +3 more

5arXiv · cs.CL·34h ago·source ↗

Roofline-inspired scaling model predicts Transformer fine-tuning energy consumption across GPU configurations

A new arXiv preprint presents a framework for modeling energy consumption during Transformer training on multiple GPUs, using BERT architectural sweeps to relate measured energy to proxies for compute, memory traffic, and hardware efficiency. The approach adapts roofline modeling with a speedup-based hardware-efficiency factor that accounts for tensor parallelism and fully sharded data parallelism. The resulting scaling law accurately predicts training energy across heterogeneous configurations, targeting sustainable and cost-aware system design.

Training Infrastructure Inference Economics The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model BERT

4arXiv · cs.CL·8d ago·source ↗

Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry

A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.

Evaluation and Benchmarking Multimodal Progress Data-Driven Decoding of Russell's Circumplex Model of Affect RoBERTa MSP-Podcast +1 more

6arXiv · cs.CL·20d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

4Hugging Face Blog·1mo ago·source ↗

Generating Human-level Text with Contrastive Search in Transformers

Hugging Face introduces contrastive search, a decoding strategy for autoregressive language models that aims to produce more coherent and human-like text compared to standard methods like beam search or nucleus sampling. The technique works by balancing a model's confidence in its next-token prediction against a contrastive penalty that discourages repetitive or degenerate outputs. The blog post describes integration of contrastive search into the Hugging Face Transformers library, making it accessible to practitioners.

Frontier Model Releases Agent and Tool Ecosystem Contrastive Search Hugging Face Transformers Hugging Face

6arXiv · cs.LG·6d ago·source ↗

Program synthesis used to reverse-engineer transformer attention heads with executable Python surrogates

Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.

Evaluation and Benchmarking AI Safety Research Llama 3.2 GPT-2 Explaining Attention with Program Synthesis +2 more

4Hugging Face Blog·1mo ago·source ↗

Probabilistic Time Series Forecasting with Transformers

This Hugging Face blog post introduces probabilistic time series forecasting using Transformer-based models available in the Hugging Face ecosystem. It covers the application of attention-based architectures to sequential prediction tasks with uncertainty quantification. The post serves as a tutorial and capability demonstration for time series modeling within the Transformers library.

Agent and Tool Ecosystem Probabilistic Time Series Forecasting Hugging Face Transformers Hugging Face

4Hugging Face Blog·1mo ago·source ↗

The Reformer - Pushing the limits of language modeling

This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.

Training Infrastructure Long Context Evolution Nikita Kitaev Hugging Face Reformer +2 more