6arXiv cs.CL (Computation and Language)·22d ago

Do Language Models Track Entities Across State Changes?

This paper investigates the mechanistic basis of entity tracking (ET) in transformer language models under realistic, multi-operation scenarios involving state changes (PUT, REMOVE, MOVE). The authors find that LMs do not incrementally update world states but instead aggregate relevant information in parallel at the final token once a query is apparent. A key finding is that the REMOVE operation is implemented via a fragile global suppression tag, which predicts specific failure modes confirmed behaviorally. The authors propose a mechanistic fix—nullifying this tag—and argue that behavioral and mechanistic analyses can productively inform each other.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Entity Tracking Global Suppression Tag Transformer Language Models

Related guides (3)

mechanistic interpretabilityConcept

Mechanistic Interpretability: Looking Inside the AI Black Box

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·19d ago·source ↗

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

This paper investigates whether language models can learn the semantics of rare English constructions (e.g., 'let alone', 'much less'), constructing a novel dataset to test form-meaning pairing understanding. Testing models across parameter counts, architectures, and pretraining dataset sizes, the authors find that modestly sized open-source models can grasp Paired-Focus construction semantics, while models trained on human-scale data fail. Training dynamics analysis reveals that semantic understanding of these constructions emerges later than syntactic knowledge and correlates with gains in world knowledge more broadly.

Evaluation and Benchmarking Open Weights Progress Paired-Focus Constructions constructional semantics scalar adjectival semantics +1 more

6arXiv · cs.CL·22d ago·source ↗

BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs

This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.

Evaluation and Benchmarking Agent and Tool Ecosystem reinforcement learning with belief-state rewards Contextual Belief Management (CBM)BeliefTrack +3 more

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

5arXiv · cs.CL·17d ago·source ↗

Knowledge editing via locate-then-edit transferred to masked diffusion language models, revealing multi-token failure mode

A new arXiv paper investigates whether locate-then-edit knowledge editing methods, developed for autoregressive models, transfer to masked diffusion language models (MDMs) such as LLaDA and Dream. The authors find that causal tracing identifies the same early-to-mid-layer MLP location in both paradigms, but MDMs degrade systematically on multi-token edits due to partially unmasked intermediate states that the edit was never optimized for. A correction targeting these intermediate states substantially restores multi-token editing performance. The work is the first systematic comparison of knowledge editing across autoregressive and diffusion-based language model paradigms.

Evaluation and Benchmarking Open Weights Progress Knowledge Editing in Masked Diffusion Language Models Qwen Llama +2 more

6arXiv · cs.CL·25d ago·source ↗

Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks

This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.

Long Context Evolution Frontier Model Releases Transformers Key-Value Cache Sleep Consolidation Mechanism +6 more

7arXiv · cs.CL·4d ago·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

5arXiv · cs.LG·11d ago·source ↗

Local linear structures in LLM weights and activations are dynamic, not fixed global directions

A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.

Evaluation and Benchmarking Alignment and RLHF CAA Qwen-0.5B LoRA +4 more

4Hugging Face Blog·1mo ago·source ↗

The Reformer - Pushing the limits of language modeling

This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.

Training Infrastructure Long Context Evolution Nikita Kitaev Hugging Face Reformer +2 more