Entity · technique

Rotary Position Embedding (RoPE)

techniqueactiverotary-position-embedding-rope--5f483984·5 events·first seen May 19, 2026

Aliases: Rotary Position Embedding (RoPE), RoPE (Rotary Position Embedding), 4D Rotary Position Embedding, Rotary Position Embeddings, Rotary Position Embedding

Co-occurring entities

More like this (12)

Modal-Aware Rotary Positional Embedding 2D-RoPE 3D-RoPE RoPE ST-RoPE Möbius RoPE Surface-Anchored Position Embedding Positional Encoding PivotRL How Data Shapes RoPE Frequency Usage: From Positional Scale Matching to Length Generalization context-rot RLOO

Recent events (5)

6arXiv · cs.CL·Jul 20, 2026·source ↗

2D-RoPE positional encoding enables frontier-length exact copying in Transformers

A new arXiv preprint demonstrates that frontier LLMs systematically fail at exact string copying within their context windows, attributing the failure to 1D positional encodings that encourage shortcut matching of local contexts rather than precise position retrieval. The authors introduce 2D-RoPE, which assigns each token a row and column ID in a 2D grid, making copying equivalent to a fixed column-offset lookup. Shallow Transformers with 2D-RoPE achieve perfect copying at lengths hundreds of times beyond training distribution, and the advantage persists in large-scale pretraining on DCLM up to 1.4B parameters. The result challenges assumptions about what frontier models can reliably do and proposes a concrete architectural modification to positional encoding.

Long Context Evolution Evaluation and Benchmarking Frontier Language Models Struggle to Copy: Text Can Be Better Viewed in 2D Rotary Position Embedding (RoPE)DCLM +1 more

6arXiv · cs.LG·Jul 9, 2026·source ↗

Data-driven theory explains RoPE frequency usage and long-context generalization in transformers

A new arXiv preprint proposes a data-centered explanation for why trained transformers use RoPE positional frequencies non-uniformly: frequencies are selected to match the relative-distance dependency structure of training data, with optimal frequency scaling as 1/W for dependency width W. The paper formalizes a field-resolution tradeoff and connects this frequency-matching principle to position-interpolation-based length generalization, showing that test-time frequency scaling succeeds when longer-context dependencies are approximate dilations of training-time dependencies. Empirical results demonstrate that natural language exhibits approximate self-similarity across positional scales, providing a mechanistic account of why context-length extrapolation methods work when they do.

Long Context Evolution Evaluation and Benchmarking Rotary Position Embedding (RoPE)How Data Shapes RoPE Frequency Usage: From Positional Scale Matching to Length Generalization

6The Batch·Jun 2, 2026·source ↗

Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects

Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.

Frontier Model Releases Multimodal Progress FLUX.1-dev Rotary Position Embedding (RoPE)Jiasen Lu +8 more

6arXiv · cs.LG·Jun 1, 2026·source ↗

Positional vs. Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

Researchers train a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks to study how attention heads specialize into positional or symbolic roles during learning. They find that successful task learning correlates with the emergence of 'pure' heads—exclusively positional or symbolic—and provide theoretical constructions showing how single-layer RoPE-based attention realizes these functions geometrically. A novel 'discrepancy' metric formalizes the robustness difference between the two head types, with symbolic mechanisms shown to extrapolate more reliably to longer sequences than positional ones. The findings have implications for understanding length generalization failures in RoPE-based models.

Long Context Evolution Evaluation and Benchmarking Transformers multi-hop reasoning Rotary Position Embedding (RoPE)+5 more

4Hugging Face Blog·May 19, 2026·source ↗

You Could Have Designed State of the Art Positional Encoding

A Hugging Face blog post walks through the design space of positional encoding for transformer models, building intuition for why modern schemes like RoPE emerged. The post takes a pedagogical approach, showing how one could derive state-of-the-art positional encoding from first principles. It covers the evolution from absolute to relative positional encodings and the properties that make certain schemes preferable for long-context generalization.

Long Context Evolution Transformers Rotary Position Embedding (RoPE)Positional Encoding +1 more