Entity · technique

Linear Diffusion Transformer

techniqueactivelinear-diffusion-transformer-e295a3ca·8 events·first seen May 18, 2026

Aliases: Linear Diffusion Transformer, Diffusion Transformer, diffusion transformer, Diffusion Transformer (DiT), Diffusion Transformers

Co-occurring entities

More like this (12)

Conditional Diffusion Transformer Differential Transformer flow-matching diffusion transformer Sparse-structure Multimodal Diffusion Transformer latent diffusion Stable Diffusion Turbo Looped Transformer Diffusion Models Linear Diffusers latent diffusion model PTL-Diffusion

Recent events (8)

5arXiv · cs.AI·Jul 22, 2026·source ↗

Appearance Pointers: Modality-Agnostic Regional Control for Diffusion Transformers

Researchers introduce 'appearance pointers,' compact tokens that enable precise regional control in Diffusion Transformers (DiTs) by aligning text or image inputs with user-specified spatial masks. A region correspondence network produces these tokens, refined via spatial aggregation, allowing multi-region guidance without retraining the base model or significantly increasing token load. The method claims to be the first modality-agnostic interface for localized multimodal control in DiTs, matching or surpassing modality-specific state-of-the-art methods across evaluated metrics.

Multimodal Progress Linear Diffusion Transformer Appearance Pointers -- Multimodal Region Control of Diffusion Transformers

6arXiv · cs.AI·Jun 9, 2026·source ↗

AHA-WAM: Asynchronous world-action modeling with temporal decoupling for robot manipulation

AHA-WAM introduces a dual Diffusion Transformer architecture that decouples world prediction (low-frequency) from action execution (high-frequency) in robot manipulation policies, addressing the inefficiency of existing world-action models that force both branches to operate at the same temporal resolution. The system uses a rolling key-value memory video DiT as a long-horizon scene planner and a fast action DiT that queries layerwise latent context via joint attention, with Observation-Guided Video-Context Routing enabling asynchronous execution. On RoboTwin benchmarks, AHA-WAM achieves 92.80% average success and 78.3% on real-world tasks at 24.17 Hz, a 4.59x speedup over Fast-WAM, without robot-data pretraining.

Inference Economics RoboTwin Linear Diffusion Transformer Observation-Guided Video-Context Routing +2 more

5arXiv · cs.AI·Jun 1, 2026·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

7arXiv · cs.CL·May 29, 2026·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

7arXiv · cs.LG·May 28, 2026·source ↗

Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads

Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.

Inference Economics Agent and Tool Ecosystem Pi 0.5 SVD-Hadamard rotation LIBERO +6 more

9Openai Blog·May 20, 2026·source ↗

Video generation models as world simulators

OpenAI introduces Sora, a large-scale text-conditional video diffusion model built on a transformer architecture that operates on spacetime patches of video and image latent codes. The model is trained jointly on videos and images of variable durations, resolutions, and aspect ratios. Sora can generate up to one minute of high-fidelity video and OpenAI frames scaling video generation as a path toward general-purpose physical world simulators.

Training Infrastructure Frontier Model Releases Linear Diffusion Transformer spacetime patch OpenAI +2 more

5Hugging Face Blog·May 19, 2026·source ↗

Memory-efficient Diffusion Transformers with Quanto and Diffusers

This Hugging Face blog post describes integrating the Quanto quantization library with the Diffusers framework to reduce memory requirements for diffusion transformer models. The approach enables running large image/video generation models on consumer-grade hardware by applying int8 and int4 quantization to model weights. The post covers practical implementation details and benchmarks showing memory savings for models like Flux and others in the diffusion transformer family.

Inference Economics Agent and Tool Ecosystem Quanto Linear Diffusion Transformer Hugging Face +3 more

5Github Trending·May 18, 2026·source ↗

NVlabs/Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

NVIDIA Labs has released Sana, an open-source image synthesis system using a Linear Diffusion Transformer architecture designed for efficient high-resolution image generation. The repository has accumulated 6,261 stars with 472 added in a single day, indicating strong community interest. The project targets improved computational efficiency in diffusion-based image synthesis, a key challenge for scaling to higher resolutions.

Inference Economics Multimodal Progress NVIDIA Labs Linear Diffusion Transformer Sana