7arXiv cs.LG (Machine Learning)·19d ago

RayDer: Scalable Self-Supervised Novel View Synthesis via Unified Feed-Forward Transformer

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone for self-supervised novel view synthesis (NVS). By treating dynamic content as a nuisance factor absorbed by a minimal dynamic state, it enables stable training on unconstrained real-world video without requiring dynamic-scene reconstruction. The model exhibits clean power-law scaling with both data and compute across multiple model sizes, and achieves zero-shot open-set performance competitive with supervised state-of-the-art methods on multiple benchmarks.

Training Infrastructure Frontier Model Releases Multimodal Progress feed-forward transformer power-law scaling CompVis Self-Supervised Learning RayDer novel view synthesis

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·1mo ago·source ↗

RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation

RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.

Inference Economics Multimodal Progress VBench RefDecoder Inter4K +4 more

5arXiv · cs.AI·19d ago·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

5Github Trending·1mo ago·source ↗

NVlabs/Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

NVIDIA Labs has released Sana, an open-source image synthesis system using a Linear Diffusion Transformer architecture designed for efficient high-resolution image generation. The repository has accumulated 6,261 stars with 472 added in a single day, indicating strong community interest. The project targets improved computational efficiency in diffusion-based image synthesis, a key challenge for scaling to higher resolutions.

Inference Economics Multimodal Progress NVIDIA Labs Linear Diffusion Transformer Sana

5arXiv · cs.AI·19d ago·source ↗

Lumos-Nexus: Efficient Frequency Bridging for Reasoning-Driven Video Generation

Lumos-Nexus is a training-efficient unified video generation framework that decouples training and inference to achieve high visual fidelity without prohibitive compute costs. During training, a lightweight generator is aligned with an understanding block; at inference, Unified Progressive Frequency Bridging (UPFB) hands off generation to a high-capacity pretrained generator in a shared latent space for coarse-to-fine refinement. The authors also introduce VR-Bench, a new benchmark for evaluating reasoning-driven video generation. Code and models are publicly released.

Evaluation and Benchmarking Inference Economics Lumos-Nexus VBench VR-Bench +3 more

9Openai Blog·1mo ago·source ↗

Video generation models as world simulators

OpenAI introduces Sora, a large-scale text-conditional video diffusion model built on a transformer architecture that operates on spacetime patches of video and image latent codes. The model is trained jointly on videos and images of variable durations, resolutions, and aspect ratios. Sora can generate up to one minute of high-fidelity video and OpenAI frames scaling video generation as a path toward general-purpose physical world simulators.

Training Infrastructure Frontier Model Releases Linear Diffusion Transformer spacetime patch OpenAI +2 more

5Hugging Face Blog·1mo ago·source ↗

Introducing RWKV - An RNN with the advantages of a transformer

Hugging Face introduces RWKV, a recurrent neural network architecture that claims to combine the parallelizable training of transformers with the efficient linear-time inference of RNNs. The model avoids the quadratic attention bottleneck of standard transformers while maintaining competitive performance. RWKV represents an alternative architectural direction to the dominant transformer paradigm for language modeling.

Frontier Model Releases Open Weights Progress Transformers Recurrent Neural Network Hugging Face +2 more

5Hugging Face Blog·1mo ago·source ↗

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

This Hugging Face blog post details a workflow for fine-tuning NVIDIA's Cosmos Predict 2.5 world model using LoRA and DoRA parameter-efficient techniques for robot video generation tasks. The post covers practical implementation steps for adapting the foundation video model to robotics-specific domains. This represents a concrete application of world models to embodied AI, where synthetic video generation can support robot training data pipelines.

Inference Economics Agent and Tool Ecosystem DoRA LoRA NVIDIA +3 more

6arXiv · cs.CL·17d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more