4arXiv cs.AI (Artificial Intelligence)·16h ago

OrbitForge: Text-to-3D scene generation via reconstruction-anchored video synthesis using Gaussian Splatting

OrbitForge is a new method for converting text-generated videos into 3D Gaussian Splatting scenes without task-specific fine-tuning or score-distillation optimization. The approach uses a frozen video diffusion model as a prior, performs an initial 3D reconstruction via Deformable Gaussian Splatting, detects missing viewpoints from a prescribed orbit, and completes only those views before final reconstruction. On a 300-prompt T3Bench-derived audit, OrbitForge achieves a 359-degree median orbit span and substantially improves coverage quality over a MedianGS-only baseline. The work also argues for coverage-aware evaluation metrics in text-to-3D tasks.

Multimodal Progress T3Bench 3D Gaussian Splatting VideoMV OrbitForge Deformable Gaussian Splatting

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Introduction to 3D Gaussian Splatting

A Hugging Face blog post introduces 3D Gaussian Splatting, a technique for real-time novel view synthesis and 3D scene reconstruction. The method represents scenes as collections of 3D Gaussians rather than implicit neural fields, enabling fast rendering. The post serves as an educational overview of the technique's mechanics and applications.

Multimodal Progress 3D Gaussian Splatting Hugging Face NeRF

5arXiv · cs.AI·16h ago·source ↗

FLUX3D: Diffusion-aligned sparse representation for high-fidelity image-to-3D Gaussian Splatting

Researchers introduce FLUX3D, an image-to-3D Gaussian Splatting framework that addresses two structural bottlenecks in sparse voxel-based 3D generation: a representation bottleneck from discriminative 2D features and a cross-modal correspondence bottleneck in diffusion transformers. The system introduces Diffusion-Aligned Structured Latents (DA-SLAT) and a Sparse-structure Multimodal Diffusion Transformer (SMDiT) with Modal-Aware Rotary Positional Embedding (MARoPE) to improve 2D-3D alignment. Benchmark results claim substantial improvements in appearance fidelity over all current state-of-the-art methods for 3DGS asset generation.

Multimodal Progress Sparse-structure Multimodal Diffusion Transformer FLUX3D Diffusion-Aligned Structured Latents +1 more

9Openai Blog·1mo ago·source ↗

Video generation models as world simulators

OpenAI introduces Sora, a large-scale text-conditional video diffusion model built on a transformer architecture that operates on spacetime patches of video and image latent codes. The model is trained jointly on videos and images of variable durations, resolutions, and aspect ratios. Sora can generate up to one minute of high-fidelity video and OpenAI frames scaling video generation as a path toward general-purpose physical world simulators.

Training Infrastructure Frontier Model Releases Linear Diffusion Transformer spacetime patch OpenAI +2 more

5arXiv · cs.AI·23d ago·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

5arXiv · cs.CL·40h ago·source ↗

ORBIT: Training-free multi-attribute behavioral steering via orthogonal subspace rotation

Researchers introduce ORBIT (Orthogonal Rotation-Based Intervention Technique), a training-free activation steering method that simultaneously controls multiple behavioral attributes in language models. The approach constructs a joint subspace from per-attribute steering planes via SVD and applies a single norm-preserving rotation, avoiding the norm imbalance and directional cancellation problems of naive vector summation. The authors also release TraitFactory, a new multi-attribute behavioral benchmark, and evaluate across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B. ORBIT outperforms existing training-free baselines on multi-attribute steering while better preserving output coherence.

Evaluation and Benchmarking Alignment and RLHF TraitFactory Llama 3.2 ORBIT +3 more

7The Batch·22d ago·source ↗

Grok Imagine 1.0 Sharply Cuts Costs for High-Quality Video Generation

xAI launched Grok Imagine 1.0, a text-and-image-to-video model that topped the Artificial Analysis Video Arena leaderboard in both text-to-video and image-to-video categories at launch. The model generates up to 15-second clips with audio at $4.20 per minute of output, significantly undercutting Google Veo 3.1 ($12/min) and OpenAI Sora 2 Pro ($30/min). It is integrated with the X social network, enabling direct generation and sharing, though xAI disclosed no technical details about the model's architecture. The launch highlights continued rapid cost compression in video generation, with a seven-fold price gap between Grok Imagine 1.0 and Sora 2 Pro.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis Grok Imagine Google Veo 3.1 +10 more

5arXiv · cs.AI·1mo ago·source ↗

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

Frontier Model Releases Multimodal Progress IVGT Signed Distance Function (SDF)Neural Radiance Field (NeRF)+1 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Text-to-Video Models

A Hugging Face blog post providing an overview of text-to-video generation models as of mid-2023. The post surveys the landscape of approaches, architectures, and key models in the emerging text-to-video space. As a tier-2 commentary piece, it synthesizes existing work rather than presenting novel research.

Multimodal Progress text-to-video generation Hugging Face