5arXiv cs.AI (Artificial Intelligence)·19d ago

Lumos-Nexus: Efficient Frequency Bridging for Reasoning-Driven Video Generation

Lumos-Nexus is a training-efficient unified video generation framework that decouples training and inference to achieve high visual fidelity without prohibitive compute costs. During training, a lightweight generator is aligned with an understanding block; at inference, Unified Progressive Frequency Bridging (UPFB) hands off generation to a high-capacity pretrained generator in a shared latent space for coarse-to-fine refinement. The authors also introduce VR-Bench, a new benchmark for evaluating reasoning-driven video generation. Code and models are publicly released.

Evaluation and Benchmarking Inference Economics Multimodal Progress Lumos-Nexus VBench VR-Bench Jiazheng Xing Unified Progressive Frequency Bridging

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

9Openai Blog·1mo ago·source ↗

Video generation models as world simulators

OpenAI introduces Sora, a large-scale text-conditional video diffusion model built on a transformer architecture that operates on spacetime patches of video and image latent codes. The model is trained jointly on videos and images of variable durations, resolutions, and aspect ratios. Sora can generate up to one minute of high-fidelity video and OpenAI frames scaling video generation as a path toward general-purpose physical world simulators.

Training Infrastructure Frontier Model Releases Linear Diffusion Transformer spacetime patch OpenAI +2 more

5Hugging Face Blog·1mo ago·source ↗

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

This Hugging Face blog post details a workflow for fine-tuning NVIDIA's Cosmos Predict 2.5 world model using LoRA and DoRA parameter-efficient techniques for robot video generation tasks. The post covers practical implementation steps for adapting the foundation video model to robotics-specific domains. This represents a concrete application of world models to embodied AI, where synthetic video generation can support robot training data pipelines.

Inference Economics Agent and Tool Ecosystem DoRA LoRA NVIDIA +3 more

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

Agent and Tool Ecosystem Multimodal Progress Hugging Face video generation

6arXiv · cs.AI·12d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

7arXiv · cs.LG·19d ago·source ↗

RayDer: Scalable Self-Supervised Novel View Synthesis via Unified Feed-Forward Transformer

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone for self-supervised novel view synthesis (NVS). By treating dynamic content as a nuisance factor absorbed by a minimal dynamic state, it enables stable training on unconstrained real-world video without requiring dynamic-scene reconstruction. The model exhibits clean power-law scaling with both data and compute across multiple model sizes, and achieves zero-shot open-set performance competitive with supervised state-of-the-art methods on multiple benchmarks.

Training Infrastructure Frontier Model Releases feed-forward transformer power-law scaling CompVis +4 more

5Hugging Face Blog·1mo ago·source ↗

State of open video generation models in Diffusers

Hugging Face published a survey of open-source video generation models integrated into the Diffusers library as of January 2025. The post covers the current landscape of available open video generation models, their capabilities, and how they are supported within the Diffusers ecosystem. This serves as a reference for practitioners looking to use or compare open-weights video generation models.

Open Weights Progress Agent and Tool Ecosystem Hugging Face video generation Diffusers +1 more

6arXiv · cs.CL·25d ago·source ↗

STORM: Internalized Spatial-Temporal Reasoning for Video-Language Models via Latent Trajectories

STORMS is a two-stage training framework that teaches large vision-language models to perform spatial-temporal video reasoning through bounded continuous latent trajectories rather than explicit textual chain-of-thought, keyframe selection, or external tool use. In Stage I, latent tokens are aligned with thought-video representations derived from generated videos; in Stage II, answer-only supervision internalizes the reasoning process. At inference time, no video regeneration or frame reinsertion is required, reducing latency and engineering complexity. Evaluations on VideoMME, MVBench, TempCompass, and MMVU show improved accuracy with substantially lower inference overhead versus tool-based pipelines.

Inference Economics Agent and Tool Ecosystem MVBench STORMS TempCompass +5 more

4arXiv · cs.AI·12d ago·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.

Long Context Evolution Multimodal Progress Watch, Remember, Reason: Human-View Video Understanding with MLLMs