Why Video Agent Models Are Next — Ethan He, xAI Grok Imagine
Latent Space interviews Ethan He, the lead behind xAI's Grok Imagine video generation product, covering its development in roughly three months. The discussion explores the distinction between video generation models and world models, and positions video agents as a significant near-term frontier. He argues Grok Imagine is underrated relative to its capabilities.
Related guides (3)
Related events (8)
Grok Imagine 1.0 Sharply Cuts Costs for High-Quality Video Generation
xAI launched Grok Imagine 1.0, a text-and-image-to-video model that topped the Artificial Analysis Video Arena leaderboard in both text-to-video and image-to-video categories at launch. The model generates up to 15-second clips with audio at $4.20 per minute of output, significantly undercutting Google Veo 3.1 ($12/min) and OpenAI Sora 2 Pro ($30/min). It is integrated with the X social network, enabling direct generation and sharing, though xAI disclosed no technical details about the model's architecture. The launch highlights continued rapid cost compression in video generation, with a seven-fold price gap between Grok Imagine 1.0 and Sora 2 Pro.
A Dive into Text-to-Video Models
A Hugging Face blog post providing an overview of text-to-video generation models as of mid-2023. The post surveys the landscape of approaches, architectures, and key models in the emerging text-to-video space. As a tier-2 commentary piece, it synthesizes existing work rather than presenting novel research.
OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling
Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).
Video generation models as world simulators
OpenAI introduces Sora, a large-scale text-conditional video diffusion model built on a transformer architecture that operates on spacetime patches of video and image latent codes. The model is trained jointly on videos and images of variable durations, resolutions, and aspect ratios. Sora can generate up to one minute of high-fidelity video and OpenAI frames scaling video generation as a path toward general-purpose physical world simulators.
[AINews] ImageGen is on the Path to AGI
Latent Space commentary piece reflecting on the continued explosion of GPT-Image-2 usage and its broader implications for AI capabilities. The piece frames recent image generation advances as significant steps on a trajectory toward AGI. Published as part of the AINews series, this is a tier-2 commentary source synthesizing recent developments around GPT-Image-2.
Genie 3: A new frontier for world models
DeepMind has announced Genie 3, a world model capable of generating interactive, navigable 3D environments in real time at 24 fps and 720p resolution. The system maintains consistency for several minutes, representing a significant step up from prior Genie iterations. This positions Genie 3 as a frontier capability demonstration in generative world modeling for interactive applications.
GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.
OpenAI: Generative Models Overview (2016)
A 2016 OpenAI blog post describing four research projects centered on generative models as a branch of unsupervised learning. The post explains what generative models are, their importance, and potential future directions. This is an archival piece predating modern large language models and diffusion systems, representing early foundational work at OpenAI.


