HAT-4D: Agentic framework for 4D multi-object interaction reconstruction from monocular video
HAT-4D is a new agentic framework that reconstructs 3D geometry, temporal dynamics, and physical interactions of multiple objects from single monocular videos, targeting scalable data collection for Embodied AI and Vision-Language-Action (VLA) model training. The system integrates VLMs with a multi-level human-in-the-loop feedback mechanism to resolve depth ambiguities and occlusions without expensive multi-camera rigs. The authors also introduce MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction with a novel evaluation protocol focused on physical plausibility and temporal consistency. Experiments show state-of-the-art performance on most metrics, and HAT-4D-generated data improves downstream model fine-tuning.
Related guides (2)
Related events (8)
D4RT: DeepMind's Unified 4D Reconstruction and Tracking System, Up to 300x Faster
DeepMind has announced D4RT, a system for unified four-dimensional (spatial + temporal) scene reconstruction and tracking. The method claims up to 300x speed improvements over prior approaches. The announcement positions D4RT as a significant efficiency advance in dynamic 3D scene understanding, with potential applications in robotics, video understanding, and embodied AI.
OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling
Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).
PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models
Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.
MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding
MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.
DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception
DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.
Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework
A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.
OneCanvas achieves state-of-the-art 3D scene understanding via panoramic reprojection in VLMs
OneCanvas is a new method for 3D scene understanding in Vision-Language Models that aggregates multi-view patch features onto a single equirectangular panoramic canvas using depth and camera pose, avoiding complex geometry encoders or large training budgets. A 3D position embedding restores metric depth information lost during angular projection, and a spatial pretraining curriculum generates on-the-fly supervision for spatial reasoning tasks. The approach achieves state-of-the-art results on SQA3D and VSI-Bench benchmarks while using an order of magnitude less training compute than competing methods, and supports situated reasoning relevant to robotics and embodied AI.
SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs
SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.

