paper

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

paperactiveprovisionalhat-4d-lifting-monocular-video-for-4d-multi-object-interactions-via-human-agent-collaboration-bf1f45a2·1 events·first seen 17h ago

Aliases: HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Co-occurring entities

MVOIK-4D HAT-4D

More like this (12)

HAT-4D MVOIK-4D human-agent collaborative pipeline Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy Collaborative Human-Agent Protocol multi-view 3D reconstruction Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking 4D reconstruction UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning multi-agent cooperative framework Watch, Remember, Reason: Human-View Video Understanding with MLLMs Human-AI Teaming Through the Lens of Calibration

Recent events (1)

5arXiv · cs.AI·17h ago·source ↗

HAT-4D: Agentic framework for 4D multi-object interaction reconstruction from monocular video

HAT-4D is a new agentic framework that reconstructs 3D geometry, temporal dynamics, and physical interactions of multiple objects from single monocular videos, targeting scalable data collection for Embodied AI and Vision-Language-Action (VLA) model training. The system integrates VLMs with a multi-level human-in-the-loop feedback mechanism to resolve depth ambiguities and occlusions without expensive multi-camera rigs. The authors also introduce MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction with a novel evaluation protocol focused on physical plausibility and temporal consistency. Experiments show state-of-the-art performance on most metrics, and HAT-4D-generated data improves downstream model fine-tuning.

Evaluation and Benchmarking Agent and Tool Ecosystem HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration MVOIK-4D HAT-4D