3The Batch (DeepLearning.AI)·1mo ago

6D-Pose Anchor-based Category-level Keypoint-tracker (6-PACK): Deep Learning for 6D Object Tracking in Robotics

A model called 6-PACK uses video from a depth-sensing camera to track objects in six dimensions (position and orientation in 3D space), extending AI object tracking beyond standard 2D approaches. The system is designed for robotics applications where understanding how objects move through physical space is critical. The Batch highlights this as a capability advance in perception for robotic manipulation and interaction.

Agent and Tool Ecosystem DeepLearning.AI The Batch 6-PACK

Related guides (1)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6Google Deepmind Blog·1mo ago·source ↗

D4RT: DeepMind's Unified 4D Reconstruction and Tracking System, Up to 300x Faster

DeepMind has announced D4RT, a system for unified four-dimensional (spatial + temporal) scene reconstruction and tracking. The method claims up to 300x speed improvements over prior approaches. The announcement positions D4RT as a significant efficiency advance in dynamic 3D scene understanding, with potential applications in robotics, video understanding, and embodied AI.

Agent and Tool Ecosystem Multimodal Progress DeepMind 4D reconstruction D4RT +1 more

6arXiv · cs.LG·23d ago·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

4arXiv · cs.AI·12d ago·source ↗

Pose-ICL: 3D-aware in-context learning for pose-controllable image generation of custom subjects

Researchers introduce Pose-ICL, a tuning-free framework for generating images of user-specified subjects with accurate pose control. The method uses Surface-Anchored Position Embedding (SAPE) to give 2D diffusion models explicit 3D awareness by anchoring image tokens to volumetric bounding box surface coordinates. Evaluations on 3D assets and real-world subjects show improvements over existing methods in both pose accuracy and identity consistency. The framework is designed for compatibility with existing Diffusion Transformer (DiT) models.

Multimodal Progress Surface-Anchored Position Embedding Pose-ICL Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

6arXiv · cs.CL·3d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

6arXiv · cs.AI·18d ago·source ↗

Humanoid-GPT: GPT-style Transformer trained on 2B-frame motion corpus for zero-shot humanoid control

Researchers introduce Humanoid-GPT, a causal Transformer pre-trained on a 2-billion-frame retargeted motion corpus that unifies major mocap datasets with large-scale in-house recordings for whole-body humanoid control. The model achieves zero-shot generalization to unseen motions and control tasks, overcoming the agility-generalization trade-off seen in prior MLP-based trackers. Scaling analyses demonstrate a new performance frontier for dynamic motion tracking without task-specific fine-tuning.

Frontier Model Releases Humanoid-GPT Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

Agent and Tool Ecosystem Multimodal Progress PEVA Conditional Diffusion Transformer Berkeley AI Research (BAIR)+2 more

6arXiv · cs.AI·3d ago·source ↗

OneCanvas achieves state-of-the-art 3D scene understanding via panoramic reprojection in VLMs

OneCanvas is a new method for 3D scene understanding in Vision-Language Models that aggregates multi-view patch features onto a single equirectangular panoramic canvas using depth and camera pose, avoiding complex geometry encoders or large training budgets. A 3D position embedding restores metric depth information lost during angular projection, and a spatial pretraining curriculum generates on-the-fly supervision for spatial reasoning tasks. The approach achieves state-of-the-art results on SQA3D and VSI-Bench benchmarks while using an order of magnitude less training compute than competing methods, and supports situated reasoning relevant to robotics and embodied AI.

Evaluation and Benchmarking Multimodal Progress SPBench VSI-Bench OneCanvas +1 more

5arXiv · cs.AI·24d ago·source ↗

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation (CoP)

Researchers introduce Center-of-Pressure (CoP), a tactile representation grounded in physical principles designed to bridge the sim-to-real gap in contact-rich dexterous manipulation. CoP preserves dense contact information while remaining robust for sim-to-real transfer, supported by a differentiable-dynamics-based sensor calibration scheme that estimates taxel orientations without ground-truth force measurements. Evaluated on peg-in-hole insertion and ball balancing tasks, CoP-conditioned policies achieve zero-shot sim-to-real transfer on a multi-fingered robotic hand, outperforming binary-contact and raw-taxel baselines. An emergent finding is that CoP-conditioned policies implicitly encode task-relevant physical properties such as object mass.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-fingered dexterous hand Center-of-Pressure (CoP) tactile representation ball balancing +5 more