6Google DeepMind Blog·1mo ago

D4RT: DeepMind's Unified 4D Reconstruction and Tracking System, Up to 300x Faster

DeepMind has announced D4RT, a system for unified four-dimensional (spatial + temporal) scene reconstruction and tracking. The method claims up to 300x speed improvements over prior approaches. The announcement positions D4RT as a significant efficiency advance in dynamic 3D scene understanding, with potential applications in robotics, video understanding, and embodied AI.

Agent and Tool Ecosystem Multimodal Progress DeepMind 4D reconstruction D4RT 3D tracking

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

3The Batch·1mo ago·source ↗

6D-Pose Anchor-based Category-level Keypoint-tracker (6-PACK): Deep Learning for 6D Object Tracking in Robotics

A model called 6-PACK uses video from a depth-sensing camera to track objects in six dimensions (position and orientation in 3D space), extending AI object tracking beyond standard 2D approaches. The system is designed for robotics applications where understanding how objects move through physical space is critical. The Batch highlights this as a capability advance in perception for robotic manipulation and interaction.

Agent and Tool Ecosystem DeepLearning.AI The Batch 6-PACK

7Google Deepmind Blog·1mo ago·source ↗

DeepMind Discovers New Solutions to Century-Old Fluid Dynamics Problems

DeepMind has published a new AI-driven method for solving long-standing problems in fluid dynamics, targeting challenges that have remained open for over a century. The approach is positioned as a general framework for leveraging AI techniques to advance mathematics, physics, and engineering. This follows DeepMind's broader research program applying machine learning to fundamental scientific problems, including prior work on protein folding and mathematical reasoning.

Evaluation and Benchmarking fluid dynamics DeepMind

6arXiv · cs.AI·2d ago·source ↗

OneCanvas achieves state-of-the-art 3D scene understanding via panoramic reprojection in VLMs

OneCanvas is a new method for 3D scene understanding in Vision-Language Models that aggregates multi-view patch features onto a single equirectangular panoramic canvas using depth and camera pose, avoiding complex geometry encoders or large training budgets. A 3D position embedding restores metric depth information lost during angular projection, and a spatial pretraining curriculum generates on-the-fly supervision for spatial reasoning tasks. The approach achieves state-of-the-art results on SQA3D and VSI-Bench benchmarks while using an order of magnitude less training compute than competing methods, and supports situated reasoning relevant to robotics and embodied AI.

Evaluation and Benchmarking Multimodal Progress SPBench VSI-Bench OneCanvas +1 more

5Google Deepmind Blog·1mo ago·source ↗

Using AI to perceive the universe in greater depth

DeepMind published a blog post describing an AI system applied to astronomical or cosmological perception tasks, aimed at improving the depth or quality of universe observation. The post originates from a Tier 1 source (DeepMind blog) but the body content was not provided beyond the title. Based on the title, this likely involves a model or technique for processing telescope or sensor data to extract richer scientific information.

Agent and Tool Ecosystem Google DeepMind

8Google Deepmind Blog·1mo ago·source ↗

Genie 3: A new frontier for world models

DeepMind has announced Genie 3, a world model capable of generating interactive, navigable 3D environments in real time at 24 fps and 720p resolution. The system maintains consistency for several minutes, representing a significant step up from prior Genie iterations. This positions Genie 3 as a frontier capability demonstration in generative world modeling for interactive applications.

Frontier Model Releases Agent and Tool Ecosystem Genie 3 Google DeepMind +1 more

7Meta Ai Blog·1mo ago·source ↗

SAM 3.1: Meta Releases Faster Real-Time Video Segmentation Model with Object Multiplexing

Meta has released SAM 3.1, an incremental update to Segment Anything Model 3, introducing object multiplexing that allows tracking up to 16 objects in a single forward pass. This doubles video processing throughput from 16 to 32 FPS on a single H100 GPU, reducing GPU resource requirements and enabling real-time tracking on smaller hardware. SAM 3.1 is a drop-in replacement for SAM 3 and is available via updated model checkpoints and codebase. The broader SAM 3 release also includes text and exemplar prompting, a new Segment Anything Playground, the SA-Co evaluation dataset, and SAM 3D for 3D reconstruction.

Evaluation and Benchmarking Inference Economics SA-Co Segment Anything Playground Conservation X Labs +8 more

9Deepseek News·1mo ago·source ↗

DeepSeek V4 Preview Release: 1.6T-param Pro and 284B Flash Models with 1M Context, Open-Sourced

DeepSeek has released DeepSeek-V4 as an open-weights preview, comprising two MoE variants: V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active parameters). Both models support 1M token context by default, enabled by a novel Token-wise compression and DeepSeek Sparse Attention (DSA) architecture. V4-Pro claims open-source SOTA on agentic coding benchmarks and world-class math/STEM/coding performance rivaling top closed-source models, while V4-Flash offers near-parity reasoning at lower cost and latency. The API is live today with OpenAI and Anthropic compatibility, and legacy model endpoints will be retired in July 2026.

Long Context Evolution Frontier Model Releases DeepSeek V4 DeepSeek-V4-Flash Claude Code +7 more

6arXiv · cs.AI·11d ago·source ↗

AHA-WAM: Asynchronous world-action modeling with temporal decoupling for robot manipulation

AHA-WAM introduces a dual Diffusion Transformer architecture that decouples world prediction (low-frequency) from action execution (high-frequency) in robot manipulation policies, addressing the inefficiency of existing world-action models that force both branches to operate at the same temporal resolution. The system uses a rolling key-value memory video DiT as a long-horizon scene planner and a fast action DiT that queries layerwise latent context via joint attention, with Observation-Guided Video-Context Routing enabling asynchronous execution. On RoboTwin benchmarks, AHA-WAM achieves 92.80% average success and 78.3% on real-world tasks at 24.17 Hz, a 4.59x speedup over Fast-WAM, without robot-data pretraining.

Inference Economics RoboTwin Linear Diffusion Transformer Observation-Guided Video-Context Routing +2 more