6arXiv cs.AI (Artificial Intelligence)·2d ago

OneCanvas achieves state-of-the-art 3D scene understanding via panoramic reprojection in VLMs

OneCanvas is a new method for 3D scene understanding in Vision-Language Models that aggregates multi-view patch features onto a single equirectangular panoramic canvas using depth and camera pose, avoiding complex geometry encoders or large training budgets. A 3D position embedding restores metric depth information lost during angular projection, and a spatial pretraining curriculum generates on-the-fly supervision for spatial reasoning tasks. The approach achieves state-of-the-art results on SQA3D and VSI-Bench benchmarks while using an order of magnitude less training compute than competing methods, and supports situated reasoning relevant to robotics and embodied AI.

Evaluation and Benchmarking Multimodal Progress SPBench VSI-Bench OneCanvas SQA3D

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

6arXiv · cs.AI·8d ago·source ↗

SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs

SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.

Evaluation and Benchmarking Agent and Tool Ecosystem SpatialClaw +1 more

6arXiv · cs.CL·2d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

7arXiv · cs.CL·22d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

6Google Deepmind Blog·1mo ago·source ↗

D4RT: DeepMind's Unified 4D Reconstruction and Tracking System, Up to 300x Faster

DeepMind has announced D4RT, a system for unified four-dimensional (spatial + temporal) scene reconstruction and tracking. The method claims up to 300x speed improvements over prior approaches. The announcement positions D4RT as a significant efficiency advance in dynamic 3D scene understanding, with potential applications in robotics, video understanding, and embodied AI.

Agent and Tool Ecosystem Multimodal Progress DeepMind 4D reconstruction D4RT +1 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

4arXiv · cs.AI·16d ago·source ↗

BabyCL: Continual multimodal learning from egocentric child video in a single chronological pass

Researchers introduce BabyCL, a continual learning framework that processes the SAYCam egocentric child video dataset in a single chronological pass rather than shuffled multi-epoch training, more closely mimicking how children actually encounter their environment. The system combines streaming visual representation learning with image-text contrastive objectives, a multi-stage temporal segmentation, and a dual replay buffer managing visual and multimodal histories. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC benchmark under matched compute budgets, substantially closing the gap to offline training upper bounds. The work advances understanding of whether neural networks can acquire word-referent mappings under biologically plausible training conditions.

Evaluation and Benchmarking Multimodal Progress SAYCam BabyCL SAYCam Labeled-S 4AFC

6arXiv · cs.AI·26d ago·source ↗

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SpaceNum is a new evaluation framework probing whether Vision-Language Models genuinely ground numerical outputs (coordinates, action magnitudes) in spatial perception, rather than relying on shallow cues. The benchmark defines two bidirectional tasks—Num2Space and Space2Num—across dynamic and static spatial settings. Results show current VLMs perform near random chance on spatial numerical grounding, with explicit reasoning providing only marginal improvement and fine-tuning offering partial gains.

Evaluation and Benchmarking Agent and Tool Ecosystem SpaceNum Vision-Language Models Space2Num +2 more