Almanac
← Events
6arXiv cs.AI (Artificial Intelligence)·8d ago

SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs

SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.

Related guides (3)

Related events (8)

6arXiv · cs.CL·11d ago·source ↗

SpatialWorld benchmark evaluates interactive spatial reasoning of multimodal agents in real-world tasks

Researchers introduce SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents across 760 human-annotated tasks spanning household, travel, and social domains. The benchmark integrates eight simulation backends under a shared protocol, requiring agents to operate under vision-only partial observability with egocentric inputs. Evaluation of 15 agents reveals that even the strongest model, GPT-5, achieves only 17.4% task success rate, exposing significant gaps in active exploration and long-horizon planning. The work highlights a mismatch between task success and execution efficiency as a key bottleneck for spatial agents.

6arXiv · cs.AI·25d ago·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

6arXiv · cs.AI·26d ago·source ↗

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SpaceNum is a new evaluation framework probing whether Vision-Language Models genuinely ground numerical outputs (coordinates, action magnitudes) in spatial perception, rather than relying on shallow cues. The benchmark defines two bidirectional tasks—Num2Space and Space2Num—across dynamic and static spatial settings. Results show current VLMs perform near random chance on spatial numerical grounding, with explicit reasoning providing only marginal improvement and fine-tuning offering partial gains.

6arXiv · cs.CL·8d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

5arXiv · cs.CL·17d ago·source ↗

RealClawBench: Live benchmark framework built from real developer-agent sessions

RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.

6arXiv · cs.CL·23d ago·source ↗

MaskClaw: Edge-Side Privacy Arbitration System for GUI Agents with Behavior-Driven Skill Evolution

MaskClaw is an edge-side privacy arbitration framework for GUI agents that intercepts screenshots before they leave a trusted environment, applying Allow/Mask/Ask decisions based on local visual evidence and user-specific policy memory. The system addresses the gap where static PII detectors miss context-dependent privacy boundaries and cloud-side VLMs may upload raw screens before deciding what to protect. The authors introduce P-GUI-Evo, a new benchmark built from real UI patterns and sanitized labels, and demonstrate that pattern matching, cloud reasoning, and routing alone each exhibit systematic failure modes. The artifact is open-sourced on GitHub.

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

7arXiv · cs.CL·22d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.