Almanac
← Events
6arXiv cs.AI (Artificial Intelligence)·26d ago

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SpaceNum is a new evaluation framework probing whether Vision-Language Models genuinely ground numerical outputs (coordinates, action magnitudes) in spatial perception, rather than relying on shallow cues. The benchmark defines two bidirectional tasks—Num2Space and Space2Num—across dynamic and static spatial settings. Results show current VLMs perform near random chance on spatial numerical grounding, with explicit reasoning providing only marginal improvement and fine-tuning offering partial gains.

Related guides (3)

Related events (8)

5arXiv · cs.AI·17d ago·source ↗

Imaginative Perception Tokens improve spatial reasoning in vision-language models

Researchers introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive from alternative spatial viewpoints, enabling reasoning about unobserved spatial structure. The approach is evaluated on three new tasks—Perspective Taking, Path Tracing, and Multiview Counting—using ~20K examples built on the BAGEL backbone. IPT supervision consistently outperforms textual chain-of-thought training for spatial tasks, with the authors finding that forcing spatial computation through language can degrade performance, suggesting a modality mismatch. The work provides both a practical supervision technique and a diagnostic finding about the limits of language-mediated spatial reasoning.

6arXiv · cs.AI·8d ago·source ↗

SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs

SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.

6arXiv · cs.CL·11d ago·source ↗

SpatialWorld benchmark evaluates interactive spatial reasoning of multimodal agents in real-world tasks

Researchers introduce SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents across 760 human-annotated tasks spanning household, travel, and social domains. The benchmark integrates eight simulation backends under a shared protocol, requiring agents to operate under vision-only partial observability with egocentric inputs. Evaluation of 15 agents reveals that even the strongest model, GPT-5, achieves only 17.4% task success rate, exposing significant gaps in active exploration and long-horizon planning. The work highlights a mismatch between task success and execution efficiency as a key bottleneck for spatial agents.

5arXiv · cs.CL·24d ago·source ↗

Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery

This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

5arXiv · cs.AI·22d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.