6arXiv cs.CL (Computation and Language)·25d ago

STORM: Internalized Spatial-Temporal Reasoning for Video-Language Models via Latent Trajectories

STORMS is a two-stage training framework that teaches large vision-language models to perform spatial-temporal video reasoning through bounded continuous latent trajectories rather than explicit textual chain-of-thought, keyframe selection, or external tool use. In Stage I, latent tokens are aligned with thought-video representations derived from generated videos; in Stage II, answer-only supervision internalizes the reasoning process. At inference time, no video regeneration or frame reinsertion is required, reducing latency and engineering complexity. Evaluations on VideoMME, MVBench, TempCompass, and MMVU show improved accuracy with substantially lower inference overhead versus tool-based pipelines.

Inference Economics Agent and Tool Ecosystem Multimodal Progress MVBench STORMS TempCompass Chain-of-Thought Reasoning MMVU VideoMME latent chain-of-thought

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·12d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

4arXiv · cs.AI·12d ago·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.

Long Context Evolution Multimodal Progress Watch, Remember, Reason: Human-View Video Understanding with MLLMs

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

Frontier Model Releases Evaluation and Benchmarking Max-Pooling Chain-of-Thought Reasoning Probe Trajectories +4 more

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem functional token GRPO Latent-Anchored GRPO +4 more

6arXiv · cs.CL·2d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

7arXiv · cs.AI·18d ago·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more

6arXiv · cs.AI·22d ago·source ↗

Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs

RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.

Evaluation and Benchmarking Inference Economics latent reasoning Chain-of-Thought Reasoning Reasoning in Memory (RiM)+3 more

5arXiv · cs.AI·17d ago·source ↗

Imaginative Perception Tokens improve spatial reasoning in vision-language models

Researchers introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive from alternative spatial viewpoints, enabling reasoning about unobserved spatial structure. The approach is evaluated on three new tasks—Perspective Taking, Path Tracing, and Multiview Counting—using ~20K examples built on the BAGEL backbone. IPT supervision consistently outperforms textual chain-of-thought training for spatial tasks, with the authors finding that forcing spatial computation through language can degrade performance, suggesting a modality mismatch. The work provides both a practical supervision technique and a diagnostic finding about the limits of language-mediated spatial reasoning.

Evaluation and Benchmarking Multimodal Progress Imaginative Perception Tokens Path Tracing Perspective Taking +2 more