Entity · technique

visual-token compression

techniqueactivevisual-token-compression-b11eaf0a·1 events·first seen Jun 2, 2026

Aliases: visual-token compression

Co-occurring entities

Multimodal Large Language Models Moment-Video Seed-2.0-Pro sparse frame sampling temporal visual event understanding

More like this (12)

thought compression visual-token activation probing gradient compression VisualMem visual language model Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models SKIM (SKIll coMpression)Planning-aligned Token Compression for Long-Context Autonomous Driving Channel-wise Vector Quantization predictive visual code Vectorize.io End-to-End Context Compression at Scale

Recent events (1)

7arXiv · cs.AI·Jun 2, 2026·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more