7arXiv cs.CL (Computation and Language)·18d ago

AdaCodec: Predictive Visual Coding for Efficient Video MLLMs

AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.

Long Context Evolution Frontier Model Releases Inference Economics Multimodal Progress Multimodal Large Language Models Qwen3-4B predictive visual code P-tokens AdaCodec

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Related events (8)

6arXiv · cs.AI·22d ago·source ↗

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.

Training Infrastructure Long Context Evolution NVIDIA B200 KV Cache 3D-RoPE +5 more

5arXiv · cs.CL·11d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

5arXiv · cs.LG·1mo ago·source ↗

RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation

RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.

Inference Economics Multimodal Progress VBench RefDecoder Inter4K +4 more

7arXiv · cs.AI·18d ago·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more

6arXiv · cs.CL·2d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

4arXiv · cs.AI·12d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

8Qwen Research·1mo ago·source ↗

Qwen3-Coder: 480B MoE Agentic Coding Model Released by Alibaba/Qwen Team

Alibaba's Qwen team has released Qwen3-Coder, a family of code-focused models with the flagship variant being Qwen3-Coder-480B-A35B-Instruct, a 480B-parameter Mixture-of-Experts model with 35B active parameters. It supports 256K native context length and up to 1M tokens via extrapolation. The model claims state-of-the-art results among open-weight models on agentic coding, browser-use, and tool-use benchmarks, with performance described as comparable to Claude Sonnet 4.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Alibaba Qwen3-Coder-480B-A35B-Instruct +5 more