6arXiv cs.LG (Machine Learning)·24d ago

LocateAnything: Parallel Box Decoding for Fast and Accurate Vision-Language Grounding

LocateAnything introduces Parallel Box Decoding (PBD), a method that decodes bounding boxes and points as atomic units in a single step rather than sequentially token-by-token, improving both throughput and geometric coherence in visual grounding tasks. The framework is paired with a large-scale data engine producing LocateAnything-Data, a 138-million-sample training dataset for high-precision localization. Evaluations show advances on the speed-accuracy frontier across diverse grounding and detection benchmarks. The work addresses a fundamental architectural mismatch in how current VLMs handle 2D spatial coordinates.

Evaluation and Benchmarking Inference Economics Multimodal Progress Parallel Box Decoding IoU vision-language grounding LocateAnything

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·26d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.

Evaluation and Benchmarking Multimodal Progress PGT (Procedurally Generated Tasks)Multimodal Large Language Models CV-Bench-2D +2 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

6arXiv · cs.AI·8d ago·source ↗

SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs

SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.

Evaluation and Benchmarking Agent and Tool Ecosystem SpatialClaw +1 more

5arXiv · cs.CL·46h ago·source ↗

RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations

Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.

Evaluation and Benchmarking Multimodal Progress RefRad2D Slake RadGrounder +1 more

5arXiv · cs.CL·11d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

6arXiv · cs.CL·25d ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more

7arXiv · cs.CL·18d ago·source ↗

AdaCodec: Predictive Visual Coding for Efficient Video MLLMs

AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.

Long Context Evolution Frontier Model Releases Multimodal Large Language Models Qwen3-4B predictive visual code +4 more