LocateAnything: Parallel Box Decoding for Fast and Accurate Vision-Language Grounding
LocateAnything introduces Parallel Box Decoding (PBD), a method that decodes bounding boxes and points as atomic units in a single step rather than sequentially token-by-token, improving both throughput and geometric coherence in visual grounding tasks. The framework is paired with a large-scale data engine producing LocateAnything-Data, a 138-million-sample training dataset for high-precision localization. Evaluations show advances on the speed-accuracy frontier across diverse grounding and detection benchmarks. The work addresses a fundamental architectural mismatch in how current VLMs handle 2D spatial coordinates.
Related guides (3)
Related events (8)
PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs
This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
SpatialClaw: Code-as-action interface for agentic 3D/4D spatial reasoning with VLMs
SpatialClaw is a training-free framework that uses code execution as the action interface for vision-language model agents performing spatial reasoning tasks. The system maintains a stateful Python kernel with perception and geometry primitives, allowing the VLM to write iterative executable cells conditioned on prior outputs rather than committing to a full strategy upfront. Evaluated across 20 spatial reasoning benchmarks covering static and dynamic 3D/4D tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the prior state-of-the-art spatial agent by +11.2 points across six VLM backbones.
RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations
Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.
AdaCodec: Predictive Visual Coding for Efficient Video MLLMs
AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.


