Almanac
benchmark

What'sUp benchmark

benchmarkactiveprovisionalwhat-sup-benchmark-817637a6·1 events·first seen 22d ago

Aliases: What'sUp benchmark

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·22d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.