5arXiv cs.CL (Computation and Language)·47h ago

RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations

Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.

Evaluation and Benchmarking Multimodal Progress RefRad2D Slake RadGrounder VQA-RAD

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

6arXiv · cs.AI·26d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.

Evaluation and Benchmarking Multimodal Progress PGT (Procedurally Generated Tasks)Multimodal Large Language Models CV-Bench-2D +2 more

4arXiv · cs.AI·4d ago·source ↗

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.

Evaluation and Benchmarking Multimodal Progress CLIP FusionRS

5Hugging Face Blog·1mo ago·source ↗

Docmatix: A Large-Scale Dataset for Document Visual Question Answering

Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.

Evaluation and Benchmarking Multimodal Progress Hugging Face Document Visual Question Answering Docmatix

6arXiv · cs.AI·10d ago·source ↗

FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones

FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.

Inference Economics Multimodal Progress USF-MAE FetalCLIP Qwen3-4B +4 more

4arXiv · cs.LG·24d ago·source ↗

Normal Guidance: Bell-Curve Regularization for Attention-Based MIL in 3D Medical Imaging

This paper addresses weakly supervised classification of 3D medical images where only volume-level binary labels are available. The authors identify that a simple center-focused baseline outperforms attention-based and transformer-based multiple instance learning (MIL) at slice-level classification across brain, thoracic, and abdominal CT datasets. They propose Normal Guidance, a regularization technique that constrains learned attention distributions to follow a bell-shaped curve, achieving superior slice-level localization over state-of-the-art MIL methods across datasets totaling over 4 million 2D slices.

Evaluation and Benchmarking Multiple Instance Learning (MIL)Attention-based MIL Normal Guidance +1 more

5arXiv · cs.LG·11d ago·source ↗

TREAD: VLM-based re-labelling framework improves robot policy generalization via dataset augmentation

TREAD (Task Robustness via Re-Labelling Vision-Action Robot Data) is a scalable framework that uses pretrained Vision-Language Models to augment existing robotics datasets without new data collection. The approach decomposes demonstrations into sub-tasks, segments videos accordingly, and generates linguistically diverse instruction labels, enriching language-action pair diversity. Evaluations on the LIBERO benchmark show improved generalization to novel tasks and goals, addressing a key limitation of current robot learning policies.

Agent and Tool Ecosystem Multimodal Progress TREAD LIBERO

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more