4arXiv cs.AI (Artificial Intelligence)·47h ago

SARLO-80: Large-scale VHR SAR-optical-text dataset for multimodal foundation model training

Researchers from ONERA release SARLO-80, a dataset of 119,566 triplets combining very-high-resolution complex SAR imagery, aligned optical patches, and natural-language captions covering 257 locations across 72 countries. The dataset is built from Umbra spotlight acquisitions standardized to an 80cm slant-range grid, with three caption variants per sample to support vision-language training and evaluation. It addresses a recognized gap in SAR-optical multimodal resources, which have historically been limited to low-resolution intensity-only products. The dataset and preprocessing code are publicly released on Hugging Face Hub.

Evaluation and Benchmarking Multimodal Progress Umbra SARLO-80 Hugging Face ONERA

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·4d ago·source ↗

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.

Evaluation and Benchmarking Multimodal Progress CLIP FusionRS

4Hugging Face Blog·1mo ago·source ↗

Visual Salamandra: Pushing the Boundaries of Multimodal Understanding

BSC-LT (Barcelona Supercomputing Center Language Technologies) has released Visual Salamandra, a 7B multimodal model announced via Hugging Face blog. The post describes a vision-language model building on the Salamandra language model family. As a tier-2 source with an empty body, specific capability details and benchmark results are not available from this item alone.

Open Weights Progress Multimodal Progress Visual Salamandra Barcelona Supercomputing Center Language Technologies Hugging Face +1 more

7arXiv · cs.AI·22d ago·source ↗

GPIC: Stanford Releases 28-Trillion-Pixel Permissively Licensed Image Corpus for Visual Generation Research

Stanford Vision Lab introduces GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising 100M training, 200K validation, and 1M test images, all permissively licensed for research and commercial use. Images are captioned by a state-of-the-art vision-language model, safety-filtered, deduplicated, and hosted on Hugging Face. The release includes a benchmarking protocol for generative modeling and a reference baseline using pixel-space flow matching. The dataset addresses a key gap in scalable visual generative modeling research by providing a large, stable, and openly licensed resource.

Training Infrastructure Evaluation and Benchmarking GPIC Stanford Vision Lab Flow Matching +3 more

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

6Deepseek·11d ago·source ↗

DeepSeek releases DeepSeek-OCR-2 vision-language model on Hugging Face

DeepSeek has released DeepSeek-OCR-2, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture and tagged for OCR and vision-language tasks. The model has accumulated over 1.8 million downloads and 980 likes, indicating substantial community uptake. It extends DeepSeek's multimodal model lineup with a specialized document/OCR capability.

Open Weights Progress Multimodal Progress DeepSeek-OCR-2 DeepSeek V4 Hugging Face

7Meta Llama·11d ago·source ↗

Meta releases Llama 3.2 90B Vision multimodal model on Hugging Face

Meta released Llama 3.2 90B Vision, a large multimodal model supporting image-text-to-text tasks, published on Hugging Face under the meta-llama organization. The model is part of the Llama 3.2 family and supports English, German, and French. This is a significant open-weights multimodal release from Meta, extending the Llama 3 series with vision capabilities at the 90B parameter scale.

Frontier Model Releases Open Weights Progress Llama 3.2 90B Vision Hugging Face Meta +1 more

5Hugging Face Blog·1mo ago·source ↗

Visual Document Retrieval Goes Multilingual

Hugging Face introduces VDR-2B-Multilingual, a 2-billion parameter vision-language model designed for visual document retrieval across multiple languages. The model enables retrieval of document images without OCR by embedding visual page representations directly. This extends prior visual document retrieval work to multilingual settings, broadening applicability for enterprise document search use cases.

Enterprise Deployment Patterns Multimodal Progress OCR-free document embedding visual document retrieval Hugging Face +1 more