4arXiv cs.AI (Artificial Intelligence)·4d ago

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.

Evaluation and Benchmarking Multimodal Progress CLIP FusionRS

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

4arXiv · cs.AI·46h ago·source ↗

SARLO-80: Large-scale VHR SAR-optical-text dataset for multimodal foundation model training

Researchers from ONERA release SARLO-80, a dataset of 119,566 triplets combining very-high-resolution complex SAR imagery, aligned optical patches, and natural-language captions covering 257 locations across 72 countries. The dataset is built from Umbra spotlight acquisitions standardized to an 80cm slant-range grid, with three caption variants per sample to support vision-language training and evaluation. It addresses a recognized gap in SAR-optical multimodal resources, which have historically been limited to low-resolution intensity-only products. The dataset and preprocessing code are publicly released on Hugging Face Hub.

Evaluation and Benchmarking Multimodal Progress Umbra SARLO-80 Hugging Face +1 more

4arXiv · cs.CL·4d ago·source ↗

RDS Fusion: Hybrid neuro-symbolic gating with compressed CoT for zero-shot irony detection

Researchers introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought reasoning without supervised fine-tuning for irony and sarcasm detection in social media text. Evaluated on TweetEval (N=734) and iSarcasm, the zero-shot system matches fine-tuned BERTweet performance and outperforms supervised SemEval transformer ensembles on the imbalanced iSarcasm dataset. A statistical ablation shows that only the full concurrent fusion of all three signals yields a validated improvement, with individual components providing no significant standalone gain.

Evaluation and Benchmarking TweetEval BERTweet Robust Dual-Signal Fusion +1 more

5arXiv · cs.CL·46h ago·source ↗

RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations

Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.

Evaluation and Benchmarking Multimodal Progress RefRad2D Slake RadGrounder +1 more

5Hugging Face Blog·1mo ago·source ↗

Visual Document Retrieval Goes Multilingual

Hugging Face introduces VDR-2B-Multilingual, a 2-billion parameter vision-language model designed for visual document retrieval across multiple languages. The model enables retrieval of document images without OCR by embedding visual page representations directly. This extends prior visual document retrieval work to multilingual settings, broadening applicability for enterprise document search use cases.

Enterprise Deployment Patterns Multimodal Progress OCR-free document embedding visual document retrieval Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

LeRobot Community Datasets: The "ImageNet" of Robotics — When and How?

Hugging Face's LeRobot blog post discusses the vision and current state of building a large-scale community robotics dataset analogous to ImageNet for computer vision. The post examines what it would take to create a standardized, scalable dataset repository for robot learning, drawing on the LeRobot ecosystem. It addresses data collection formats, community contribution workflows, and the open challenges in making such a resource practically useful for training generalizable robot policies.

Evaluation and Benchmarking Open Weights Progress LeRobot Hugging Face ImageNet +1 more

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

5arXiv · cs.CL·12d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more