5arXiv cs.AI (Artificial Intelligence)·16d ago

FINO: Label-free adaptation of vision foundation models using metadata in scientific domains

Researchers propose FINO, a self-supervised method for adapting vision foundation models to specialized scientific domains without task labels, using metadata as a guidance signal instead. The approach combines a standard self-supervised objective with flexible handling of both discrete and continuous metadata to preserve informative factors while suppressing spurious ones. Evaluated across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO outperforms both unsupervised domain adaptation and fully supervised fine-tuning, including domain-specific state-of-the-art models.

Evaluation and Benchmarking FINO Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.

Enterprise Deployment Patterns Agent and Tool Ecosystem Microsoft Hugging Face Florence-2 +1 more

6arXiv · cs.LG·23d ago·source ↗

Label-Free Bias Identification in Vision Models via Gradient Probes on Concept Decompositions

This paper introduces a post-hoc, label-free method for identifying spurious correlations in frozen vision classifiers without requiring bias annotations, group labels, or retraining. The approach applies non-negative matrix factorization to intermediate activations to extract interpretable concept vectors, then ranks them using a gradient-based bias estimator derived from misclassified examples. On Colored MNIST, Waterbirds, and CelebA benchmarks, the method recovers known spurious cues and improves worst-group accuracy by up to 17.9 percentage points on Waterbirds by suppressing top-ranked concepts at inference time. Notably, the method surfaces decision-relevant directions that do not always coincide with annotated attributes, offering both an auditing tool and a debiasing handle for deployed models.

Evaluation and Benchmarking AI Safety Research Colored MNIST Waterbirds CelebA +2 more

4Hugging Face Blog·1mo ago·source ↗

Finetuning olmOCR to be a faithful OCR-Engine

TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.

Open Weights Progress Agent and Tool Ecosystem Hugging Face olmOCR TNG Technology Consulting +1 more

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

5Meta Ai Blog·1mo ago·source ↗

Meta and World Resources Institute Release Canopy Height Maps v2 Using DINOv3 Self-Supervised Vision Model

Meta AI and the World Resources Institute have released Canopy Height Maps v2 (CHMv2), an open-source global forest mapping system powered by DINOv3, Meta's self-supervised vision model pre-trained on SAT-493M, a large satellite imagery dataset. The new model improves R² accuracy from 0.53 to 0.86 over the previous DINOv2-based version, with better performance on tall trees and greater geographic consistency. CHMv2 is already being adopted by the UK Forestry Commission, the European Commission's Joint Research Centre, and multiple US city planning initiatives. The model, maps, and dataset are publicly available.

Open Weights Progress Multimodal Progress Meta AI DINOv2 DINOv3 +4 more

5Hugging Face Blog·1mo ago·source ↗

Can Foundation Models Label Data Like Humans?

This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback Open LLM Leaderboard Hugging Face +1 more

6arXiv · cs.AI·10d ago·source ↗

FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones

FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.

Inference Economics Multimodal Progress USF-MAE FetalCLIP Qwen3-4B +4 more