FINO: Label-free adaptation of vision foundation models using metadata in scientific domains
Researchers propose FINO, a self-supervised method for adapting vision foundation models to specialized scientific domains without task labels, using metadata as a guidance signal instead. The approach combines a standard self-supervised objective with flexible handling of both discrete and continuous metadata to preserve informative factors while suppressing spurious ones. Evaluated across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO outperforms both unsupervised domain adaptation and fully supervised fine-tuning, including domain-specific state-of-the-art models.
Related guides (1)
Related events (8)
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.
Label-Free Bias Identification in Vision Models via Gradient Probes on Concept Decompositions
This paper introduces a post-hoc, label-free method for identifying spurious correlations in frozen vision classifiers without requiring bias annotations, group labels, or retraining. The approach applies non-negative matrix factorization to intermediate activations to extract interpretable concept vectors, then ranks them using a gradient-based bias estimator derived from misclassified examples. On Colored MNIST, Waterbirds, and CelebA benchmarks, the method recovers known spurious cues and improves worst-group accuracy by up to 17.9 percentage points on Waterbirds by suppressing top-ranked concepts at inference time. Notably, the method surfaces decision-relevant directions that do not always coincide with annotated attributes, offering both an auditing tool and a debiasing handle for deployed models.
Finetuning olmOCR to be a faithful OCR-Engine
TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.
Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions
This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
Meta and World Resources Institute Release Canopy Height Maps v2 Using DINOv3 Self-Supervised Vision Model
Meta AI and the World Resources Institute have released Canopy Height Maps v2 (CHMv2), an open-source global forest mapping system powered by DINOv3, Meta's self-supervised vision model pre-trained on SAT-493M, a large satellite imagery dataset. The new model improves R² accuracy from 0.53 to 0.86 over the previous DINOv2-based version, with better performance on tall trees and greater geographic consistency. CHMv2 is already being adopted by the UK Forestry Commission, the European Commission's Joint Research Centre, and multiple US city planning initiatives. The model, maps, and dataset are publicly available.
Can Foundation Models Label Data Like Humans?
This Hugging Face blog post examines whether foundation models can serve as substitutes for human annotators in RLHF data labeling pipelines. It investigates the reliability and quality of model-generated preference labels compared to human-generated ones, with implications for scalable oversight and alignment research. The analysis is framed around the Open LLM Leaderboard and RLHF methodology.
FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones
FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.
