Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.
Related guides (4)
Related events (8)
Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation
This Hugging Face blog post details a workflow for fine-tuning NVIDIA's Cosmos Predict 2.5 world model using LoRA and DoRA parameter-efficient techniques for robot video generation tasks. The post covers practical implementation steps for adapting the foundation video model to robotics-specific domains. This represents a concrete application of world models to embodied AI, where synthetic video generation can support robot training data pipelines.
Vision Language Models (Better, faster, stronger)
A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.
A Dive into Vision-Language Models
This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.
Fine-Tuning Gemma Models in Hugging Face
Hugging Face published a guide on fine-tuning Google's Gemma models using parameter-efficient fine-tuning (PEFT) techniques. The post covers practical workflows for adapting Gemma to downstream tasks within the Hugging Face ecosystem. This represents part of the broader tooling support rollout accompanying Gemma's release in February 2024.
Finetuning olmOCR to be a faithful OCR-Engine
TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.
Falcon LLM Integrated into Hugging Face Ecosystem
Hugging Face announced the integration of the Falcon language models (Falcon-7B and Falcon-40B) into its ecosystem, including model hosting, inference APIs, and tooling support. Falcon, developed by the Technology Innovation Institute (TII), had recently topped the Open LLM Leaderboard at the time of release. The post covers usage patterns, fine-tuning guidance, and deployment options within the Hugging Face stack.
FineVideo: Behind the Scenes — HuggingFace Video Dataset Release
HuggingFace published a behind-the-scenes account of FineVideo, a curated dataset aimed at advancing video understanding in AI/ML models. The post details the data collection, annotation, and curation methodology used to build the dataset. FineVideo is positioned as a resource for training and evaluating multimodal video models.
Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions
This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.



