Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model
Hugging Face released IDEFICS, an open-weights reproduction of DeepMind's Flamingo visual language model. The release aims to provide the research community with an accessible, open alternative to proprietary multimodal models. IDEFICS supports image-text interleaved inputs and is available on the Hugging Face Hub.
Related guides (3)
Related events (8)
Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community
Hugging Face introduces Idefics2, an 8-billion parameter open vision-language model released for the community. The model is positioned as a capable multimodal system combining vision and language understanding. As an open-weights release from a major AI platform, it contributes to the growing ecosystem of accessible multimodal models.
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.
DeepSeek releases DeepSeek-OCR vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture. The model targets OCR and image feature extraction tasks and has accumulated over 2.4 million downloads and 3,275 likes, indicating significant community uptake. This represents an open-weights multimodal release from a major Chinese AI lab.
Hugging Face open reproduction of DeepSeek-R1
Hugging Face has published an open reproduction of DeepSeek-R1, the reasoning-focused language model, on GitHub. The project aims to replicate DeepSeek-R1's training methodology and capabilities in an open-weights setting. This contributes to the broader effort to make frontier reasoning model techniques accessible to the research community.
DeepSeek releases DeepSeek-OCR-2 vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR-2, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture and tagged for OCR and vision-language tasks. The model has accumulated over 1.8 million downloads and 980 likes, indicating substantial community uptake. It extends DeepSeek's multimodal model lineup with a specialized document/OCR capability.
A Dive into Vision-Language Models
This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.
Vision Language Models (Better, faster, stronger)
A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.
The State of Computer Vision at Hugging Face
Hugging Face published a survey of the computer vision ecosystem available through its platform as of early 2023, covering supported model architectures, tasks, datasets, and tooling. The post reviews progress in image classification, object detection, segmentation, and multimodal vision-language models integrated into the Transformers library. It serves as a reference for practitioners on what CV capabilities are accessible via the Hugging Face hub and APIs.


