Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community
Hugging Face introduces Idefics2, an 8-billion parameter open vision-language model released for the community. The model is positioned as a capable multimodal system combining vision and language understanding. As an open-weights release from a major AI platform, it contributes to the growing ecosystem of accessible multimodal models.
Related guides (3)
Related events (8)
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model
Hugging Face released IDEFICS, an open-weights reproduction of DeepMind's Flamingo visual language model. The release aims to provide the research community with an accessible, open alternative to proprietary multimodal models. IDEFICS supports image-text interleaved inputs and is available on the Hugging Face Hub.
Qwen2.5-VL-32B: Reinforcement-Learning-Optimized Vision-Language Model Released
Alibaba's Qwen team has released Qwen2.5-VL-32B-Instruct, a 32-billion-parameter vision-language model built on the Qwen2.5-VL series and further optimized with reinforcement learning. The model is open-sourced under the Apache 2.0 license and available on Hugging Face and ModelScope. It follows the January 2025 launch of the broader Qwen2.5-VL series, positioning the 32B scale as a balance between capability and deployment practicality.
Vision Language Models (Better, faster, stronger)
A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.
DeepSeek releases DeepSeek-OCR vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture. The model targets OCR and image feature extraction tasks and has accumulated over 2.4 million downloads and 3,275 likes, indicating significant community uptake. This represents an open-weights multimodal release from a major Chinese AI lab.
DeepSeek releases DeepSeek-OCR-2 vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR-2, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture and tagged for OCR and vision-language tasks. The model has accumulated over 1.8 million downloads and 980 likes, indicating substantial community uptake. It extends DeepSeek's multimodal model lineup with a specialized document/OCR capability.
SigLIP 2: A better multilingual vision language encoder
Google releases SigLIP 2, an improved multilingual vision-language encoder model published via Hugging Face blog. The update targets better multilingual understanding and vision-language alignment compared to the original SigLIP. The post appears to cover architectural improvements and benchmark results for this encoder model, which is commonly used as a backbone in multimodal systems.
Visual Document Retrieval Goes Multilingual
Hugging Face introduces VDR-2B-Multilingual, a 2-billion parameter vision-language model designed for visual document retrieval across multiple languages. The model enables retrieval of document images without OCR by embedding visual page representations directly. This extends prior visual document retrieval work to multilingual settings, broadening applicability for enterprise document search use cases.
Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2
This Hugging Face blog post covers the deployment and acceleration of BridgeTower, a vision-language model, on Intel's Habana Gaudi2 AI accelerator hardware. The piece likely benchmarks inference throughput and training performance on Gaudi2 compared to other hardware. It represents a practical infrastructure and deployment case study for multimodal models on alternative AI accelerators.


