Almanac
← Events
4Hugging Face Blog·1mo ago

The State of Computer Vision at Hugging Face

Hugging Face published a survey of the computer vision ecosystem available through its platform as of early 2023, covering supported model architectures, tasks, datasets, and tooling. The post reviews progress in image classification, object detection, segmentation, and multimodal vision-language models integrated into the Transformers library. It serves as a reference for practitioners on what CV capabilities are accessible via the Hugging Face hub and APIs.

Related guides (3)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

4Hugging Face Blog·1mo ago·source ↗

Universal Image Segmentation with Mask2Former and OneFormer

Hugging Face published a blog post introducing Mask2Former and OneFormer, two universal image segmentation architectures now available in the Transformers library. These models unify panoptic, instance, and semantic segmentation tasks under a single framework. The post covers model capabilities, usage examples, and integration into the HuggingFace ecosystem.

7Hugging Face Blog·1mo ago·source ↗

Transformers v5: Simple model definitions powering the AI ecosystem

Hugging Face has announced Transformers v5, a major version update to its flagship open-source library. The release focuses on simplified model definitions and architectural improvements to the codebase. As one of the most widely used ML libraries in the ecosystem, this update has broad implications for researchers and practitioners building on top of the Transformers framework.

6Deepseek·11d ago·source ↗

DeepSeek releases DeepSeek-OCR vision-language model on Hugging Face

DeepSeek has released DeepSeek-OCR, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture. The model targets OCR and image feature extraction tasks and has accumulated over 2.4 million downloads and 3,275 likes, indicating significant community uptake. This represents an open-weights multimodal release from a major Chinese AI lab.

6Deepseek·11d ago·source ↗

DeepSeek releases DeepSeek-OCR-2 vision-language model on Hugging Face

DeepSeek has released DeepSeek-OCR-2, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture and tagged for OCR and vision-language tasks. The model has accumulated over 1.8 million downloads and 980 likes, indicating substantial community uptake. It extends DeepSeek's multimodal model lineup with a specialized document/OCR capability.

6Hugging Face Blog·1mo ago·source ↗

Hugging Face and Google Partner for Open AI Collaboration

Hugging Face and Google have announced a partnership focused on open AI collaboration, expanding access to Hugging Face models and tools on Google Cloud Platform. The deal deepens integration between Hugging Face's model hub and Google's cloud infrastructure, enabling easier deployment of open-source models via GCP services. This follows a pattern of major cloud providers forming strategic alliances with leading open-source AI platforms.

3Hugging Face Blog·1mo ago·source ↗

An Overview of Inference Solutions on Hugging Face

Hugging Face published a blog post surveying its inference product offerings as of late 2022. The post covers the range of hosted and API-based inference solutions available on the platform, aimed at helping developers choose appropriate deployment paths. This serves as a reference overview of Hugging Face's inference infrastructure ecosystem at that time.

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.