4Hugging Face Blog·1mo ago

New ViT and ALIGN Models From Kakao Brain

Kakao Brain released new Vision Transformer (ViT) and ALIGN models, announced via the Hugging Face blog. The post covers multimodal vision-language models contributed to the open ecosystem. These models expand the available open-weights options for image-text tasks.

Open Weights Progress Multimodal Progress ViT (Vision Transformer)Hugging Face ALIGN Kakao Brain

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Vision Language Model Alignment in TRL

Hugging Face's TRL library has added support for aligning Vision Language Models (VLMs), extending existing RLHF and preference optimization tooling to multimodal settings. The blog post covers the new capabilities for training VLMs with alignment techniques such as DPO and related methods. This expands the open-source ecosystem for multimodal model fine-tuning and alignment.

Open Weights Progress Agent and Tool Ecosystem Direct Preference Optimization (DPO)Vision-Language Models Hugging Face +3 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

smolagents Now Supports Vision-Language Models

Hugging Face has added vision-language model (VLM) support to its smolagents framework, enabling agents to process and reason over visual inputs alongside text. This update extends the agentic tooling ecosystem to multimodal workflows. The announcement comes from the Hugging Face blog, which serves as the primary communication channel for the smolagents project.

Agent and Tool Ecosystem Multimodal Progress Vision-Language Models Hugging Face smolagents

4Hugging Face Blog·1mo ago·source ↗

The State of Computer Vision at Hugging Face

Hugging Face published a survey of the computer vision ecosystem available through its platform as of early 2023, covering supported model architectures, tasks, datasets, and tooling. The post reviews progress in image classification, object detection, segmentation, and multimodal vision-language models integrated into the Transformers library. It serves as a reference for practitioners on what CV capabilities are accessible via the Hugging Face hub and APIs.

Agent and Tool Ecosystem Multimodal Progress Transformers Hugging Face

7Qwen Research·1mo ago·source ↗

Qwen2.5-VL-32B: Reinforcement-Learning-Optimized Vision-Language Model Released

Alibaba's Qwen team has released Qwen2.5-VL-32B-Instruct, a 32-billion-parameter vision-language model built on the Qwen2.5-VL series and further optimized with reinforcement learning. The model is open-sourced under the Apache 2.0 license and available on Hugging Face and ModelScope. It follows the January 2025 launch of the broader Qwen2.5-VL series, positioning the 32B scale as a balance between capability and deployment practicality.

Open Weights Progress Inference Economics Qwen2.5-VL Qwen2.5-VL-32B-Instruct Apache 2.0 +5 more

6Qwen Research·1mo ago·source ↗

Introducing Qwen-VL-Plus and Qwen-VL-Max: Upgraded Multimodal Models from Alibaba

Alibaba's Qwen team has launched two enhanced versions of their multimodal model, Qwen-VL-Plus and Qwen-VL-Max, building on the open-sourced Qwen-VL released in September 2023. Key improvements include substantially boosted image reasoning capabilities, enhanced detail recognition and text extraction from images, and support for high-definition images exceeding one million pixels across various aspect ratios. The upgrades represent a significant step forward in the Qwen-VL series' generalization and visual understanding capabilities.

Frontier Model Releases Open Weights Progress Qwen-VL Qwen-VL-Max Alibaba +2 more