4Hugging Face Blog·1mo ago

nanoVLM: Minimal Pure-PyTorch Repository for Training Vision-Language Models

Hugging Face published nanoVLM, a minimal open-source repository designed to make training vision-language models (VLMs) as simple as possible using pure PyTorch. The project aims to lower the barrier to entry for VLM research and experimentation by providing a clean, readable codebase without heavy abstractions. It follows in the tradition of educational ML repositories like nanoGPT, targeting researchers and practitioners who want to understand or customize VLM training from scratch.

Open Weights Progress Agent and Tool Ecosystem Multimodal Progress nanoGPT nanoVLM Hugging Face PyTorch

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

SmolVLM - Small Yet Mighty Vision Language Model

Hugging Face introduces SmolVLM, a compact vision-language model designed to deliver strong multimodal performance at small parameter counts. The model targets edge and resource-constrained deployment scenarios while maintaining competitive capabilities relative to its size. The announcement highlights efficiency improvements in both training and inference for small-scale VLMs.

Open Weights Progress Inference Economics SmolVLM Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

NVIDIA Llama Nemotron Nano VLM Released on Hugging Face Hub

NVIDIA has released the Llama Nemotron Nano VLM on Hugging Face Hub, a compact vision-language model built on the Llama architecture. The model is part of NVIDIA's Nemotron family targeting efficient multimodal inference. This release makes the model accessible to the broader research and developer community through Hugging Face's model hosting infrastructure.

Open Weights Progress Inference Economics Llama Nemotron Nano VLM NVIDIA Hugging Face +3 more

4Hugging Face Blog·1mo ago·source ↗

KV Cache from scratch in nanoVLM

This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.

Inference Economics Multimodal Progress KV Cache nanoVLM Vision-Language Models +1 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

Open Weights Progress Agent and Tool Ecosystem LeRobot Vision-Language-Action model Hugging Face +2 more

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLM2: Bringing Video Understanding to Every Device

Hugging Face introduces SmolVLM2, a family of compact vision-language models designed for video understanding on resource-constrained devices. The models extend the SmolVLM line with video comprehension capabilities while maintaining small footprints suitable for edge and on-device deployment. The release targets democratizing multimodal video understanding beyond cloud-only inference.

Open Weights Progress Inference Economics SmolVLM SmolVLM2 Hugging Face +1 more