Almanac
← Events
5Hugging Face Blog·1mo ago

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

Related guides (4)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

SmolVLM - Small Yet Mighty Vision Language Model

Hugging Face introduces SmolVLM, a compact vision-language model designed to deliver strong multimodal performance at small parameter counts. The model targets edge and resource-constrained deployment scenarios while maintaining competitive capabilities relative to its size. The announcement highlights efficiency improvements in both training and inference for small-scale VLMs.

5Hugging Face Blog·1mo ago·source ↗

SmolVLM2: Bringing Video Understanding to Every Device

Hugging Face introduces SmolVLM2, a family of compact vision-language models designed for video understanding on resource-constrained devices. The models extend the SmolVLM line with video comprehension capabilities while maintaining small footprints suitable for edge and on-device deployment. The release targets democratizing multimodal video understanding beyond cloud-only inference.

5Hugging Face Blog·1mo ago·source ↗

smolagents Now Supports Vision-Language Models

Hugging Face has added vision-language model (VLM) support to its smolagents framework, enabling agents to process and reason over visual inputs alongside text. This update extends the agentic tooling ecosystem to multimodal workflows. The announcement comes from the Hugging Face blog, which serves as the primary communication channel for the smolagents project.

6arXiv · cs.CL·8d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

5Hugging Face Blog·1mo ago·source ↗

Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine-Tuning, and On-Device Optimizations

NXP and Hugging Face describe a pipeline for deploying Vision-Language-Action (VLA) models on embedded/edge hardware, covering dataset recording, fine-tuning, and on-device optimization techniques. The post targets robotics applications where inference must run on resource-constrained microcontrollers or SoCs rather than cloud GPUs. Key topics include quantization, model compression, and integration with the LeRobot ecosystem. This represents a practical engineering bridge between frontier VLA research and real-world embedded robotics deployment.

5Hugging Face Blog·1mo ago·source ↗

SmolLM: Hugging Face Releases Blazingly Fast Small Language Models

Hugging Face introduces SmolLM, a family of small language models designed for on-device and edge deployment with high speed and competitive performance. The models are positioned as efficient alternatives for resource-constrained environments. The release includes model weights and associated tooling on the Hugging Face Hub.

5Hugging Face Blog·1mo ago·source ↗

SmolVLM Grows Smaller – Introducing the 256M & 500M Models

Hugging Face has released two new ultra-compact vision-language models, SmolVLM-256M and SmolVLM-500M, extending the SmolVLM family to sub-billion parameter sizes. These models are designed for on-device and resource-constrained deployment scenarios. The release continues the trend of pushing capable multimodal models into smaller footprints suitable for edge inference.

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.