Visual Salamandra: Pushing the Boundaries of Multimodal Understanding
BSC-LT (Barcelona Supercomputing Center Language Technologies) has released Visual Salamandra, a 7B multimodal model announced via Hugging Face blog. The post describes a vision-language model building on the Salamandra language model family. As a tier-2 source with an empty body, specific capability details and benchmark results are not available from this item alone.
Related guides (3)
Related events (8)
Vision Language Models (Better, faster, stronger)
A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.
A Dive into Vision-Language Models
This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.
Vision Language Models Explained
A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.
Meta releases Llama 3.2 90B Vision-Instruct multimodal model
Meta released Llama 3.2 90B Vision-Instruct on Hugging Face, a large multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With 858 downloads and 358 likes, it represents Meta's open-weights push into vision-language capabilities at the 90B parameter scale.
Meta releases Llama 3.2 11B Vision Instruct multimodal model
Meta released Llama 3.2 11B Vision Instruct on Hugging Face, an open-weights multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With over 157K downloads and 1,600 likes, it has seen substantial community adoption.
Meta releases Llama 3.2 11B Vision multimodal model on Hugging Face
Meta released Llama 3.2 11B Vision, an open-weights image-text-to-text model, on Hugging Face. The model is part of the Llama 3.2 family and supports multiple languages including English, German, and French. This represents Meta's entry into open-weights multimodal models at the 11B parameter scale.
smolagents Now Supports Vision-Language Models
Hugging Face has added vision-language model (VLM) support to its smolagents framework, enabling agents to process and reason over visual inputs alongside text. This update extends the agentic tooling ecosystem to multimodal workflows. The announcement comes from the Hugging Face blog, which serves as the primary communication channel for the smolagents project.
Meta releases Llama 3.2 90B Vision multimodal model on Hugging Face
Meta released Llama 3.2 90B Vision, a large multimodal model supporting image-text-to-text tasks, published on Hugging Face under the meta-llama organization. The model is part of the Llama 3.2 family and supports English, German, and French. This is a significant open-weights multimodal release from Meta, extending the Llama 3 series with vision capabilities at the 90B parameter scale.


