
Vision-Language Models
vision-language-models-a9d274ce·20 events·first seen 1mo agoAliases: Large Vision-Language Models, Vision Language Model, Vision Language Models, Vision-Language Model, Vision-Language Models, vision-language models (VLMs), Vision-Language Models (VLMs)
Merged from
Vision Language Models, Vision Language Model, Vision-Language Model, Large Vision-Language Models
Co-occurring entities
More like this (12)
Guides (1)
Recent events (20)
Vision Language Models (Better, faster, stronger)
A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.
A Dive into Vision-Language Models
This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.
Preference Optimization for Vision Language Models
This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.
Vision Language Models Explained
A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.
Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery
This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
SpaceNum is a new evaluation framework probing whether Vision-Language Models genuinely ground numerical outputs (coordinates, action magnitudes) in spatial perception, rather than relying on shallow cues. The benchmark defines two bidirectional tasks—Num2Space and Space2Num—across dynamic and static spatial settings. Results show current VLMs perform near random chance on spatial numerical grounding, with explicit reasoning providing only marginal improvement and fine-tuning offering partial gains.
VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.
Get your VLM running in 3 simple steps on Intel CPUs
A Hugging Face blog post describes a workflow for deploying vision-language models (VLMs) on Intel CPUs using OpenVINO, presented as a three-step process. The post targets practitioners looking to run multimodal inference on CPU hardware without requiring GPU resources. This is relevant to the inference-on-edge and CPU-based deployment pattern for multimodal models.
TempGlitch: Benchmark for Evaluating VLMs on Temporal Glitch Detection in Gameplay Videos
TempGlitch is a new benchmark designed to evaluate vision-language models on temporal glitch detection in gameplay videos, distinguishing temporal anomalies (visible only across ordered frames) from spatial ones (visible in a single frame). The benchmark covers five temporal glitch types with paired glitch-free videos for binary evaluation, and tests 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Results show current VLMs perform near chance on temporal glitches, with neither denser frame sampling nor larger model size reliably improving detection. The work highlights a systematic gap in VLM temporal reasoning capabilities relevant to automated video quality assurance.
Vision Language Model Alignment in TRL
Hugging Face's TRL library has added support for aligning Vision Language Models (VLMs), extending existing RLHF and preference optimization tooling to multimodal settings. The blog post covers the new capabilities for training VLMs with alignment techniques such as DPO and related methods. This expands the open-source ecosystem for multimodal model fine-tuning and alignment.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
Vision-Language Models Suppress Female Representations Under Ambiguous Input
This paper investigates gender bias in vision-language models (VLMs) when inputs are ambiguous (e.g., workers in full gear or seen from behind), finding that models default to male outputs even for strongly female-stereotyped occupations. The authors introduce LALS (Latent Association Leaning Score), a zero-shot metric that probes internal visual-token activations to measure concept associations across layers. Across 15 occupations, 800+ ambiguous images, and four VLMs, they find a systematic decoupling: models internally encode female associations but suppress them before generation, with male signals amplifying end-to-end while female signals peak mid-network and are filtered out. Cultural visual cues like clothing color further modulate these internal associations.
MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models
This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.
smolagents Now Supports Vision-Language Models
Hugging Face has added vision-language model (VLM) support to its smolagents framework, enabling agents to process and reason over visual inputs alongside text. This update extends the agentic tooling ecosystem to multimodal workflows. The announcement comes from the Hugging Face blog, which serves as the primary communication channel for the smolagents project.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.
EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
EdgeFlow augments Vision Language Models with deterministically extracted Canny edge maps as structural priors to improve flowchart-to-Mermaid conversion in industrial requirements engineering, requiring no annotated training data or fine-tuning. Evaluated on IndusReqFlow, a real-world industrial dataset, it achieves +17.39 pp node-level F1 and +16.94 pp edge-level F1 over off-the-shelf VLMs. Cross-dataset evaluation on a synthetic benchmark shows no significant gains, highlighting the gap between synthetic and industrial benchmarks for VLM-based RE tools.
VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench
Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.
KV Cache from scratch in nanoVLM
This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.
PhotoFlow: Agentic 3D Virtual Photography via Director-Reviewer-Reflector Loop
PhotoFlow introduces a closed-loop agentic system for language-conditioned virtual photography in arbitrary 3D scenes, using a Director-Reviewer-Reflector architecture to iteratively search camera poses and render photographs without preselected viewpoints. The system is evaluated on VPhotoBench, a new benchmark of 47 Blender scenes and 141 language-conditioned missions covering spatial composition and aesthetic criteria. PhotoFlow outperforms one-shot prediction, single-chain reflection, anchor-bank selection, and random search baselines under a six-round rendering budget. The work represents the first formalization of language-conditioned virtual photography as an executable agent task, probing both 3D spatial reasoning and aesthetic judgment in vision-language models.
