Almanac
technique

Vision-Language Models

techniqueactivevision-language-models-a9d274ce·20 events·first seen 1mo ago

Aliases: Large Vision-Language Models, Vision Language Model, Vision Language Models, Vision-Language Model, Vision-Language Models, vision-language models (VLMs), Vision-Language Models (VLMs)

Merged from

Vision Language Models, Vision Language Model, Vision-Language Model, Large Vision-Language Models

Co-occurring entities

More like this (12)

Guides (1)

Recent events (20)

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

5Hugging Face Blog·1mo ago·source ↗

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

5arXiv · cs.CL·24d ago·source ↗

Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery

This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

6arXiv · cs.AI·26d ago·source ↗

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SpaceNum is a new evaluation framework probing whether Vision-Language Models genuinely ground numerical outputs (coordinates, action magnitudes) in spatial perception, rather than relying on shallow cues. The benchmark defines two bidirectional tasks—Num2Space and Space2Num—across dynamic and static spatial settings. Results show current VLMs perform near random chance on spatial numerical grounding, with explicit reasoning providing only marginal improvement and fine-tuning offering partial gains.

5arXiv · cs.CL·23d ago·source ↗

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.

4Hugging Face Blog·1mo ago·source ↗

Get your VLM running in 3 simple steps on Intel CPUs

A Hugging Face blog post describes a workflow for deploying vision-language models (VLMs) on Intel CPUs using OpenVINO, presented as a three-step process. The post targets practitioners looking to run multimodal inference on CPU hardware without requiring GPU resources. This is relevant to the inference-on-edge and CPU-based deployment pattern for multimodal models.

4arXiv · cs.AI·1mo ago·source ↗

TempGlitch: Benchmark for Evaluating VLMs on Temporal Glitch Detection in Gameplay Videos

TempGlitch is a new benchmark designed to evaluate vision-language models on temporal glitch detection in gameplay videos, distinguishing temporal anomalies (visible only across ordered frames) from spatial ones (visible in a single frame). The benchmark covers five temporal glitch types with paired glitch-free videos for binary evaluation, and tests 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Results show current VLMs perform near chance on temporal glitches, with neither denser frame sampling nor larger model size reliably improving detection. The work highlights a systematic gap in VLM temporal reasoning capabilities relevant to automated video quality assurance.

6Hugging Face Blog·1mo ago·source ↗

Vision Language Model Alignment in TRL

Hugging Face's TRL library has added support for aligning Vision Language Models (VLMs), extending existing RLHF and preference optimization tooling to multimodal settings. The blog post covers the new capabilities for training VLMs with alignment techniques such as DPO and related methods. This expands the open-source ecosystem for multimodal model fine-tuning and alignment.

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

6arXiv · cs.CL·19d ago·source ↗

Vision-Language Models Suppress Female Representations Under Ambiguous Input

This paper investigates gender bias in vision-language models (VLMs) when inputs are ambiguous (e.g., workers in full gear or seen from behind), finding that models default to male outputs even for strongly female-stereotyped occupations. The authors introduce LALS (Latent Association Leaning Score), a zero-shot metric that probes internal visual-token activations to measure concept associations across layers. Across 15 occupations, 800+ ambiguous images, and four VLMs, they find a systematic decoupling: models internally encode female associations but suppress them before generation, with male signals amplifying end-to-end while female signals peak mid-network and are filtered out. Cultural visual cues like clothing color further modulate these internal associations.

6arXiv · cs.AI·1mo ago·source ↗

MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models

This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.

5Hugging Face Blog·1mo ago·source ↗

smolagents Now Supports Vision-Language Models

Hugging Face has added vision-language model (VLM) support to its smolagents framework, enabling agents to process and reason over visual inputs alongside text. This update extends the agentic tooling ecosystem to multimodal workflows. The announcement comes from the Hugging Face blog, which serves as the primary communication channel for the smolagents project.

5arXiv · cs.AI·1mo ago·source ↗

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.

5arXiv · cs.CL·24d ago·source ↗

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.

4arXiv · cs.AI·24d ago·source ↗

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow augments Vision Language Models with deterministically extracted Canny edge maps as structural priors to improve flowchart-to-Mermaid conversion in industrial requirements engineering, requiring no annotated training data or fine-tuning. Evaluated on IndusReqFlow, a real-world industrial dataset, it achieves +17.39 pp node-level F1 and +16.94 pp edge-level F1 over off-the-shelf VLMs. Cross-dataset evaluation on a synthetic benchmark shows no significant gains, highlighting the gap between synthetic and industrial benchmarks for VLM-based RE tools.

5arXiv · cs.AI·22d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

4Hugging Face Blog·1mo ago·source ↗

KV Cache from scratch in nanoVLM

This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.

5arXiv · cs.AI·26d ago·source ↗

PhotoFlow: Agentic 3D Virtual Photography via Director-Reviewer-Reflector Loop

PhotoFlow introduces a closed-loop agentic system for language-conditioned virtual photography in arbitrary 3D scenes, using a Director-Reviewer-Reflector architecture to iteratively search camera poses and render photographs without preselected viewpoints. The system is evaluated on VPhotoBench, a new benchmark of 47 Blender scenes and 141 language-conditioned missions covering spatial composition and aesthetic criteria. PhotoFlow outperforms one-shot prediction, single-chain reflection, anchor-bank selection, and random search baselines under a six-round rendering budget. The work represents the first formalization of language-conditioned virtual photography as an executable agent task, probing both 3D spatial reasoning and aesthetic judgment in vision-language models.