4arXiv cs.CL (Computation and Language)·4d ago

Visually grounded method connects spoken words to written text without textual supervision

Researchers present a method for learning mappings between written words and spoken audio using only images and spoken captions, with no explicit text supervision. The approach uses image captioning to build a written vocabulary, then applies unsupervised word discovery to align spoken utterances to those words. The system outperforms a neural baseline on spoken word retrieval and keyword spotting tasks in English, with implications for low-resource language processing.

Evaluation and Benchmarking Multimodal Progress Connecting Speech to Words through Images

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

9Openai Blog·1mo ago·source ↗

CLIP: Connecting Text and Images

OpenAI introduced CLIP (Contrastive Language-Image Pre-training), a neural network that learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by accepting natural language descriptions of categories rather than requiring task-specific training data. The approach mirrors the zero-shot transfer capabilities demonstrated by GPT-2 and GPT-3 in the language domain.

Frontier Model Releases Evaluation and Benchmarking GPT-3 GPT-2 Contrastive Language-Image Pretraining (CLIP)+3 more

5arXiv · cs.CL·12d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

6arXiv · cs.CL·25d ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more

6arXiv · cs.AI·26d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.

Evaluation and Benchmarking Multimodal Progress PGT (Procedurally Generated Tasks)Multimodal Large Language Models CV-Bench-2D +2 more

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

6arXiv · cs.CL·25d ago·source ↗

STORM: Internalized Spatial-Temporal Reasoning for Video-Language Models via Latent Trajectories

STORMS is a two-stage training framework that teaches large vision-language models to perform spatial-temporal video reasoning through bounded continuous latent trajectories rather than explicit textual chain-of-thought, keyframe selection, or external tool use. In Stage I, latent tokens are aligned with thought-video representations derived from generated videos; in Stage II, answer-only supervision internalizes the reasoning process. At inference time, no video regeneration or frame reinsertion is required, reducing latency and engineering complexity. Evaluations on VideoMME, MVBench, TempCompass, and MMVU show improved accuracy with substantially lower inference overhead versus tool-based pipelines.

Inference Economics Agent and Tool Ecosystem MVBench STORMS TempCompass +5 more

5arXiv · cs.CL·24d ago·source ↗

Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery

This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

Evaluation and Benchmarking Multimodal Progress concreteness ratings canonical correlation analysis Vision-Language Models +2 more