Zero-shot image-to-text generation with BLIP-2
Hugging Face published a blog post introducing BLIP-2, a multimodal model that enables zero-shot image-to-text generation by bridging frozen image encoders and large language models via a lightweight Querying Transformer (Q-Former). The post covers the model's architecture, capabilities, and how to use it via the Hugging Face Transformers library. BLIP-2 achieves strong performance on visual question answering and image captioning tasks without task-specific fine-tuning.
Related guides (4)
Related events (8)
Zero-shot image segmentation with CLIPSeg
This Hugging Face blog post introduces CLIPSeg, a model that performs zero-shot image segmentation by leveraging CLIP-based text and image prompts. The approach allows segmentation of arbitrary image regions without task-specific training, using natural language or reference images as queries. The post likely covers integration into the Hugging Face ecosystem and practical usage examples.
CLIP: Connecting Text and Images
OpenAI introduced CLIP (Contrastive Language-Image Pre-training), a neural network that learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by accepting natural language descriptions of categories rather than requiring task-specific training data. The approach mirrors the zero-shot transfer capabilities demonstrated by GPT-2 and GPT-3 in the language domain.
SigLIP 2: A better multilingual vision language encoder
Google releases SigLIP 2, an improved multilingual vision-language encoder model published via Hugging Face blog. The update targets better multilingual understanding and vision-language alignment compared to the original SigLIP. The post appears to cover architectural improvements and benchmark results for this encoder model, which is commonly used as a backbone in multimodal systems.
A Dive into Text-to-Video Models
A Hugging Face blog post providing an overview of text-to-video generation models as of mid-2023. The post surveys the landscape of approaches, architectures, and key models in the emerging text-to-video space. As a tier-2 commentary piece, it synthesizes existing work rather than presenting novel research.
Assisted Generation: a new direction toward low-latency text generation
Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.
Qwen releases Qwen3.5-9B-Base multimodal model on Hugging Face
Qwen has released Qwen3.5-9B-Base, a 9-billion-parameter image-text-to-text base model on Hugging Face. The model supports conversational use and is compatible with the transformers library and inference endpoints. With over 153,000 downloads, it has seen substantial early adoption.
Generating Human-level Text with Contrastive Search in Transformers
Hugging Face introduces contrastive search, a decoding strategy for autoregressive language models that aims to produce more coherent and human-like text compared to standard methods like beam search or nucleus sampling. The technique works by balancing a model's confidence in its next-token prediction against a contrastive penalty that discourages repetitive or degenerate outputs. The blog post describes integration of contrastive search into the Hugging Face Transformers library, making it accessible to practitioners.
Pre-Train BERT with Hugging Face Transformers and Habana Gaudi
This Hugging Face blog post from August 2022 describes how to pre-train a BERT model from scratch using the Hugging Face Transformers library on Habana Gaudi hardware accelerators. It covers the full pipeline including data preparation, tokenizer training, and masked language modeling pretraining. The post serves as both a technical tutorial and a demonstration of Habana Gaudi's viability as an alternative AI training accelerator.



