4Hugging Face Blog·1mo ago

Zero-shot image segmentation with CLIPSeg

This Hugging Face blog post introduces CLIPSeg, a model that performs zero-shot image segmentation by leveraging CLIP-based text and image prompts. The approach allows segmentation of arbitrary image regions without task-specific training, using natural language or reference images as queries. The post likely covers integration into the Hugging Face ecosystem and practical usage examples.

Agent and Tool Ecosystem Multimodal Progress Hugging Face CLIPSeg CLIP

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Zero-shot image-to-text generation with BLIP-2

Hugging Face published a blog post introducing BLIP-2, a multimodal model that enables zero-shot image-to-text generation by bridging frozen image encoders and large language models via a lightweight Querying Transformer (Q-Former). The post covers the model's architecture, capabilities, and how to use it via the Hugging Face Transformers library. BLIP-2 achieves strong performance on visual question answering and image captioning tasks without task-specific fine-tuning.

Open Weights Progress Agent and Tool Ecosystem Q-Former Salesforce Research Hugging Face Transformers +3 more

9Openai Blog·1mo ago·source ↗

CLIP: Connecting Text and Images

OpenAI introduced CLIP (Contrastive Language-Image Pre-training), a neural network that learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by accepting natural language descriptions of categories rather than requiring task-specific training data. The approach mirrors the zero-shot transfer capabilities demonstrated by GPT-2 and GPT-3 in the language domain.

Frontier Model Releases Evaluation and Benchmarking GPT-3 GPT-2 Contrastive Language-Image Pretraining (CLIP)+3 more

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

4Hugging Face Blog·1mo ago·source ↗

Universal Image Segmentation with Mask2Former and OneFormer

Hugging Face published a blog post introducing Mask2Former and OneFormer, two universal image segmentation architectures now available in the Transformers library. These models unify panoptic, instance, and semantic segmentation tasks under a single framework. The post covers model capabilities, usage examples, and integration into the HuggingFace ecosystem.

Agent and Tool Ecosystem Multimodal Progress Mask2Former OneFormer Hugging Face Transformers +1 more

4Hugging Face Blog·1mo ago·source ↗

Generate Images with Claude and Hugging Face via MCP

Hugging Face published a blog post demonstrating how to use Claude with the Model Context Protocol (MCP) to generate images through Hugging Face's inference infrastructure. The integration allows Claude to call Hugging Face image generation models as tools via MCP, connecting frontier LLMs with open-weight diffusion models. This represents a practical example of the agent-tool ecosystem pattern where LLMs orchestrate specialized model endpoints.

Agent and Tool Ecosystem Multimodal Progress Claude Hugging Face Anthropic +1 more

5arXiv · cs.CL·12d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

5arXiv · cs.AI·4d ago·source ↗

ActiveSAM: Training-free open-vocabulary segmentation via image-conditional class pruning on SAM 3

ActiveSAM is a training-free, zero-shot inference framework that wraps Segment Anything Model 3 (SAM 3) to perform open-vocabulary semantic segmentation more efficiently. It estimates an image-conditioned active class subset at low resolution before running full-resolution decoding only on retained classes, using bucketed prompt multiplexing and margin-aware background calibration. Across eight benchmarks, it outperforms the prior state-of-the-art SegEarth-OV3 by ~1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets, with strong robustness to image corruption relevant to autonomous driving and embodied AI.

Evaluation and Benchmarking Inference Economics VILA-Lab Segment Anything Model 2 ActiveSAM +1 more

7Openai Blog·1mo ago·source ↗

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

OpenAI published research on hierarchical text-conditional image generation using CLIP latents, the technique underlying DALL-E 2. The approach uses a prior network to map text embeddings to image embeddings, then a diffusion decoder to generate images from those embeddings. This represented a significant advance in text-to-image generation quality and semantic fidelity at the time of release.

Frontier Model Releases Multimodal Progress DALL·E 3 unCLIP OpenAI +2 more