Almanac
model

CLIP

modelactiveclip-8494d55c·11 events·first seen 1mo ago

Aliases: CLIP

Co-occurring entities

More like this (12)

Recent events (11)

9Openai Blog·28d ago·source ↗

CLIP: Connecting Text and Images

OpenAI introduced CLIP (Contrastive Language-Image Pre-training), a neural network that learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by accepting natural language descriptions of categories rather than requiring task-specific training data. The approach mirrors the zero-shot transfer capabilities demonstrated by GPT-2 and GPT-3 in the language domain.

3Hugging Face Blog·28d ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

5Openai Blog·28d ago·source ↗

Multimodal neurons in artificial neural networks

OpenAI researchers discovered neurons in CLIP that respond to the same concept across literal, symbolic, and conceptual representations. This finding parallels multimodal neurons previously observed in biological brains and helps explain CLIP's ability to classify unusual visual renditions of concepts. The work is presented as a step toward understanding the associations and biases learned by CLIP and similar vision-language models.

7Openai Blog·28d ago·source ↗

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

OpenAI published research on hierarchical text-conditional image generation using CLIP latents, the technique underlying DALL-E 2. The approach uses a prior network to map text embeddings to image embeddings, then a diffusion decoder to generate images from those embeddings. This represented a significant advance in text-to-image generation quality and semantic fidelity at the time of release.

5arXiv · cs.CL·9d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

4Qwen Research·1mo ago·source ↗

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Alibaba's Qwen team released Chinese CLIP, a language-specific vision-language contrastive pretraining model targeting Chinese multimodal representation learning. The project addresses a gap in open-source Chinese CLIP models, particularly for cross-modal retrieval tasks. It follows the CLIP framework but is adapted for Chinese language and cultural context.

5arXiv · cs.LG·20d ago·source ↗

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

AREA is a new method for CLIP-based Class-Incremental Learning (CIL) that decomposes the classification process into attribute extraction and aggregation stages to combat catastrophic forgetting. Extraction is stabilized by anchoring visual and textual attributes on a hyperspherical embedding space via principal geodesic analysis, while aggregation uses lightweight task-specific experts regularized by a variational information bottleneck. Inference employs optimal transport routing over task attribute manifolds. The method is reported to consistently outperform state-of-the-art CIL approaches and is accepted at ICML 2026.

4Hugging Face Blog·28d ago·source ↗

Zero-shot image segmentation with CLIPSeg

This Hugging Face blog post introduces CLIPSeg, a model that performs zero-shot image segmentation by leveraging CLIP-based text and image prompts. The approach allows segmentation of arbitrary image regions without task-specific training, using natural language or reference images as queries. The post likely covers integration into the Hugging Face ecosystem and practical usage examples.

6Openai Blog·28d ago·source ↗

Scaling Kubernetes to 7,500 Nodes

OpenAI describes scaling Kubernetes clusters to 7,500 nodes to support large-scale AI training workloads including GPT-3, CLIP, and DALL·E. The post details infrastructure challenges and solutions enabling both massive model training and rapid small-scale research iteration. This represents a significant engineering milestone in ML training infrastructure at the time of publication (January 2021).

5arXiv · cs.AI·21d ago·source ↗

Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.

4arXiv · cs.AI·26h ago·source ↗

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.