Entity · model

CLIP

modelactiveclip-8494d55c·15 events·first seen May 18, 2026

Aliases: CLIP

Co-occurring entities

More like this (12)

Chinese CLIP CLIPSeg unCLIP KLIP Contrastive Language-Image Pretraining (CLIP)Paperclip FetalCLIP Clippy PLAID Adaptive Clip Policy Optimization CLI-Hub BLIP-2

Recent events (15)

5arXiv · cs.AI·Jul 8, 2026·source ↗

AirflowAttack: First adversarial attack on infrared remote-sensing VLMs using thermal-airflow perturbations

Researchers introduce AirflowAttack, the first adversarial attack targeting vision-language models deployed on infrared remote-sensing imagery, using physically plausible thermal-airflow turbulence as the perturbation prior. A single input-agnostic perturbation optimized on one surrogate CLIP model achieves 48.5% mean zero-shot attack success rate across five CLIP backbones, outperforming four IR-specific physical baselines. Applied to six state-of-the-art VLMs, the attack reduces scene-classification accuracy by up to 38.2% relative while paradoxically increasing model confidence, causing confabulation of thermal artifacts as genuine evidence. The work also releases a benchmark spanning eleven models and four tasks, exposing systematic vulnerabilities in security-critical IR VLM deployments.

Evaluation and Benchmarking AI Safety Research CLIP AirflowAttack +1 more

5arXiv · cs.CL·Jul 3, 2026·source ↗

Training-free mechanistic defense against typographic attacks on CLIP-based vision encoders

Researchers propose a training-free method to defend CLIP-based vision encoders against typographic attacks, where irrelevant text embedded in images biases visual representations toward lexical rather than semantic meaning. The approach uses sampling-based mechanistic interpretability to identify specific Vision Transformer attention heads responsible for encoding lexical information, then applies targeted circuit-level interventions to suppress this behavior. Without any retraining, the method outperforms both supervised and training-free baselines on object classification and improves Visual Question Answering accuracy under typographic attack conditions on RIO-Bench across several state-of-the-art LVLMs.

Evaluation and Benchmarking AI Safety Research ViT (Vision Transformer)Towards Robustness against Typographic Attack with Training-free Concept Localization RIO-Bench +2 more

5arXiv · cs.AI·Jun 23, 2026·source ↗

Self-Filtering: Iterative bootstrapped data selection for vision-language model training

Researchers propose Self-Filtering, a bootstrapped data curation method for vision-language models in which a CLIP model iteratively trains on and re-selects its own training data. The approach alternates between filtering high-confidence clean samples and preserving distributional diversity, without requiring curated reference datasets or pre-trained external models. Experiments show downstream performance improvements over standard noisy training pipelines.

Training Infrastructure Multimodal Progress Data Selection Through Iterative Self-Filtering for Vision-Language Settings CLIP

3arXiv · cs.CL·Jun 23, 2026·source ↗

Concept-Constrained Prompt Learning (CCPL) improves CLIP few-shot generalization via concept regularization

Researchers propose Concept-Constrained Prompt Learning (CCPL), a lightweight regularization framework for few-shot CLIP adaptation that anchors learnable class prompts to frozen concept-level text prototypes. The method uses cosine consistency objectives in text space and concept dropout to reduce overfitting to base classes, improving base-to-new generalization. Experiments show gains on DTD (+0.6 HM) and EuroSAT (+2.9 HM) over CoOp, with near-neutral results on OxfordPets, suggesting effectiveness is tied to how well concept prototypes align with dataset semantics.

Evaluation and Benchmarking Multimodal Progress EuroSAT DTD Concept-Constrained Prompt Learning +2 more

4arXiv · cs.AI·Jun 16, 2026·source ↗

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.

Evaluation and Benchmarking Multimodal Progress CLIP FusionRS

5arXiv · cs.CL·Jun 8, 2026·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

5arXiv · cs.LG·May 28, 2026·source ↗

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

AREA is a new method for CLIP-based Class-Incremental Learning (CIL) that decomposes the classification process into attribute extraction and aggregation stages to combat catastrophic forgetting. Extraction is stabilized by anchoring visual and textual attributes on a hyperspherical embedding space via principal geodesic analysis, while aggregation uses lightweight task-specific experts regularized by a variational information bottleneck. Inference employs optimal transport routing over task attribute manifolds. The method is reported to consistently outperform state-of-the-art CIL approaches and is accepted at ICML 2026.

Evaluation and Benchmarking Multimodal Progress Optimal Transport AREA Principal Geodesic Analysis +5 more

5arXiv · cs.AI·May 27, 2026·source ↗

Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.

Evaluation and Benchmarking AI Safety Research Effort FLUX.1-Fill Social Gaze Consistency +5 more

9Openai Blog·May 20, 2026·source ↗

CLIP: Connecting Text and Images

OpenAI introduced CLIP (Contrastive Language-Image Pre-training), a neural network that learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by accepting natural language descriptions of categories rather than requiring task-specific training data. The approach mirrors the zero-shot transfer capabilities demonstrated by GPT-2 and GPT-3 in the language domain.

Frontier Model Releases Evaluation and Benchmarking GPT-3 GPT-2 Contrastive Language-Image Pretraining (CLIP)+3 more

6Openai Blog·May 20, 2026·source ↗

Scaling Kubernetes to 7,500 Nodes

OpenAI describes scaling Kubernetes clusters to 7,500 nodes to support large-scale AI training workloads including GPT-3, CLIP, and DALL·E. The post details infrastructure challenges and solutions enabling both massive model training and rapid small-scale research iteration. This represents a significant engineering milestone in ML training infrastructure at the time of publication (January 2021).

Training Infrastructure Frontier Model Releases GPT-3 Kubernetes DALL·E 3 +3 more

5Openai Blog·May 20, 2026·source ↗

Multimodal neurons in artificial neural networks

OpenAI researchers discovered neurons in CLIP that respond to the same concept across literal, symbolic, and conceptual representations. This finding parallels multimodal neurons previously observed in biological brains and helps explain CLIP's ability to classify unusual visual renditions of concepts. The work is presented as a step toward understanding the associations and biases learned by CLIP and similar vision-language models.

AI Safety Research Multimodal Progress OpenAI multimodal neurons CLIP

7Openai Blog·May 20, 2026·source ↗

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

OpenAI published research on hierarchical text-conditional image generation using CLIP latents, the technique underlying DALL-E 2. The approach uses a prior network to map text embeddings to image embeddings, then a diffusion decoder to generate images from those embeddings. This represented a significant advance in text-to-image generation quality and semantic fidelity at the time of release.

Frontier Model Releases Multimodal Progress DALL·E 3 unCLIP OpenAI +2 more

3Hugging Face Blog·May 19, 2026·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

4Hugging Face Blog·May 19, 2026·source ↗

Zero-shot image segmentation with CLIPSeg

This Hugging Face blog post introduces CLIPSeg, a model that performs zero-shot image segmentation by leveraging CLIP-based text and image prompts. The approach allows segmentation of arbitrary image regions without task-specific training, using natural language or reference images as queries. The post likely covers integration into the Hugging Face ecosystem and practical usage examples.

Agent and Tool Ecosystem Multimodal Progress Hugging Face CLIPSeg CLIP

4Qwen Research·May 18, 2026·source ↗

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Alibaba's Qwen team released Chinese CLIP, a language-specific vision-language contrastive pretraining model targeting Chinese multimodal representation learning. The project addresses a gap in open-source Chinese CLIP models, particularly for cross-modal retrieval tasks. It follows the CLIP framework but is adapted for Chinese language and cultural context.

Open Weights Progress Multimodal Progress contrastive vision-language pretraining Chinese CLIP CLIP +1 more