5arXiv cs.AI (Artificial Intelligence)·44h ago

Self-Filtering: Iterative bootstrapped data selection for vision-language model training

Researchers propose Self-Filtering, a bootstrapped data curation method for vision-language models in which a CLIP model iteratively trains on and re-selects its own training data. The approach alternates between filtering high-confidence clean samples and preserving distributional diversity, without requiring curated reference datasets or pre-trained external models. Experiments show downstream performance improvements over standard noisy training pipelines.

Training Infrastructure Multimodal Progress Data Selection Through Iterative Self-Filtering for Vision-Language Settings CLIP

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Improving language model behavior by training on a curated dataset

OpenAI published research showing that fine-tuning language models on a small, curated dataset can improve alignment with specific behavioral values. The work demonstrates a targeted approach to shaping model behavior without large-scale retraining. This represents an early contribution to what would become the RLHF and instruction-tuning research lineage.

AI Safety Research Alignment and RLHF curated dataset OpenAI behavioral fine-tuning

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more

5arXiv · cs.CL·16d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

5arXiv · cs.CL·14d ago·source ↗

Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation

A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.

Evaluation and Benchmarking Alignment and RLHF Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

4arXiv · cs.AI·20d ago·source ↗

BabyCL: Continual multimodal learning from egocentric child video in a single chronological pass

Researchers introduce BabyCL, a continual learning framework that processes the SAYCam egocentric child video dataset in a single chronological pass rather than shuffled multi-epoch training, more closely mimicking how children actually encounter their environment. The system combines streaming visual representation learning with image-text contrastive objectives, a multi-stage temporal segmentation, and a dual replay buffer managing visual and multimodal histories. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC benchmark under matched compute budgets, substantially closing the gap to offline training upper bounds. The work advances understanding of whether neural networks can acquire word-referent mappings under biologically plausible training conditions.

Evaluation and Benchmarking Multimodal Progress SAYCam BabyCL SAYCam Labeled-S 4AFC

4Qwen Research·1mo ago·source ↗

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Alibaba's Qwen team released Chinese CLIP, a language-specific vision-language contrastive pretraining model targeting Chinese multimodal representation learning. The project addresses a gap in open-source Chinese CLIP models, particularly for cross-modal retrieval tasks. It follows the CLIP framework but is adapted for Chinese language and cultural context.

Open Weights Progress Multimodal Progress contrastive vision-language pretraining Chinese CLIP CLIP +1 more

7arXiv · cs.LG·13d ago·source ↗

Interpretability-based pipeline for auditing and shaping post-training learning signals

Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.

Evaluation and Benchmarking AI Safety Research Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal +1 more

5arXiv · cs.CL·28d ago·source ↗

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

Evaluation and Benchmarking Multimodal Progress WB-ChartExtract ChartQA World Bank +1 more