dataset
Cosmopedia
datasetactive
cosmopedia-1650c65e·1 events·first seen 28d agoAliases: Cosmopedia
Co-occurring entities
More like this (12)
Recent events (1)
Cosmopedia: Creating Large-Scale Synthetic Data for Pre-training LLMs
Hugging Face introduces Cosmopedia, a large-scale synthetic dataset designed for pre-training large language models. The blog post details the methodology for generating diverse, high-quality synthetic text at scale using existing LLMs as data generators. The work addresses the growing challenge of data scarcity and quality in LLM pre-training pipelines.