Almanac
dataset

Cosmopedia

datasetactivecosmopedia-1650c65e·1 events·first seen 28d ago

Aliases: Cosmopedia

Co-occurring entities

More like this (12)

Recent events (1)

6Hugging Face Blog·28d ago·source ↗

Cosmopedia: Creating Large-Scale Synthetic Data for Pre-training LLMs

Hugging Face introduces Cosmopedia, a large-scale synthetic dataset designed for pre-training large language models. The blog post details the methodology for generating diverse, high-quality synthetic text at scale using existing LLMs as data generators. The work addresses the growing challenge of data scarcity and quality in LLM pre-training pipelines.