Almanac
dataset

Common Crawl

datasetactivecommon-crawl-5e910890·3 events·first seen 26d ago

Aliases: Common Crawl

Co-occurring entities

More like this (12)

Recent events (3)

6arXiv · cs.AI·26d ago·source ↗

Temporally Ordered Pre-training Improves LLM Factual Freshness (Kairos)

Researchers from Kyutai pre-train 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training baselines. They introduce a benchmark of over 7,000 temporally grounded questions to evaluate whether models correctly associate facts with their corresponding time periods. Results show sequentially trained models match shuffled baselines on general language understanding while exhibiting more up-to-date and temporally precise factual knowledge. Code, checkpoints, and datasets are released under the Kairos project.

7The Batch·4d ago·source ↗

Study finds state media in training data causes LLMs to reflect government propaganda in native languages

Researchers from University of Oregon, Purdue, UCSD, NYU, and Princeton found that state-controlled media is heavily overrepresented in web-scraped training datasets, causing Claude 3 Sonnet and GPT-4o to express significantly more favorable attitudes toward authoritarian governments when prompted in those governments' native languages. Chinese state media accounts for over 40x more documents in CulturaX than Chinese Wikipedia, and both models reproduced state-media strings at 3-5% rates. When prompted in Chinese, both models favored China's government roughly 68-75% of the time versus English prompts on the same topics, with the effect scaling with a country's World Press Freedom Index ranking.

6arXiv · cs.AI·4h ago·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.