Entity · dataset

Common Crawl

datasetactivecommon-crawl-5e910890·4 events·first seen May 22, 2026

Aliases: Common Crawl

Co-occurring entities

More like this (12)

Crawl4AI CRAG CQL BrowseComp news-crawler-LM Common Pile Community Tools BrowseComp-Plus SearchGen-Corpus-1M Collective Intelligence Project Mozilla Common Voice ClawBot

Recent events (4)

7The Batch·Jul 3, 2026·source ↗

Microsoft reveals MAI-Thinking-1, a from-scratch reasoning model with MoE architecture

Microsoft introduced MAI-Thinking-1, its first reasoning language model built without distillation from third-party models, comparable in size to Claude Sonnet 4.6. The model uses a mixture-of-experts architecture (1T total / 35B active parameters), was pretrained on 30 trillion tokens of primarily licensed human-generated data, and trained via reinforcement learning across specialist models for STEM, coding, and safety. It scored 97.0% on AIME 2025, placing third behind Claude Opus 4.6 and ahead of DeepSeek V3.2, and is available in private preview via Microsoft Foundry. The release marks a strategic shift as Microsoft moves to reduce dependence on OpenAI models following a renegotiated partnership in April 2026.

Training Infrastructure Frontier Model Releases MAI-Thinking-1 Claude Sonnet 4 Claude Opus 4.6 +12 more

6arXiv · cs.AI·Jun 17, 2026·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.

Training Infrastructure Long Context Evolution EDGAR-OCR EDGAR-Forecast Stanford University +3 more

7The Batch·Jun 12, 2026·source ↗

Study finds state media in training data causes LLMs to reflect government propaganda in native languages

Researchers from University of Oregon, Purdue, UCSD, NYU, and Princeton found that state-controlled media is heavily overrepresented in web-scraped training datasets, causing Claude 3 Sonnet and GPT-4o to express significantly more favorable attitudes toward authoritarian governments when prompted in those governments' native languages. Chinese state media accounts for over 40x more documents in CulturaX than Chinese Wikipedia, and both models reproduced state-media strings at 3-5% rates. When prompted in Chinese, both models favored China's government roughly 68-75% of the time versus English prompts on the same topics, with the effect scaling with a country's World Press Freedom Index ranking.

Frontier Model Releases Evaluation and Benchmarking New York University University of California San Diego CulturaX +14 more

6arXiv · cs.AI·May 22, 2026·source ↗

Temporally Ordered Pre-training Improves LLM Factual Freshness (Kairos)

Researchers from Kyutai pre-train 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training baselines. They introduce a benchmark of over 7,000 temporally grounded questions to evaluate whether models correctly associate facts with their corresponding time periods. Results show sequentially trained models match shuffled baselines on general language understanding while exhibiting more up-to-date and temporally precise factual knowledge. Code, checkpoints, and datasets are released under the Kairos project.

Training Infrastructure Frontier Model Releases Kyutai Common Crawl temporally ordered pre-training +3 more