Entity · benchmark

EDGAR-OCR

benchmarkactiveedgar-ocr-714a3b2a·1 events·first seen Jun 17, 2026

Aliases: EDGAR-OCR

Co-occurring entities

EDGAR-Forecast Stanford University Common Crawl Stanford EDGAR Filings Dataset

More like this (12)

SEC EDGAR EDGAR-Forecast Stanford EDGAR Filings Dataset PP-OCRv6 OCR-Robust SECDA-DSE TrOCR GLM-OCR DeepSeek-OCR-2 Azure OCR ERC-8004 olmOCR

Recent events (1)

6arXiv · cs.AI·Jun 17, 2026·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.

Training Infrastructure Long Context Evolution EDGAR-OCR EDGAR-Forecast Stanford University +3 more