Almanac
benchmark

EDGAR-OCR

benchmarkactiveprovisionaledgar-ocr-714a3b2a·1 events·first seen 5h ago

Aliases: EDGAR-OCR

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·5h ago·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.