benchmark
EDGAR-Forecast
benchmarkactiveprovisional
edgar-forecast-3e38a887·1 events·first seen 7h agoAliases: EDGAR-Forecast
Co-occurring entities
More like this (12)
Recent events (1)
Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining
Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.