6arXiv cs.AI (Artificial Intelligence)·4d ago

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.

Training Infrastructure Long Context Evolution Evaluation and Benchmarking EDGAR-OCR EDGAR-Forecast Stanford University Common Crawl Stanford EDGAR Filings Dataset

Related guides (3)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·24d ago·source ↗

IPO-Mine: Toolkit and Dataset for Multimodal Analysis of Long IPO Filings

Researchers introduce IPO-Mine, comprising an open-source toolkit and a large-scale dataset of over 109,000 IPO filings (1994–2026) with 76,000+ extracted images, structured for section-level analysis. The toolkit parses long regulatory documents (often exceeding 500,000 tokens) into standardized text and image outputs. Benchmark tasks on financial chart quality and misleadingness assessment reveal that state-of-the-art multimodal models frequently diverge from expert human judgments, exposing alignment gaps in long-document multimodal reasoning. The dataset and code are publicly released under CC-BY-4.0.

Long Context Evolution Evaluation and Benchmarking IPO-Dataset IPO-Toolkit IPO-Mine +3 more

5arXiv · cs.CL·26d ago·source ↗

StakeBench: A Market-Commitment-Grounded Benchmark for Financial Language Understanding

StakeBench is a new evaluation framework linking 560,876 comments from 2,261 resolved prediction markets (Polymarket and Manifold) to verified trading positions, actions, and market-odds records, replacing human annotation with observable market behavior as supervision. Four diagnostic tasks test commitment detection, side identification, action anticipation, and collective odds projection, evaluated across 15 LLMs. Results reveal structural failures: models partially recover position-side signals (Directed Accuracy 0.506–0.599) but collapse on action anticipation and fail to beat naive baselines on odds projection. Notably, model scale shows no correlation with performance, and finance-domain fine-tuning does not improve revealed-side identification.

Frontier Model Releases Evaluation and Benchmarking Manifold StakeBench Polymarket +1 more

4arXiv · cs.CL·2d ago·source ↗

STAGE pipeline generates source-grounded training data for text-to-JSON extraction

Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.

Evaluation and Benchmarking Enterprise Deployment Patterns STAGE Qwen3-4B STAGE-Eval

9Deepseek News·1mo ago·source ↗

DeepSeek V4 Preview Release: 1.6T-param Pro and 284B Flash Models with 1M Context, Open-Sourced

DeepSeek has released DeepSeek-V4 as an open-weights preview, comprising two MoE variants: V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active parameters). Both models support 1M token context by default, enabled by a novel Token-wise compression and DeepSeek Sparse Attention (DSA) architecture. V4-Pro claims open-source SOTA on agentic coding benchmarks and world-class math/STEM/coding performance rivaling top closed-source models, while V4-Flash offers near-parity reasoning at lower cost and latency. The API is live today with OpenAI and Anthropic compatibility, and legacy model endpoints will be retired in July 2026.

Long Context Evolution Frontier Model Releases DeepSeek V4 DeepSeek-V4-Flash Claude Code +7 more

7arXiv · cs.AI·1mo ago·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.

Evaluation and Benchmarking Enterprise Deployment Patterns FinBench Open LLM Leaderboard Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.

Open Weights Progress Inference Economics Hugging Face +1 more

4arXiv · cs.AI·6d ago·source ↗

AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models

Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.

Evaluation and Benchmarking Multimodal Progress AudioDER Qwen2-Audio-7B-Instruct Qwen3-30B +3 more