Almanac
← Events
4arXiv cs.CL (Computation and Language)·47h ago

STAGE pipeline generates source-grounded training data for text-to-JSON extraction

Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.

Related guides (2)

Related events (8)

5arXiv · cs.CL·19d ago·source ↗

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

This paper proposes Semantic Triplet Restoration (STR), a table serialization protocol that rewrites each cell as an atomic fact <item path, feature path, value> to make header-cell alignments explicit for LLMs, replacing HTML/Markdown representations. The authors also introduce TripletQL, a query-aware router that selects relevant triplets per question. Evaluated on four Chinese and English table-QA benchmarks, STR matches or outperforms HTML-based baselines while reducing input token count. Benefits are most pronounced for smaller models and longer tables, suggesting value under constrained inference budgets.

5arXiv · cs.CL·24d ago·source ↗

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

4Hugging Face Blog·1mo ago·source ↗

Efficient Table Pre-training without Real Data: An Introduction to TAPEX

TAPEX is a table pre-training approach that avoids reliance on real tabular data by instead training a language model to simulate SQL query execution over synthetic tables. The method achieves strong performance on table-based question answering and fact verification benchmarks. This Hugging Face blog post introduces the technique and its integration into the Hugging Face ecosystem.

7Openai Blog·1mo ago·source ↗

Introducing Structured Outputs in the API

OpenAI is introducing Structured Outputs in its API, enabling model responses to reliably conform to developer-supplied JSON Schemas. This feature addresses a longstanding pain point in production deployments where inconsistent output formatting required extensive post-processing. The capability is available via the API and targets developers building applications that depend on structured data from language models.

4Github Trending·22d ago·source ↗

MinerU: Document-to-LLM-Ready Markdown/JSON Conversion Tool

MinerU is an open-source Python tool by OpenDataLab that converts complex documents (PDFs, Office files) into structured markdown or JSON formats optimized for LLM and agentic workflows. The repository has accumulated 65,610 GitHub stars with 180 new stars today, indicating sustained community traction. It targets a common preprocessing bottleneck in RAG and agent pipelines.

6arXiv · cs.AI·3d ago·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.

4arXiv · cs.CL·11d ago·source ↗

TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs

TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.

4Hugging Face Blog·1mo ago·source ↗

Improving Prompt Consistency with Structured Generations

This Hugging Face blog post examines how structured generation outputs can improve consistency in LLM evaluation pipelines. It explores techniques for constraining model outputs to specific formats, reducing variability in prompt-based assessments. The post addresses a practical challenge in evaluation workflows where inconsistent response formats degrade measurement reliability.