Almanac
technique

STAGE

techniqueactiveprovisionalstage-3f4983c6·1 events·first seen 47h ago

Aliases: STAGE

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·47h ago·source ↗

STAGE pipeline generates source-grounded training data for text-to-JSON extraction

Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.