4arXiv cs.CL (Computation and Language)·18d ago

ODTQA-FoRe: Open-Domain Tabular QA Dataset for Future Data Forecasting and Reasoning

The paper introduces ODTQA-FoRe, a new benchmark dataset for open-domain tabular question answering focused on time-series forecasting and forecast-based reasoning using real estate data. The authors also propose TimeFore, an LLM agent framework that decomposes the task into three roles: a SQL-generating Retriever, a Forecaster that calls external time-series models, and an Analyzer that synthesizes results. The work targets a gap in existing tabular QA systems, which typically cannot perform future-oriented numerical prediction. Experiments demonstrate TimeFore's effectiveness on the new benchmark.

Evaluation and Benchmarking Agent and Tool Ecosystem SQL generation TimeFore time-series forecasting ODTQA-FoRe tabular question answering

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Docmatix: A Large-Scale Dataset for Document Visual Question Answering

Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.

Evaluation and Benchmarking Multimodal Progress Hugging Face Document Visual Question Answering Docmatix

5arXiv · cs.CL·11d ago·source ↗

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

4Hugging Face Blog·1mo ago·source ↗

Efficient Table Pre-training without Real Data: An Introduction to TAPEX

TAPEX is a table pre-training approach that avoids reliance on real tabular data by instead training a language model to simulate SQL query execution over synthetic tables. The method achieves strong performance on table-based question answering and fact verification benchmarks. This Hugging Face blog post introduces the technique and its integration into the Hugging Face ecosystem.

Evaluation and Benchmarking Agent and Tool Ecosystem TAPEX Hugging Face SQL +1 more

5Hugging Face Blog·1mo ago·source ↗

Back to The Future: Evaluating AI Agents on Predicting Future Events

This Hugging Face blog post introduces FutureBench, a benchmark designed to evaluate AI agents on their ability to predict future events, addressing the challenge of data contamination in standard benchmarks by using temporally forward-looking tasks. The approach tests whether agents can reason about and forecast outcomes beyond their training data cutoff. This framing positions future-event prediction as a rigorous, contamination-resistant evaluation methodology for frontier models and agents.

Evaluation and Benchmarking Agent and Tool Ecosystem FutureBench Hugging Face

5Hugging Face Blog·1mo ago·source ↗

DABStep: Data Agent Benchmark for Multi-step Reasoning

Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem DABStep Hugging Face

5arXiv · cs.AI·18d ago·source ↗

LLM Agent Framework for Last-Mile Time Series Forecasting Revision

This paper introduces a 'last-mile forecasting' framework where an LLM agent sits atop a statistical forecasting backbone to incorporate weakly structured business context—holidays, campaigns, expert feedback, external events—into decision-ready forecasts. The system uses tool-invocation for contextual retrieval, converts reasoning into explicit revision actions under safety constraints, and supports long-horizon forecasting via map-reduce decomposition with a memory bank for post-hoc reflection. The authors validate the approach through real-world case studies, positioning it as a bridge between statistical prediction and operationally usable forecasts.

Enterprise Deployment Patterns Agent and Tool Ecosystem Map-Reduce Decomposition Last-Mile Forecasting Framework Time Series Foundation Models +2 more

4arXiv · cs.CL·2d ago·source ↗

CADE framework proposes direct timestep embedding and contrastive alignment for time-series question answering

A new arXiv preprint introduces CADE (Contrastive Alignment with Direct Embedding), a framework for time-series question answering (TSQA) that bypasses the tokenization bottleneck of standard LLMs by mapping each timestep directly into the LLM embedding space via a point-wise linear encoder and MLP projector. The approach also introduces a one-directional supervised contrastive loss to align time-series embeddings with frozen class-name text anchors, bridging the semantic gap between numerical and language representations. Evaluated on the Time-MQA benchmark across six TSQA tasks, CADE outperforms both open-source and proprietary LLM baselines. The work addresses a concrete limitation of patch-based encoders — fixed granularity and poor cross-dataset transfer — with a cleaner architectural alternative.

Evaluation and Benchmarking Multimodal Progress Time-MQA Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering CADE

7arXiv · cs.AI·1mo ago·source ↗

Toto 2.0: Open-Weights Time Series Foundation Models Demonstrate Scaling Laws from 4M to 2.5B Parameters

Datadog releases Toto 2.0, a family of five open-weights time series forecasting models ranging from 4M to 2.5B parameters, demonstrating consistent forecast quality improvements with scale. The models achieve state-of-the-art results on three benchmarks: BOOM (observability), GIFT-Eval (general-purpose), and TIME (contamination-resistant). The release includes architectural details, a u-muP hyperparameter transfer pipeline, and all base checkpoints under Apache 2.0 license.

Training Infrastructure Frontier Model Releases Toto 2.0 GIFT-Eval TIME +5 more