5arXiv cs.CL (Computation and Language)·25d ago

StakeBench: A Market-Commitment-Grounded Benchmark for Financial Language Understanding

StakeBench is a new evaluation framework linking 560,876 comments from 2,261 resolved prediction markets (Polymarket and Manifold) to verified trading positions, actions, and market-odds records, replacing human annotation with observable market behavior as supervision. Four diagnostic tasks test commitment detection, side identification, action anticipation, and collective odds projection, evaluated across 15 LLMs. Results reveal structural failures: models partially recover position-side signals (Directed Accuracy 0.506–0.599) but collapse on action anticipation and fail to beat naive baselines on odds projection. Notably, model scale shows no correlation with performance, and finance-domain fine-tuning does not improve revealed-side identification.

Frontier Model Releases Evaluation and Benchmarking Manifold StakeBench Polymarket Directed Accuracy

Related guides (2)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·23d ago·source ↗

Stance Detection in Prediction Market Commentary via Counterfactual Augmentation and Market Context

This paper introduces the first stance detection system applied to prediction market commentary (Polymarket), addressing extreme class imbalance (8.7% anti-market comments) through LLM-driven counterfactual augmentation using the Anthropic API. RoBERTa-base is fine-tuned across a 4×3 ablation covering input configurations and augmentation doses. Key findings: market context is the dominant factor (raising 3-class Anti recall from 0.10 to 0.45), 50% synthetic augmentation is optimal, and full augmentation (100%) consistently degrades performance. Attention-based interpretability supports all three findings mechanistically.

Agent and Tool Ecosystem RoBERTa counterfactual data augmentation Polymarket +2 more

6arXiv · cs.LG·22d ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more

7arXiv · cs.AI·1mo ago·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.

Evaluation and Benchmarking Enterprise Deployment Patterns FinBench Open LLM Leaderboard Hugging Face

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

4arXiv · cs.CL·9d ago·source ↗

Zero-shot LLMs fail to beat baselines on stock prediction; explainability signals retain practical value

A new arXiv preprint evaluates zero-shot NLP pipelines for predicting short-term stock movements from financial news, finding that across multiple models and prediction horizons, zero-shot approaches consistently fail to outperform simple baselines, with especially weak performance on negative price movements. The authors introduce a multi-layered explainability framework linking predictions to token-, article-, and aggregate-level evidence, finding that explainability signals can reliably distinguish trustworthy from unreliable predictions even when accuracy is low. The work argues for a shift toward decision-support systems emphasizing transparency and uncertainty awareness rather than raw predictive accuracy.

Evaluation and Benchmarking Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

6arXiv · cs.AI·4d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more