StakeBench: A Market-Commitment-Grounded Benchmark for Financial Language Understanding
StakeBench is a new evaluation framework linking 560,876 comments from 2,261 resolved prediction markets (Polymarket and Manifold) to verified trading positions, actions, and market-odds records, replacing human annotation with observable market behavior as supervision. Four diagnostic tasks test commitment detection, side identification, action anticipation, and collective odds projection, evaluated across 15 LLMs. Results reveal structural failures: models partially recover position-side signals (Directed Accuracy 0.506–0.599) but collapse on action anticipation and fail to beat naive baselines on odds projection. Notably, model scale shows no correlation with performance, and finance-domain fine-tuning does not improve revealed-side identification.
Related guides (2)
Related events (8)
Stance Detection in Prediction Market Commentary via Counterfactual Augmentation and Market Context
This paper introduces the first stance detection system applied to prediction market commentary (Polymarket), addressing extreme class imbalance (8.7% anti-market comments) through LLM-driven counterfactual augmentation using the Anthropic API. RoBERTa-base is fine-tuned across a 4×3 ablation covering input configurations and augmentation doses. Key findings: market context is the dominant factor (raising 3-class Anti recall from 0.10 to 0.45), 50% synthetic augmentation is optimal, and full augmentation (100%) consistently degrades performance. Attention-based interpretability supports all three findings mechanistically.
SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability
SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.
DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.
Introducing the Open FinLLM Leaderboard
Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.
Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)
The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.
Zero-shot LLMs fail to beat baselines on stock prediction; explainability signals retain practical value
A new arXiv preprint evaluates zero-shot NLP pipelines for predicting short-term stock movements from financial news, finding that across multiple models and prediction horizons, zero-shot approaches consistently fail to outperform simple baselines, with especially weak performance on negative price movements. The authors introduce a multi-layered explainability framework linking predictions to token-, article-, and aggregate-level evidence, finding that explainability signals can reliably distinguish trustworthy from unreliable predictions even when accuracy is low. The work argues for a shift toward decision-support systems emphasizing transparency and uncertainty awareness rather than raw predictive accuracy.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
Bayesian audit framework for public AI evaluation archives challenges frontier model claims
A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

