Entity · benchmark

FinanceBench

benchmarkactivefinancebench-8f8dd514·2 events·first seen May 27, 2026

Aliases: FinanceBench

Co-occurring entities

CM-LRS SEC EDGAR Llama 3.1 70B ConvFinQA Claude Sonnet 4 Claude Opus 4.6 FinQA GPT-5.5 BrowseComp-Plus MuSiQue Qwen3-4B Retrieval-Augmented Generation BRANE

More like this (12)

FinBench PRBench Finance FeatBench SorryBench TokenBench IFBench SelectBench FoldBench TriggerBench DeliveryBench LiveBench AdvBench

Recent events (2)

5arXiv · cs.CL·Jul 24, 2026·source ↗

CM-LRS: A capital markets reliability benchmark for LLM workflow outputs

Researchers introduce CM-LRS (Capital Markets LLM Reliability Score), a seven-dimension evaluation framework assessing LLM outputs at the workflow level rather than the question-answer layer, targeting regulated capital-markets use cases such as DCM/ECM term extraction, M&A comparables, and issuer profiling. The benchmark is demonstrated on five workflows using public SEC EDGAR and UK takeover filings, scoring four models across four LLM judges. Key findings: frontier closed-source models cluster tightly (Sonnet 4.6 = 4.31, Opus 4.7 = 4.30, GPT-5.5 = 4.09) while Llama 3.3 70B lags at 3.15, with the gap concentrated in retrieval and synthesis tasks rather than extraction. The work advances domain-specific evaluation methodology for high-stakes financial workflows where regulatory defensibility matters.

Evaluation and Benchmarking Enterprise Deployment Patterns CM-LRS SEC EDGAR Llama 3.1 70B +6 more

6arXiv · cs.AI·May 27, 2026·source ↗

BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents

BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.

Evaluation and Benchmarking Inference Economics BrowseComp-Plus MuSiQue Qwen3-4B +4 more