6arXiv cs.AI (Artificial Intelligence)·24d ago

BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents

BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.

Evaluation and Benchmarking Inference Economics Agent and Tool Ecosystem BrowseComp-Plus MuSiQue Qwen3-4B FinanceBench Retrieval-Augmented Generation BRANE

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·18d ago·source ↗

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER introduces a family of lightweight routers that decide whether to escalate retrieval complexity for multi-hop QA without making additional LLM calls. Built on top of one-shot RAG using six derived features, RASER-2 and RASER-3 route queries to progressively more expensive retrieval strategies (PRUNE, IRCoT) only when needed. Across six LLMs and three benchmarks, the routers match SOTA F1 while consuming only 41-49% of the tokens required by always-escalating baselines.

Inference Economics Agent and Tool Ecosystem PRUNE IRCoT RASER +1 more

5arXiv · cs.CL·4d ago·source ↗

RL-trained LLMs learn retriever-specific query formulation strategies for RAG

A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.

Evaluation and Benchmarking Agent and Tool Ecosystem Understanding the Behaviors of Environment-aware Information Retrieval LCO-Embedding

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

6arXiv · cs.CL·18d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

6arXiv · cs.CL·24d ago·source ↗

Coverage Illusion: Post-Retrieval Cascade Design Reduces LLM Augmentation Overhead in Production RAG

A case study on the Danish National Encyclopedia's RAG system evaluates five retrieval workflows across 20,000 query-workflow pairs, revealing a 'Coverage Illusion' where synthetic queries overestimate the need for LLM augmentation (90%+) versus real production traffic (27.8%). Pre-retrieval routing cannot detect this gap because augmentation necessity is only revealed after index search. A post-retrieval cascade running workflows cheapest-first and escalating to LLM augmentation only on empty results improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and eliminates LLM augmentation for 72.2% of real queries. The work highlights a structural mismatch between synthetic and real query distributions that affects RAG system design assumptions.

Evaluation and Benchmarking Inference Economics RAG Coverage Illusion HyDE +5 more

5arXiv · cs.CL·22d ago·source ↗

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

GRASP is a three-stage retrieval framework for semi-structured knowledge bases (SKBs) that combines plan-based graph retrieval, plan-conditioned dense retrieval fusion, and a fine-tuned reranker. It targets applications like product search, academic search, and precision medicine over typed entity-relation graphs. Evaluated on the STaRK benchmarks, GRASP advances average Hit@1 from 62.0 to 73.9, representing a substantial improvement over prior hybrid retrieval systems. Ablation studies confirm the contribution of each component.

Evaluation and Benchmarking Agent and Tool Ecosystem semi-structured knowledge bases GRASP STaRK