LLM agents
llm-agents-d25620a4·3 events·first seen 28d agoAliases: LLM agents, LLM Agent
Co-occurring entities
More like this (12)
Recent events (3)
LLM Agent Framework for Last-Mile Time Series Forecasting Revision
This paper introduces a 'last-mile forecasting' framework where an LLM agent sits atop a statistical forecasting backbone to incorporate weakly structured business context—holidays, campaigns, expert feedback, external events—into decision-ready forecasts. The system uses tool-invocation for contextual retrieval, converts reasoning into explicit revision actions under safety constraints, and supports long-horizon forecasting via map-reduce decomposition with a memory bank for post-hoc reflection. The authors validate the approach through real-world case studies, positioning it as a bridge between statistical prediction and operationally usable forecasts.
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.