EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents
EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.
Related guides (3)
Related events (8)
Adaptive LLM tutoring system with subject-aware prompt routing improves high-school student engagement
Researchers develop and evaluate an LLM-based tutoring system that uses a learned prompt routing model to dynamically select pedagogical strategies based on 14 features extracted from conversation transcripts. The system was trained in simulation and deployed in an A/B test with 359 high-school students (656 conversations), showing sim-to-real transfer and reducing required interactions by ~3 turns. A stochastic routing strategy achieved a notably higher exercise conversion rate (28.1%) compared to a greedy router (19.1%) and static baseline (19.6%).
EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments
Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.
MLEvolve: Self-evolving multi-agent framework for automated ML algorithm discovery
MLEvolve is a new LLM-based multi-agent framework for end-to-end machine learning algorithm discovery, addressing limitations of existing MLE agents including information isolation and memoryless search. The system introduces Progressive MCGS (a graph-extended tree search), Retrospective Memory for experience accumulation, and decoupled strategic planning from code generation. Evaluated on MLE-Bench, it achieves state-of-the-art medal and valid submission rates within a 12-hour budget, and also outperforms AlphaEvolve on mathematical algorithm optimization tasks.
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training
Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.
Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient
This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.
AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents
AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory is a fully automated framework for training tool-use LLM agents via Agentic Reinforcement Learning, addressing two key bottlenecks: scalable execution environments and realistic multi-turn training data. It autonomously constructs stateful, executable tool environments from authentic resources and synthesizes natural trajectories with implicit human intents via topology-aware sampling. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks, outperforming prior approaches that use 5x more environments.


