Entity · technique

GEPA

techniqueactivegepa-d541ea63·4 events·first seen May 25, 2026

Aliases: GEPA

Co-occurring entities

More like this (12)

JEPA AdaJEPA HamJEPA GEIS GEOS LegalGPT FERPA SAMPA eGeMAPS GiGPO GPQA MM-EPC

Recent events (4)

6arXiv · cs.CL·Jul 16, 2026·source ↗

Continual-learning evaluation on Terminal-Bench 2.0 tests whether agent optimizer gains compound across tasks

A new arXiv paper introduces a two-phase continual-learning evaluation framework built on Terminal-Bench 2.0 to test whether agent-optimization gains persist and compound when new tasks arrive over time. Three agent-harness optimization methods — GEPA, Meta Harness, and RELAI-VCL — are compared under identical budgets; all improve in static single-phase settings but diverge sharply under continual optimization. RELAI-VCL is the only method that both transfers positively to unseen tasks and continues improving, reaching a 76.4% lifelong average pass rate versus 58.7% for the unoptimized baseline. The key finding is that compounding gains require regression control built into the optimization loop to prevent shortcut solutions.

Evaluation and Benchmarking Agent and Tool Ecosystem RELAI Verifiable Continual Learning Meta Harness RELAI +4 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

7arXiv · cs.AI·May 28, 2026·source ↗

CORE: Contrastive Reflection for Sample-Efficient Reasoning Improvement

CORE (Contrastive Reflection) is a non-parametric learning algorithm that improves LLM reasoning by comparing successful and unsuccessful reasoning traces to generate compact natural-language 'insights' about reasoning strategies. Across four reasoning tasks, CORE outperforms both parametric baselines (GRPO/RLVR) and non-parametric baselines (GEPA, episodic RAG, MemRL) under fixed rollout budgets, achieving comparable or better gains with as few as five training samples. The method is also more context-efficient than prompt-optimization approaches, storing learned knowledge as interpretable natural-language descriptions rather than raw traces or weight updates. The results suggest contrastive distillation of reasoning traces may be a more efficient route to self-improvement than traditional fine-tuning.

Evaluation and Benchmarking Inference Economics RLVR GRPO CORE (Contrastive Reflection)+5 more

7arXiv · cs.AI·May 25, 2026·source ↗

SkillOpt: Systematic Text-Space Optimizer for Self-Evolving Agent Skills

SkillOpt introduces a principled optimization framework for agent skills, treating the skill document as an external trainable state analogous to model weights. A separate optimizer model converts scored rollouts into bounded edits (add/delete/replace) on a skill document, accepting only edits that improve held-out validation scores. Evaluated across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt achieves best or tied performance on all 52 evaluated cells, lifting GPT-5.5 no-skill accuracy by up to +24.8 points inside the Codex agentic loop. Optimized skill artifacts also transfer across model scales and execution environments without further optimization.

Evaluation and Benchmarking Agent and Tool Ecosystem TextGrad SkillOpt Trace2Skill +6 more