6arXiv cs.LG (Machine Learning)·1mo ago

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF Reflexion Grok-4-Fast ReAct Gemini-2.5-Flash-Lite Qwen3-235B Llama-4-Maverick FORGE CybORG CAGE-2

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·24d ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

Long Context Evolution Evaluation and Benchmarking heterogeneous graph memory LightMem GAIA +6 more

5arXiv · cs.CL·9d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

4arXiv · cs.AI·4d ago·source ↗

EvolveNav: Self-evolving memory and preflection for zero-shot object-goal navigation

EvolveNav is a new framework for Zero-Shot Object-Goal Navigation (ZS-OGN) that enables test-time improvement through a self-evolving agentic rule memory built from past trajectories. A retrieval strategy based on upper confidence bound balances semantic relevance and historical success when selecting rules, while a memory-guided preflection module forecasts action outcomes before execution to reduce inefficient exploration. The method achieves a 10.1% improvement in success rate over existing zero-shot baselines with fewer unnecessary steps.

Evaluation and Benchmarking Agent and Tool Ecosystem EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation EvolveNav

6arXiv · cs.CL·6d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts

7arXiv · cs.AI·1mo ago·source ↗

MOSS: Self-Evolving Agents via Source-Level Code Rewriting

MOSS is a system enabling autonomous agents to self-evolve by rewriting their own source code rather than being limited to text-mutable artifacts like prompts or skill files. The system anchors each evolution cycle to production-failure evidence, delegates code modification to an external coding-agent CLI, and verifies candidates by replaying failures in ephemeral trial workers before promoting via consent-gated container swap with rollback. On the OpenClaw benchmark, MOSS improves a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention. The authors argue source-level adaptation is strictly more general than text-layer evolution, being Turing-complete and immune to long-context drift.

Evaluation and Benchmarking AI Safety Research MOSS source-level self-rewriting OpenClaw +3 more

6arXiv · cs.LG·11d ago·source ↗

Future Probe Controlled Generation enables steering of reasoning models without quality degradation

Researchers introduce Future Probe Controlled Generation (FPCG), a text-level steering method for large reasoning models (LRMs) that trains activation probes to predict future behavior likelihoods from intermediate reasoning steps rather than detecting behavior in already-generated text. The probes achieve 64–91% accuracy in predicting the most likely future behavior, revealing a distinct class of internal prediction features separate from detection features. FPCG steers model outputs by sampling candidate sentences and selecting the best according to these probes, achieving steering with minimal output quality degradation and succeeding in cases where activation steering fails. The work provides a principled distinction between detection and prediction features as intervention targets for controlling LRM behavior.

Frontier Model Releases AI Safety Research Predicting Future Behaviors in Reasoning Models Enables Better Steering Future Probe Controlled Generation +1 more

5arXiv · cs.CL·6d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more