Entity · benchmark

CybORG CAGE-2

benchmarkactivecyborg-cage-2-12383121·1 events·first seen May 18, 2026

Aliases: CybORG CAGE-2

Co-occurring entities

Reflexion Grok-4-Fast ReAct Gemini-2.5-Flash-Lite Qwen3-235B Llama-4-Maverick FORGE

More like this (12)

CAGE TRIBE v2 CyberSecEval 2 Mamba2 Sora 2 DINOv2 Codex Labs LeRobot Phi-2 BigCodeArena Cerebras Systems Transformers Agents 2.0

Recent events (1)

6arXiv · cs.LG·May 18, 2026·source ↗

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Reflexion Grok-4-Fast ReAct +6 more