Entity · technique

Reflexion

techniqueactivereflexion-2c36df82·2 events·first seen May 18, 2026

Aliases: Reflexion

Co-occurring entities

Wang et al. 2024 Self-Refine Best-of-N Sample More, Reflect Less: Self-Refine and Reflexion Lose to Repeated Sampling at Equal Token Cost, from 1.5B to 7B Grok-4-Fast ReAct Gemini-2.5-Flash-Lite Qwen3-235B Llama-4-Maverick FORGE CybORG CAGE-2

More like this (12)

RELEX CORE (Contrastive Reflection)Introspection Cognition self-refinement Reversa Reply Self-Refine Deep Think Tree of Thoughts Contemplating mode Hindsight Experience Replay

Recent events (2)

7arXiv · cs.CL·37h ago·source ↗

Rigorous experiment finds Self-Refine and Reflexion lose to repeated sampling at equal token cost across 1.5B–7B models

A new arXiv paper conducts a controlled experiment comparing seven LLM self-improvement methods (including Self-Refine, Reflexion, Best-of-N, and debate) against simple repeated sampling at equal token budgets, across 1.5B, 3B, and 7B open models on two math benchmarks. Using bootstrap confidence intervals and multiplicity correction across 36 paired comparisons, no method reliably outperforms repeated sampling; ten methods are reliably worse, all involving self-inspection of the model's own output. A notable finding is that Self-Refine and Reflexion remain 3.6–10.1 points below baseline even at 7B, and Reflexion on the smallest model silently degraded to a single chain-of-thought by always judging itself correct. The results challenge a broad class of iterative self-critique methods and extend earlier point-estimate findings by Wang et al. (2024) with proper statistical rigor.

Evaluation and Benchmarking Inference Economics Reflexion Wang et al. 2024 Self-Refine +3 more

6arXiv · cs.LG·May 18, 2026·source ↗

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Reflexion Grok-4-Fast ReAct +6 more