Entity · benchmark

ALFWorld

benchmarkactivealfworld-c4e69dec·5 events·first seen Jun 1, 2026

Aliases: ALFWorld

Co-occurring entities

More like this (12)

ALFRED Meta-World DevicesWorld OSWorld SpatialWorld AlphaEarth Foundations ALX Altana DreamForge-World 0.1 Preview World-In-Agent Open Fable AIEWF

Recent events (5)

5arXiv · cs.AI·Jul 20, 2026·source ↗

Muon optimizer shows large gains over AdamW in sparse-reward agentic RL on ALFWorld

A new arXiv preprint investigates the Muon optimizer for reinforcement learning post-training of language model agents, comparing it to AdamW on the ALFWorld benchmark using Qwen2.5-0.5B-Instruct. Under Group-in-Group Policy Optimization (GiGPO), applying Muon to hidden weight matrices raises validation success from 0.290 to 0.546 (+88%), with further gains at lower learning rates reaching 0.901 success. The results are exploratory (single-seed, single-task) but suggest that optimizer choice, advantage estimator, and learning rate interact significantly in agentic RL settings.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld GRPO Qwen2.5-7B-Instruct-1M +4 more

6arXiv · cs.LG·Jul 1, 2026·source ↗

TRIAGE: Role-typed credit assignment framework improves reinforcement learning for agentic tasks

TRIAGE is a new credit assignment framework for agentic reinforcement learning that augments standard GRPO by classifying action segments into semantic roles (decisive progress, useful exploration, no-progress infrastructure, regression) and applying role-conditioned process rewards. The approach addresses two structural blind spots of outcome-only credit: punishing useful exploration in failed rollouts and reinforcing redundant actions in successful ones. Evaluated on ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO and reduces environment-facing turns by 10-15%, with regression detection inside successful trajectories identified as the dominant contributor.

Evaluation and Benchmarking Agent and Tool Ecosystem ALFWorld Search-QA GRPO +3 more

6arXiv · cs.CL·Jun 30, 2026·source ↗

WorldEvolver: Self-Evolving World Models for LLM Agent Planning via Test-Time Memory Revision

Researchers introduce WorldEvolver, a framework that equips LLM agents with self-improving world models that revise their context at deployment time without updating model parameters. The system combines episodic memory (retrieval-based simulation), semantic memory (heuristic rule extraction from prediction errors), and selective foresight (confidence-based filtering). Evaluated on ALFWorld and ScienceWorld benchmarks, WorldEvolver achieves state-of-the-art world model prediction accuracy and improved downstream agent success rates across three backbone models. The work addresses a key challenge in long-horizon agent planning: unreliable foresight that can degrade rather than improve decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem ALFWorld AgentBoard Word2World +2 more

5arXiv · cs.CL·Jun 15, 2026·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

6arXiv · cs.AI·Jun 1, 2026·source ↗

ReuseRL: Skill Reuse as Compression in Agentic RL via MDL Principle

ReuseRL formalizes agentic reinforcement learning through the Minimum Description Length (MDL) principle, extracting a shared skill dictionary from successful trajectories and augmenting the RL objective with a segmentation cost that penalizes idiosyncratic, non-reusable behaviors. The authors prove a PAC-Bayes generalization bound for this compression penalty. Evaluated on ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL outperforms vanilla GRPO and round-length baselines on both in-distribution and out-of-distribution tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Minimum Description Length ALFWorld Countdown-Stepwise +5 more