benchmark

ScienceWorld

benchmarkactiveprovisionalscienceworld-18ad90f9·1 events·first seen 10h ago

Aliases: ScienceWorld

Co-occurring entities

ALFWorld AgentBoard Word2World WorldEvolver

More like this (12)

SpatialWorld LifeSciBench SciFact OSWorld InternScience BigScience FrontierScience OSWorld-Verified Meta-World DiscoverPhysics SciCode iOSWorld

Recent events (1)

6arXiv · cs.CL·10h ago·source ↗

WorldEvolver: Self-Evolving World Models for LLM Agent Planning via Test-Time Memory Revision

Researchers introduce WorldEvolver, a framework that equips LLM agents with self-improving world models that revise their context at deployment time without updating model parameters. The system combines episodic memory (retrieval-based simulation), semantic memory (heuristic rule extraction from prediction errors), and selective foresight (confidence-based filtering). Evaluated on ALFWorld and ScienceWorld benchmarks, WorldEvolver achieves state-of-the-art world model prediction accuracy and improved downstream agent success rates across three backbone models. The work addresses a key challenge in long-horizon agent planning: unreliable foresight that can degrade rather than improve decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem ALFWorld AgentBoard Word2World +2 more