Almanac
product

CodeClash

productactiveprovisionalcodeclash-057f983d·1 events·first seen 11h ago

Aliases: CodeClash

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·11h ago·source ↗

RevengeBench: Benchmark for Reconstructing Agent Decision Programs from Behavioral Observations

RevengeBench is a new benchmark of 75 LLM-generated, Elo-calibrated policies across five game environments that tests whether a learner can reconstruct a hidden agent's decision program as executable code from behavioral traces alone. The benchmark draws from CodeClash tournament trajectories and allows the learner to design controlled behavioral probes (custom opponent policies) to elicit informative behavior before submitting an executable hypothesis. Evaluated across twelve frontier LLMs, recovery quality ranges from 34 to 72% of initial action-distance closed, with reconstructed policies providing measurable competitive advantage especially for weaker models. The work frames policy reconstruction as a tractable inverse problem in code-space, with implications for opponent modeling and policy interpretability.