CodeClash
codeclash-057f983d·1 events·first seen 11h agoAliases: CodeClash
Co-occurring entities
More like this (12)
Recent events (1)
RevengeBench: Benchmark for Reconstructing Agent Decision Programs from Behavioral Observations
RevengeBench is a new benchmark of 75 LLM-generated, Elo-calibrated policies across five game environments that tests whether a learner can reconstruct a hidden agent's decision program as executable code from behavioral traces alone. The benchmark draws from CodeClash tournament trajectories and allows the learner to design controlled behavioral probes (custom opponent policies) to elicit informative behavior before submitting an executable hypothesis. Evaluated across twelve frontier LLMs, recovery quality ranges from 34 to 72% of initial action-distance closed, with reconstructed policies providing measurable competitive advantage especially for weaker models. The work frames policy reconstruction as a tractable inverse problem in code-space, with implications for opponent modeling and policy interpretability.