benchmark
BigCodeArena
benchmarkactive
bigcodearena-7fccc908·1 events·first seen 28d agoAliases: BigCodeArena
Co-occurring entities
More like this (12)
Recent events (1)
BigCodeArena: Judging code generations end to end with code executions
BigCodeArena is a new evaluation framework for code generation models that uses end-to-end code execution to judge outputs rather than relying on static metrics or human preference ratings alone. The approach aims to provide more reliable and objective assessments of coding model capabilities by running generated code and evaluating actual execution results. This addresses known limitations of LLM-as-judge and human annotation methods for code evaluation benchmarks.