Almanac
benchmark

BigCodeArena

benchmarkactivebigcodearena-7fccc908·1 events·first seen 28d ago

Aliases: BigCodeArena

Co-occurring entities

More like this (12)

Recent events (1)

5Hugging Face Blog·28d ago·source ↗

BigCodeArena: Judging code generations end to end with code executions

BigCodeArena is a new evaluation framework for code generation models that uses end-to-end code execution to judge outputs rather than relying on static metrics or human preference ratings alone. The approach aims to provide more reliable and objective assessments of coding model capabilities by running generated code and evaluating actual execution results. This addresses known limitations of LLM-as-judge and human annotation methods for code evaluation benchmarks.