Almanac
benchmark

Codex HumanEval

benchmarkactiveprovisionalcodex-humaneval-df837778·1 events·first seen 12d ago

Aliases: Codex HumanEval

Co-occurring entities

More like this (12)

Recent events (1)

7Anthropic News·12d ago·source ↗

Anthropic launches Claude 2 with 100K context window and improved coding, reasoning, and safety

Anthropic released Claude 2, featuring a 100K token context window, improved performance on coding (71.2% on Codex HumanEval, up from 56.0%), math (88.0% on GSM8k), and legal reasoning (76.5% on the Bar exam multiple choice section). The model is available via API at the same price as Claude 1.3 and through a new public beta at claude.ai for US and UK users. Safety improvements include a 2x reduction in harmful outputs on internal red-team evaluations compared to Claude 1.3. Early API partners include Jasper and Sourcegraph.