8OpenAI Blog·1mo ago

Evaluating Large Language Models Trained on Code

OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem GPT-3 pass@k OpenAI HumanEval Codex

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Codex

Codex: OpenAI's AI Coding Agent

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Very Large Language Models and How to Evaluate Them

This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.

Evaluation and Benchmarking Open Weights Progress zero-shot evaluation Hugging Face

5Openai Blog·1mo ago·source ↗

A Hazard Analysis Framework for Code Synthesis Large Language Models

OpenAI published a hazard analysis framework specifically targeting code synthesis LLMs, addressing the safety and risk dimensions of models that generate executable code. The framework likely identifies threat categories, failure modes, and mitigation strategies relevant to deploying code-generating AI systems. This represents an early structured attempt to apply safety engineering methodology to a specific LLM capability domain. The work is relevant to both AI safety research and enterprise deployment considerations for coding assistants.

AI Safety Research Agent and Tool Ecosystem hazard analysis framework code synthesis LLMs OpenAI

5Hugging Face Blog·1mo ago·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

5Hugging Face Blog·1mo ago·source ↗

BigCodeArena: Judging code generations end to end with code executions

BigCodeArena is a new evaluation framework for code generation models that uses end-to-end code execution to judge outputs rather than relying on static metrics or human preference ratings alone. The approach aims to provide more reliable and objective assessments of coding model capabilities by running generated code and evaluating actual execution results. This addresses known limitations of LLM-as-judge and human annotation methods for code evaluation benchmarks.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCode BigCodeArena Hugging Face

6Hugging Face Blog·1mo ago·source ↗

StarCoder: A State-of-the-Art LLM for Code

Hugging Face and ServiceNow released StarCoder, a large language model for code trained on permissively licensed data from The Stack dataset. The model targets code generation, completion, and understanding tasks and is positioned as an open-weights alternative to proprietary code models. The release includes model weights, training details, and an associated technical report.

Open Weights Progress Agent and Tool Ecosystem ServiceNow AI BigCode The Stack v2 +2 more

4Openai Blog·1mo ago·source ↗

Powering next generation applications with OpenAI Codex

OpenAI announced that Codex is now powering 70 different applications across various use cases via the OpenAI API. The post highlights the breadth of adoption of Codex as a developer tool for code generation and related tasks. This represents an early milestone in the enterprise and developer ecosystem deployment of large language models for coding.

Enterprise Deployment Patterns Agent and Tool Ecosystem OpenAI API OpenAI OpenAI Codex

5Hugging Face Blog·1mo ago·source ↗

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.

Evaluation and Benchmarking AI Safety Research CyberSecEval 2 LlamaGuard Hugging Face +1 more

4Openai Blog·1mo ago·source ↗

Best practices for deploying language models

Cohere, OpenAI, and AI21 Labs jointly published a preliminary set of best practices for organizations developing or deploying large language models. The document represents an early cross-industry effort to establish shared norms around responsible LLM deployment. This is a 2022 publication surfaced in a tier-1 feed.

AI Safety Research Enterprise Deployment Patterns AI21 Labs Cohere OpenAI +1 more