Almanac
← Events
8OpenAI Blog·1mo ago

Evaluating Large Language Models Trained on Code

OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.

Related guides (3)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Very Large Language Models and How to Evaluate Them

This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.

5Openai Blog·1mo ago·source ↗

A Hazard Analysis Framework for Code Synthesis Large Language Models

OpenAI published a hazard analysis framework specifically targeting code synthesis LLMs, addressing the safety and risk dimensions of models that generate executable code. The framework likely identifies threat categories, failure modes, and mitigation strategies relevant to deploying code-generating AI systems. This represents an early structured attempt to apply safety engineering methodology to a specific LLM capability domain. The work is relevant to both AI safety research and enterprise deployment considerations for coding assistants.

5Hugging Face Blog·1mo ago·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

5Hugging Face Blog·1mo ago·source ↗

BigCodeArena: Judging code generations end to end with code executions

BigCodeArena is a new evaluation framework for code generation models that uses end-to-end code execution to judge outputs rather than relying on static metrics or human preference ratings alone. The approach aims to provide more reliable and objective assessments of coding model capabilities by running generated code and evaluating actual execution results. This addresses known limitations of LLM-as-judge and human annotation methods for code evaluation benchmarks.

6Hugging Face Blog·1mo ago·source ↗

StarCoder: A State-of-the-Art LLM for Code

Hugging Face and ServiceNow released StarCoder, a large language model for code trained on permissively licensed data from The Stack dataset. The model targets code generation, completion, and understanding tasks and is positioned as an open-weights alternative to proprietary code models. The release includes model weights, training details, and an associated technical report.

4Openai Blog·1mo ago·source ↗

Powering next generation applications with OpenAI Codex

OpenAI announced that Codex is now powering 70 different applications across various use cases via the OpenAI API. The post highlights the breadth of adoption of Codex as a developer tool for code generation and related tasks. This represents an early milestone in the enterprise and developer ecosystem deployment of large language models for coding.

5Hugging Face Blog·1mo ago·source ↗

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.

4Openai Blog·1mo ago·source ↗

Best practices for deploying language models

Cohere, OpenAI, and AI21 Labs jointly published a preliminary set of best practices for organizations developing or deploying large language models. The document represents an early cross-industry effort to establish shared norms around responsible LLM deployment. This is a 2022 publication surfaced in a tier-1 feed.