8OpenAI Blog·1mo ago

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

OpenAI introduces a methodology called malicious fine-tuning (MFT) to assess worst-case risks of releasing open-weight models, specifically applied to their internal model gpt-oss. The study attempts to elicit maximum dangerous capabilities in biology and cybersecurity domains through targeted fine-tuning. This represents a systematic effort to quantify uplift risks before open-weight releases, informing OpenAI's open-weight release policy.

Evaluation and Benchmarking Open Weights Progress AI Safety Research cybersecurity risk uplift biology risk uplift malicious fine-tuning GPT-OSS OpenAI

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

Building an Early Warning System for LLM-Aided Biological Threat Creation

OpenAI published a blueprint for evaluating whether LLMs can meaningfully assist in biological threat creation. In a controlled study with biology experts and students, GPT-4 was found to provide at most mild uplift in biological threat creation accuracy. The results are inconclusive but are framed as a starting point for ongoing safety research and community deliberation on biosecurity risks from AI.

Evaluation and Benchmarking AI Safety Research biological threat creation evaluation OpenAI GPT-4

7Openai Blog·1mo ago·source ↗

Introducing gpt-oss-safeguard

OpenAI has released gpt-oss-safeguard, a set of open-weight reasoning models designed for safety classification tasks. The models are intended to help developers implement and iterate on custom content safety policies. This represents OpenAI's entry into the open-weight safety tooling space, providing infrastructure-level moderation capabilities that can be customized and deployed independently.

Open Weights Progress AI Safety Research gpt-oss-safeguard OpenAI +2 more

7Openai Blog·1mo ago·source ↗

OpenAI Releases gpt-oss-safeguard-120b and gpt-oss-safeguard-20b: Open-Weight Policy-Reasoning Safety Models

OpenAI has released two open-weight reasoning models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, post-trained from the gpt-oss base models to perform policy-conditioned content labeling. The models are designed to reason from a provided policy document and classify content accordingly, functioning as configurable safety classifiers. A technical report accompanies the release, covering capabilities and baseline safety evaluations benchmarked against the underlying gpt-oss models.

Open Weights Progress AI Safety Research GPT-OSS gpt-oss-safeguard OpenAI +1 more

8Openai Blog·1mo ago·source ↗

OpenAI Releases Most Capable Open-Weights Models

OpenAI has released what it describes as its most capable open-weights models, framing the move as a major step toward broader AI accessibility. The announcement emphasizes openness, flexibility, and global reach as core motivations. This marks a significant shift in OpenAI's historically closed model distribution strategy.

Frontier Model Releases Open Weights Progress open-weight models OpenAI +2 more

9Openai Blog·1mo ago·source ↗

OpenAI Releases gpt-oss-120b and gpt-oss-20b Open-Weight Models Under Apache 2.0

OpenAI is releasing two open-weight language models, gpt-oss-120b and gpt-oss-20b, under the Apache 2.0 license. The models are claimed to outperform similarly sized open models on reasoning tasks and feature strong tool use capabilities. They are optimized for efficient deployment on consumer hardware, positioning them as cost-effective alternatives in the open-weights ecosystem.

Frontier Model Releases Open Weights Progress Apache 2.0 GPT-OSS 120B OpenAI +3 more

9Openai Blog·1mo ago·source ↗

OpenAI Releases gpt-oss-120b and gpt-oss-20b Open-Weight Reasoning Models

OpenAI has published model cards for gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models released under the Apache 2.0 license alongside a dedicated gpt-oss usage policy. This marks a significant move by OpenAI into the open-weights space, offering both a large 120B parameter model and a smaller 20B variant. The release signals a strategic shift for OpenAI, which has historically kept its frontier models proprietary.

Frontier Model Releases Open Weights Progress gpt-oss usage policy Apache 2.0 GPT-OSS 120B +4 more

4Openai Blog·1mo ago·source ↗

Adversarial Attacks on Neural Network Policies

OpenAI published research examining adversarial attacks on neural network-based reinforcement learning policies. The work investigates how small, carefully crafted perturbations to observations can cause trained RL agents to fail catastrophically. This represents an early investigation into the robustness and safety of learned policies under adversarial conditions.

Evaluation and Benchmarking AI Safety Research adversarial examples Adversarial Attacks on Neural Network Policies Reinforcement Learning +1 more

6The Batch·19d ago·source ↗

GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain

Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index SWE-bench +9 more