OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open-Weight Reasoning Models for Safety Classification Tasks

OpenAI has released a research preview of gpt-oss-safeguard, two open weight safety reasoning models that let developers apply custom safety policies at inference time. The models come in two sizes, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, both fine tuned from gpt-oss, both licensed under Apache 2.0, and both available on Hugging Face for local use.

https://openai.com/index/introducing-gpt-oss-safeguard/

Why Policy-Conditioned Safety Matters?

Conventional moderation models are trained on a single fixed policy. When that policy changes, the model must be retrained or replaced. gpt-oss-safeguard reverses this relationship. It takes the developer authored policy as input together with the user content, then reasons step by step to decide whether the content violates the policy. This turns safety into a prompt and evaluation task, which is better suited for fast changing or domain specific harms such as fraud, biology, self harm or game specific abuse.

Same Pattern as OpenAI’s Internal Safety Reasoner

OpenAI states that gpt-oss-safeguard is an open weight implementation of the Safety Reasoner used internally across systems like GPT 5, ChatGPT Agent and Sora 2. In production settings OpenAI already runs small high recall filters first, then escalates uncertain or sensitive items to a reasoning model, and in recent launches up to 16 percent of total compute was spent on safety reasoning. The open release lets external teams reproduce this defense in depth pattern instead of guessing how OpenAI’s stack works.

Model Sizes and Hardware Fit

The large model, gpt-oss-safeguard-120b, has 117B parameters with 5.1B active parameters and is sized to fit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B active parameters and targets lower latency or smaller GPUs, including 16GB setups. Both models were trained on the harmony response format, so prompts must follow that structure otherwise results will degrade. The license is Apache 2.0, the same as the parent gpt-oss models, so commercial local deployment is permitted.

Evaluation Results

OpenAI evaluated the models on internal multi policy tests and on public datasets. In multi policy accuracy, where the model must correctly apply several policies at once, gpt-oss-safeguard and OpenAI’s internal Safety Reasoner outperform gpt-5-thinking and the open gpt-oss baselines. On the 2022 moderation dataset the new models slightly outperform both gpt-5-thinking and the internal Safety Reasoner, however OpenAI specifies that this gap is not statistically significant, so it should not be oversold. On ToxicChat, the internal Safety Reasoner still leads, with gpt-oss-safeguard close behind. This places the open models in the competitive range for real moderation tasks.

Recommended Deployment Pattern

OpenAI is explicit that pure reasoning on every request is expensive. The recommended setup is to run small, fast, high recall classifiers on all traffic, then send only uncertain or sensitive content to gpt-oss-safeguard, and when user experience requires fast responses, to run the reasoner asynchronously. This mirrors OpenAI’s own production guidance and reflects the fact that dedicated task specific classifiers can still win when there is a large high quality labeled dataset.

Key Takeaways

gpt-oss-safeguard is a research preview of two open weight safety reasoning models, 120b and 20b, that classify content using developer supplied policies at inference time, so policy changes do not require retraining.
The models implement the same Safety Reasoner pattern OpenAI uses internally across GPT 5, ChatGPT Agent and Sora 2, where a first fast filter routes only risky or ambiguous content to a slower reasoning model.
Both models are fine tuned from gpt-oss, keep the harmony response format, and are sized for real deployments, the 120b model fits on a single H100 class GPU, the 20b model targets 16GB level hardware, and both are Apache 2.0 on Hugging Face.
On internal multi policy evaluations and on the 2022 moderation dataset, the safeguard models outperform gpt-5-thinking and the gpt-oss baselines, but OpenAI notes that the small margin over the internal Safety Reasoner is not statistically significant.
OpenAI recommends using these models in a layered moderation pipeline, together with community resources such as ROOST, so platforms can express custom taxonomies, audit the chain of thought, and update policies without touching weights.

OpenAI is taking an internal safety pattern and making it reproducible, which is the most important part of this launch. The models are open weight, policy conditioned and Apache 2.0, so platforms can finally apply their own taxonomies instead of accepting fixed labels. The fact that gpt-oss-safeguard matches and sometimes slightly exceeds the internal Safety Reasoner on the 2022 moderation dataset, while outperforming gpt-5-thinking on multi policy accuracy, but with a non statistically significant margin, shows the approach is already usable. The recommended layered deployment is realistic for production.

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Why Policy-Conditioned Safety Matters?

Same Pattern as OpenAI’s Internal Safety Reasoner

Model Sizes and Hardware Fit

Evaluation Results

Recommended Deployment Pattern

Key Takeaways

Michal Sutter

Leave a Reply Cancel reply

Related Posts

Microsoft Releases a Comprehensive Guide to Failure Modes in Agentic AI Systems

Building the Internet of Agents: A Technical Dive into AI Agent Protocols and Their Role in Scalable Intelligence Systems

The résumé is dying, and AI is holding the smoking gun