OpenAI Baselines: ACKTR & A2C
OpenAI released two new implementations in its Baselines library: A2C, a synchronous deterministic variant of A3C offering equivalent performance, and ACKTR, a more sample-efficient RL algorithm than TRPO and A2C with modest additional compute overhead. These additions expand the reference implementations available for reinforcement learning research. The release is from August 2017 and represents foundational RL tooling from that era.
Related guides (2)
Related events (8)
OpenAI Gym Beta Release
OpenAI released the public beta of OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. The toolkit includes a suite of environments ranging from simulated robots to Atari games, along with a site for comparing and reproducing results. This represented a significant early infrastructure contribution to the RL research community.
Agency-transferring technique improves RL policy training by bootstrapping from baseline policies
A new arXiv paper proposes a model-free reinforcement learning method that embeds an existing suboptimal baseline policy into training via an arbitration mechanism, progressively transferring control from the baseline to a trainable neural network. The approach yields high goal-reaching rates from the start of training and produces a standalone policy that outperforms the baseline without requiring it at inference time. Theoretical bounds on goal-reaching probability are derived, and empirical results on continuous-control benchmarks show competitive or superior returns compared to existing methods.
Ingredients for robotics research
OpenAI released eight simulated robotics environments and a Baselines implementation of Hindsight Experience Replay (HER), developed over the prior year for internal research. These environments were used to train models that transfer to physical robots. The release also included a set of research requests to guide community contributions in robotics.
TRL v1.0: Post-Training Library Built to Move with the Field
Hugging Face has released TRL v1.0, a major milestone for its post-training library focused on reinforcement learning from human feedback and related alignment techniques. The release signals a stabilization of the API and feature set after iterative development tracking the rapidly evolving post-training landscape. TRL is widely used in the open-source community for fine-tuning and aligning language models using methods such as PPO, DPO, and GRPO.
Safety Gym: OpenAI Releases RL Safety Constraint Benchmark Suite
OpenAI released Safety Gym, a suite of environments and tools designed to measure progress in training reinforcement learning agents that respect safety constraints during training. The toolkit targets the challenge of constrained RL, where agents must optimize objectives without violating specified safety boundaries. This represents an early formal effort by OpenAI to provide standardized benchmarking infrastructure for safe RL research.
OpenAI Releases Proximal Policy Optimization (PPO)
OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.
Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability
This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.
Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
A Hugging Face blog post authored by LinkedIn describes practical lessons from implementing reinforcement learning training for agentic open-source GPT-class models. The retrospective covers engineering and algorithmic challenges encountered when applying RL to agentic workflows. As a tier-2 source with no body content available, the depth and specific findings cannot be fully assessed, but the topic sits at the intersection of agentic systems and RLHF/RL training pipelines.

