Almanac
← Events
4Hugging Face Blog·1mo ago

PipelineRL: ServiceNow's Pipeline-Based Reinforcement Learning Framework for LLMs

ServiceNow introduces PipelineRL, a reinforcement learning training framework for large language models published via the Hugging Face blog. The post describes a pipeline-based approach to RL training, likely addressing throughput and efficiency challenges in RLHF or similar post-training workflows. As a tier-2 source with minimal body content, the technical depth is unclear but the topic is relevant to alignment and training infrastructure.

Related guides (4)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Putting RL back in RLHF: RLOO Implementation on Hugging Face

Hugging Face published a blog post introducing RLOO (REINFORCE Leave-One-Out), a reinforcement learning algorithm aimed at making the RL component of RLHF more practical and effective. The post discusses implementation details and motivations for revisiting pure RL-based fine-tuning approaches within the TRL library. This represents a technical contribution to the alignment and RLHF tooling ecosystem, offering an alternative to PPO-based RLHF pipelines.

5Hugging Face Blog·1mo ago·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

5Hugging Face Blog·1mo ago·source ↗

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Hugging Face published a detailed tutorial demonstrating how to fine-tune Meta's LLaMA model using Reinforcement Learning from Human Feedback (RLHF) on StackExchange data. The guide covers the full pipeline: supervised fine-tuning, reward model training, and PPO-based RL optimization. It serves as a practical reference for practitioners seeking to replicate RLHF workflows on open-weight models using the TRL library.

4Hugging Face Blog·1mo ago·source ↗

vLLM V0 to V1: Correctness Before Corrections in RL

A ServiceNow AI blog post on Hugging Face discusses lessons learned migrating reinforcement learning training pipelines from vLLM V0 to V1. The piece focuses on correctness issues encountered during the transition and how they were diagnosed and resolved before applying RL corrections. This is relevant to practitioners using vLLM as an inference backend for RL-based LLM training workflows.

6Hugging Face Blog·1mo ago·source ↗

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

A Hugging Face blog post surveys 16 open-source reinforcement learning libraries for LLM training, analyzing their architectural approaches to async and synchronous token generation pipelines. The piece distills practical lessons about throughput, scalability, and design trade-offs across the ecosystem. It serves as a comparative landscape analysis for practitioners building or choosing RL training infrastructure for language models.

6Hugging Face Blog·1mo ago·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

8Openai Blog·1mo ago·source ↗

Aligning language models to follow instructions

OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.

5Openai Blog·1mo ago·source ↗

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

OpenAI published RL², a meta-reinforcement learning approach in which a slow outer RL process trains a recurrent neural network whose hidden state encodes a fast inner learning algorithm. The method allows agents to rapidly adapt to new tasks within a single episode by leveraging experience accumulated across many training tasks. This work is an early foundational contribution to meta-learning for RL, predating the modern agent and LLM era but relevant to understanding the intellectual lineage of in-context and few-shot learning in AI systems.