4Hugging Face Blog·1mo ago

vLLM V0 to V1: Correctness Before Corrections in RL

A ServiceNow AI blog post on Hugging Face discusses lessons learned migrating reinforcement learning training pipelines from vLLM V0 to V1. The piece focuses on correctness issues encountered during the transition and how they were diagnosed and resolved before applying RL corrections. This is relevant to practitioners using vLLM as an inference backend for RL-based LLM training workflows.

Inference Economics Agent and Tool Ecosystem Alignment and RLHF ServiceNow AI Reinforcement Learning from Human Feedback vLLM

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

PipelineRL: ServiceNow's Pipeline-Based Reinforcement Learning Framework for LLMs

ServiceNow introduces PipelineRL, a reinforcement learning training framework for large language models published via the Hugging Face blog. The post describes a pipeline-based approach to RL training, likely addressing throughput and efficiency challenges in RLHF or similar post-training workflows. As a tier-2 source with minimal body content, the technical depth is unclear but the topic is relevant to alignment and training infrastructure.

Training Infrastructure Agent and Tool Ecosystem ServiceNow AI PipelineRL Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

A Hugging Face blog post describes a chatbot arena experiment evaluating LLMs' ability to self-correct errors, using Keras and TPUs as the infrastructure backbone. The experiment appears to use a head-to-head arena format to assess self-correction capabilities across models. This touches on both evaluation methodology and a core capability question about whether LLMs can reliably identify and fix their own mistakes.

Evaluation and Benchmarking Agent and Tool Ecosystem Chatbot Arena Keras TPU +1 more

5Hugging Face Blog·1mo ago·source ↗

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Hugging Face published a detailed tutorial demonstrating how to fine-tune Meta's LLaMA model using Reinforcement Learning from Human Feedback (RLHF) on StackExchange data. The guide covers the full pipeline: supervised fine-tuning, reward model training, and PPO-based RL optimization. It serves as a practical reference for practitioners seeking to replicate RLHF workflows on open-weight models using the TRL library.

Open Weights Progress Agent and Tool Ecosystem Reinforcement Learning from Human Feedback PPO StackLLaMA +5 more

6Hugging Face Blog·1mo ago·source ↗

TRL v1.0: Post-Training Library Built to Move with the Field

Hugging Face has released TRL v1.0, a major milestone for its post-training library focused on reinforcement learning from human feedback and related alignment techniques. The release signals a stabilization of the API and feature set after iterative development tracking the rapidly evolving post-training landscape. TRL is widely used in the open-source community for fine-tuning and aligning language models using methods such as PPO, DPO, and GRPO.

Open Weights Progress Agent and Tool Ecosystem GRPO PPO DPO +3 more

5Hugging Face Blog·1mo ago·source ↗

Putting RL back in RLHF: RLOO Implementation on Hugging Face

Hugging Face published a blog post introducing RLOO (REINFORCE Leave-One-Out), a reinforcement learning algorithm aimed at making the RL component of RLHF more practical and effective. The post discusses implementation details and motivations for revisiting pure RL-based fine-tuning approaches within the TRL library. This represents a technical contribution to the alignment and RLHF tooling ecosystem, offering an alternative to PPO-based RLHF pipelines.

Agent and Tool Ecosystem Alignment and RLHF RLOO Reinforcement Learning from Human Feedback PPO +2 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

5arXiv · cs.CL·4d ago·source ↗

RL-trained LLMs learn retriever-specific query formulation strategies for RAG

A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.

Evaluation and Benchmarking Agent and Tool Ecosystem Understanding the Behaviors of Environment-aware Information Retrieval LCO-Embedding

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more