6arXiv cs.CL (Computation and Language)·17d ago

Taiji: Pareto Optimal Policy Optimization for LLM-enhanced recommendation at Kuaishou scale

Researchers from Kuaishou present Taiji, an LLM-as-Enhancer framework for industrial recommender systems that addresses two bottlenecks: generating high-quality chain-of-thought data via reverse-engineered reasoning and rejection sampling during SFT, and balancing semantic vs. ID-based rewards during RL alignment via a new algorithm called Pareto Optimal Policy Optimization (POPO). The system has been deployed on Kuaishou's advertising platform since May 2026, serving over 400 million daily users. The paper contributes both a practical deployment case study and a novel RL alignment technique for the LLM4Rec paradigm.

Enterprise Deployment Patterns Alignment and RLHF Taiji Pareto Optimal Policy Optimization Kuaishou

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

5arXiv · cs.CL·2d ago·source ↗

Turing-RL: Reinforcement learning with Turing-Test-based rewards for user simulator training

Researchers propose Turing-RL, a method for training LLM-based user simulators using a discriminative reward signal that scores how indistinguishable generated responses are from real user responses, rather than matching a single ground-truth output. An LLM judge evaluates indistinguishability given the user's history, and the simulator is trained via RL to maximize this reward. Evaluated on conversational chat and Reddit forum discussion domains, Turing-RL outperforms log-probability and similarity-reward baselines on both LLM and human evaluation metrics. The work has implications for agent assistant training, personalization system evaluation, and social science research.

Evaluation and Benchmarking Agent and Tool Ecosystem Turing-RL

5Hugging Face Blog·1mo ago·source ↗

Preference Tuning LLMs with Direct Preference Optimization Methods

A Hugging Face blog post surveys Direct Preference Optimization (DPO) and related preference tuning methods for aligning large language models. The post covers the landscape of DPO variants and their practical application via the TRL library. It serves as a technical reference for practitioners implementing RLHF alternatives.

Agent and Tool Ecosystem Alignment and RLHF Reinforcement Learning from Human Feedback Direct Preference Optimization (DPO)Hugging Face +1 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

6arXiv · cs.CL·2d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4arXiv · cs.AI·47h ago·source ↗

G2Rec: Scalable framework unifying graph-based user modeling with semantic tokenization for generative recommendation

Researchers propose G2Rec, a framework that combines holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation systems. The approach addresses limitations of existing methods—scalability issues in graph serialization and lack of supervision in semantic tokenization—by learning user interest prototypes without ground-truth labels. The system has been deployed in production across product surfaces and evaluated on public datasets, showing improvements over prior methods.

Enterprise Deployment Patterns G2Rec

7arXiv · cs.AI·29d ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

Evaluation and Benchmarking Inference Economics GRPO pass@k AlphaEvolve +4 more