Almanac
technique

Pareto Optimal Policy Optimization

techniqueactiveprovisionalpareto-optimal-policy-optimization-dcc46cea·1 events·first seen 13d ago

Aliases: Pareto Optimal Policy Optimization

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·13d ago·source ↗

Taiji: Pareto Optimal Policy Optimization for LLM-enhanced recommendation at Kuaishou scale

Researchers from Kuaishou present Taiji, an LLM-as-Enhancer framework for industrial recommender systems that addresses two bottlenecks: generating high-quality chain-of-thought data via reverse-engineered reasoning and rejection sampling during SFT, and balancing semantic vs. ID-based rewards during RL alignment via a new algorithm called Pareto Optimal Policy Optimization (POPO). The system has been deployed on Kuaishou's advertising platform since May 2026, serving over 400 million daily users. The paper contributes both a practical deployment case study and a novel RL alignment technique for the LLM4Rec paradigm.