Entity · technique

LamPO

techniqueactivelampo-09aecf92·1 events·first seen May 21, 2026

Aliases: LamPO

Co-occurring entities

RLVR ROUGE-L AIME24 GRPO AIME25 PPO Qwen3-4B GPQA Diamond Phi-4-mini Qwen3-1.7B MATH-500

More like this (12)

LaMP-2 OLMo LAMBDA SmolLM2 LPU 3LM LayoutLM DLAM SmolLM3 LAMDA-CL LAMMPS MDLM

Recent events (1)

5arXiv · cs.CL·May 21, 2026·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more