technique
policy gradient
techniqueactive
policy-gradient-67c24139·1 events·first seen 26d agoAliases: policy gradient
Co-occurring entities
More like this (12)
Policy Gradient MethodsEvolved Policy Gradientsdiffusion-based policyWasserstein Policy Gradientgradient accumulationDual-Evidence Gradient PurificationProximal Policy Optimizationbehavioral-gradient validatorgradient flow dynamicspolitical bias evaluationgradient noise scaleGRPO (Group Relative Policy Optimization)
Recent events (1)
DelTA: Discriminative Token Credit Assignment for RLVR Training
DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.