Entity · technique

policy gradient

techniqueactivepolicy-gradient-67c24139·1 events·first seen May 21, 2026

Aliases: policy gradient

Co-occurring entities

DelTA Qwen3-8B-Base token credit assignment Qwen Reinforcement Learning with Verifiable Rewards Qwen3-14B-Base

More like this (12)

Policy Gradient Methods Evolved Policy Gradients diffusion-based policy Wasserstein Policy Gradient gradient accumulation Dual-Evidence Gradient Purification Proximal Policy Optimization Mask-Aware Policy Gradients for Diffusion Language Models behavioral-gradient validator gradient flow dynamics political bias evaluation gradient noise scale

Recent events (1)

6arXiv · cs.CL·May 21, 2026·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more