Almanac
model

Qwen3-14B-Base

modelactiveqwen3-14b-base-95402e8a·1 events·first seen 27d ago

Aliases: Qwen3-14B-Base

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·27d ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.