technique
Gradient-Guided Reward Optimization
techniqueactiveprovisional
gradient-guided-reward-optimization-9c2867b9·1 events·first seen 8d agoAliases: Gradient-Guided Reward Optimization
Co-occurring entities
More like this (12)
Scaling Laws for Reward Model OveroptimizationUsing Reward Uncertainty to Induce Diverse Behaviour in Reinforcement LearningIn-Context Reward AdaptationEntropy-Regularized Reinforcement Learningrule-based reinforcement learning rewardsGravity-Weighted Direct Preference OptimizationRREDCoT: Segment-Level Reward Redistribution for Reasoning ModelsAccelerated Decentralized Stochastic Gradient Descent for Strongly Convex OptimizationHybrid Reward Advantage SplittingEvolved Policy GradientsTraining LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference OptimizationRule-Based Rewards
Recent events (1)
GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment
Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.