technique
RREDCoT
techniqueactiveprovisional
rredcot-13203903·1 events·first seen 12d agoAliases: RREDCoT
Co-occurring entities
More like this (12)
Recent events (1)
RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment
RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.