Almanac
technique

RREDCoT

techniqueactiveprovisionalrredcot-13203903·1 events·first seen 12d ago

Aliases: RREDCoT

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·12d ago·source ↗

RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment

RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.