Almanac
paper

Predicting Future Behaviors in Reasoning Models Enables Better Steering

paperactiveprovisionalpredicting-future-behaviors-in-reasoning-models-enables-better-steering-05d4f39d·1 events·first seen 7d ago

Aliases: Predicting Future Behaviors in Reasoning Models Enables Better Steering

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·7d ago·source ↗

Future Probe Controlled Generation enables steering of reasoning models without quality degradation

Researchers introduce Future Probe Controlled Generation (FPCG), a text-level steering method for large reasoning models (LRMs) that trains activation probes to predict future behavior likelihoods from intermediate reasoning steps rather than detecting behavior in already-generated text. The probes achieve 64–91% accuracy in predicting the most likely future behavior, revealing a distinct class of internal prediction features separate from detection features. FPCG steers model outputs by sampling candidate sentences and selecting the best according to these probes, achieving steering with minimal output quality degradation and succeeding in cases where activation steering fails. The work provides a principled distinction between detection and prediction features as intervention targets for controlling LRM behavior.