Almanac
technique

UBP2

techniqueactiveprovisionalubp2-202f148b·1 events·first seen 2d ago

Aliases: UBP2

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.AI·2d ago·source ↗

UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.