technique

UBP2

techniqueactiveprovisionalubp2-202f148b·1 events·first seen 2d ago

Aliases: UBP2

Co-occurring entities

Meta-World

More like this (12)

GP-UCB AB-UPT BLIP-2 u-muP Uni-1 MBPP BOPTEST LM1B UMAP AUPRC E2B TPU

Recent events (1)

4arXiv · cs.AI·2d ago·source ↗

UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.

Evaluation and Benchmarking Alignment and RLHF Meta-World UBP2