Almanac
benchmark

Meta-World

benchmarkactiveprovisionalmeta-world-3fe1d069·1 events·first seen 2d ago

Aliases: Meta-World

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.AI·2d ago·source ↗

UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.