Almanac
← Events
5arXiv cs.LG (Machine Learning)·9d ago

UniIntervene: Agentic model reduces human interventions in real-world robot RL by 57%

UniIntervene is a proposed agentic intervention model for human-in-the-loop reinforcement learning (HiL-RL) that autonomously detects unproductive exploration and recovers robot policies toward high-value states, replacing the bulk of human corrections. The system uses future-conditioned action-value estimation, a temporal value-risk critic, and a goal-conditioned recovery policy drawing from a memory of past interventions. Experiments on real-world robotic manipulation tasks show a 57% reduction in human interventions and an 8.6% improvement in average success rate over state-of-the-art HiL-RL baselines.

Related guides (2)

Related events (8)

6arXiv · cs.LG·4d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

6arXiv · cs.AI·17d ago·source ↗

AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation

AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.

6arXiv · cs.CL·25d ago·source ↗

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation

SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.

4arXiv · cs.LG·11d ago·source ↗

Agency-transferring technique improves RL policy training by bootstrapping from baseline policies

A new arXiv paper proposes a model-free reinforcement learning method that embeds an existing suboptimal baseline policy into training via an arbitration mechanism, progressively transferring control from the baseline to a trainable neural network. The approach yields high goal-reaching rates from the start of training and produces a standalone policy that outperforms the baseline without requiring it at inference time. Theoretical bounds on goal-reaching probability are derived, and empirical results on continuous-control benchmarks show competitive or superior returns compared to existing methods.

5arXiv · cs.AI·1mo ago·source ↗

HITL-D: Human-In-The-Loop Diffusion for Shared Control in Robotic Manipulation

HITL-D is a shared control framework that combines diffusion-based policies with human teleoperation for robotic manipulation tasks. The system autonomously updates end-effector orientation conditioned on scene point clouds and Cartesian position, reducing the number of joystick axes operators must manage. A 12-participant user study found 40% faster task completion, 37% lower perceived workload, and improved subjective ratings versus traditional teleoperation. The work addresses a relatively unexplored intersection of diffusion policy methods and human-in-the-loop control.

5arXiv · cs.CL·5d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

7Openai Blog·1mo ago·source ↗

Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons

OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.