6arXiv cs.LG (Machine Learning)·15d ago

HANDOFF: Unified humanoid whole-body controller distilled from complementary specialist teachers

HANDOFF is a single whole-body controller for humanoid robots that uses a compact, explicit command-space interface bridging task planning and motor control. It is trained via multi-teacher KL distillation into a mixture-of-experts student from three specialists: whole-body motion tracking, locomotion, and fall-recovery. Evaluated on the Unitree G1, it matches state-of-the-art velocity tracking and demonstrates natural-language-driven task execution via a VLM-based agentic planner without task-specific fine-tuning. The work is relevant to the AI/robotics intersection as it shows a practical path to deploying language-driven agentic planners on physical humanoid hardware.

Agent and Tool Ecosystem Multimodal Progress Mixture of Experts Unitree G1 HANDOFF

Related guides (3)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

HITL-D: Human-In-The-Loop Diffusion for Shared Control in Robotic Manipulation

HITL-D is a shared control framework that combines diffusion-based policies with human teleoperation for robotic manipulation tasks. The system autonomously updates end-effector orientation conditioned on scene point clouds and Cartesian position, reducing the number of joystick axes operators must manage. A 12-participant user study found 40% faster task completion, 37% lower perceived workload, and improved subjective ratings versus traditional teleoperation. The work addresses a relatively unexplored intersection of diffusion policy methods and human-in-the-loop control.

Agent and Tool Ecosystem Alignment and RLHF HITL-D diffusion-based policy shared control +1 more

7arXiv · cs.CL·22d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

5arXiv · cs.LG·8d ago·source ↗

Mana framework achieves zero-shot sim-to-real transfer for dexterous articulated tool manipulation

Researchers introduce Mana (Manipulation Animator), a sim-to-real framework that reframes dexterous robotic manipulation as an animation problem using a coarse-to-fine pipeline of procedurally-generated grasp keyframes, motion planning, and reinforcement learning. The system requires minimal human input (under one minute per tool) and achieves zero-shot sim-to-real transfer across four articulated tools with varying joint types and scales. The work addresses a longstanding gap in dexterous robotics where articulated tool use—requiring coordination of internal degrees of freedom and contact-rich interactions—has been underexplored relative to rigid object manipulation.

Agent and Tool Ecosystem Mana Mana: Dexterous Manipulation of Articulated Tools

4arXiv · cs.LG·47h ago·source ↗

UNIEGO: Hierarchical multi-teacher distillation for unified egocentric video representation

Researchers introduce UNIEGO, an egocentric video encoder trained via a hierarchical multi-teacher distillation framework using nine teachers spanning ego-exo viewpoints, RGB/depth/skeleton modalities, and four foundation models. A key contribution is the interposition of Proxy models that translate heterogeneous teacher knowledge into a homogeneous space, followed by Selective Proxy Distillation (SPD) which adaptively selects reliable supervision signals per training sample. UNIEGO achieves state-of-the-art results on action recognition, video retrieval, and action segmentation across three ego-exo benchmarks. The work addresses a practical deployment constraint: the unified model runs from egocentric video alone despite being trained with multi-modal, multi-viewpoint supervision.

Evaluation and Benchmarking Multimodal Progress Selective Proxy Distillation UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning UNIEGO

6arXiv · cs.LG·22d ago·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

6Openai Blog·1mo ago·source ↗

Learning Dexterity: OpenAI Trains Robot Hand for Physical Object Manipulation

OpenAI announced the training of a human-like robot hand capable of manipulating physical objects with what they describe as unprecedented dexterity. The system uses reinforcement learning to develop fine motor control in a dexterous robotic hand. This work represents an early milestone in OpenAI's robotics research program, predating their later Dactyl work on solving Rubik's cubes.

Agent and Tool Ecosystem OpenAI Dexterous Hand Reinforcement Learning OpenAI

5arXiv · cs.AI·1mo ago·source ↗

DexHoldem: A Real-World Benchmark for Dexterous Embodied Agents Using Texas Hold'em Manipulation

DexHoldem is a new system-level benchmark for evaluating dexterous embodied agents on a ShadowHand robot performing Texas Hold'em card manipulation tasks. It provides 1,470 teleoperated demonstrations across 14 manipulation primitives, a physical policy benchmark, and an agentic perception benchmark for structured game-state recovery. Top performers include π₀.₅ at 61.2% task completion and Claude Opus 4.7 at 34.3% strict perception accuracy, with GPT 5.5 achieving 66.8% field-wise accuracy. The benchmark exposes gaps between isolated visual sub-capabilities and full closed-loop embodied decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 π₀.₅ Physical Intelligence +4 more

6arXiv · cs.AI·17d ago·source ↗

Humanoid-GPT: GPT-style Transformer trained on 2B-frame motion corpus for zero-shot humanoid control

Researchers introduce Humanoid-GPT, a causal Transformer pre-trained on a 2-billion-frame retargeted motion corpus that unifies major mocap datasets with large-scale in-house recordings for whole-body humanoid control. The model achieves zero-shot generalization to unseen motions and control tasks, overcoming the agility-generalization trade-off seen in prior MLP-based trackers. Scaling analyses demonstrate a new performance frontier for dynamic motion tracking without task-specific fine-tuning.

Frontier Model Releases Humanoid-GPT Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking