HERMES

Human-to-Robot Embodied Learning From Multi-Source
Motion Data for Mobile Dexterous Manipulation

Zhecheng Yuan^1,2* Tianming Wei^1,2* Langzhe Gu^1,2 Pu Hua^1,2
Tianhai Liang^1,2 Yuanpei Chen³ Huazhe Xu^1,2

¹Tsinghua University ²Shanghai Qi Zhi Institute ³Peking University

^*Equal Contribution

Mobile Bimanual Dexterous Manipulation

Scan Items

Clean Plate

Tidy Up

Pour Tea

Clean Table

Learning From Human Video

Clean Plate

Pour Tea

Learning From Mocap Data

Putoff Burner

Arrange Flowers

Spatial Generalization

Although we collect only a single piece of human motion data, we edit this trajectory to enable its randomization in space. Combined with Reinforcement Learning training for motion refinement and an efficient sim-to-real approach, HERMES can achieve spatial generalization using just a single demonstration.

Instance Generalization

Generalizing across instances that differ in shape, and appearance poses a greater challenge. As shown below, benefiting from various randomization strategies and the properties of inherent depth images, HERMES is also capable of generalizing to bottles and objects with diverse shapes and appearances.

Closed-loop PnP

Our closed-loop PnP approach facilitates precise localization for the robot. As demonstrated in the video, regardless of the initial position, the robot is able to continuously refine its pose and align with the target pose through closed-loop PnP, ultimately achieving accurate arrival at the desired location.

Long-horizon Navigation

Enabled by the navigation foundation model, HERMES achieves long-horizon indoor and outdoor navigation without relying on LiDAR or fine-grained mapping. Coupled with our closed-loop PnP approach, it provides precise localization across a variety of complex environments.

Outdoor

Indoor

Simulation Results

scanbottle

cleanplate

cleantable

flowervase

pourteapot

putoffburner

Method

HERMES pipeline comprises four stages:
Stage 1: HERMES supports multi-source human motion data. We also provide a dedicated data acquisition pipeline for extracting human motion and object trajectories.
Stage 2: After obtaining the one-shot motion, we acquire the corresponding kinematic motion. However, directly applying retargeting would not enable the robot to accomplish the task, as it lacks modeling of the interactions and dynamics between the robot and the objects. To address this, we employ reinforcement learning (RL) to capture these intrinsic properties and relationships, enabling the robot to complete the task by adapting its behaviors. Subsequently, we distill the state-based expert policy into a vision-based student policy for sim-to-real transfer.
Stage 3: To allow the robot to operate autonomously across diverse scenarios, HERMES is equipped with navigation and localization capabilities. We adopt ViNT for long-horizon navigation and design a closed-loop PnP module to achieve precise localization, thereby enabling seamless integration with downstream manipulation tasks.
Stage 4: The distilled vision-based student policy is deployed in the real world to accomplish diverse tasks.

Failure Cases

Q & A

Q: Why does temporary pauses occur during the real-time PnP phase?
A: Due to the limited control precision of the mobile base, we reduced the integral gain \( K_i \) in PID control to minimize localization error. During the final refinement stage, the pose deviation becomes minimal, causing the integral term to accumulate slowly. As a result, the control signal may fall below the minimum actuation threshold for base movement, leading to temporary pauses in the base's motion.

Q: Why does the robot sway left and right during navigation?
A: The actions output by ViNT tend to fluctuate laterally. We directly used the outputs from ViNT without applying any smoothing. We found that using the raw action outputs yields better performance when handling sharp turns, intersections, or obstacle avoidance. Therefore, we did not apply additional processing.

Q: How does HERMES determine the final target position?
A: We first employ the depth camera to capture the target pose. Subsequently, following the same navigation settings as ViNT, we infer actions from the RGB images in the recorded trajectory. Once navigation completes, the robot's pose is refined to the recorded target pose via closed-loop PnP.

Q: Does HERMES require camera calibration?
A: We do not calibrate the camera; we only ensure that the scene content captured in real-world images roughly matches that in simulation images. In addition, we apply camera view randomization during training to enhance robustness to view changes.

Website modified from TWIST.