立即登录
v1.2.5
Although we collect only a single piece of human motion data, we edit this trajectory to enable its randomization in space. Combined with Reinforcement Learning training for motion refinement and an efficient sim-to-real approach, HERMES can achieve spatial generalization using just a single demonstration.
Generalizing across instances that differ in shape, and appearance poses a greater challenge. As shown below, benefiting from various randomization strategies and the properties of inherent depth images, HERMES is also capable of generalizing to bottles and objects with diverse shapes and appearances.
Our closed-loop PnP approach facilitates precise localization for the robot. As demonstrated in the video, regardless of the initial position, the robot is able to continuously refine its pose and align with the target pose through closed-loop PnP, ultimately achieving accurate arrival at the desired location.
Enabled by the navigation foundation model, HERMES achieves long-horizon indoor and outdoor navigation without relying on LiDAR or fine-grained mapping. Coupled with our closed-loop PnP approach, it provides precise localization across a variety of complex environments.
HERMES pipeline comprises four stages:
Stage 1: HERMES supports multi-source human motion data. We also provide a dedicated data acquisition pipeline for extracting human motion and object trajectories.
Stage 2: After obtaining the one-shot motion, we acquire the corresponding kinematic motion. However, directly applying retargeting would not enable the robot to accomplish the task, as it lacks modeling of the interactions and dynamics between the robot and the objects. To address this, we employ reinforcement learning (RL) to capture these intrinsic properties and relationships, enabling the robot to complete the task by adapting its behaviors. Subsequently, we distill the state-based expert policy into a vision-based student policy for sim-to-real transfer.
Stage 3: To allow the robot to operate autonomously across diverse scenarios, HERMES is equipped with navigation and localization capabilities. We adopt ViNT for long-horizon navigation and design a closed-loop PnP module to achieve precise localization, thereby enabling seamless integration with downstream manipulation tasks.
Stage 4: The distilled vision-based student policy is deployed in the real world to accomplish diverse tasks.
Q: Why does temporary pauses occur during the real-time PnP phase?
A: Due to the limited control precision of the mobile base, we reduced the integral gain \( K_i \) in PID control to minimize localization error. During the final refinement stage, the pose deviation becomes minimal, causing the integral term to accumulate slowly. As a result, the control signal may fall below the minimum actuation threshold for base movement, leading to temporary pauses in the base's motion.
Q: Why does the robot sway left and right during navigation?
A: The actions output by ViNT tend to fluctuate laterally. We directly used the outputs from ViNT without applying any smoothing. We found that using the raw action outputs yields better performance when handling sharp turns, intersections, or obstacle avoidance. Therefore, we did not apply additional processing.
Q: How does HERMES determine the final target position?
A: We first employ the depth camera to capture the target pose. Subsequently, following the same navigation settings as ViNT, we infer actions from the RGB images in the recorded trajectory. Once navigation completes, the robot's pose is refined to the recorded target pose via closed-loop PnP.
Q: Does HERMES require camera calibration?
A: We do not calibrate the camera; we only ensure that the scene content captured in real-world images roughly matches that in simulation images. In addition, we apply camera view randomization during training to enhance robustness to view changes.
Website modified from TWIST.