Learning to Manipulate Anywhere

A Visual Genralizable Framework For Visual Reinforcement Learning

Abstract. Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose Maniwhere, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods.

Method

We employs a multi-view representation objective to capture implicitly shared semantic information and correspondences across different viewpoints. In addition, we fuse the STN module within the visual encoder to further enhance the robot's robustness to view changes. Subsequently, to achieve sim2real transfer, we utilize a curriculum-based domain randomization approach to stabilize RL training and prevent divergence. The resulting trained policy can be transferred to real-world environments in a zero-shot manner.

Task

To conduct the evaluation, we develop 3 types of robotic arms and 2 types of robotic hands to design a total of 8 diverse tasks.

Tasks in Simulation

LiftCube

LiftCube Dex

Open Drawer

PickPlace Dex

CloseLaptop

Button Dex

PickPlace

HandOver

Real-world Setup

Our real-world experiments encompass 3 types of robotic arms, 2 dexterous hands, and various tasks including articulated objects and bi-manual manipulation.

In the following tasks, we directly transfer the simulation trained polices into the real-world in a zero-shot manner. The following videos are recorded from the same view as the policy input images!!

Task 1: Close-Laptop

Task 2: PickPlace Dex

Task 3: Handover

Task 4: PickPlace

Task 5: OpenDrawer

Position Perturbation

Maniwhere can automatically track the position of both pick and place objects rather than acquiring a script policy.

Dynamic Camera Views

ManiWhere is also capable of performing well under dynamically changing viewpoints.

Instance Generalization

Thanks to the general grasping capabilities of the dexterous hand, we find that Maniwhere is not limited to a single object when executing the lifting behaviours and can generalize across different instances with various shapes and sizes.

Cube

Dice (small)

Apple

Pitaya

Dice (big)

Plush Toy

Failure Case

This webpage template is borrowed from DP3.