RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization

Zhecheng Yuan^, Sizhe Yang^, Pu Hua, Can Chang,
Kaizhe Hu, Huazhe Xu

Tsinghua University IIIS, University of Electronic Science and Technology, UC San Diego

Shanghai Qi Zhi Institute, Shanghai AI Lab

Paper Code

Abstract

Visual Reinforcement Learning (Visual RL), coupling with high-dimensional observations, has consistently confronted the long-standing challenge of generalization. Despite the focus on algorithms aimed at resolving visual generalization problems, we argue that the devil is in the existing benchmarks as they are restricted to isolated tasks and generalization categories, undermining a comprehensive evaluation of agents' visual generalization capabilities. To bridge this gap, we introduce RL-ViGen: a novel Reinforcement Learning Benchmark for Visual Generalization, which contains diverse tasks and a wide spectrum of generalization classes, thereby facilitating the derivation of more reliable conclusions. Furthermore, RL-ViGen incorporates the latest generalization visual RL algorithms into a unified framework, under which the experiment results indicate that no single existing algorithm has prevailed universally across tasks. Our aspiration is that RL-ViGen will serve as a catalyst in this field, laying a foundation for the future creation of universal visual generalization RL agents suitable for real-world scenarios. Access to our code and implemented algorithms is provided at https://github.com/gemcollector/RL-ViGen.

Task Categories

RL-ViGen consists of 5 distinct task categories, spanning locomotion, table-top manipulation, autonomous driving, indoor navigation, and dexterous hand manipulation. In contrast to prior benchmarks, RL-ViGen employs a diverse array of task types for evaluating the agent's generalization performance.

Autonomous Driving

CARLA serves as a realistic and high-fidelity simulator for autonomous driving. RL-ViGen provides an enhanced range of dynamic weather, and more complex road conditions in different scene structures. Furthermore, flexible camera angle adjustments are also integrated.

Dexterous Manipulation

Adroit consists of a sophisticated environment which tailored for the dexterous hand manipulation tasks. In RL-ViGen, we have enriched the Adroit environment by incorporating diverse visual appearances, camera perspectives, hand types, lighting changes and object shapes.

Table-top Manipulation

Robosuite is a modular simulation platform designed to support robot learning. RL-ViGen incorporates dynamic backgrounds and adaptive lighting conditions, refining the simulation to be closer to the real world.

Indoor Navigation

As an efficient and photorealistic 3D simulation, Habitat combines numerous visual navigation tasks. RL-ViGen proposes additional scenarios with different visual and lighting settings.

Locomotion

DeepMind Control is a popular continuous visual RL benchmark. RL-ViGen introduces objects and corresponding tasks from real-world advanced locomotion and manipulation applications, such as Unitree quadrupedal robots and Franka Arm.

Training Scenes

For better understanding the difference between the novel scenarios and the original training environments, we show the training scenes of each environment in our paper. Meanwhile, you can also choose any environment as your training scenarios.

Table-top Manipulation

Door

Lift

TwoArm Peginhole

Dexterous Hand Manipulation

Door

Hammer

Pen

Locomotion

Unitree Stand

Unitree Walk

Anymal Stand

Autonomous Driving & Navigation

Navigation

Driving

Generalizaton Categories

Our benchmark offers a wide range of generalization categories, including visual appearances, camera views, variations in lighting conditions, scene structures, and cross embodiments settings, thereby providing a thorough evaluation of algorithms' robustness and generalization abilities.

Visual Appearances

In RL-ViGen, different components within the environment can be modified with a wide range of colors. Meanwhile, the dynamic video background is also introduced in challenging setting.

Lighting Changes

Lighting changes is an inevitable occurrence in the real world. Therefore, in order to enable agents to adapt to lighting variations, we provide some interfaces related to lighting, such as altering light intensity, colors, and dynamic shadow changes.

Camera Views

In the real world, the agents have to encounter varied camera configurations, angles, or positions that may deviate from those experienced during training. We offer the access to set the cameras in different angles and distances. In addition, the number of cameras can be adjusted accordingly.

Cross Embodiments

Adapting learned skills and knowledge to different physical bodies or embodiments is essential for an agent to perform well across various platforms. Our benchmark also provides access to modify the embodiment of trained agents in aspect of model types, sizes, and other physical properties.

Scene Structures

Mastering the ability of understanding and adapting to different spatial arrangements and organization patterns within various scenes is crucial for a generalizable agent. Our benchmark enables modifications in scene structure by adjusting maps, patterns, or introducing extra objects.

Experiments

In our benchmark, we assemble 7 leading visual RL algorithms and apply the same unified training and evaluation framework. We investigate the generalization ability of different approaches in RL-ViGen. All agents are trained on the same fixed training environment and evaluate on various unseen scenarios in a zero-shot manner. Here, we have listed a few representative experiments; More details can be founded in our paper.

Here, we have listed a few representative experiments to evaluate each algorithm in RL-ViGen. More experimental results and details can be founded in our paper.

Visual Appearances and Lighting Changes

Each generalization algorithm possesses its own unique strengths. Notably, PIE-G demonstrates superior performance with respect to realistic and complex scenarios, while SRM, under significant image frequency variations, exhibits remarkable robustness.

Camera Views

In terms of Camera Views, the exceptional performance of SGQN is largely due to its heavy reliance on saliency maps, which enhances the agent's self-awareness of object geometry and relative positioning.

Scene Structures

For scene structures, the performance of all algorithms fall short of expectations, suggesting that the current visual RL algorithms and generalization approaches are not adequately robust to scene structural changes.