Click here to Skip to main content
15,908,173 members
Articles / Artificial Intelligence / Machine Learning

Teaching a Robot to Walk with AI - Introduction to Continuous Control Environments

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
25 Sep 2020CPOL6 min read 7.8K   8  
In this article, we set up with the Bullet physics simulator as a basis for doing some reinforcement learning in continuous control environments.
Here we look at bullet CartPole environment with DQN, a Python script for training an agent in PyBullet’s CartPole environment, bullet CartPole Environment With PPO, and continuous CartPole environment with PPO.


Hello and welcome to my second series of articles about practical deep reinforcement learning.

Last time we learned the basics of deep reinforcement learning by training an agent to play Atari Breakout. If you haven’t already read that series, it’s probably the best place to start.

All the environments we looked at before had discrete action spaces: At any timestep, the agent must pick an action to perform from a list, such as "do nothing", "move left", "move right", or "fire".

This time our focus will shift to environments with continuous action spaces: At each timestep, the agent must choose a collection of floating-point numbers for the action. In particular, we will be using the Humanoid environment in which we encourage a human figure to learn to walk.

We will continue using Ray’s RLlib library for the reinforcement learning process, as using continuous action spaces reduces the number of learning algorithms available to us, which you can see in the feature compatibility matrix.


These articles assume you have some familiarity with Python 3 and installing components.

The continuous control environments we will be training are more challenging than those in the previous series. Even on a powerful machine with a good GPU and plenty of CPU cores, some of these experiments will take several days to train. Many companies will gladly rent Linux servers by the hour, and that is the approach I have taken when running these training sessions. It works particularly well together with tmux or screen, enabling you to disconnect and reconnect to the running sessions (and to run several experiments at the same time if you have access to a powerful enough machine).

A Simple Set-Up Script

For this series of articles, I ran these training sessions (and a lot more that didn’t make the cut!) on rented Linux servers — specifically Ubuntu. I found it convenient to have all the requirements in a single setup script.

The main dependency that we didn’t have last time is PyBullet, a Python wrapper around the Bullet physics simulation engine.


# Get a clean host up to speed for running ML experiments

apt-get update
apt-get upgrade –quiet

apt-get install -y xvfb x11-utils
apt-get install -y git
apt-get install -y ffmpeg

pip install gym==0.17.2
pip install ray[rllib]==0.8.6
pip install PyBullet==2.8.6
pip install pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*
pip install pandas==1.1.0

# should be pip install tensorflow_probability, but had to do this:
pip install tfp-nightly==0.11.0.dev20200708
# because of this issue:

So my routine for a fresh server was to apt-get install wget, then wget the script above from the public server I had hosted it on, and then run it to install all my dependencies.

For downloading the results from a remote server, you can use scp (or PuTTY’s pscp for downloading to Windows).

CartPole revisited

Bullet CartPole Environment With DQN

About the Environment

If you remember the CartPole environment from the last series of articles, I described it as the "Hello World" of reinforcement learning. It had a wheeled cart with a pole balanced on it, and the agent could move the cart left or right, getting rewarded for keeping the pole balanced for as long as possible.

The CartPole is one of the simpler reinforcement learning environments and still has a discrete action space. PyBullet includes its own version (instead of the one from OpenAI’s Gym, which we used last time), which you can try running to check that PyBullet is installed correctly.

Let’s take a look at the specific environment we will be using.

>>> import gym
>>> import pybullet_envs
>>> env = gym.make('CartPoleBulletEnv-v1')
>>> env.observation_space
>>> env.action_space

So, the observation space is a four-dimensional continuous space, but the action space has just two discrete options (left or right). This is what we are used to.

We’ll start by using the Deep Q Network (DQN) algorithm, which we encountered in the previous series.


Here is a Python script for training an agent in PyBullet’s CartPole environment. The general structure is the same as the structure we used for the experiments in the previous series: Set up a virtual display so that we can run headlessly on a server, restart Ray cleanly, register the environment and then set it to train until it has reached the target reward.

Note that it is the import pybullet_envs that registers the PyBullet environment names with Gym. I found that I needed to import it inside the make_env function instead of at the top of the file, presumably because this function gets called in isolation on Ray’s remote workers.

import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()

import gym
import ray
from ray import tune
from ray.rllib.agents.dqn import DQNTrainer
from ray.tune.registry import register_env

ray.init(include_webui=False, ignore_reinit_error=True)

ENV = 'CartPoleBulletEnv-v1'
def make_env(env_config):
    import pybullet_envs
    return gym.make(ENV)
register_env(ENV, make_env)
    stop={"episode_reward_mean": TARGET_REWARD},
        "env": ENV,
        "num_workers": 7,
        "num_gpus": 1,
        "monitor": True,
        "evaluation_num_episodes": 50,

You can tweak the values for num_gpus and num_workers based on the hardware you are using. The value of num_workers can be up to one less than the number of CPU cores you have available.

By setting monitor to True and evaluation_num_episodes to 50, we told RLlib to save progress statistics and mp4 videos of the training sessions every 50 episodes (the default is every 10). These will appear in the ray_results directory.


For my run, learning progress (episode_reward_mean from progress.csv) looked like this, taking 176 seconds to reach the target reward of 190:

Image 1

Note that we didn’t change any of the interesting parameters, such as the learning rate. This environment is simple enough that RLlib’s default parameters for the DQN algorithm do a good job without needing any finessing.


The trained agent interacting with its environment looks something like this:

Image 2

Bullet CartPole Environment With PPO

Changing the Algorithm

A good thing about using RLlib is that switching algorithms is simple. To train using Proximal Policy Optimisation (PPO) instead of DQN, add the following import:

from ray.rllib.agents.ppo import PPOTrainer

and change the definition of TRAINER as follows:


We will find out more details about PPO in a later article when we use it for training the Humanoid environment. For now, we just treat it as a black box and use its default learning parameters.


When I ran this, training took 109.6 seconds and progress looked like this:

Image 3

Now that’s a lovely smooth graph! And it trained more swiftly, too.

The videos look similar to those from the training session with the DQN algorithm, so I won’t repeat them here.

Continuous CartPole Environment With PPO

Introduction to the Environment

As mentioned before, traditional CartPole has a discrete action space: We train a policy to pick "move left" or "move right."

PyBullet also includes a continuous version of the CartPole environment, where the action specifies a single floating-point value representing the force to apply to the cart (positive for one direction, negative for the other).

>>> import gym
>>> import pybullet_envs
>>> env = gym.make('CartPoleContinuousBulletEnv-v0')
>>> env.observation_space
>>> env.action_space

The observation space is the same as before, but this time the action space is continuous instead of discrete.

The DQN algorithm cannot work with continuous action spaces. PPO can, so let’s do it.


Change the declaration of the environment in the training code as follows:

ENV = 'CartPoleContinuousBulletEnv-v0'


This time the model took 108.6 seconds to train (no significant difference from the discrete environment), and progress looked like this:

Image 4


The resulting video looks like this:

Image 5

There’s more oscillation happening, but it has learned to balance the pole.

Next time

In the next article, we will take a look at two of the simpler locomotion environments that PyBullet makes available and train agents to solve them.

This article is part of the series 'Teach a Robot to Walk Deep Reinforcement Learning View All


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
Web Developer
United Kingdom United Kingdom
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

-- There are no messages in this forum --