Cartpole: Tweaking the Options

philoxenic

5.00/5 (3 votes)

Jun 26, 2020

CPOL

3 min read

7799

In this article, we will see what’s going on behind the scenes and what options are available for changing the reinforcement learning.

In the previous, and first, article in this series, we went over the reinforcement learning background, and set up with some helper functions. At the end, we solved a simple cartpole environment.

This time, we will take a look behind the scenes to see what options we have for tweaking the learning. We’ll finish by looking at another classic reinforcement learning environment: the mountain car.

Understanding the Cartpole Environment

We can interrogate a Gym environment to find out how it expects us interact with it:

import gym
env = gym.make("Cartpole-v0")
print(env.observation_space)
>>> Box(4,)

This tells us that we should expect four values in each observation.

print(env.action_space)
>>> Discrete(2)

This is known as a discrete action space; in this case, move left or move right. Staying still is not an option: the cart will be in a constant state of motion. Some other environments have continuous action spaces where, for example, an agent has to decide exactly how much voltage to apply to a servo to move a robot arm.

Let’s step through one episode of interaction with the cartpole environment. We are resetting the environment and then taking the same action (0, or LEFT) over and over again, until the pole topples too far or the cart moves out of bounds.

env.reset()
while True:
  update = env.step(action=0)
  print(update)
  observation, reward, done, info = update
  if done:
    break

With each interaction, we get a new observation, details about the reward (+1 for each timestep we manage to survive), a notification of whether or not the episode is finished, and some (empty) information.

OpenAI’s wiki page for the environment goes into more detail, including explanation of the four floating-point numbers in the observation: the cart’s position, its velocity, the angle of the pole, and the velocity of the tip of the pole.

Changing the Learning Configuration

In the previous article, we used a DQNTrainer. Since we didn’t specify much in the config, RLlib used defaults. We can show these using the Python Library’s pretty-printing module:

import pprint
pprint.pprint(ray.rllib.agents.dqn.DEFAULT_CONFIG)

The output here should give you some idea of how customisable the library is. For example:

double_q, dueling, and prioritied_replay all default to True: these options can be used to help the agent learn faster
lr is the learning rate; it is something you’re likely to want to tweak but it can have dramatic effects on how the agent learns

We’ll be looking at a few more options throughout the remainder of this series. I hope that will whet your appetite for investigating others!

Add the following lines to the config dictionary you used to train on the cartpole environment last time:

"lr": 0.001,
"explore": False,

This will increase the learning rate from the default 0.0005 to 0.001, and will turn off the exploration schedule. If exploration is on, the agent might take an action chosen at random, with a probability that decays over time, instead of just taking the action that it thinks is best. This can avoid over-fitting. The cartpole environment is simple enough so we don’t need to worry about this.

Run the training again and see how it goes. You might need to run each configuration several times to get a clear picture. Tweak some of the other parameters to see if you can get the training time down. This is a good environment to experiment with because you know whether your changes have been successful within a few minutes.

Mountain Car Environment

Mountain car is another classic reinforcement learning environment. Your agent has to learn to get a cart to the top of a mountain by pushing it left and right, expending as little energy as possible.

Note the reward structure: you lose one point from your score for every timestep that passes between the start and the mountain car reaching the top of the hill. So the target score is also a negative number, just less negative than the scores you get in the early stages of training. The episode will automatically terminate after 200 timesteps, so the worst score is -200.

Here is the code I used:

ENV = 'MountainCar-v0'
TARGET_REWARD = -110  # note the negative target
TRAINER = DQNTrainer

tune.run(
     TRAINER,
     stop={"episode_reward_mean": TARGET_REWARD},
     config={
       "env": ENV,
       "num_workers": 0,  # run in a single process
       "num_gpus": 0,
       "monitor": True,
       "evaluation_num_episodes": 25,
       "lr": 0.001,
       "explore": False,
     }
)

For me, this solved the environment in less than 10 minutes. As before, experiment with the configuration to see if you can improve the performance.

From the next article onwards, we will learn more complicated environments based on the Atari Breakout game.