Learning Breakout From RAM – Part 1

philoxenic

Rate me:

5.00/5 (3 votes)

2 Jul 2020CPOL3 min read

7.1K

In this article we will learn from the contents of the game’s RAM instead of the pixels.

Programmers with so little memory to use were accustomed to coming up with all sorts of "neat tricks" to pack as much information into the space as possible. So in this article we will be learning from RAM, and learning from a subset of the RAM.

In the previous article, we used RLlib’s I MPALA agent to learn the Atari Breakout environment from pixels in a respectable time.

Here, we will take it one step further and try to learn from the contents of the game’s RAM instead of the pixels.

As a software engineer, I expected the RAM environments to be easier to learn. After all, it seems likely that one location in memory would hold the x-coordinate of the bat, and two more would hold the position of the ball. If I was trying to write some code to play this game, and wasn’t using machine learning, that’s probably where I’d want to start. If forced to use graphics, I’d just process them to extract this information anyway, so surely it’s simpler to skip right over that step.

Turns out I was wrong! It’s easier to learn from the images than from the RAM. Modern convolutional neural network architectures are good at extracting useful features from images. In contrast, programmers with so little memory to use were accustomed to coming up with all sorts of "neat tricks" to pack as much information into the space as possible. One byte might represent a number, or two numbers, of four bits each, or eight flags...

Learning From RAM

Here is the code I used:

Python

import ray
from ray import tune
from ray.rllib.agents.dqn import DQNTrainer

ray.shutdown()
ray.init(include_webui=False, ignore_reinit_error=True)

ENV = "Breakout-ramDeterministic-v4"
TARGET_REWARD = 200
TRAINER = DQNTrainer

tune.run(
    TRAINER,
    stop={"episode_reward_mean": TARGET_REWARD},
    config={
      "env": ENV,
      "monitor": True,
      "evaluation_num_episodes": 25,
      "double_q": True,
      "hiddens": [128],
      "num_workers": 0,
      "num_gpus": 1,
      "target_network_update_freq": 12_000,
      "lr": 5E-6,
      "adam_epsilon": 1E-5,
      "learning_starts": 150_000,
      "buffer_size": 1_500_000,
    }
)

This is the progress up to the point where I stopped the process:

This is not a great success. I let the training run for 54 hours to achieve the score of 40. So it had learnt something, and the graph suggests it was continuing to improve, but progress was very slow. In the next article, we will see how to do better.

Learning From a Subset of the RAM

It is tempting to think that, even though Atari has only 128 bytes of memory, many of the stored values are just noise. For example, somewhere in there would be the player’s current score, and using this as an input feature won’t help the learning.

So I tried to identify a subset of bits that carry useful information. By logging the observations and seeing which ones seemed to be changing meaningfully (that is, had lots of distinct values over the first hundred timesteps), I picked out the following column indexes as "interesting": 70, 71, 72, 74, 75, 90, 94, 95, 99, 101, 103, 105, and 119.

Here is the code I used for training a model using only these values. I switched over to using the PPO algorithm because it seemed to perform a bit better than DQN.

The interesting part is the TruncateObservation class, which simplifies the observation space from 128 bytes down to 13.

Python

import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()

import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer

ray.shutdown()
ray.init(include_webui=False, ignore_reinit_error=True)

import numpy as np
import gym
from gym.wrappers import TransformObservation
from gym.spaces import Box
from ray.tune.registry import register_env
from gym import ObservationWrapper


class TruncateObservation(ObservationWrapper):
  interesting_columns = [70, 71, 72, 74, 75, 90, 94, 95, 99, 101, 103, 105, 119]

  def __init__(self, env):
    super().__init__(env)
    self.observation_space = Box(low=0, high=255, shape=(len(self.interesting_columns),), dtype=np.uint8)

  def observation(self, obs):
    # print(obs.tolist())  # print full observation to find interesting columns
    return obs[self.interesting_columns]  # filter


def env_creator(env_config):
    env = gym.make('Breakout-ramDeterministic-v4')
    env = TruncateObservation(env)
    return env


register_env("simpler_breakout", env_creator)
ENV = "simpler_breakout"
TARGET_REWARD = 200
TRAINER = PPOTrainer


tune.run(
    TRAINER,
    stop={"episode_reward_mean": TARGET_REWARD},
    config={
      "env": ENV,
      "num_workers": 1,
      "num_gpus": 0,
      "monitor": True,
      "evaluation_num_episodes": 25
    }
)

The learning performance looked like this:

It achieved the score of 42 after 27 hours of training, at which point I stopped the process. This looks more promising than trying to train on all the bytes. Let me know in the comments if you manage to do better than this with a slightly different subset of memory locations. For example, "Learning from the memory of Atari 2600" by Jakub Sygnowski and Henryk Michalewski calls out memory locations 95 to 105 as particularly influential.

In the next article, we will see how we can improve by approaching the RAM in a slightly different way.

This article is part of the series 'Deep Reinforcement Learning and Game Optimization ◁ Prev View All Next ▷

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

philoxenic

Web Developer

United Kingdom

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Learning Breakout From RAM – Part 1

Learning From RAM

Learning From a Subset of the RAM

License

Comments and Discussions