In earlier articles in this series we looked at the Humanoid Bullet environment, where the objective is to teach a model of a humanoid to walk forwards without falling over.

This time we will look at how we can tweak the environment to make the agent learn a slightly different challenge: to walk backwards instead of forwards.

You won’t find much, if anything, about this in the scientific literature for machine learning. Indeed, getting it to work at all was a bit of a struggle! Naively I thought I would just be able to set a different target x-coordinate on the environment, and that would be it. However, when I tried that I ended up with agents that simply learnt to throw themselves backwards from the starting line and then the episode would finish. It’s fun to watch, but it’s not what I was trying to achieve.

Instead, it turned out that I needed to override the target x-coordinate in both the environment and the robot, as well as move the starting point away from the origin location (0, 0, 0). It took a lot of trial and error to get this to work! I never did manage to track down where in the code the episode was being forcibly ended when the x-coordinate went negative – If you work this out, please let me know. The PyBullet code wasn’t designed with this sort of extensibility in mind.

Code

Here is the code I used. The interesting part is in the custom environment’s reset function, where it sets the new target location and start position:

import pyvirtualdisplay

_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()

import ray
from ray import tune
from ray.rllib.agents.sac import SACTrainer
from pybullet_envs.gym_locomotion_envs import HumanoidBulletEnv

ray.shutdown()
ray.init(include_webui=False, ignore_reinit_error=True)

from ray.tune.registry import register_env

class RunBackwardsEnv(HumanoidBulletEnv):
    def reset(self):
        state = super().reset()

        self.walk_target_x = -1e3

        self.robot.walk_target_x = -1e3
        self.robot.start_pos_x = 500
        self.robot.body_xyz = [500, 0, 0]

        return state

def make_env(env_config):
    env = RunBackwardsEnv()
    return env

ENV = 'HumanoidBulletEnvReverseReward-v0'
register_env(ENV, make_env)

TARGET_REWARD = 6000
TRAINER = SACTrainer

tune.run(
    TRAINER,
    stop={"episode_reward_mean": TARGET_REWARD},
    config={
        "env": ENV,
        "num_workers": 15,
        "num_gpus": 1,
        "monitor": True,
        "evaluation_num_episodes": 50,
        "optimization": {
            "actor_learning_rate": 1e-3,
            "critic_learning_rate": 1e-3,
            "entropy_learning_rate": 3e-4,
        },
        "train_batch_size": 128,
        "target_network_update_freq": 1,
        "learning_starts": 1000,
        "buffer_size": 1_000_000,
        "observation_filter": "MeanStdFilter",
    }
)

By default, the environment’s target x location is 1000. I set it to -1000, but I’m not sure if it ever makes it that far. I suspect the episode would be forcibly terminated when it passes zero.

Graph

Here is a graph of the average reward over time from the training, over the course of 46.8 hours.

As you can see, the learning process was not particularly smooth, and it looks like the agent might have continued to improve if I had left it for longer.

Video

Here is what the trained agent looked like. Perhaps it’s not an elegant gait, but given how many failed experiments I had done that were thwarted by not being able to go backwards past the origin in the environment, I was thrilled to finally see this working.

Next time

In the next and final article in this series we will look at even deeper customisation: editing the XML-based model of the figure and then training the result.