Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


Implementations of reinforcement learning algorithms.


Supported algorithms:

  • Proximal Policy Optimization (PPO)
    • Asynchronous Proximal Policy Optimization (APPO) uses Ray to decouple rollout generation from the learner. Rollout, policy inference, and evaluation are each Ray actors that can be independently scaled depending on the environment. Workers can use the same GPU as the learner or can be assigned to other GPUs to maximize learning throughput.
    • Distributed Proximal Policy Optimization (DPPO) extends APPO with the Hugging Face Accelerate library to support multi-GPU learning using PyTorch's DistributedDataParallel.
  • Advantage Actor-Critic (A2C)
  • Behavior Cloning (ACBC): While not a reinforcement learning algorithm, it's included as a model pretrainer for the other algorithms.

Supported environments:

  • microRTS: small real-time strategy (RTS) game implementation in Java. MicroRTS-Py was used as a started point for its vectorized gym environment and extended.
  • Lux AI Season 2: turn-based resource management game on a larger grid-based map (48x48 and later 64x64) than the original Lux AI game.
  • Classic, Box2D, MuJoCo, and Atari envs from gymnasium

Removed support:

  • Deep-Q Neural Network (DQN): Removed in v0.2.0 because rollout generation redesign doesn't support a replay buffer yet.
  • procgen: Removed in 2ee4de7 because of gym 0.21 dependency. v0.1.0 uses gymnasium.
  • PyBullet: Removed in 2ee4de7 because of gym dependency.

Prerequisites: Weights & Biases (WandB)

Training and benchmarking assumes you have a Weights & Biases project to upload runs to. By default training goes to a rl-algo-impls project while benchmarks go to rl-algo-impls-benchmarks. During training and benchmarking runs, videos of the best models and the model weights are uploaded to WandB.

Before doing anything below, you'll need to create a wandb account and run wandb login.

Setup and Usage

Lambda Labs instance for benchmarking

Benchmark runs are uploaded to WandB, which can be made into reports (for example). So far I've found Lambda Labs A10 instances to be a good balance of performance (14 hours to train PPO in 14 environments [5 basic gymnasium, 4 MuJoCo, CarRacing-v2, and 4 Atari] across 3 seeds) vs cost ($0.60/hr).

git clone
cd rl-algo-impls
# git checkout BRANCH_NAME if running on non-main branch
bash ./scripts/ [--microrts] [--lux] # End of script will prompt for WandB API key
bash ./scripts/ [-a {"ppo"}] [-e ENVS] [-j {6}] [-p {rl-algo-impls-benchmarks}] [-s {"1 2 3"}]

Benchmarking runs are by default upload to a rl-algo-impls-benchmarks project. Runs upload videos of the running best model and the weights of the best and last model. Benchmarking runs are tagged with a shorted commit hash (i.e., benchmark_5598ebc) and hostname (i.e., host_192-9-145-26)

Publishing models to Huggingface

Publishing benchmarks to Huggingface requires logging into Huggingface with a write-capable API token:

git config --global credential.helper store
huggingface-cli login
# For example: python --wandb-tags host_192-9-147-166 benchmark_1d4094f --wandb-report-url
# --virtual-display likely must be specified if running on a remote machine.
poetry run python --wandb-tags HOST_TAG COMMIT_TAG --wandb-report-url WANDB_REPORT_URL [--virtual-display]

Google Colab Pro+

3 notebooks in the colab directory are setup to be used with Google Colab:

  • colab_benchmark.ipynb: Even with a Google Colab Pro+ subscription you'd need to only run parts of the benchmark. The file recommends 4 splits (basic+mujoco, carcarcing, atari1, atari2) because it would otherwise exceed the 24-hour session limit. This mostly comes from being unable to get pool_size above 1 because of WandB errors.
  • colab_train.ipynb: Train models while being able to specify the env, seeds, and algo. By default training runs are uploaded to the rl-algo-impls project.
  • colab_enjoy.ipynb: Download models from WandB and evaluate them. Training is likely to be more interesting given videos are uploaded.



My local development has been on an M1 Mac. These instructions might not be complete, but these are the approximate setup and usage I've been using:

  1. Install libraries with homebrew
brew install swig
brew install --cask xquartz
brew install pipx
  1. Download and install Miniconda for arm64
curl -O
  1. Create a conda environment from this repo's environment.yml
conda env create -f environment.yml -n rai_py38_poetry
conda activate rai_py38_poetry
  1. Install other dependencies with poetry
pipx install poetry
poetry install -E all


Training, benchmarking, and watching the agents playing the environments can be done locally:

poetry run python [-h] [--algo {ppo}] [--env ENV [ENV ...]] [--seed [SEED ...]] [--wandb-project-name WANDB_PROJECT_NAME] [--wandb-tags [WANDB_TAGS ...]] [--pool-size POOL_SIZE] [-virtual-display] by default uploads to the rl-algo-impls WandB project. Training creates videos of the running best model, which will cause popups. Creating the first video requires a display, so you shouldn't shutoff the display until the video of the initial model is created (1-5 minutes depending on environment). The --virtual-display flag should allow headless mode, but that hasn't been reliable on macOS.

poetry run python [-h] [--algo {ppo}] [--env ENV] [--seed SEED] [--render RENDER] [--best BEST] [--n_episodes N_EPISODES] [--deterministic-eval DETERMINISTIC_EVAL] [--no-print-returns]
# OR
poetry run python [--wandb-run-path WANDB_RUN_PATH]

The first where you specify algo, env, and seed loads a model you locally trained with those parameters and renders the agent playing the environment.

The second downloads the model and hyperparameters from a WandB run. An example run path is sgoodfriend/rl-algo-impls-benchmarks/09gea50g


These are specified in yaml files in the hyperparams directory by game (atari is a special case for all Atari games).

gym-microrts Setup

Requires Java SDK to be installed first

poetry install -E microrts

Lux AI Season 2 Setup

Lux training uses a Jux fork that adds support for environments not being in lockstep, stats collection, and other improvements. The fork by default will install the CPU-only version of Jax, which isn't ideal for training, but useful for development.

poetry run pip install vec-noise # lux requires vec-noise, which isn't poetry installable
poetry install -E lux

When doing actual training, you'll need an Nvidia GPU and follow these instructions to install jax[cuda11_pip]==0.4.7 after installing the lux dependencies:

poetry run pip install vec-noise # lux requires vec-noise, which isn't poetry installable
poetry install -E lux
# If CUDA 12 installed, use `cuda12_pip` instead.
poetry run pip install --upgrade "jax[cuda11_pip]==0.4.7" -f

Citing this Project

To cite the microRTS work in this project:

      title={A Competition Winning Deep Reinforcement Learning Agent in microRTS}, 
      author={Scott Goodfriend},

If citing parts of this project NOT microRTS:

      title={rl-algo-impls: Implementations of reinforcement learning algorithms},
      author={Scott Goodfriend},