Rainbow: Combining Improvements in Deep Reinforcement Learning

Last update: Dec 29, 2022

Overview

Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning [1].

Results and pretrained models can be found in the releases.

Run the original Rainbow with the default arguments:

python main.py

Data-efficient Rainbow [9] can be run using the following options (note that the "unbounded" memory is implemented here in practice by manually setting the memory capacity to be the same as the maximum number of timesteps):

python main.py --target-update 2000 \
               --T-max 100000 \
               --learn-start 1600 \
               --memory-capacity 100000 \
               --replay-frequency 1 \
               --multi-step 20 \
               --architecture data-efficient \
               --hidden-size 256 \
               --learning-rate 0.0001 \
               --evaluation-interval 10000

Note that pretrained models from the 1.3 release used a (slightly) incorrect network architecture. To use these, change the padding in the first convolutional layer from 0 to 1 (DeepMind uses "valid" (no) padding).

Requirements

To install all dependencies with Anaconda run conda env create -f environment.yml and use source activate rainbow to activate the environment.

Available Atari games can be found in the atari-py ROMs folder.

Acknowledgements

References

[1] Rainbow: Combining Improvements in Deep Reinforcement Learning
[2] Playing Atari with Deep Reinforcement Learning
[3] Deep Reinforcement Learning with Double Q-learning
[4] Prioritized Experience Replay
[5] Dueling Network Architectures for Deep Reinforcement Learning
[6] Reinforcement Learning: An Introduction
[7] A Distributional Perspective on Reinforcement Learning
[8] Noisy Networks for Exploration
[9] When to Use Parametric Models in Reinforcement Learning?

Comments

Prioritised Experience Replay

I am interested in implementing Rainbow too. I didn't go deep in code for the moment, but I just saw on the Readme.md that Prioritised Experience Replay is not checked. Will this feature be implemented or it is maybe already working? On their paper, Deepmind are actually showing that Prioritized Experience Replay is the most important feature, that means the "no priority" got the bigger performance gap with the full Rainbow.
bug help wanted

opened by marintoro 28
Replicating DeepMind results
As of 5c252ea, this repo has been checked over several times for discrepancies, but is still unable to replicate DeepMind's results. This issue is to discuss any further points that may need fixing.

[x] Should the loss be averaged or summed over the minibatch?

[x] Should noisy network updating use independent noise per transition in the batch [v1] or the same noise but another noise sample for action selection [v2]?

[x] Is the max priority over all time, or just from the current buffer (may be the former)? Results and paper indicate former.

[x] Are priorities added as δ, or δ + ε (ε may not be needed with a KL loss)? One single ablation run indicates adding ε causes performance to drop more at end of training. δ + ε shouldn't be needed with a KL loss,

[x] Most people implement PER by adding priorities already multiplied by α, but the maths indicates that the raw values should be stored and sampling should be done with respect to everything to the power of α? α isn't changed here - so not an issue.

Space Invaders (averaged losses):

Space Invaders (summed losses):
help wanted
opened by Kaixhin 24
Resume support

Added preliminary support for resuming. Initial testing looks like it works, but I'd appreciate if anyone else gets a chance to play with it in their setup.

I didn't add an explicit resume flag, although we could do that. Currently, the assumption is that if you provide the --memory-save-path argument, you want the memory saved there, by default after every testing round. If you provide the --model argument and do not provide the --evaluateflag, the assumption is that you want to resume, and that --memory-save-path exists.

Another flag we could add is a --T_start flag, akin to --T_max, in order to specify where training is resuming from to better the logging of resumed models. What do you think?

Choosing to compress at all, and choosing to use bz2 specifically, came after a quick benchmark I did with some pickled memories I had. It drops them from ~2GB to <100 MB, and bz2 took somewhere around 2-3 minutes, while pickling without it took around 40 seconds.

opened by guydav 14
Performance of release v1.0 on Space Invaders

I just launched the release v1.0 (commit 952fcb4) on Space Invaders for the whole week-end (around 25M steps). I took the exact same code with the exact same random seed. I got really lower performance than the one you are showing. Here are the plots of rewards and Q-values

Could you explain exactly how you got your results for this release? Did you try multiple experiments with different random seed and average them or just took the best one of them? Or maybe it's a pytorch, atari_py or any other library issue? Could you give all your library version?

opened by marintoro 13
Testing should be not deterministic

There is a parameter --evaluation-episodes but in the current implementation, like we are always acting greedly, all the episodes are going to be exactly the same. I think that to get a better testing evaluation, you should add a deterministic=False when you are testing (i.e. in stead of taking the action with the higher Q value, you can sample on all the action with each Q value as the probability) .

I implemented that on my branch on the last commit [email protected] (it's really straightforward)

Btw I launched a training last night, everything worked properly. But I don't have access to a powerfull computer yet so the agent was still pretty poor in performance (in the early stage of training). I just wanted to know if you already launched a big training, on which game and if you compared it to a standard DRL algo (like simple DQN for example)? Because there may still be some non-breaking errors in the implementation which could be sneaky to spot and debug (I mean if the agent is learning worse than simple DQN, there must be something wrong for example).

opened by marintoro 8
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

Traceback (most recent call last): File "main.py", line 81, in state, done = env.reset(), False File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 53, in reset return torch.stack(self.state_buffer, 0) TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

Could somebody give a hand?

opened by forhonourlx 7
disable env.reset() after every episode

Hi, May I check if I would like to keep the environment as it is after each training episode, should I just comment line line 147 in main.py or should I also comment line 130? Besides what am I supposed to do if I just want to reset the agent's position but keep the environment as it is after each training episode?

Thank you.
question

opened by zyzhang1130 5
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

...... self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): None

Traceback (most recent call last): File "main.py", line 103, in next_state, reward, done = env.step(action) # Step File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 63, in step reward += self.ale.act(self.actions.get(action)) File "C:\Program Files\Python35\lib\site-packages\atari_py\ale_python_interface.py", line 159, in act return ale_lib.act(self.obj, int(action)) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

opened by forhonourlx 5
Unit test Prioritised Experience Replay Memory

PER was reported to cause issues (decreasing the performance of a DQN) when ported to another codebase. Although PER can cause performance to decrease, it is still likely that there exists a bug within it.
bug

opened by Kaixhin 5
Policy and reward function

Hi, There is certain thing I would like to modify for policy and reward function. May I ask where is policy stored after each epoch of training? Is there some way to call/index/assign it with some flag? Thanks for answering.

opened by zyzhang1130 4
Memory capacity for example data-efficient Rainbow?

Hi folks,

I'm running the data-efficient Rainbow as a baseline for a project I'm starting, and one thing isn't making sense in my head. The original Rainbow paper uses a 1M transition buffer, and comparatively, the data-efficient paper (Appendix E) claims to use an unbounded memory.

Do you have any sense of what does an unbounded memory even mean in practice? Is there any particular reason you chose to make it smaller than the default Rainbow's memory buffer, rather than larger?

Thank you!
question

opened by guydav 4
A problem about one game in ALE cannot be trained

Hi, Kai! I find an issue which happens when I set the game "defender" as the environment. It only displays hyper-parameter setting "args", and however, any training results aren't output, not as the same as other games.

Thanks!
bug

opened by Hugh-Cai 1
Stuck in memory._retrieve when batch size > 32

Hi,

I notice that RAINBOW doesn't work when the batch size is greater than 32 (I tried for 64, 128, 256), where it is stuck in memory._retrieve the recursive call. Why does this happen? Is there something that I can do about this (to increase the batch size) or batch size needs to be small?

Thanks
question

opened by jiwoongim 1
Is the evluation procedure different?

Hi Kai,

In the Rainbow paper, the evaluation procedure is described as

The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play).

However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).

Thanks!
enhancement question

opened by guydav 8
Human-expert normalized scores

The Rainbow DQN paper uses human-expert normalized scores, so I am not sure how to evaluate the training results against the original paper. Do you know what values were used for human expert scores?

I found snippets of the values used from papers here and there, but not sure if we can use the same number and how we can compute a single normalized value for all Atari games:

opened by ThisIsIsaac 4
Pinned memory experience replay

A more efficient implementation would allocate a giant tensor in advance for each item (e.g. state, action) in a transition tuple, furthermore pin it (as long as the machine has enough RAM spare - should be at least 6GB?), and use asynchronous copies to GPU.
enhancement

opened by Kaixhin 0

Releases(1.4)

1.4(Jun 18, 2019)

Pretrained models for data-efficient Rainbow. Reported scores matched for most games (sometimes models are a bit worse, sometimes a bit better).

Alien

Reward | Q-values :----------:|:-----------: |

Amidar

Reward | Q-values :----------:|:-----------: |

Assault

Reward | Q-values :----------:|:-----------: |

Asterix

Reward | Q-values :----------:|:-----------: |

Bank Heist

Reward | Q-values :----------:|:-----------: |

Battlezone

Reward | Q-values :----------:|:-----------: |

Boxing

Reward | Q-values :----------:|:-----------: |

Breakout

Reward | Q-values :----------:|:-----------: |

Chopper Command

Reward | Q-values :----------:|:-----------: |

Crazy Climber

Reward | Q-values :----------:|:-----------: |

Demon Attack

Reward | Q-values :----------:|:-----------: |

Freeway

Reward | Q-values :----------:|:-----------: |

Frostbite

Reward | Q-values :----------:|:-----------: |

Gopher

Reward | Q-values :----------:|:-----------: |

H.E.R.O.

Reward | Q-values :----------:|:-----------: |

James Bond 007

Reward | Q-values :----------:|:-----------: |

Kangaroo

Reward | Q-values :----------:|:-----------: |

Krull

Reward | Q-values :----------:|:-----------: |

Kung-Fu Master

Reward | Q-values :----------:|:-----------: |

Ms. Pac-Man

Reward | Q-values :----------:|:-----------: |

Pong

Reward | Q-values :----------:|:-----------: |

Private Eye

Reward | Q-values :----------:|:-----------: |

Q*bert

Reward | Q-values :----------:|:-----------: |

Road Runner

Reward | Q-values :----------:|:-----------: |

Seaquest

Reward | Q-values :----------:|:-----------: |

Up'n Down

Reward | Q-values :----------:|:-----------: |
Source code(tar.gz)
Source code(zip)
alien.pth(6.44 MB)
amidar.pth(5.24 MB)
assault.pth(4.79 MB)
asterix.pth(5.09 MB)
bank_heist.pth(6.44 MB)
battle_zone.pth(40.15 KB)
boxing.pth(6.44 MB)
breakout.pth(4.34 MB)
chopper_command.pth(6.44 MB)
crazy_climber.pth(5.09 MB)
demon_attack.pth(4.64 MB)
freeway.pth(4.19 MB)
frostbite.pth(6.44 MB)
gopher.pth(4.94 MB)
hero.pth(6.44 MB)
jamesbond.pth(6.44 MB)
kangaroo.pth(6.44 MB)
krull.pth(6.44 MB)
kung_fu_master.pth(5.84 MB)
ms_pacman.pth(5.09 MB)
pong.pth(4.64 MB)
private_eye.pth(6.44 MB)
qbert.pth(4.64 MB)
road_runner.pth(6.44 MB)
seaquest.pth(6.44 MB)
up_n_down.pth(4.64 MB)
1.3(Oct 24, 2018)

Pretrained models for several games. Note that performance can vary greatly between runs on some games (particularly hard exploration games such as Frostbite, H.E.R.O. and Montezuma's Revenge). Reported scores achieved for all listed games except H.E.R.O. and Montezuma's Revenge.

Asteroids

Reward | Q-values :----------:|:-----------: |

Boxing

Reward | Q-values :----------:|:-----------: |

Breakout

Reward | Q-values :----------:|:-----------: |

Beam Rider

Reward | Q-values :----------:|:-----------: |

Enduro

Reward | Q-values :----------:|:-----------: |

Freeway

Reward | Q-values :----------:|:-----------: |

Frostbite

Reward | Q-values :----------:|:-----------: |

H.E.R.O.

Reward | Q-values :----------:|:-----------: |

Montezuma's Revenge

Reward | Q-values :----------:|:-----------: |

Ms. Pac-Man

Reward | Q-values :----------:|:-----------: |

Pong

Reward | Q-values :----------:|:-----------: |

Q*bert

Reward | Q-values :----------:|:-----------: |

Seaquest

Reward | Q-values :----------:|:-----------: |

Space Invaders

Reward | Q-values :----------:|:-----------: |

Video Pinball

Reward | Q-values :----------:|:-----------: |
Source code(tar.gz)
Source code(zip)
asteroids.pth(41.55 MB)
beam_rider.pth(40.05 MB)
boxing.pth(42.75 MB)
breakout.pth(38.56 MB)
enduro.pth(40.05 MB)
freeway.pth(38.26 MB)
frostbite.pth(42.75 MB)
hero.pth(42.75 MB)
montezuma_revenge.pth(42.75 MB)
ms_pacman.pth(40.05 MB)
pong.pth(39.15 MB)
qbert.pth(39.15 MB)
seaquest.pth(42.75 MB)
space_invaders.pth(39.15 MB)
video_pinball.pth(40.05 MB)

Owner

Kai Arulkumaran

Researcher, programmer, DJ, transhumanist.

GitHub Repository

The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

PISE The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp. Requirement conda create -n pise pyt

110 Nov 21, 2022

ZEBRA: Zero Evidence Biometric Recognition Assessment

ZEBRA: Zero Evidence Biometric Recognition Assessment license: LGPLv3 - please reference our paper version: 2020-06-11 author: Andreas Nautsch (EURECO

2 Dec 12, 2021

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger 🍔 Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

271 Dec 29, 2022

Analyzing basic network responses to novel classes

novelty-detection Analyzing how AlexNet responds to novel classes with varying degrees of similarity to pretrained classes from ImageNet. If you find

34 Oct 02, 2022

Self-supervised Deep LiDAR Odometry for Robotic Applications

DeLORA: Self-supervised Deep LiDAR Odometry for Robotic Applications Overview Paper: link Video: link ICRA Presentation: link This is the correspondin

181 Dec 29, 2022

PyTorch META-DATASET (Few-shot classification benchmark)

PyTorch META-DATASET (Few-shot classification benchmark) This repo contains a PyTorch implementation of meta-dataset and a unified implementation of s

39 Oct 31, 2022

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition This is a Torch implementation of "Deep Residual Learning for Image Recognition",Kaiming He, Xiangyu Zhan

561 Dec 01, 2022

A high performance implementation of HDBSCAN clustering.

HDBSCAN HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates

2.3k Jan 02, 2023

Wordle Env: A Daily Word Environment for Reinforcement Learning

Wordle Env: A Daily Word Environment for Reinforcement Learning Setup Steps: git pull [email&#

2 Mar 28, 2022

High level network definitions with pre-trained weights in TensorFlow

TensorNets High level network definitions with pre-trained weights in TensorFlow (tested with 2.1.0 = TF = 1.4.0). Guiding principles Applicability.

1k Dec 13, 2022

Implementation of association rules mining algorithms (Apriori|FPGrowth) using python.

Association Rules Mining Using Python Implementation of association rules mining algorithms (Apriori|FPGrowth) using python. As a part of hw1 code in

2 Nov 10, 2021

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet, CVPR2021 安全AI挑战者计划第六期：ImageNet无限制对抗攻击决赛第四名（team name: Advers）

51 Dec 01, 2022

Unsupervised Foreground Extraction via Deep Region Competition

Unsupervised Foreground Extraction via Deep Region Competition [Paper] [Code] The official code repository for NeurIPS 2021 paper "Unsupervised Foregr

28 Nov 06, 2022

Calibrate your listeners! Robust communication-based training for pragmatic speakers. Findings of EMNLP 2021.

Calibrate your listeners! Robust communication-based training for pragmatic speakers Rose E. Wang, Julia White, Jesse Mu, Noah D. Goodman Findings of

3 Apr 02, 2022

ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL and PFRL ChainerRL (this repository) is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement al

1.1k Jan 01, 2023

Epidemiology analysis package

zEpid zEpid is an epidemiology analysis package, providing easy to use tools for epidemiologists coding in Python 3.5+. The purpose of this library is

111 Jan 08, 2023

Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

semantic-segmentation-tensorflow This is a Tensorflow implementation of semantic segmentation models on MIT ADE20K scene parsing dataset and Cityscape

83 Oct 13, 2022

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Vision Transformer Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: T

1.4k Dec 28, 2022

An updated version of virtual model making

Model-Swap-Face v2 这个项目是基于stylegan2 pSp制作的，比v1版本Model-Swap-Face在推理速度和图像质量上有一定提升。主要的功能是将虚拟模特进行环球不同区域的风格转换，目前转换器提供西欧模特、东亚模特和北非模特三种主流的风格样式，可帮我们实现生产资料零成

62 Dec 09, 2022

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations. Each modality’s augmentations are contained within its own sub-l

4.6k Jan 09, 2023