Deep reinforcement learning methods
Their advantages and shortcomings
Ashley Hill
CEA, LIST, LCSR
4th May 2020
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 1 / 97
Deep reinforcement learning methods Their advantages and - - PowerPoint PPT Presentation
Deep reinforcement learning methods Their advantages and shortcomings Ashley Hill CEA, LIST, LCSR 4 th May 2020 4 th May 2020 Ashley Hill ( CEA, LIST, LCSR ) Deep reinforcement learning methods 1 / 97 Who am I? Ashley Hill, PhD student at CEA
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 1 / 97
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 2 / 97
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 3 / 97
Reinforcement learning
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 4 / 97
Reinforcement learning History of deep learning
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 5 / 97
Reinforcement learning History of deep learning
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 6 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 7 / 97
Reinforcement learning Reinforcement learning introduction
reward rt action at rt+1
at+1
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 8 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 9 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 10 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 11 / 97
Reinforcement learning Reinforcement learning introduction
Overheated
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 12 / 97
Reinforcement learning Reinforcement learning introduction
reward rt action at rt+1
at+1
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 13 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 14 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 15 / 97
Reinforcement learning Reinforcement learning introduction
1Sutton, Barto, et al., Introduction to reinforcement learning. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 16 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 17 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 18 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 19 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 20 / 97
Reinforcement learning Reinforcement learning introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 21 / 97
Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 22 / 97
Deep Q network Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 23 / 97
Deep Q network Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 24 / 97
Deep Q network Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 25 / 97
Deep Q network Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 26 / 97
Deep Q network Building the Deep Q network
1Hornik, Stinchcombe, and White, “Universal approximation of an unknown mapping
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 27 / 97
Deep Q network Building the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 28 / 97
Deep Q network Building the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 29 / 97
Deep Q network Building the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 30 / 97
Deep Q network Building the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 31 / 97
Deep Q network Building the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 32 / 97
Deep Q network Building the Deep Q network
2Sutton, Barto, et al., Introduction to reinforcement learning. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 33 / 97
Deep Q network Building the Deep Q network
3Sutton, Barto, et al., Introduction to reinforcement learning. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 34 / 97
Deep Q network Stabilizing the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 35 / 97
Deep Q network Stabilizing the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 36 / 97
Deep Q network Stabilizing the Deep Q network
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 37 / 97
Deep Q network Deep Q network (DQN) method
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 38 / 97
Deep Q network Deep Q network (DQN) method
4 Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 39 / 97
Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 40 / 97
Deep Deterministic Policy Gradient Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 41 / 97
Deep Deterministic Policy Gradient Examples
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 42 / 97
Deep Deterministic Policy Gradient Building the Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 43 / 97
Deep Deterministic Policy Gradient Building the Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 44 / 97
Deep Deterministic Policy Gradient Building the Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 45 / 97
Deep Deterministic Policy Gradient Building the Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 46 / 97
Deep Deterministic Policy Gradient Exploring with Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 47 / 97
Deep Deterministic Policy Gradient Exploring with Deep Deterministic Policy Gradient
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 48 / 97
Deep Deterministic Policy Gradient Deep Deterministic Policy Gradient (DDPG) method
a
q(s, a)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 49 / 97
Deep Deterministic Policy Gradient Deep Deterministic Policy Gradient (DDPG) method
5Lillicrap et al., “Continuous control with deep reinforcement learning”. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 50 / 97
Advantage Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 51 / 97
Advantage Actor Critic Off-policy, On-policy, policy gradient, and value function
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 52 / 97
Advantage Actor Critic Building the Advantages Actor Critic
(from https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 53 / 97
Advantage Actor Critic Building the Advantages Actor Critic
(from https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 54 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 55 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 56 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 57 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 58 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 59 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 60 / 97
Advantage Actor Critic Building the Advantages Actor Critic
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 61 / 97
Advantage Actor Critic Advantage Actor Critic (A2C) model
σa µa
v(s)
6Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning”. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 62 / 97
Advantage Actor Critic Advantage Actor Critic (A2C) model
paper)
7Silver et al., “Mastering the game of go without human knowledge”. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 63 / 97
Overview
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 64 / 97
Overview Sample efficiency
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 65 / 97
Overview Overview of seen methods
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 66 / 97
Overview Overview of seen methods
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 67 / 97
Overview Common Pitfalls
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 68 / 97
Overview Common Pitfalls
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 69 / 97
Conclusion
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 70 / 97
Conclusion Stable Baselines
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 71 / 97
Conclusion Stable Baselines
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 72 / 97
Conclusion Stable Baselines
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 73 / 97
Conclusion Stable Baselines
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 74 / 97
Conclusion TP
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 75 / 97
Conclusion End of the presentation
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 76 / 97
Conclusion End of the presentation
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 77 / 97
Appendix
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 78 / 97
Appendix
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 79 / 97
Appendix
fixed tf.session(). enter () being used, rather than sess = tf.session() and passing the session to the objects fixed uneven scoping of TensorFlow Sessions throughout the code fixed rolling vecwrapper to handle observations that are not only grayscale images fixed deepq saving the environment when trying to save itself fixed ValueError: Cannot take the length of Shape with unknown rank. in acktr, when running run atari.py script. fixed calling baselines sequentially no longer creates graph conflicts fixed mean on empty array warning with deepq fixed kfac eigen decomposition not cast to float64, when the parameter use float64 is set to True fixed Dataset data loader, not correctly resetting id position if shuffling is disabled fixed EOFError when reading from connection in the worker in subproc vec env.py fixed behavior clone weight loading and saving for GAIL avoid taking root square of negative number in trpo mpi.py fixed render function ignoring parameters when using wrapped environments fixed numpy warning when using DDPG Memory fixed DummyVecEnv not copying the observation array when stepping and resetting fixed graphs issues, so models wont collide in names fixed behavior clone weight loading for GAIL fixed Tensorflow using all the GPU VRAM fixed models so that they are all compatible with vectorized environments fixed ‘set global seed‘ to update ‘gym.spaces‘’s random seed fixed PPO1 and TRPO performance issues when learning identity function fixed DQN wrapping for atari fixed ACER buffer with constant values assuming n stack=4 fixed some RL algorithms not clipping the action to be in the action space, when using ‘gym.spaces.Box‘ removed unused, undocumented and crashing function reset task in subproc vec env.py ... Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 80 / 97
Appendix
π(z0)
π(z1)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 81 / 97
Appendix
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 82 / 97
Neural networks
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 83 / 97
Neural networks Introduction
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 84 / 97
Neural networks Introduction
. . . 1 2 3 4 5 6 7 8 9
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 85 / 97
Neural networks Perceptron
Deep reinforcement learning methods 4th May 2020 86 / 97
Neural networks Perceptron
. . . 1 2 3 4 5 6 7 8 9
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 87 / 97
Neural networks Multi Perceptron Network (MLP)
s(0) s(0)
1
s(0)
2
s(0)
3
s(0)
4
s(0)
5
s(0)
6
s(0)
7
s(1) s(1)
1
s(1)
2
s(1)
3
s(1)
4
s(1)
5
1
5
0,0
0,1
0,7
1,0
1,1
1,7
5,0
5,1
5,7
1
7
1
5
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 88 / 97
Neural networks Multi Perceptron Network (MLP)
. . . 1 2 3 4 5 6 7 8 9
Deep reinforcement learning methods 4th May 2020 89 / 97
Neural networks Multi Perceptron Network (MLP)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 90 / 97
Neural networks Multi Perceptron Network (MLP)
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 91 / 97
Neural networks Policy Gradient issues
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 92 / 97
Neural networks Building the Proximal Policy Optimization
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 93 / 97
Neural networks Building the Proximal Policy Optimization
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 94 / 97
Neural networks Building the Proximal Policy Optimization
Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 95 / 97
Neural networks Proximal Policy Optimization (PPO) model
σa µa
v(s)
8Schulman et al., “Proximal Policy Optimization Algorithms”. Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 96 / 97
Neural networks Proximal Policy Optimization (PPO) model
reference)
9 Ashley Hill (CEA, LIST, LCSR) Deep reinforcement learning methods 4th May 2020 97 / 97