rainbow
play

Rainbow Milan Straka November 19, 2018 Charles University in - PowerPoint PPT Presentation

NPFL122, Lecture 6 Rainbow Milan Straka November 19, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Function Approximation v q We will approximate


  1. NPFL122, Lecture 6 Rainbow Milan Straka November 19, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Function Approximation v q We will approximate value function and/or state-value function , choosing from a family of w ∈ R d functions parametrized by a weight vector . We denote the approximations as ^ ( s , w ), v ^ ( s , a , w ). q V E We utilize the Mean Squared Value Error objective, denoted : ∑ ] 2 def ( w ) = μ ( s ) v [ π ( s ) − ( s , w ) ^ , V E v s ∈ S μ ( s ) where the state distribution is usually on-policy distribution. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 2/37

  3. Gradient and Semi-Gradient Methods w The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as 1 t ] 2 t +1 ← w − α ∇ v ( S [ π ) − ( S ^ , w ) w v t t t 2 ← w + α v [ π ( S ) − ( S ^ , w t ] ) ∇ ( S ^ , w ). v v t t t t t ( S ) v π t As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, G t we use episodic return , and in temporal difference methods, we employ bootstrapping and + γ ( S ^ , w ). R v t +1 t +1 use NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 3/37

  4. Deep Q Network Off-policy Q-learning algorithm with a convolutional neural network function approximation of action-value function. Training can be extremely brittle (and can even diverge as shown earlier). Convol ut i on Convol ut i on Ful l y connect ed Ful l y connect ed No i nput Figure 1 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 4/37

  5. Deep Q Networks 210 × 160 Preprocessing: 128-color images are converted to grayscale and then resized to 84 × 84 . 4 th Frame skipping technique is used, i.e., only every frame (out of 60 per second) is considered, and the selected action is repeated on the other frames. 4 Input to the network are last frames (considering only the frames kept by frame skipping), 4 i.e., an image with channels. The network is fairly standard, performing 8 × 8 32 filters of size with stride 4 and ReLU, 4 × 4 64 filters of size with stride 2 and ReLU, 3 × 3 64 filters of size with stride 1 and ReLU, fully connected layer with 512 units and ReLU, output layer with 18 output units (one for each action) NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 5/37

  6. Deep Q Networks Network is trained with RMSProp to minimize the following loss: def E [ 2 ] ˉ ′ θ ′ L = ( r + γ max Q ( s , a ; ) − Q ( s , a ; θ )) . ′ ( s , a , r , s )∼ data a ′ ε An -greedy behavior policy is utilized. Important improvements: ′ ( s , a , r , s ) experience replay: the generated episodes are stored in a buffer as quadruples, ˉ and for training a transition is sampled uniformly; θ separate target network : to prevent instabilities, a separate target network is used to estimate state-value function. The weights are not trained, but copied from the trained ˉ network once in a while; ′ θ ′ ( r + γ max Q ( s , a ; ) − Q ( s , a ; θ )) [−1, 1] a ′ reward clipping of to . NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 6/37

  7. Deep Q Networks Hyperparameters Hyperparameter Value minibatch size 32 replay buffer size 1M target network update frequency 10k discount factor 0.99 training frames 50M RMSProp learning rate and momentum 0.00025, 0.95 ε ε ε initial , final and frame of final 1.0, 0.1, 1M replay start size 50k no-op max 30 NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 7/37

  8. Rainbow There have been many suggested improvements to the DQN architecture. In the end of 2017, the Rainbow: Combining Improvements in Deep Reinforcement Learning paper combines 7 of them into a single architecture they call Rainbow . Figure 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 8/37

  9. Rainbow DQN Extensions Double Q-learning Similarly to double Q-learning, instead of ˉ ′ θ ′ r + γ max Q ( s , a ; ) − Q ( s , a ; θ ), a ′ we minimize ˉ ′ ′ ′ r + γQ ( s , arg max Q ( s , a ; θ ); ) − Q ( s , a ; θ ). θ a ′ 1 . 5 error 1 . 0 0 . 5 0 . 0 2 4 8 16 32 64 128 256 512 1024 number of actions Figure 1: The orange bars show the bias in a single Q- learning update when the action values are Q ( s, a ) = V ∗ ( s ) + ǫ a and the errors { ǫ a } m a =1 are independent standard normal random variables. The second set of action values Q ′ , used for the blue bars, was generated identically and in- dependently. All bars are the average of 100 repetitions. Figure 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 9/37

  10. Rainbow DQN Extensions Double Q-learning Average error True value and an estimate All estimates and max Bias as function of state 2 2 max a Q t ( s, a ) − max a Q ∗ ( s, a ) +0 . 61 max a Q t ( s, a ) 1 Q ∗ ( s, a ) 0 0 0 − 0 . 02 Double-Q estimate Q t ( s, a ) − 1 − 2 − 2 max a Q t ( s, a ) − max a Q ∗ ( s, a ) max a Q t ( s, a ) +0 . 47 Q ∗ ( s, a ) 2 2 1 Q t ( s, a ) 0 +0 . 02 Double-Q estimate 0 0 − 1 4 4 4 max a Q t ( s, a ) − Q t ( s, a ) max a Q t ( s, a ) max a Q ∗ ( s, a ) 2 2 2 +3 . 35 0 0 0 − 0 . 02 Q ∗ ( s, a ) Double-Q estimate − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 state state state Figure 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 10/37

  11. Rainbow DQN Extensions Double Q-learning Alien Space Invaders Time Pilot Zaxxon Value estimates 20 2 . 5 8 DQN estimate 8 2 . 0 6 15 1 . 5 4 6 Double DQN estimate 10 1 . 0 2 Double DQN true value 4 DQN true value 0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Training steps (in millions) Value estimates Wizard of Wor Asterix (log scale) 100 80 DQN 40 10 20 DQN 1 10 Double DQN 5 Double DQN 0 50 100 150 200 0 50 100 150 200 Wizard of Wor Asterix 4000 Double DQN 6000 Double DQN Score 3000 4000 2000 2000 1000 DQN DQN 0 0 0 50 100 150 200 0 50 100 150 200 Training steps (in millions) Training steps (in millions) Figure 3 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 11/37

  12. Rainbow DQN Extensions Double Q-learning Table 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. Table 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 12/37

  13. Rainbow DQN Extensions Prioritized Replay Instead of sampling the transitions uniformly from the replay buffer, we instead prefer those with a large TD error. Therefore, we sample transitions according to their probability ∣ ∣ ω ′ θ ˉ ′ ∝ r + max Q ( s , a ; ) − Q ( s , a ; θ ) , ∣ ∣ p γ t ∣ ∣ a ′ ω = 0 ω where controls the shape of the distribution (which is uniform for and corresponds to ω = 1 TD error for ). New transitions are inserted into the replay buffer with maximum probability to support exploration of all encountered transitions. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 13/37

  14. Rainbow DQN Extensions Prioritized Replay p t Because we now sample transitions according to instead of uniformly, on-policy distribution and sampling distribution differ. To compensate, we therefore utilize importance sampling with ratio 1/ N ) β ( p = . ρ t t The authors utilize in fact “ for stability reasons ” / max . ρ ρ t i i NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 14/37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend