deep neural networks and deep reinforcement learning
play

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - PowerPoint PPT Presentation

Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an


  1. Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an Introduction, 2nd Edition [sect. 5.1-5.3, 6.1-6.3, 6.5]

  2. Outline Deep Neural Networks and Deep Reinforcement Learning ♦ Neural Networks: intro ♦ Deep Neural Networks ♦ Deep Reinforcement Learning ♦ Deep Q Network ♦ Slides based on course offered by Prof. Pascal Poupart at Univ. of Waterloo.

  3. Reinforcement Learning: key points Deep Neural Networks and Deep Reinforcement Learning ♦ MDPs and Value iteration (planning) s ′ T ( s , a , s ′ )( R ( s , a , s ′ ) + γ V k ( s ′ )) V ( s ) = max a � ♦ TD Learning and Q-Learning(reinforcement learning) Q ( s , a ) = (1 − α ) Q ( s , a ) + α ( R ( s , a , s ′ ) + γ max a ′ Q ( s , a ′ )) ♦ Key Issues: number of states and actions maybe too many to maintain and update the Q-Table

  4. Reinforcement Learning: Example of large state spaces Deep Neural ♦ the game of Go: 3 361 Networks and Deep Reinforcement Learning

  5. Reinforcement Learning: Example of large state spaces Deep Neural Networks and Deep Reinforcement ♦ Cart pole control problem: ( x , x ′ , θ, θ ′ ) continuous Learning

  6. Reinforcement Learning: Example of large state spaces Deep Neural Networks and Deep Reinforcement ♦ Atari games: 210x160x3 (possible pixels considering the Learning RGB layers)

  7. Key Idea: function approximation Deep Neural Networks and Deep Reinforcement Learning ♦ Which functions are we interested in: Policy π ( s ) → a Value function V ( s ) ∈ ℜ Q-function Q ( s , a ) ∈ ℜ

  8. Q-Function approximation Deep Neural Networks and Deep Reinforcement Learning ♦ State is a set of features: s = ( x 1 , x 2 , · · · , x n ) T CartPole: s = ( x , x ′ , θ, θ ′ ) T Atari, values of pixels ♦ Linear approximation: Q ( a , s ) ≈ � n i =1 w ai x i ♦ Non-linear ( neural network ): Q ( s , a ) ≈ g ( x ; w )

  9. Feed-Forward ANN Deep Neural ♦ Network of units (computational neurons) Networks and Deep ♦ DAG connecting functions with weighted edges Reinforcement Learning ♦ Each unit computes h ( w T x + b ) w : weights, x : inputs to node, b: bias h: activation function , usually non-linear

  10. One hidden layer ANN Deep Neural ♦ hidden units: z j = h 1 ( w (1) x + b (1) ) Networks and j j Deep ♦ output units: y k = h 2 ( w (2) k z + b (2) Reinforcement k ) Learning j w (2) i w (1) ji x i + b (1) ) + b (2) ♦ overall y k = h 2 ( � kj h 1 ( � k )) j w : weights, x : inputs to node, b: bias h: activation function , usually non-linear

  11. Activation function h Deep Neural Networks and Deep Reinforcement � 1 if x ≥ 0 Learning ♦ threshold: h ( x ) = − 1 otherwise 1 ♦ sigmoid: h ( x ) = σ ( x ) = 1+ e − x ♦ tanh h ( x ) = tanh ( x ) = e x − e − x e x + e − x 2 ( x − µ ♦ gaussian h ( x ) = e − 1 σ ) 2 ♦ identity h ( x ) = x ♦ rectified linear (ReLU) h ( x ) = max { 0 , x }

  12. Universal approximation property Deep Neural Networks and Deep Reinforcement Learning Theorem (Hornik et al., 1989, Cybenko, 1989) : a feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (sigmoid/tahnh/gaussian) can approximate any function arbitrarely closely, provided that the network is given enough hidden units. ♦ any: continuous function on a closed an bounded subset of ℜ n (relationship with Borel measurability)

  13. Minimize least squared error Deep Neural Networks and ♦ Key idea to optimize the weights: minimize the error with Deep Reinforcement respect to the output (Loss) Learning E n ( W ) 2 = | f ( x n ; W ) − y n | 2 � � E ( W ) = 2 n n ♦ Non convex optimization problem: can train by using gradient descent Given sample ( x n , y n ) update weights as follows: w ji ← w ji − η ∂ E n ∂ w ji Backpropagation algorithm to compute the gradient in a ANN

  14. Deep Neural Networks Deep Neural Networks and Deep Reinforcement Learning ♦ Deep NN: ANN with many hidden layers ♦ Benefit: high expressivity (i.e., compact representation) ♦ Issues: can we train Deep NN in the same way ? can we avoid overfitting ?

  15. Example: Image Classification Deep Neural Networks and ♦ ImageNet Large Scale Visual Recognition Challenge Deep Reinforcement Learning

  16. Vanishing Gradient Deep Neural Networks and ♦ Deep Neural networks that uses "squashing" functions (e.g., Deep Reinforcement sigmoid, tanh) suffer from vanishing gradients Learning

  17. Sigmoid and Hyperbolic functions Deep Neural Networks and Deep Reinforcement ♦ Derivative of Sigmoid and Tanh is always less than one! Learning ♦ when back propagating gradients we multiply several numbers that are less than one

  18. Example: vanishing gradient ♦ y = t ( w 3 t ( w 2 t ( w 1 x ))), where t ( · ) is the tanh function Deep Neural Networks and Deep ♦ common weight initialization in (-1,1) Reinforcement Learning ♦ tanh function and its derivative are less than 1 ♦ vanishing gradient ∂ y ∂ w 3 = t ′ ( a 3 ) t ( a 2 ) ∂ y ∂ y ∂ w 2 = t ′ ( a 3 ) w 3 t ′ ( a 2 ) t ( a 1 ) ≤ ∂ w 3 ∂ y ∂ y ∂ w 1 = t ′ ( a 3 ) w 3 t ′ ( a 2 ) w 2 t ′ ( a 1 ) x ≤ ∂ w 2

  19. Mitigations for vanishing gradient Deep Neural Networks and Deep Reinforcement Learning ♦ typical solutions to mitigate vanishing gradient Pre-training Rectified Linear Units Batch normalization Skip connections

  20. Rectified Linear Units (ReLU) Deep Neural ♦ Rectified linear h ( x ) = max (0 , x ) Networks and Deep Reinforcement Gradient is 0 or 1 Learning Piecewise linear Sparse computation ♦ Soft computation (Softplus): h ( x ) = log (1 + e x ) ♦ Softplus does not mitigate gradient vanishing

  21. Deep Reinforcement Learning: key points Deep Neural Networks and Deep Reinforcement Learning ♦ For many real world domains we can not explicitly represent key functions for RL ( π ( s ), V ( s ), Q ( s , a )) ♦ We can try to approximate them Linear approximation Neural Network approximation Deep RL ♦ Deep Q Network approximates Q ( s , a ) with a DNN

  22. Gradient Q-Learning Deep Neural Networks and Deep ♦ approximate Q ( s , a ) with a parametrized function Q w ( s , a ) Reinforcement Learning ♦ Minimize squared error between estimate and target Estimate Q w ( s , a ) Target: r ( s , a , s ′ ) + γ max a ′ Q w ( s ′ , a ′ ) ♦ squared error: Err ( w ) = ( Q w ( s , a ) − r ( s , a , s ′ ) − γ max a ′ Q w ( s ′ , a ′ )) 2 ♦ gradient: ∂ Err ( w ) = 2( Q w ( s , a ) − r ( s , a , s ′ ) − γ max a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) ∂ w ∂ w (Scalar 2 is a constant factor and not important for update)

  23. Gradient Q-Learning Algorithm Deep Neural Networks and Deep Reinforcement Learning Algorithm 1 Gradient Q-Learning 1: Initialize weights w randomly in [ − 1 , 1] 2: Initialize s {observe current state} 3: loop Select and execute action a 4: Observe new state s ′ receive immediate reward r 5: ∂ Err ( w ) = ( Q w ( s , a ) − r − γ max a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) 6: ∂ w ∂ w update weights w ← w − α ∂ Err ( w ) 7: ∂ w update state s ← s ′ 8: 9: end loop

  24. Convergence of tabular Q-Learning Deep Neural Networks and Deep Reinforcement Learning ♦ Q ( s , a ) ← Q ( s , a )+ α ( r ( s , a , s ′ )+ γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a )) ♦ Tabular Q-Learning converges to optimal policy if you explore enough if you make the learning rate small enough ... but not decrease it too quickly α = 1 / n ( s , a ), where n ( s , a ) is number of visits for ( s , a )

  25. Convergence of linear gradient Q-Learning Deep Neural Networks and Deep Reinforcement Learning i w ai x i = w T x ♦ linear approximation of Q(s,a), Q ( s , a ) ≈ � ♦ α t = 1 / t ♦ gradient Q-learning a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) w ← w − α t ( Q w ( s , a ) − r − γ max ∂ w

  26. Non-Convergence of Non-linear gradient Q-Learning Deep Neural Networks and Deep Reinforcement ♦ Non-linear approximation of Q(s,a), Q ( s , a ) ≈ g ( x ; w ) Learning ♦ Even if α t = 1 / t , gradient Q-Learning may not converge ♦ Issue: we update the weights to reduce error for a specific experience (i.e., a specific ( s , a )) but by changing the weights we may end up changing the Q ( s , a ) potentially everywhere. this is true also for linear approximation, but in that case convergence can still be guaranteed

  27. Mitigating divergence Deep Neural Networks and Deep Reinforcement Learning ♦ Two main approaches to mitigate divergence: 1 experience replay 2 use two different networks Q-network Target network

  28. Experience replay Deep Neural Networks and Deep ♦ Store previous experiences (i.e., ( s , a , s ′ , r )) and use them at Reinforcement Learning each step Store previous ( s , a , s ′ , r ) in a dedicated memory buffer At each step sample a mini-batch from this buffer and use the mini-batch to update the weights ♦ Benefits 1 reduces correlation between successive samples (increase stability) 2 reduces number of interaction with the environment (increase data efficiency)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend