deep reinforcement learning
play

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57 Outline


  1. Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57

  2. Outline Introduction 1 Value-based Deep RL 2 Deep Q -Network Improvements Policy-based Deep RL 3 Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 2 / 57

  3. Outline Introduction 1 Value-based Deep RL 2 Deep Q -Network Improvements Policy-based Deep RL 3 Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 3 / 57

  4. (Tabular) RL Q -learning: Q ∗ ( s , a ) ← Q ∗ ( s , a )+ η [( R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )) − Q ∗ ( s , a )] SARSA: Q π ( s , a ) ← Q π ( s , a )+ η [( R ( s , a , s ′ )+ γ Q π ( s ′ , π ( s ′ ))) − Q π ( s , a )] Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

  5. (Tabular) RL Q -learning: Q ∗ ( s , a ) ← Q ∗ ( s , a )+ η [( R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )) − Q ∗ ( s , a )] SARSA: Q π ( s , a ) ← Q π ( s , a )+ η [( R ( s , a , s ′ )+ γ Q π ( s ′ , π ( s ′ ))) − Q π ( s , a )] In realistic environments with large state/action space, requires a large table to store Q ∗ / Q π values Maze: O ( 10 1 ) , Tetris: O ( 10 60 ) , Atari: O ( 10 16922 ) pixels Continuous states/actions? May not be able to visit all ( s , a ) ’s in limited training time Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

  6. Generalizing across States Idea: to learn a function f Q ∗ ( s , a ; Θ ) (resp. f Q π ) that approximates Q ∗ ( s , a ) (resp. Q π ( s , a ) ), ∀ s , a Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

  7. Generalizing across States Idea: to learn a function f Q ∗ ( s , a ; Θ ) (resp. f Q π ) that approximates Q ∗ ( s , a ) (resp. Q π ( s , a ) ), ∀ s , a Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store E.g., in Q -learning, Q ∗ should satisfy Bellman optimality equation: Q ∗ ( s , a ) ← ∑ P ( s ′ | s ; a )[ R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )] , ∀ s , a s ′ Algorithm (TD estimate): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

  8. Works If with Careful Feature Engineering Tetris: [1] States: O ( 10 60 ) configurations Actions: rotation and translation to falling piece f ( s , a ; Θ ) and C ( Θ ) modeled as an approximated linear programming problem Hand-crafted features (22 in total) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

  9. Works If with Careful Feature Engineering Tetris: [1] States: O ( 10 60 ) configurations Actions: rotation and translation to falling piece f ( s , a ; Θ ) and C ( Θ ) modeled as an approximated linear programming problem Hand-crafted features (22 in total) Why not use a deep neural network to represent Q Θ ? One model for different tasks Automatically learned features Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

  10. Deep RL Value-based : use DNNs to represent value / Q -function E.g., DQN π ∗ ( s ) ← argmax a Q ∗ ( s , a ) only feasible if actions are discrete Policy-based : use DNNs to represent policy π E.g., DDPG, Action-Critic, A3C, TRPO, PPO Model-based : deep RL when MDP / env. model is known E.g., AlphaGo Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 7 / 57

  11. Outline Introduction 1 Value-based Deep RL 2 Deep Q -Network Improvements Policy-based Deep RL 3 Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 8 / 57

  12. DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

  13. DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

  14. DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = However, diverges due to Samples are correlated (violates i.i.d. assumption of training examples) Non-stationary target ( f Q ∗ ( s ′ , a ′ ) changes as Θ is updated for current a ) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

  15. Outline Introduction 1 Value-based Deep RL 2 Deep Q -Network Improvements Policy-based Deep RL 3 Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 10 / 57

  16. Deep Q-Network (DQN) Naive TD algorithm diverges due to: Samples are correlated Non-stationary target Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

  17. Deep Q-Network (DQN) Naive TD algorithm diverges due to: Samples are correlated Non-stationary target Stabilization techniques proposed by (Nature) DQN [5]: Experience replay Delayed target network Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

  18. Experience Replay Use a replay memory D to store recently seen transitions ( s , a , r , s ′ ) ’s Sample a mini-batch from D and update Θ Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

  19. Experience Replay Use a replay memory D to store recently seen transitions ( s , a , r , s ′ ) ’s Sample a mini-batch from D and update Θ Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using π ′ derived from f Q ∗ (e.g., ε -greedy) 1 Observe s ′ and reward R , add ( s , a , R , s ′ ) to D 2 Sample a mini-batch of ( s ( i ) , a ( i ) , R ( i ) , s ( i + 1 ) ) ’s from D , do: 3 Θ ← Θ − η ∇ Θ C , where � 2 � R ( i ) + γ max C ( Θ ) = ∑ a ′ f Q ∗ ( s ( i + 1 ) , a ′ ; Θ ) − f Q ∗ ( s ( i ) , a ( i ) ; Θ ) i Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

  20. Delayed Target Network To avoid chasing a moving target, set the target value at network output parametrized by old Θ − Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

  21. Delayed Target Network To avoid chasing a moving target, set the target value at network output parametrized by old Θ − Algorithm (TD): initialize Θ arbitrarily and Θ − = Θ , iterate until converge: Take action a from s using π ′ derived from f Q ∗ (e.g., ε -greedy) 1 Observe s ′ and reward R , add ( s , a , R , s ′ ) to D 2 Sample a mini-batch of ( s ( i ) , a ( i ) , R ( i ) , s ( i + 1 ) ) ’s from D , do: 3 Θ ← Θ − η ∇ Θ C , where � 2 � C ( Θ ) = ∑ R ( i ) + γ max a ′ f Q ∗ ( s ( i + 1 ) , a ′ ; Θ − ) − f Q ∗ ( s ( i ) , a ( i ) ; Θ ) i Update Θ − ← Θ every K iterations 4 Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

  22. Other Tricks Optimization techniques matter in deep RL Optimization error may lead to wrong traditions (trajectory) And bad final policy Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

  23. Other Tricks Optimization techniques matter in deep RL Optimization error may lead to wrong traditions (trajectory) And bad final policy Reward clipping for better conditioned gradients Can’t differentiate between small and large rewards Better use batch normalization Use RMSProp instead of vanilla SGD for adaptive learning rate Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

  24. DQN on Atari 49 Atari 2600 games States: raw pixels Actions: 18 joystick/button positions Rewards: changes in score Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 15 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend