Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57

Outline Introduction 1 Value-based Deep RL 2 Deep Q -Network Improvements Policy-based Deep RL 3 Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 2 / 57

(Tabular) RL Q -learning: Q ∗ ( s , a ) ← Q ∗ ( s , a )+ η [( R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )) − Q ∗ ( s , a )] SARSA: Q π ( s , a ) ← Q π ( s , a )+ η [( R ( s , a , s ′ )+ γ Q π ( s ′ , π ( s ′ ))) − Q π ( s , a )] Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

(Tabular) RL Q -learning: Q ∗ ( s , a ) ← Q ∗ ( s , a )+ η [( R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )) − Q ∗ ( s , a )] SARSA: Q π ( s , a ) ← Q π ( s , a )+ η [( R ( s , a , s ′ )+ γ Q π ( s ′ , π ( s ′ ))) − Q π ( s , a )] In realistic environments with large state/action space, requires a large table to store Q ∗ / Q π values Maze: O ( 10 1 ) , Tetris: O ( 10 60 ) , Atari: O ( 10 16922 ) pixels Continuous states/actions? May not be able to visit all ( s , a ) ’s in limited training time Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

Generalizing across States Idea: to learn a function f Q ∗ ( s , a ; Θ ) (resp. f Q π ) that approximates Q ∗ ( s , a ) (resp. Q π ( s , a ) ), ∀ s , a Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

Generalizing across States Idea: to learn a function f Q ∗ ( s , a ; Θ ) (resp. f Q π ) that approximates Q ∗ ( s , a ) (resp. Q π ( s , a ) ), ∀ s , a Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store E.g., in Q -learning, Q ∗ should satisfy Bellman optimality equation: Q ∗ ( s , a ) ← ∑ P ( s ′ | s ; a )[ R ( s , a , s ′ )+ γ max a ′ Q ∗ ( s ′ , a ′ )] , ∀ s , a s ′ Algorithm (TD estimate): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

Works If with Careful Feature Engineering Tetris: [1] States: O ( 10 60 ) configurations Actions: rotation and translation to falling piece f ( s , a ; Θ ) and C ( Θ ) modeled as an approximated linear programming problem Hand-crafted features (22 in total) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

Works If with Careful Feature Engineering Tetris: [1] States: O ( 10 60 ) configurations Actions: rotation and translation to falling piece f ( s , a ; Θ ) and C ( Θ ) modeled as an approximated linear programming problem Hand-crafted features (22 in total) Why not use a deep neural network to represent Q Θ ? One model for different tasks Automatically learned features Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

Deep RL Value-based : use DNNs to represent value / Q -function E.g., DQN π ∗ ( s ) ← argmax a Q ∗ ( s , a ) only feasible if actions are discrete Policy-based : use DNNs to represent policy π E.g., DDPG, Action-Critic, A3C, TRPO, PPO Model-based : deep RL when MDP / env. model is known E.g., AlphaGo Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 7 / 57

DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

DNNs for Q ∗ Use a DNN f Q ∗ ( s , a ; Θ ) to represent Q ∗ ( s , a ) Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using some exploration policy π ′ derived from f Q ∗ 1 (e.g., ε -greedy) Observe s ′ and reward R ( s , a , s ′ ) , update Θ using SGD: 2 Θ ← Θ − η ∇ Θ C , where � 2 � R ( s , a , s ′ )+ γ max a ′ f Q ∗ ( s ′ , a ′ ; Θ ) − f Q ∗ ( s , a ; Θ ) C ( Θ ) = However, diverges due to Samples are correlated (violates i.i.d. assumption of training examples) Non-stationary target ( f Q ∗ ( s ′ , a ′ ) changes as Θ is updated for current a ) Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

Deep Q-Network (DQN) Naive TD algorithm diverges due to: Samples are correlated Non-stationary target Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

Deep Q-Network (DQN) Naive TD algorithm diverges due to: Samples are correlated Non-stationary target Stabilization techniques proposed by (Nature) DQN [5]: Experience replay Delayed target network Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

Experience Replay Use a replay memory D to store recently seen transitions ( s , a , r , s ′ ) ’s Sample a mini-batch from D and update Θ Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

Experience Replay Use a replay memory D to store recently seen transitions ( s , a , r , s ′ ) ’s Sample a mini-batch from D and update Θ Algorithm (TD): initialize Θ arbitrarily, iterate until converge: Take action a from s using π ′ derived from f Q ∗ (e.g., ε -greedy) 1 Observe s ′ and reward R , add ( s , a , R , s ′ ) to D 2 Sample a mini-batch of ( s ( i ) , a ( i ) , R ( i ) , s ( i + 1 ) ) ’s from D , do: 3 Θ ← Θ − η ∇ Θ C , where � 2 � R ( i ) + γ max C ( Θ ) = ∑ a ′ f Q ∗ ( s ( i + 1 ) , a ′ ; Θ ) − f Q ∗ ( s ( i ) , a ( i ) ; Θ ) i Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

Delayed Target Network To avoid chasing a moving target, set the target value at network output parametrized by old Θ − Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

Delayed Target Network To avoid chasing a moving target, set the target value at network output parametrized by old Θ − Algorithm (TD): initialize Θ arbitrarily and Θ − = Θ , iterate until converge: Take action a from s using π ′ derived from f Q ∗ (e.g., ε -greedy) 1 Observe s ′ and reward R , add ( s , a , R , s ′ ) to D 2 Sample a mini-batch of ( s ( i ) , a ( i ) , R ( i ) , s ( i + 1 ) ) ’s from D , do: 3 Θ ← Θ − η ∇ Θ C , where � 2 � C ( Θ ) = ∑ R ( i ) + γ max a ′ f Q ∗ ( s ( i + 1 ) , a ′ ; Θ − ) − f Q ∗ ( s ( i ) , a ( i ) ; Θ ) i Update Θ − ← Θ every K iterations 4 Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

Other Tricks Optimization techniques matter in deep RL Optimization error may lead to wrong traditions (trajectory) And bad final policy Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

Other Tricks Optimization techniques matter in deep RL Optimization error may lead to wrong traditions (trajectory) And bad final policy Reward clipping for better conditioned gradients Can’t differentiate between small and large rewards Better use batch normalization Use RMSProp instead of vanilla SGD for adaptive learning rate Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

DQN on Atari 49 Atari 2600 games States: raw pixels Actions: 18 joystick/button positions Rewards: changes in score Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 15 / 57

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57 Outline

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Controlling modular robotic systems: some ideas from Computational Geometry Vera Sacristn

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

Realigning Cancer Care To Address Interactional Suffering Debra Parker Oliver PhD, MSW Paul

The Simulated Annealing Algorithm An approximation algorithm Course: CS 5130 - Advanced Data

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Ancient-Future Faith and the Spiritual Formation Movement Bruce Baker, Pastor Washington County

Complex Networks Principles of Complex Systems Basic definitions Examples of Course CSYS/MATH

Eco-Disciplines Deep Creation Formation Practices Stages of Creation Connection Stage 1:

Sambuz

Useful Links

Newsletter

Mail Us