deep reinforcement learning
play

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed - PowerPoint PPT Presentation

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit at Freiburg 07. July 2016 Reinforcement


  1. Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit¨ at Freiburg 07. July 2016

  2. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 2 / 33

  3. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience Agent interacts with Environment over a number of discrete timesteps t . chooses an action based on current situation recieves feedback updates future choice of actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

  4. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Environment) Markov decision process ) 5-tuple ( T , S , A , f , c ) discrete decision points t 2 T = 0 , 1 , ..., N system state s t 2 S actions a t 2 A transition function s t +1 = f ( s t , a t , w t ) = p ij ( a ) direct costs/rewards c : S ⇥ A ! R Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33 Image: de.slideshare.net/ckmarkohchang/

  5. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Policy) Agents behavior is defined by it’s policy ⇡ . ⇡ t ( s t ) = a ⇡ can be stationary or non-stationary ˆ ⇡ = ( ⇡ 1 , ⇡ 2 , ..., ⇡ N ) or ˆ ⇡ = ( ⇡ , ⇡ , ..., ⇡ ) Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

  6. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

  7. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return estimate value of states and actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

  8. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value functions State value function: Expected return when following ⇡ from state s ∞ P � k r t + k +1 | s t = s ] V π ( s ) = E π [ R t | s t = s ] = E π [ k =0 Action value function: Expected return starting from state s , taking action a , following ⇡ ∞ P � k r t + k +1 | s t = s , a t = a ] Q π ( s , a ) = E π [ R t | s t = s , a t = a ] = E π [ k =0 where 0  �  1 is the discounting rate Axel Perschmann Deep Reinforcement Learning Presentation 4 / 33

  9. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

  10. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

  11. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network 2 Unknown environment: unknown transition functions unknown rewards ) no model ) learn optimal Q ⇤ ( s , a ; ✓ ) by updating ✓ Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

  12. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learn optimal value functions Source: Reinforcement Learning Lecture, ue01.pdf ’trial and error’ approach take action a t according to the ✏ -greedy policy receive new state s t +1 and reward r t continue until a terminal state is reached use history to change Q ( s , a ; ✓ ) or V ( s ; ✓ ), restart Axel Perschmann Deep Reinforcement Learning Presentation 6 / 33

  13. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

  14. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

  15. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) ) obtaining a reward only a ff ects ( s , a )-pair that led to the reward Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

  16. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) n-step Q-Learning: o ff -policy technique r t + � r t +1 + ... + � n � 1 r t + n + max a � n Q ( s t + n +1 , a ; ✓ i ) (3) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

  17. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

  18. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) ) actor-critic architecture Actor: policy ⇡ Critic: baseline b t ( s t ) ⇡ V π ( s t ) Source: Reinforcement Learning Lecture, ue07.pdf Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

  19. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 9 / 33

  20. Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Related Work (1) Deep Q Networks (DQN) [Mnih et al., 2015] Deep Neural Network as non-linear function approximator Techniques to avoid divergence: experience replay: perform Q-learning updates on random samples of past experience target network fix neural network for several thousand iterations, before updating weights ) make training data less non-stationary, stabilize training Axel Perschmann Deep Reinforcement Learning Presentation 10 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend