Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed - PowerPoint PPT Presentation

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit¨ at Freiburg 07. July 2016

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 2 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience Agent interacts with Environment over a number of discrete timesteps t . chooses an action based on current situation recieves feedback updates future choice of actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Environment) Markov decision process ) 5-tuple ( T , S , A , f , c ) discrete decision points t 2 T = 0 , 1 , ..., N system state s t 2 S actions a t 2 A transition function s t +1 = f ( s t , a t , w t ) = p ij ( a ) direct costs/rewards c : S ⇥ A ! R Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33 Image: de.slideshare.net/ckmarkohchang/

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Policy) Agents behavior is defined by it’s policy ⇡ . ⇡ t ( s t ) = a ⇡ can be stationary or non-stationary ˆ ⇡ = ( ⇡ 1 , ⇡ 2 , ..., ⇡ N ) or ˆ ⇡ = ( ⇡ , ⇡ , ..., ⇡ ) Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return estimate value of states and actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value functions State value function: Expected return when following ⇡ from state s ∞ P � k r t + k +1 | s t = s ] V π ( s ) = E π [ R t | s t = s ] = E π [ k =0 Action value function: Expected return starting from state s , taking action a , following ⇡ ∞ P � k r t + k +1 | s t = s , a t = a ] Q π ( s , a ) = E π [ R t | s t = s , a t = a ] = E π [ k =0 where 0  �  1 is the discounting rate Axel Perschmann Deep Reinforcement Learning Presentation 4 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network 2 Unknown environment: unknown transition functions unknown rewards ) no model ) learn optimal Q ⇤ ( s , a ; ✓ ) by updating ✓ Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learn optimal value functions Source: Reinforcement Learning Lecture, ue01.pdf ’trial and error’ approach take action a t according to the ✏ -greedy policy receive new state s t +1 and reward r t continue until a terminal state is reached use history to change Q ( s , a ; ✓ ) or V ( s ; ✓ ), restart Axel Perschmann Deep Reinforcement Learning Presentation 6 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) ) obtaining a reward only a ff ects ( s , a )-pair that led to the reward Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) n-step Q-Learning: o ff -policy technique r t + � r t +1 + ... + � n � 1 r t + n + max a � n Q ( s t + n +1 , a ; ✓ i ) (3) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) ) actor-critic architecture Actor: policy ⇡ Critic: baseline b t ( s t ) ⇡ V π ( s t ) Source: Reinforcement Learning Lecture, ue07.pdf Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 9 / 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Related Work (1) Deep Q Networks (DQN) [Mnih et al., 2015] Deep Neural Network as non-linear function approximator Techniques to avoid divergence: experience replay: perform Q-learning updates on random samples of past experience target network fix neural network for several thousand iterations, before updating weights ) make training data less non-stationary, stabilize training Axel Perschmann Deep Reinforcement Learning Presentation 10 / 33

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed - PowerPoint PPT Presentation

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit at Freiburg 07. July 2016 Reinforcement

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Presentation at 63 th Regular Session of CICAD April 27, 2018 Ann Fordham Executive Director,

Governance Changes The Officers want to clarify the

Congressional Budget Office March 7, 2019 Overseas Contingency Operations: Trends and Issues A

FIVE YEAR FINANCIAL FORECAST UPDATE Board of Education Meeting June 18, 2020 1 D211 Current

Machine Learning Networking Event Paul Sherrer Institut 15th January 2018 Johannes Kirschner

Using State Predictions for Value Regularization in Curiosity Driven Deep Reinforcement Learning

IoTwins Project Distributed Digital Twins for Industrial SMEs: a Big Data Platform Paolo

What is an Academy? A small learning community within the school Focused and rigorous

Sambuz

Useful Links

Newsletter

Mail Us