deep reinforcement learning
play

Deep Reinforcement Learning M. Soleymani Sharif University of - PowerPoint PPT Presentation

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016. Supervised Learning


  1. Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016.

  2. Supervised Learning • Data: (x, y) – x is data – y is label • Goal: Learn a function to map x -> y • Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

  3. Unsupervised Learning • Data: x – Just data, no labels! • Goal: Learn some underlying hidden structure of the data • Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc.

  4. Reinforcement Learning • Goal : Learn how to take actions an agent interacting with an environment , in order to maximize reward which provides numeric reward signals – Concerned with taking sequences of actions • Described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward

  5. Overview • What is Reinforcement Learning? • Markov Decision Processes • Q-Learning • Policy Gradients

  6. Reinforcement Learning

  7. Reinforcement Learning

  8. Reinforcement Learning

  9. Reinforcement Learning

  10. Robot Locomotion

  11. Motor Control and Robotics • Robotics: – Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans

  12. Atari Games

  13. Go

  14. How Does RL Relate to Other Machine Learning Problems? • Differences between RL and supervised learning: – You don't have full access to the function you're trying to optimize • must query it through interaction. – Interacting with a stateful world: input 𝑦 𝑢 depend on your previous actions

  15. How can we mathematically formalize the RL problem?

  16. Markov Decision Process

  17. Markov Decision Process • At time step t=0, environment samples initial state 𝑡 0 ~𝑞(𝑡 0 ) • Then, for t=0 until done: – Agent selects action 𝑏 𝑢 – Environment samples reward 𝑠 𝑢 ~𝑆(. |𝑡 𝑢 , 𝑏 𝑢 ) – Environment samples next state 𝑡 𝑢+1 ~𝑄(. |𝑡 𝑢 , 𝑏 𝑢 ) – Agent receives reward 𝑠 𝑢 and next state 𝑡 𝑢+1 • A policy 𝜌 is a function from S to A that specifies what action to take in each state • Objective: find policy 𝜌 ∗ that maximizes cumulative discounted reward: ∞ 1 + 𝛿 2 𝑠 𝛿 𝑙 𝑠 𝑠 0 + 𝛿𝑠 2 + ⋯ = 𝑙 𝑙=0

  18. A simple MDP: Grid World

  19. A simple MDP: Grid World

  20. The optimal policy 𝜌 ∗ • We want to find optimal policy 𝝆 ∗ that maximizes the sum of rewards. • How do we handle the randomness (initial state, transition probability … )? – Maximize the expected sum of rewards!

  21. Definitions: Value function and Q-value function

  22. Value function for policy 𝜌 𝑊 𝜌 𝑡 = 𝐹 𝑙=0 ∞ 𝛿 𝑙 𝑠 𝑢 𝑡 0 = 𝑡, 𝜌 𝑅 𝜌 𝑡, 𝑏 = 𝐹 𝑙=0 ∞ 𝛿 𝑙 𝑠 𝑢 𝑡 0 = 𝑡, 𝑏 0 = 𝑏 , 𝜌 • 𝑊 𝜌 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌 – It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌 𝑊 𝜌 𝑡 = 𝐹 𝑠 + 𝛿𝑊 𝜌 (𝑡′)|𝑡, 𝜌 Bellman Equations 𝑅 𝜌 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅 𝜌 (𝑡 ′ , 𝑏′)|𝑡, 𝑏, 𝜌 22

  23. Bellman optimality equation 𝑊 ∗ 𝑡 = max 𝑏∈𝒝(𝑡) 𝐹 𝑠 + 𝛿𝑊 ∗ (𝑡′)|𝑡, 𝑏 𝑅 ∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ |𝑡, 𝑏 23

  24. Bellman equation

  25. Optimal policy • It can also be computed as: 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 𝑏∈𝒝(𝑡) 25

  26. Solving for the optimal policy: Value iteration

  27. Solving for the optimal policy: Q-learning algorithm • Initialize 𝑅(𝑡, 𝑏) arbitrarily • Repeat (for each episode): Initialize 𝑡 • e.g., greedy, ε -greedy Repeat (for each step of episode): • Choose 𝑏 from 𝑡 using a policy derived from 𝑅 • Take action 𝑏 , receive reward 𝑠 , observe new state 𝑡 ′ • 𝑅 𝑡 ′ , 𝑏 ′ − 𝑅 𝑡, 𝑏 ← 𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑅 𝑡, 𝑏 • 𝑏 ′ 𝑡 ← 𝑡 ′ • until 𝑡 is terminal • 27

  28. Problem • Not scalable. – Must compute Q(s,a) for every state-action pair. • it computationally infeasible to compute for entire state space! • Solution: use a function approximator to estimate Q(s,a). – E.g. a neural network!

  29. Solving for the optimal policy: Q-learning

  30. Solving for the optimal policy: Q-learning Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  31. Case Study: Playing Atari Games (seen before) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  32. Q-network Architecture Last FC layer has 4-d output (if 4 actions) Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 ) A single feedforward pass to compute Number of actions between 4-18 Q-values for all actions from the current depending on Atari game state => efficient! [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  33. Training the Q-network: Experience Replay • Learning from batches of consecutive samples is problematic: – Samples are correlated => inefficient learning – Current Q-network parameters determines next training samples • can lead to bad feedback loops • Address these problems using experience replay – Continually update a replay memory table of transitions (𝑡 𝑢 , 𝑏 𝑢 , 𝑠 𝑢 , 𝑡 𝑢+1 ) – Train Q-network on random minibatches of transitions from the replay memory  Each transition can also contribute to multiple weight updates => greater data efficiency  Smoothing out learning and avoiding oscillations or divergence in the parameters

  34. Putting it together: Deep Q-Learning with Experience Replay [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  35. Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  36. Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  37. Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  38. Putting it together: Deep Q-Learning with Experience Replay For each time-step of game [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  39. Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  40. Putting it together: Deep Q-Learning with Experience Replay Take the selected action observe the reward and next state [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  41. Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  42. Putting it together: Deep Q-Learning with Experience Replay Sample a random minibatch of transitions and perform a gradient descent step [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

  43. Results on 49 Games • The architecture and hyperparameter values were the same for all 49 games. • DQN achieved performance comparable to or better than an experienced human on 29 out of 49 games. [V. Mnih et al., Human-level control through deep reinforcement learning, Nature 2015]

  44. Policy Gradients • What is a problem with Q-learning? – The Q-function can be very complicated! • Hard to learn exact value of every (state, action) pair • But the policy can be much simple • Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

  45. The goal of RL the policy that must be learnt

  46. Policy Gradients

  47. REINFORCE algorithm

  48. REINFORCE algorithm

  49. REINFORCE algorithm

  50. ⇒ Can estimate with Monte Carlo sampling

  51. REINFORCE algorithm

  52. REINFORCE algorithm 𝑂 𝛼 𝜄 𝐾(𝜄) ≈ 1 𝑠(𝜐 (𝑜) ) 𝛼 𝜄 log 𝑞 𝜐 (𝑜) ; 𝜄 𝑂 𝑜=1 𝑂 𝛼 𝜄 𝐾(𝜄) ≈ 1 𝑜 , 𝑏 𝑢 𝑜 |𝑡 𝑢 (𝑜) 𝑜 𝑂 𝑠 𝑡 𝑢 𝛼 𝜄 log 𝜌 𝜄 𝑏 𝑢 𝑜=1 𝑢≥0 𝑢≥0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend