cs 4803 7643 deep learning
play

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia Tech Administrative PS3/HW3 due Tuesday 03/31 PS4/HW4 is optional and due 04/03 There are lots of bonus/Extra credit questions


  1. CS 4803 / 7643: Deep Learning Topics: – Policy Gradients – Actor Critic Zsolt Kira Georgia Tech

  2. Administrative • PS3/HW3 due Tuesday 03/31 • PS4/HW4 is optional and due 04/03 • There are lots of bonus/Extra credit questions there! • Sessions with Facebook for project (fill out spreadsheet) 2

  3. Administrative • How to ask questions during live lecture: • Use Q&A window (other students can upvote) • Raise hands 3

  4. Topics we’ll cover • Overview of RL • RL vs other forms of learning • RL “API” • Applications • Framework: Markov Decision Processes (MDP’s) • Definitions and notations • Policies and Value Functions • Solving MDP’s • Value Iteration (recap) • Q-Value Iteration (new) • Policy Iteration • Reinforcement learning • Value-based RL (Q-learning, Deep-Q Learning) • Policy-based RL (Policy gradients) • Actor-Critic 4

  5. Recap: MDPs • Markov Decision Processes (MDP): • States: • Actions: • Rewards: • Transition Function: • Discount Factor: 5

  6. Value Function Following policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter): 6 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  7. Optimal Quantities Given optimal policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  8. Recap: Optimal Value Function The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 8

  9. Recap: Optimal Value Function The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter Optimal policy: 9

  10. Bellman Optimality Equations • Relations: • Recursive optimality equations: 10

  11. Value Iteration (VI) [NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max] 11 Slide credit: Pieter Abbeel

  12. Snapshot of Demo – Gridworld V Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

  13. Computing Actions from Values • Let’s imagine we have the optimal values V*(s) • How should we act? • It’s not obvious! • We need to do a one step calculation • This is called policy extraction, since it gets the policy implied by the values Slide Credit: http://ai.berkeley.edu

  14. Snapshot of Demo – Gridworld Q Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

  15. Computing Actions from Q-Values • Let’s imagine we have the optimal q-values: • How should we act? • Completely trivial to decide! • Important lesson: actions are easier to select from q- values than values! Slide Credit: http://ai.berkeley.edu

  16. Recap: Learning Based Methods • Typically, we don’t know the environment • unknown, how actions affect the environment. • unknown, what/when are the good actions? 17

  17. Recap: Learning Based Methods • Typically, we don’t know the environment • unknown, how actions affect the environment. • unknown, what/when are the good actions? • But, we can learn by trial and error. • Gather experience (data) by performing actions. • Approximate unknown quantities from data. 18

  18. Sample-Based Policy Evaluation? • We want to improve our estimate of V by computing these averages: • Idea: Take samples of outcomes s’ (by doing the action!) and average s  (s) s,  (s) s,  (s),s’ s s ' s ' s ' 1 3 2 ' Almost! But we can’t rewind time to get sample after sample from state s. What’s the difficulty of this algorithm?

  19. Temporal Difference Learning • Big idea: learn from every experience! s • Update V(s) each time we experience a transition (s, a, s’, r)  (s) • Likely outcomes s’ will contribute updates more often s,  (s) • Temporal difference learning of values • Policy still fixed, still doing evaluation! s’ • Move values toward value of whatever successor occurs: running average Sample of V(s): Update to V(s): Same update:

  20. Deep Q-Learning • Q-Learning with linear function approximators • Has some theoretical guarantees • Deep Q-Learning: Fit a deep Q-Network • Works well in practice • Q-Network can take RGB images Image Credits: Fei-Fei Li, Justin Johnson, 21 Serena Yeung, CS 231n

  21. Recap: Deep Q-Learning • Collect a dataset • Loss for a single data point: Predicted Q-Value Target Q-Value • Act optimally according to the learnt Q function: Pick action with best Q value 22

  22. Exploration Problem • What should be? • Greedy? -> Local minimas, no exploration • An exploration strategy: • 23

  23. Experience Replay • Address this problem using experience replay • A replay buffer stores transitions • Continually update replay buffer as game (experience) episodes are played, older samples discarded • Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples 24 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  24. Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function 25

  25. Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function unknown Previous class: Estimate Q values Q - learning From data 26

  26. Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function Estimate and from data unknown Estimate Q values Homework! From data 27

  27. Getting to the optimal policy Use value / policy iteration known unknown Transition function Obtain “optimal” policy and reward function Estimate and from data unknown Estimate Q values This class! From data 28

  28. Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. 29

  29. Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. • Want to maximize: 30

  30. Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. • Want to maximize: • In other words, 31

  31. Learning the optimal policy Sample a few trajectories by acting according to 32

  32. REINFORCE algorithm Mathematically, we can write: Where r( 𝜐 ) is the reward of a trajectory Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  33. REINFORCE algorithm Expected reward: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  34. REINFORCE algorithm Expected reward: Now let’s differentiate this: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  35. REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  36. REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ However, we can use a nice trick: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  37. REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  38. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  39. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  40. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  41. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Therefore when sampling a trajectory 𝜐 , we can estimate J( 𝜄 ) with Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend