cs 4803 7643 deep learning
play

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov Decision Processes Zsolt Kira Georgia Tech Administrative PS3/HW3 due March 15 th ! Projects 2 new FB projects up


  1. CS 4803 / 7643: Deep Learning Topic: – Reinforcement Learning (RL) – Overview – Markov Decision Processes Zsolt Kira Georgia Tech

  2. Administrative • PS3/HW3 due March 15 th ! • Projects – 2 new FB projects up (https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/fb_projects.html) • Project 1: Confident Machine Translation • Project 2: Habitat Embodied Navigation Challenge @ CVPR20 • Project 3: MRI analysis • Project 4: Transfer learning for machine translation quality estimation – Tentative FB plan: • March 20 th : Phone call with FB • April 5 th : Written Q&A • April 15 th : Phone call with FB – Fill out spreadsheet: https://gtvault- my.sharepoint.com/:x:/g/personal/sdharur3_gatech_edu/EVXbNc4oxelMmj1T5WsEIRQBE4Hn532GeLQVcmOnWdG2 Jg?e=dIGNfX

  3. From Last Time • Overview of RL • RL vs other forms of learning • RL “API” • Applications • Framework: Markov Decision Processes (MDP’s) • Definitions and notations • Policies and Value Functions Last lecture: • Solving MDP’s – Focus on MDP’s • Value Iteration – No learning (deep • Policy Iteration or otherwise) • Reinforcement learning • Value-based RL (Q-learning, Deep-Q Learning) • Policy-based RL (Policy gradients) 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  4. RL API • At each step t the agent: – Executes action a t – Receives observation o t – Receives scalar reward r t • The environment: – Receives action a t – Emits observation o t+1 – Emits scalar reward r t+1 4 Slide Credit: David Silver

  5. Markov Decision Process (MDP) - RL operates within a framework called a Markov Decision Process - MDP’s: General formulation for decision making under uncertainty Defined by: : set of possible states [start state = s 0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor - Life is trajectory: - Markov property : Current state completely characterizes state of the world - Assumption : Most recent observation is sufficient statistic of history 5 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  6. Markov Decision Process (MDP) - MDP state projects a search tree - Observability : - Full: In a fully observable MDP, - Example: Chess - Partial: In a partially observable MDP, agent constructs its own state, using history, of beliefs of world state, or an RNN, … - Example: Poker 6 Slide Credit: Emma Brunskill, Byron Boots

  7. Markov Decision Process (MDP) - In RL, we don’t have access to or (i.e. the environment) - Need to actually try actions and states out to learn - Sometimes, need to model the environment - Last time, assumed we do have access to how the world works - And that our goal is to find an optimal behavior strategy for an agent 7

  8. Canonical Example: Grid World • Agent lives in a grid • Walls block the agent’s path • Actions do not always go as planned • 80% of the time, action North takes the agent North (if there is no wall) • 10% of the time, North takes the agent West; 10% East • If there is a wall, the agent stays put • State: Agent’s location • Actions: N, E, S, W • Rewards: +1 / -1 at absorbing states • Also small “living” reward each step (negative) 8 Slide credit: Pieter Abbeel

  9. Policy • A policy is how the agent acts • Formally, map from states to actions – Deterministic – Stochastic 9

  10. The optimal policy * What’s a good policy? Maximizes current reward? Sum of all future reward? Discounted future rewards! (Typically for a Formally: fixed horizon T) with 10

  11. The optimal policy * Reward at every non- terminal state (living reward/ penalty) 11 Slide Credit: Byron Boots, CS 7641

  12. Value Function • A value function is a prediction of future reward • State Value Function or simply Value Function – How good is a state? – Am I screwed? Am I winning this game? • Action-Value Function or Q-function – How good is a state action-pair? – Should I do this now? 12

  13. Value Function Following policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter): 13 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  14. Optimal Quantities Given optimal policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 14 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  15. Recursive definition of value • Extracting optimal value / policy from Q-values: 15 Slide credit: Byron Boots, CS 7641

  16. Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: 16 Slide credit: Byron Boots, CS 7641

  17. Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: 17 Slide credit: Byron Boots, CS 7641

  18. Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: • Characterize optimal values in a way we’ll use over and over 18 Slide credit: Byron Boots, CS 7641

  19. Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it 19 Slide credit: Byron Boots, CS 7641

  20. Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) 20 Slide credit: Byron Boots, CS 7641

  21. Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) • Complexity per iteration (DP): O(|S| 2 |A|) 21 Slide credit: Byron Boots, CS 7641

  22. Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) • Complexity per iteration (DP): O(|S| 2 |A|) • Convergence – Guaranteed for – Sketch: Approximations get refined towards optimal values – In practice, policy may converge before values do 22 Slide credit: Byron Boots, CS 7641

  23. Value Iteration (VI) [NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max] 23 Slide credit: Pieter Abbeel

  24. Q-Value Iteration • Value Iteration Update: • Remember: • Q-Value Iteration Update: The algorithm is same as value iteration, but it loops over actions as well as states 24

  25. Q-Value Iteration • Value Iteration Update: • Remember: • Q-Value Iteration Update: The algorithm is same as value iteration, but it loops over actions as well as states 25

  26. Snapshot of Demo – Gridworld V Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

  27. Computing Actions from Values • Let’s imagine we have the optimal values V*(s) • How should we act? – It’s not obvious! • We need to do a one step calculation • This is called policy extraction, since it gets the policy implied by the values Slide Credit: http://ai.berkeley.edu

  28. Snapshot of Demo – Gridworld Q Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

  29. Computing Actions from Q-Values • Let’s imagine we have the optimal q-values: • How should we act? – Completely trivial to decide! • Important lesson: actions are easier to select from q-values than values! Slide Credit: http://ai.berkeley.edu

  30. Demo • https://cs.stanford.edu/people/karpathy/reinforcejs/gri dworld_dp.html

  31. Next class • Solving MDP’s – Policy Iteration • Reinforcement learning – Value-based RL • Q-learning • Deep Q Learning 31 Slide Credit: David Silver

  32. Policy Iteration (C) Dhruv Batra 32

  33. Policy Iteration • Policy iteration: Start with arbitrary and refine it. 33

  34. Policy Iteration • Policy iteration: Start with arbitrary and refine it. • Involves repeating two steps: – Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per 34

  35. Policy Iteration • Policy iteration: Start with arbitrary and refine it. • Involves repeating two steps: – Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per • Why do policy iteration? – often converges to much sooner than 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend