 
              CS 4803 / 7643: Deep Learning Topic: – Reinforcement Learning (RL) – Overview – Markov Decision Processes Zsolt Kira Georgia Tech
Administrative • PS3/HW3 due March 15 th ! • Projects – 2 new FB projects up (https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/fb_projects.html) • Project 1: Confident Machine Translation • Project 2: Habitat Embodied Navigation Challenge @ CVPR20 • Project 3: MRI analysis • Project 4: Transfer learning for machine translation quality estimation – Tentative FB plan: • March 20 th : Phone call with FB • April 5 th : Written Q&A • April 15 th : Phone call with FB – Fill out spreadsheet: https://gtvault- my.sharepoint.com/:x:/g/personal/sdharur3_gatech_edu/EVXbNc4oxelMmj1T5WsEIRQBE4Hn532GeLQVcmOnWdG2 Jg?e=dIGNfX
From Last Time • Overview of RL • RL vs other forms of learning • RL “API” • Applications • Framework: Markov Decision Processes (MDP’s) • Definitions and notations • Policies and Value Functions Last lecture: • Solving MDP’s – Focus on MDP’s • Value Iteration – No learning (deep • Policy Iteration or otherwise) • Reinforcement learning • Value-based RL (Q-learning, Deep-Q Learning) • Policy-based RL (Policy gradients) 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
RL API • At each step t the agent: – Executes action a t – Receives observation o t – Receives scalar reward r t • The environment: – Receives action a t – Emits observation o t+1 – Emits scalar reward r t+1 4 Slide Credit: David Silver
Markov Decision Process (MDP) - RL operates within a framework called a Markov Decision Process - MDP’s: General formulation for decision making under uncertainty Defined by: : set of possible states [start state = s 0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor - Life is trajectory: - Markov property : Current state completely characterizes state of the world - Assumption : Most recent observation is sufficient statistic of history 5 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Markov Decision Process (MDP) - MDP state projects a search tree - Observability : - Full: In a fully observable MDP, - Example: Chess - Partial: In a partially observable MDP, agent constructs its own state, using history, of beliefs of world state, or an RNN, … - Example: Poker 6 Slide Credit: Emma Brunskill, Byron Boots
Markov Decision Process (MDP) - In RL, we don’t have access to or (i.e. the environment) - Need to actually try actions and states out to learn - Sometimes, need to model the environment - Last time, assumed we do have access to how the world works - And that our goal is to find an optimal behavior strategy for an agent 7
Canonical Example: Grid World • Agent lives in a grid • Walls block the agent’s path • Actions do not always go as planned • 80% of the time, action North takes the agent North (if there is no wall) • 10% of the time, North takes the agent West; 10% East • If there is a wall, the agent stays put • State: Agent’s location • Actions: N, E, S, W • Rewards: +1 / -1 at absorbing states • Also small “living” reward each step (negative) 8 Slide credit: Pieter Abbeel
Policy • A policy is how the agent acts • Formally, map from states to actions – Deterministic – Stochastic 9
The optimal policy * What’s a good policy? Maximizes current reward? Sum of all future reward? Discounted future rewards! (Typically for a Formally: fixed horizon T) with 10
The optimal policy * Reward at every non- terminal state (living reward/ penalty) 11 Slide Credit: Byron Boots, CS 7641
Value Function • A value function is a prediction of future reward • State Value Function or simply Value Function – How good is a state? – Am I screwed? Am I winning this game? • Action-Value Function or Q-function – How good is a state action-pair? – Should I do this now? 12
Value Function Following policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter): 13 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Optimal Quantities Given optimal policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 14 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Recursive definition of value • Extracting optimal value / policy from Q-values: 15 Slide credit: Byron Boots, CS 7641
Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: 16 Slide credit: Byron Boots, CS 7641
Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: 17 Slide credit: Byron Boots, CS 7641
Recursive definition of value • Extracting optimal value / policy from Q-values: • Bellman Equations: • Characterize optimal values in a way we’ll use over and over 18 Slide credit: Byron Boots, CS 7641
Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it 19 Slide credit: Byron Boots, CS 7641
Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) 20 Slide credit: Byron Boots, CS 7641
Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) • Complexity per iteration (DP): O(|S| 2 |A|) 21 Slide credit: Byron Boots, CS 7641
Value Iteration (VI) • Bellman equations characterize optimal values, VI is a fixed-point DP solution method to compute it • Algorithm – Initialize values of all states V 0 (s) = 0 – Update: – Repeat until convergence (to ) • Complexity per iteration (DP): O(|S| 2 |A|) • Convergence – Guaranteed for – Sketch: Approximations get refined towards optimal values – In practice, policy may converge before values do 22 Slide credit: Byron Boots, CS 7641
Value Iteration (VI) [NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max] 23 Slide credit: Pieter Abbeel
Q-Value Iteration • Value Iteration Update: • Remember: • Q-Value Iteration Update: The algorithm is same as value iteration, but it loops over actions as well as states 24
Q-Value Iteration • Value Iteration Update: • Remember: • Q-Value Iteration Update: The algorithm is same as value iteration, but it loops over actions as well as states 25
Snapshot of Demo – Gridworld V Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu
Computing Actions from Values • Let’s imagine we have the optimal values V*(s) • How should we act? – It’s not obvious! • We need to do a one step calculation • This is called policy extraction, since it gets the policy implied by the values Slide Credit: http://ai.berkeley.edu
Snapshot of Demo – Gridworld Q Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu
Computing Actions from Q-Values • Let’s imagine we have the optimal q-values: • How should we act? – Completely trivial to decide! • Important lesson: actions are easier to select from q-values than values! Slide Credit: http://ai.berkeley.edu
Demo • https://cs.stanford.edu/people/karpathy/reinforcejs/gri dworld_dp.html
Next class • Solving MDP’s – Policy Iteration • Reinforcement learning – Value-based RL • Q-learning • Deep Q Learning 31 Slide Credit: David Silver
Policy Iteration (C) Dhruv Batra 32
Policy Iteration • Policy iteration: Start with arbitrary and refine it. 33
Policy Iteration • Policy iteration: Start with arbitrary and refine it. • Involves repeating two steps: – Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per 34
Policy Iteration • Policy iteration: Start with arbitrary and refine it. • Involves repeating two steps: – Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per • Why do policy iteration? – often converges to much sooner than 35
Recommend
More recommend