CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: – Policy Gradients – Actor Critic Zsolt Kira Georgia Tech

Administrative • PS3/HW3 due Tuesday 03/31 • PS4/HW4 is optional and due 04/03 • There are lots of bonus/Extra credit questions there! • Sessions with Facebook for project (fill out spreadsheet) 2

Administrative • How to ask questions during live lecture: • Use Q&A window (other students can upvote) • Raise hands 3

Topics we’ll cover • Overview of RL • RL vs other forms of learning • RL “API” • Applications • Framework: Markov Decision Processes (MDP’s) • Definitions and notations • Policies and Value Functions • Solving MDP’s • Value Iteration (recap) • Q-Value Iteration (new) • Policy Iteration • Reinforcement learning • Value-based RL (Q-learning, Deep-Q Learning) • Policy-based RL (Policy gradients) • Actor-Critic 4

Recap: MDPs • Markov Decision Processes (MDP): • States: • Actions: • Rewards: • Transition Function: • Discount Factor: 5

Value Function Following policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter): 6 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Optimal Quantities Given optimal policy that produces sample trajectories s 0 , a 0 , r 0 , s 1 , a 1 , … How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap: Optimal Value Function The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter 8

Recap: Optimal Value Function The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter Optimal policy: 9

Bellman Optimality Equations • Relations: • Recursive optimality equations: 10

Value Iteration (VI) [NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max] 11 Slide credit: Pieter Abbeel

Snapshot of Demo – Gridworld V Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

Computing Actions from Values • Let’s imagine we have the optimal values V*(s) • How should we act? • It’s not obvious! • We need to do a one step calculation • This is called policy extraction, since it gets the policy implied by the values Slide Credit: http://ai.berkeley.edu

Snapshot of Demo – Gridworld Q Values Noise = 0.2 Discount = 0.9 Living reward = 0 Slide Credit: http://ai.berkeley.edu

Computing Actions from Q-Values • Let’s imagine we have the optimal q-values: • How should we act? • Completely trivial to decide! • Important lesson: actions are easier to select from q- values than values! Slide Credit: http://ai.berkeley.edu

Recap: Learning Based Methods • Typically, we don’t know the environment • unknown, how actions affect the environment. • unknown, what/when are the good actions? 17

Recap: Learning Based Methods • Typically, we don’t know the environment • unknown, how actions affect the environment. • unknown, what/when are the good actions? • But, we can learn by trial and error. • Gather experience (data) by performing actions. • Approximate unknown quantities from data. 18

Sample-Based Policy Evaluation? • We want to improve our estimate of V by computing these averages: • Idea: Take samples of outcomes s’ (by doing the action!) and average s  (s) s,  (s) s,  (s),s’ s s ' s ' s ' 1 3 2 ' Almost! But we can’t rewind time to get sample after sample from state s. What’s the difficulty of this algorithm?

Temporal Difference Learning • Big idea: learn from every experience! s • Update V(s) each time we experience a transition (s, a, s’, r)  (s) • Likely outcomes s’ will contribute updates more often s,  (s) • Temporal difference learning of values • Policy still fixed, still doing evaluation! s’ • Move values toward value of whatever successor occurs: running average Sample of V(s): Update to V(s): Same update:

Deep Q-Learning • Q-Learning with linear function approximators • Has some theoretical guarantees • Deep Q-Learning: Fit a deep Q-Network • Works well in practice • Q-Network can take RGB images Image Credits: Fei-Fei Li, Justin Johnson, 21 Serena Yeung, CS 231n

Recap: Deep Q-Learning • Collect a dataset • Loss for a single data point: Predicted Q-Value Target Q-Value • Act optimally according to the learnt Q function: Pick action with best Q value 22

Exploration Problem • What should be? • Greedy? -> Local minimas, no exploration • An exploration strategy: • 23

Experience Replay • Address this problem using experience replay • A replay buffer stores transitions • Continually update replay buffer as game (experience) episodes are played, older samples discarded • Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples 24 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function 25

Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function unknown Previous class: Estimate Q values Q - learning From data 26

Getting to the optimal policy Use value / policy iteration known Transition function Obtain “optimal” policy and reward function Estimate and from data unknown Estimate Q values Homework! From data 27

Getting to the optimal policy Use value / policy iteration known unknown Transition function Obtain “optimal” policy and reward function Estimate and from data unknown Estimate Q values This class! From data 28

Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. 29

Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. • Want to maximize: 30

Learning the optimal policy • Class of policies defined by parameters • Eg: can be parameters of linear transformation, deep network, etc. • Want to maximize: • In other words, 31

Learning the optimal policy Sample a few trajectories by acting according to 32

REINFORCE algorithm Mathematically, we can write: Where r( 𝜐 ) is the reward of a trajectory Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Expected reward: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Expected reward: Now let’s differentiate this: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ However, we can use a nice trick: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Expected reward: Intractable! Expectation of gradient Now let’s differentiate this: is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Therefore when sampling a trajectory 𝜐 , we can estimate J( 𝜄 ) with Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia Tech Administrative PS3/HW3 due Tuesday 03/31 PS4/HW4 is optional and due 04/03 There are lots of bonus/Extra credit questions

CS 4803 / 7643: Deep Learning Website: http://www.cc.gatech.edu/classes/AY2020/cs7643_spring/

CS 4803 / 7643: Deep Learning Website: https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

CS 4803 / 7643: Deep Learning Topics: Structured representations with graph networks Zsolt

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised learning Zsolt Kira Georgia

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

CS 4803 / 7643: Deep Learning Topics: Forward and backward though conv (Beginning) of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 4803 / 7643: Deep Learning Topics: (Continue) Low-label ML Formulations Zsolt Kira

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

CS 4803 / 7643: Deep Learning Topics: Backpropagation Vector/Matrix/Tensor math

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 251 Fall 2019 CS 251 Fall 2019 Principles of Programming Languages Principles of

L-functions: structure and tools David Farmer AIM joint work with Sally Koutsoliotas and Stefan

Functions Jason Smith, Josiah Manson, and Scott Schaefer Texas A&M University Indicator

XL1G: Create Histograms using Excel 2013 Functions V0H 3/31/2017 XL1G: 0H Create Histograms

New Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with

Writing Home 2: The Simpson Family The Simpson family moved to Biggin Hill on the outbreak of

De slides para powerpoint gratis Direct Link #1 I downtown commented in the Windows news post on

Exam Structure Two Hours (although you should only need one) 5 short questions that will

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia Tech Administrative PS3/HW3 due Tuesday 03/31 PS4/HW4 is optional and due 04/03 There are lots of bonus/Extra credit questions

CS 4803 / 7643: Deep Learning Website: http://www.cc.gatech.edu/classes/AY2020/cs7643_spring/

CS 4803 / 7643: Deep Learning Website: https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

CS 4803 / 7643: Deep Learning Topics: Structured representations with graph networks Zsolt

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised learning Zsolt Kira Georgia

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

CS 4803 / 7643: Deep Learning Topics: Forward and backward though conv (Beginning) of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

CS 4803 / 7643: Deep Learning Topics: (Continue) Low-label ML Formulations Zsolt Kira

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

CS 4803 / 7643: Deep Learning Topics: Backpropagation Vector/Matrix/Tensor math

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

CS 251 Fall 2019 CS 251 Fall 2019 Principles of Programming Languages Principles of

L-functions: structure and tools David Farmer AIM joint work with Sally Koutsoliotas and Stefan

Functions Jason Smith, Josiah Manson, and Scott Schaefer Texas A&amp;M University Indicator

XL1G: Create Histograms using Excel 2013 Functions V0H 3/31/2017 XL1G: 0H Create Histograms

New Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with

Writing Home 2: The Simpson Family The Simpson family moved to Biggin Hill on the outbreak of

De slides para powerpoint gratis Direct Link #1 I downtown commented in the Windows news post on

Exam Structure Two Hours (although you should only need one) 5 short questions that will

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

Functions Jason Smith, Josiah Manson, and Scott Schaefer Texas A&M University Indicator