Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement - PowerPoint PPT Presentation

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silver’s Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 1 / 58

Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? s ′ p ( s ′ | s t , a t ) V ( s ′ ) V ( s t ) = r ( s t , a t ) + γ � 1 V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) 2 V ( s t ) = (1 − α ) V ( s t ) + α � H i = t r ( s i , a i ) 3 V ( s t ) = (1 − α ) V ( s t ) + α max a ( r ( s t , a ) + γ V ( s t +1 )) 4 Not sure 5 Bootstrapping is when Samples of (s,a,s’) transitions are used to approximate the true 1 expectation over next states An estimate of the next state value is used instead of the true next 2 state value Used in Monte-Carlo policy evaluation 3 Not sure 4 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 2 / 58

Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? True. V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) Bootstrapping is when An estimate of the next state value is used instead of the true next state value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 3 / 58

Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 4 / 58

Class Structure Last time: Policy evaluation with no knowledge of how the world works (MDP model not given) This time: Control (making decisions) without a model of how the world works Next time: Generalization – Value function approximation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 5 / 58

Evaluation to Control Last time: how good is a specific policy? Given no access to the decision process model parameters Instead have to estimate from data / experience Today: how can we learn a good policy? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 6 / 58

Recall: Reinforcement Learning Involves Optimization Delayed consequences Exploration Generalization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 7 / 58

Today: Learning to Control Involves Optimization: Goal is to identify a policy with high expected rewards (similar to Lecture 2 on computing an optimal policy given decision process models) Delayed consequences: May take many time steps to evaluate whether an earlier decision was good or not Exploration: Necessary to try different actions to learn what actions can lead to high rewards Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 8 / 58

Today: Model-free Control Generalized policy improvement Importance of exploration Monte Carlo control Model-free control with temporal difference (SARSA, Q-learning) Maximization bias Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 9 / 58

Model-free Control Examples Many applications can be modeled as a MDP: Backgammon, Go, Robot locomation, Helicopter flight, Robocup soccer, Autonomous driving, Customer ad selection, Invasive species management, Patient treatment For many of these and other problems either: MDP model is unknown but can be sampled MDP model is known but it is computationally infeasible to use directly, except through sampling Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 10 / 58

On and Off-Policy Learning On-policy learning Direct experience Learn to estimate and evaluate a policy from experience obtained from following that policy Off-policy learning Learn to estimate and evaluate a policy using experience gathered from following a different policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 11 / 58

Recall Policy Iteration Initialize policy π Repeat: Policy evaluation: compute V π Policy improvement: update π π ′ ( s ) = arg max � P ( s ′ | s , a ) V π ( s ′ ) = arg max R ( s , a ) + γ Q π ( s , a ) a a s ′ ∈ S Now want to do the above two steps without access to the true dynamics and reward models Last lecture introduced methods for model-free policy evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 13 / 58

Model Free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 14 / 58

MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 15 / 58

Model-free Generalized Policy Improvement Given an estimate Q π i ( s , a ) ∀ s , a Update new policy π i +1 ( s ) = arg max Q π i ( s , a ) a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 16 / 58

Model-free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π given Q π May need to modify policy evaluation: If π is deterministic, can’t compute Q ( s , a ) for any a � = π ( s ) How to interleave policy evaluation and improvement? Policy improvement is now using an estimated Q Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 17 / 58

Policy Evaluation with Exploration Want to compute a model-free estimate of Q π In general seems subtle Need to try all ( s , a ) pairs but then follow π Want to ensure resulting estimate Q π is good enough so that policy improvement is a monotonic operator For certain classes of policies can ensure all (s,a) pairs are tried such that asymptotically Q π converges to the true value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 19 / 58

ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 20 / 58

ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = arg max a Q ( s , a ), w. prob 1 − ǫ + ǫ | A | a ′ � = arg max Q ( s , a ) w. prob ǫ | A | Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 21 / 58

For Later Practice: MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Mars rover with new actions: r ( − , a 1 ) = [ 1 0 0 0 0 0 +10], r ( − , a 2 ) = [ 0 0 0 0 0 0 +5], γ = 1. Assume current greedy π ( s ) = a 1 ∀ s , ǫ =.5 Sample trajectory from ǫ -greedy policy Trajectory = ( s 3 , a 1 , 0, s 2 , a 2 , 0, s 3 , a 1 , 0, s 2 , a 2 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of Q of each ( s , a ) pair? Q ǫ − π ( − , a 1 ) = [1 0 1 0 0 0 0], Q ǫ − π ( − , a 2 ) = [0 1 0 0 0 0 0] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 22 / 58

Monotonic ǫ -greedy Policy Improvement Theorem For any ǫ -greedy policy π i , the ǫ -greedy policy w.r.t. Q π i , π i +1 is a monotonic improvement V π i +1 ≥ V π i Q π i ( s , π i +1 ( s )) � π i +1 ( a | s ) Q π i ( s , a ) = a ∈ A   � Q π i ( s , a )  + (1 − ǫ ) max Q π i ( s , a ) = ( ǫ/ | A | ) a a ∈ A Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 23 / 58

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement - PowerPoint PPT Presentation

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silvers Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Industrial Robots Industrial Robots Control Control Part 1 Control Control Part 1 Part 1

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Lecture 30 Ratio, Feed Forward, Cascade Control Process Control Prof. Kannan M. Moudgalya IIT

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Industrial Robots Industrial Robots Control Control Part 2 Control Control Part 2 Part 2

SPECIAL Part 2 COVID-19 Response ECHO for Oregon Clinicians Session 7 September 10, 2020

A computational study of a class of multivalued tronqu ee solutions of the third Painlev e

IT350: Web & Internet Programming Fall 2015 Set 4: CSS No Style Style! How do we get from

Reference Resolution and other Discourse phenomena 11-711 Algorithms for NLP November 2020 What

Shared Action Learning: Supporting Collaboration and Critical Thinking Stephen McCauley &

Policy gradients CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the

Illinois Early Childhood Collaborations: Community Highlights and Peer Exchange Part 1 of 2:

Partnering For Leadership A Unique Business/ Non-Profit Action Learning Approach - ASTD

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement - PowerPoint PPT Presentation

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silvers Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Industrial Robots Industrial Robots Control Control Part 1 Control Control Part 1 Part 1

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Gluten Free &amp; Free From August 25, 2014 Natural &amp; Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Lecture 30 Ratio, Feed Forward, Cascade Control Process Control Prof. Kannan M. Moudgalya IIT

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Industrial Robots Industrial Robots Control Control Part 2 Control Control Part 2 Part 2

SPECIAL Part 2 COVID-19 Response ECHO for Oregon Clinicians Session 7 September 10, 2020

A computational study of a class of multivalued tronqu ee solutions of the third Painlev e

IT350: Web &amp; Internet Programming Fall 2015 Set 4: CSS No Style Style! How do we get from

Reference Resolution and other Discourse phenomena 11-711 Algorithms for NLP November 2020 What

Shared Action Learning: Supporting Collaboration and Critical Thinking Stephen McCauley &amp;

Policy gradients CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the

Illinois Early Childhood Collaborations: Community Highlights and Peer Exchange Part 1 of 2:

Partnering For Leadership A Unique Business/ Non-Profit Action Learning Approach - ASTD

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

IT350: Web & Internet Programming Fall 2015 Set 4: CSS No Style Style! How do we get from

Shared Action Learning: Supporting Collaboration and Critical Thinking Stephen McCauley &