Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 Material builds on structure from David SIlver’s Lecture 4: Model-Free Prediction. Other resources: Sutton and Barto Jan 1 2018 draft Chapter/Sections: 5.1; 5.5; 6.1-6.3 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 1 / 56

Refresh Your Knowledge 2 [Piazza Poll] What is the max number of iterations of policy iteration in a tabular MDP? | A || S | 1 | S | | A | 2 | A | | S | 3 Unbounded 4 Not sure 5 In a tabular MDP asymptotically value iteration will always yield a policy with the same value as the policy returned by policy iteration True. 1 False 2 Not sure 3 Can value iteration require more iterations than | A | | S | to compute the optimal value function? (Assume | A | and | S | are small enough that each round of value iteration can be done exactly). True. 1 False 2 Not sure 3 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 2 / 56

Refresh Your Knowledge 2 What is the max number of iterations of policy iteration in a tabular MDP? Can value iteration require more iterations than | A | | S | to compute the optimal value function? (Assume | A | and | S | are small enough that each round of value iteration can be done exactly). In a tabular MDP asymptotically value iteration will always yield a policy with the same value as the policy returned by policy iteration Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 3 / 56

Today’s Plan Last Time: Markov reward / decision processes Policy evaluation & control when have true model (of how the world works) Today Policy evaluation without known dynamics & reward models Next Time: Control when don’t have a model of how the world works Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 4 / 56

This Lecture: Policy Evaluation Estimating the expected return of a particular policy if don’t have access to true MDP models Dynamic programming Monte Carlo policy evaluation Policy evaluation when don’t have a model of how the world work Given on-policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 5 / 56

Recall Definition of Return, G t (for a MRP) Discounted sum of rewards from time step t to horizon G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · Definition of State Value Function, V π ( s ) Expected return from starting in state s under policy π V π ( s ) = E π [ G t | s t = s ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Definition of State-Action Value Function, Q π ( s , a ) Expected return from starting in state s , taking action a and then following policy π Q π ( s , a ) = E π [ G t | s t = s , a t = a ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s , a t = a ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 6 / 56

Dynamic Programming for Evaluating Value of Policy π Initialize V π 0 ( s ) = 0 for all s For k = 1 until convergence For all s in S � V π p ( s ′ | s , π ( s )) V π k − 1 ( s ′ ) k ( s ) = r ( s , π ( s )) + γ s ′ ∈ S V π k ( s ) is exact value of k -horizon value of state s under policy π V π k ( s ) is an estimate of infinite horizon value of state s under policy π V π ( s ) = E π [ G t | s t = s ] ≈ E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 7 / 56

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 8 / 56

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 12 / 56

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 13 / 56

Policy Evaluation: V π ( s ) = E π [ G t | s t = s ] G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π Dynamic Programming V π ( s ) ≈ E π [ r t + γ V k − 1 | s t = s ] Requires model of MDP M Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history What if don’t know dynamics model P and/ or reward model R ? Today: Policy evaluation without a model Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π For example: Estimate expected total purchases during an online shopping session for a new automated product recommendation policy Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 14 / 56

This Lecture Overview: Policy Evaluation Dynamic Programming Evaluating the quality of an estimator Monte Carlo policy evaluation Policy evaluation when don’t know dynamics and/or reward model Given on policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 15 / 56

Monte Carlo (MC) Policy Evaluation G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E T ∼ π [ G t | s t = s ] Expectation over trajectories T generated by following π Simple idea: Value = mean return If trajectories are all finite, sample set of trajectories & average returns Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 16 / 56

Monte Carlo (MC) Policy Evaluation If trajectories are all finite, sample set of trajectories & average returns Does not require MDP dynamics/rewards No bootstrapping Does not assume state is Markov Can only be applied to episodic MDPs Averaging over returns from a complete episode Requires each episode to terminate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 17 / 56

Monte Carlo (MC) On Policy Evaluation Aim: estimate V π ( s ) given episodes generated under policy π s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . where the actions are sampled from π G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E π [ G t | s t = s ] MC computes empirical mean return Often do this in an incremental fashion After each episode, update estimate of V π Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 18 / 56

First-Visit Monte Carlo (MC) On Policy Evaluation Initialize N ( s ) = 0, G ( s ) = 0 ∀ s ∈ S Loop Sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i Define G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i as return from time step t onwards in i th episode For each state s visited in episode i For first time t that state s is visited in episode i Increment counter of total first visits: N ( s ) = N ( s ) + 1 Increment total return G ( s ) = G ( s ) + G i , t Update estimate V π ( s ) = G ( s ) / N ( s ) Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 19 / 56

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 Material builds on structure from David SIlvers Lecture 4: Model-Free Prediction. Other

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Class 2: Model-Free Prediction Sutton and Barto, Chapters 5 and 6 David Silver 295, class 2 1

Military Safeguards: An Outlier Case Orpet Peixoto ABACC - Brazilian Argentine Agency for

PRECIP: Towards Practical and Retrofittable Confidential Information Protection XiaoFeng Wang

Board of Governors Meeting via Teleconference/Webinar September 15, 2020 1:00 PM 4:45 PM 1

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University

MAY 2020 RUP - TSXV CAUTIONARY RY STATEMENT Cautionary Note Regarding Forward-Looking

London Borough of Sutton Pension Fund Page 11 Actuarial valuation as at 31 March 2019 Agenda

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 Material builds on structure from David SIlvers Lecture 4: Model-Free Prediction. Other

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Gluten Free &amp; Free From August 25, 2014 Natural &amp; Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Class 2: Model-Free Prediction Sutton and Barto, Chapters 5 and 6 David Silver 295, class 2 1

Military Safeguards: An Outlier Case Orpet Peixoto ABACC - Brazilian Argentine Agency for

PRECIP: Towards Practical and Retrofittable Confidential Information Protection XiaoFeng Wang

Board of Governors Meeting via Teleconference/Webinar September 15, 2020 1:00 PM 4:45 PM 1

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept

Reinforcement Learning: A Tutorial Satinder Singh Computer Science &amp; Engineering University

MAY 2020 RUP - TSXV CAUTIONARY RY STATEMENT Cautionary Note Regarding Forward-Looking

London Borough of Sutton Pension Fund Page 11 Actuarial valuation as at 31 March 2019 Agenda

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University