lecture 3 model free policy evaluation policy
play

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 Material builds on structure from David SIlvers Lecture 4: Model-Free Prediction. Other


  1. Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 Material builds on structure from David SIlver’s Lecture 4: Model-Free Prediction. Other resources: Sutton and Barto Jan 1 2018 draft Chapter/Sections: 5.1; 5.5; 6.1-6.3 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 1 / 56

  2. Refresh Your Knowledge 2 [Piazza Poll] What is the max number of iterations of policy iteration in a tabular MDP? | A || S | 1 | S | | A | 2 | A | | S | 3 Unbounded 4 Not sure 5 In a tabular MDP asymptotically value iteration will always yield a policy with the same value as the policy returned by policy iteration True. 1 False 2 Not sure 3 Can value iteration require more iterations than | A | | S | to compute the optimal value function? (Assume | A | and | S | are small enough that each round of value iteration can be done exactly). True. 1 False 2 Not sure 3 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 2 / 56

  3. Refresh Your Knowledge 2 What is the max number of iterations of policy iteration in a tabular MDP? Can value iteration require more iterations than | A | | S | to compute the optimal value function? (Assume | A | and | S | are small enough that each round of value iteration can be done exactly). In a tabular MDP asymptotically value iteration will always yield a policy with the same value as the policy returned by policy iteration Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 3 / 56

  4. Today’s Plan Last Time: Markov reward / decision processes Policy evaluation & control when have true model (of how the world works) Today Policy evaluation without known dynamics & reward models Next Time: Control when don’t have a model of how the world works Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 4 / 56

  5. This Lecture: Policy Evaluation Estimating the expected return of a particular policy if don’t have access to true MDP models Dynamic programming Monte Carlo policy evaluation Policy evaluation when don’t have a model of how the world work Given on-policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 5 / 56

  6. Recall Definition of Return, G t (for a MRP) Discounted sum of rewards from time step t to horizon G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · Definition of State Value Function, V π ( s ) Expected return from starting in state s under policy π V π ( s ) = E π [ G t | s t = s ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Definition of State-Action Value Function, Q π ( s , a ) Expected return from starting in state s , taking action a and then following policy π Q π ( s , a ) = E π [ G t | s t = s , a t = a ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s , a t = a ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 6 / 56

  7. Dynamic Programming for Evaluating Value of Policy π Initialize V π 0 ( s ) = 0 for all s For k = 1 until convergence For all s in S � V π p ( s ′ | s , π ( s )) V π k − 1 ( s ′ ) k ( s ) = r ( s , π ( s )) + γ s ′ ∈ S V π k ( s ) is exact value of k -horizon value of state s under policy π V π k ( s ) is an estimate of infinite horizon value of state s under policy π V π ( s ) = E π [ G t | s t = s ] ≈ E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 7 / 56

  8. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 8 / 56

  9. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 9 / 56

  10. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 10 / 56

  11. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 11 / 56

  12. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 12 / 56

  13. Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 13 / 56

  14. Policy Evaluation: V π ( s ) = E π [ G t | s t = s ] G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π Dynamic Programming V π ( s ) ≈ E π [ r t + γ V k − 1 | s t = s ] Requires model of MDP M Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history What if don’t know dynamics model P and/ or reward model R ? Today: Policy evaluation without a model Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π For example: Estimate expected total purchases during an online shopping session for a new automated product recommendation policy Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 14 / 56

  15. This Lecture Overview: Policy Evaluation Dynamic Programming Evaluating the quality of an estimator Monte Carlo policy evaluation Policy evaluation when don’t know dynamics and/or reward model Given on policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 15 / 56

  16. Monte Carlo (MC) Policy Evaluation G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E T ∼ π [ G t | s t = s ] Expectation over trajectories T generated by following π Simple idea: Value = mean return If trajectories are all finite, sample set of trajectories & average returns Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 16 / 56

  17. Monte Carlo (MC) Policy Evaluation If trajectories are all finite, sample set of trajectories & average returns Does not require MDP dynamics/rewards No bootstrapping Does not assume state is Markov Can only be applied to episodic MDPs Averaging over returns from a complete episode Requires each episode to terminate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 17 / 56

  18. Monte Carlo (MC) On Policy Evaluation Aim: estimate V π ( s ) given episodes generated under policy π s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . where the actions are sampled from π G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E π [ G t | s t = s ] MC computes empirical mean return Often do this in an incremental fashion After each episode, update estimate of V π Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 18 / 56

  19. First-Visit Monte Carlo (MC) On Policy Evaluation Initialize N ( s ) = 0, G ( s ) = 0 ∀ s ∈ S Loop Sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i Define G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i as return from time step t onwards in i th episode For each state s visited in episode i For first time t that state s is visited in episode i Increment counter of total first visits: N ( s ) = N ( s ) + 1 Increment total return G ( s ) = G ( s ) + G i , t Update estimate V π ( s ) = G ( s ) / N ( s ) Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2020 19 / 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend