lecture 4 model free control
play

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement - PowerPoint PPT Presentation

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silvers Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234


  1. Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silver’s Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 1 / 58

  2. Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? s ′ p ( s ′ | s t , a t ) V ( s ′ ) V ( s t ) = r ( s t , a t ) + γ � 1 V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) 2 V ( s t ) = (1 − α ) V ( s t ) + α � H i = t r ( s i , a i ) 3 V ( s t ) = (1 − α ) V ( s t ) + α max a ( r ( s t , a ) + γ V ( s t +1 )) 4 Not sure 5 Bootstrapping is when Samples of (s,a,s’) transitions are used to approximate the true 1 expectation over next states An estimate of the next state value is used instead of the true next 2 state value Used in Monte-Carlo policy evaluation 3 Not sure 4 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 2 / 58

  3. Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? True. V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) Bootstrapping is when An estimate of the next state value is used instead of the true next state value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 3 / 58

  4. Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 4 / 58

  5. Class Structure Last time: Policy evaluation with no knowledge of how the world works (MDP model not given) This time: Control (making decisions) without a model of how the world works Next time: Generalization – Value function approximation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 5 / 58

  6. Evaluation to Control Last time: how good is a specific policy? Given no access to the decision process model parameters Instead have to estimate from data / experience Today: how can we learn a good policy? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 6 / 58

  7. Recall: Reinforcement Learning Involves Optimization Delayed consequences Exploration Generalization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 7 / 58

  8. Today: Learning to Control Involves Optimization: Goal is to identify a policy with high expected rewards (similar to Lecture 2 on computing an optimal policy given decision process models) Delayed consequences: May take many time steps to evaluate whether an earlier decision was good or not Exploration: Necessary to try different actions to learn what actions can lead to high rewards Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 8 / 58

  9. Today: Model-free Control Generalized policy improvement Importance of exploration Monte Carlo control Model-free control with temporal difference (SARSA, Q-learning) Maximization bias Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 9 / 58

  10. Model-free Control Examples Many applications can be modeled as a MDP: Backgammon, Go, Robot locomation, Helicopter flight, Robocup soccer, Autonomous driving, Customer ad selection, Invasive species management, Patient treatment For many of these and other problems either: MDP model is unknown but can be sampled MDP model is known but it is computationally infeasible to use directly, except through sampling Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 10 / 58

  11. On and Off-Policy Learning On-policy learning Direct experience Learn to estimate and evaluate a policy from experience obtained from following that policy Off-policy learning Learn to estimate and evaluate a policy using experience gathered from following a different policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 11 / 58

  12. Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 12 / 58

  13. Recall Policy Iteration Initialize policy π Repeat: Policy evaluation: compute V π Policy improvement: update π π ′ ( s ) = arg max � P ( s ′ | s , a ) V π ( s ′ ) = arg max R ( s , a ) + γ Q π ( s , a ) a a s ′ ∈ S Now want to do the above two steps without access to the true dynamics and reward models Last lecture introduced methods for model-free policy evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 13 / 58

  14. Model Free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 14 / 58

  15. MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 15 / 58

  16. Model-free Generalized Policy Improvement Given an estimate Q π i ( s , a ) ∀ s , a Update new policy π i +1 ( s ) = arg max Q π i ( s , a ) a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 16 / 58

  17. Model-free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π given Q π May need to modify policy evaluation: If π is deterministic, can’t compute Q ( s , a ) for any a � = π ( s ) How to interleave policy evaluation and improvement? Policy improvement is now using an estimated Q Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 17 / 58

  18. Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 18 / 58

  19. Policy Evaluation with Exploration Want to compute a model-free estimate of Q π In general seems subtle Need to try all ( s , a ) pairs but then follow π Want to ensure resulting estimate Q π is good enough so that policy improvement is a monotonic operator For certain classes of policies can ensure all (s,a) pairs are tried such that asymptotically Q π converges to the true value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 19 / 58

  20. ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 20 / 58

  21. ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = arg max a Q ( s , a ), w. prob 1 − ǫ + ǫ | A | a ′ � = arg max Q ( s , a ) w. prob ǫ | A | Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 21 / 58

  22. For Later Practice: MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Mars rover with new actions: r ( − , a 1 ) = [ 1 0 0 0 0 0 +10], r ( − , a 2 ) = [ 0 0 0 0 0 0 +5], γ = 1. Assume current greedy π ( s ) = a 1 ∀ s , ǫ =.5 Sample trajectory from ǫ -greedy policy Trajectory = ( s 3 , a 1 , 0, s 2 , a 2 , 0, s 3 , a 1 , 0, s 2 , a 2 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of Q of each ( s , a ) pair? Q ǫ − π ( − , a 1 ) = [1 0 1 0 0 0 0], Q ǫ − π ( − , a 2 ) = [0 1 0 0 0 0 0] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 22 / 58

  23. Monotonic ǫ -greedy Policy Improvement Theorem For any ǫ -greedy policy π i , the ǫ -greedy policy w.r.t. Q π i , π i +1 is a monotonic improvement V π i +1 ≥ V π i Q π i ( s , π i +1 ( s )) � π i +1 ( a | s ) Q π i ( s , a ) = a ∈ A   � Q π i ( s , a )  + (1 − ǫ ) max Q π i ( s , a ) = ( ǫ/ | A | ) a a ∈ A Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 23 / 58

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend