class structure
play

Class Structure Last time: Fast Learning, Exploration/Exploitation - PowerPoint PPT Presentation

Lecture 12: Fast Reinforcement Learning Part II 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver, Worked Examples New Lecture 12: Fast Reinforcement Learning Part II 3 Emma Brunskill


  1. Lecture 12: Fast Reinforcement Learning Part II 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver, Worked Examples New Lecture 12: Fast Reinforcement Learning Part II 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 70

  2. Class Structure Last time: Fast Learning, Exploration/Exploitation Part 1 This Time: Fast Learning Part II Next time: Batch RL Lecture 12: Fast Reinforcement Learning Part II 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 70

  3. Table of Contents Metrics for evaluating RL algorithms 1 Principles for RL Exploration 2 Probability Matching 3 Information State Search 4 MDPs 5 Principles for RL Exploration 6 Metrics for evaluating RL algorithms 7 Lecture 12: Fast Reinforcement Learning Part II 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 70

  4. Performance Criteria of RL Algorithms Empirical performance Convergence (to something ...) Asymptotic convergence to optimal policy Finite sample guarantees: probably approximately correct Regret (with respect to optimal decisions) Optimal decisions given information have available PAC uniform Lecture 12: Fast Reinforcement Learning Part II 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 70

  5. Table of Contents Metrics for evaluating RL algorithms 1 Principles for RL Exploration 2 Probability Matching 3 Information State Search 4 MDPs 5 Principles for RL Exploration 6 Metrics for evaluating RL algorithms 7 Lecture 12: Fast Reinforcement Learning Part II 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 70

  6. Principles Naive Exploration (last time) Optimistic Initialization (last time) Optimism in the Face of Uncertainty (last time + this time) Probability Matching (last time + this time) Information State Search (this time) Lecture 12: Fast Reinforcement Learning Part II 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 70

  7. Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Lecture 12: Fast Reinforcement Learning Part II 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 70

  8. Regret Action-value is the mean reward for action a Q ( a ) = E [ r | a ] Optimal value V ∗ V ∗ = Q ( a ∗ ) = max a ∈A Q ( a ) Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Maximize cumulative reward ⇐ ⇒ minimize total regret Lecture 12: Fast Reinforcement Learning Part II 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 70

  9. Optimism Under Uncertainty: Upper Confidence Bounds Estimate an upper confidence ˆ U t ( a ) for each action value, such that Q ( a ) ≤ ˆ Q t ( a ) + ˆ U t ( a ) with high probability This depends on the number of times N ( a ) has been selected Small N t ( a ) → large ˆ U t ( a ) (estimate value is uncertain) Large N t ( a ) → small ˆ U t ( a ) (estimate value is accurate) Select action maximizing Upper Confidence Bound (UCB) a t = arg max a ∈ A ˆ Q t ( a ) + ˆ U t ( a ) Lecture 12: Fast Reinforcement Learning Part II 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 70

  10. UCB1 This leads to the UCB1 algorithm � 2 log t a t = arg max a ∈A Q ( a ) + N t ( a ) Theorem: The UCB algorithm achieves logarithmic asymptotic total regret � t →∞ L t ≤ 8 log t lim ∆ a a | ∆ a > 0 Lecture 12: Fast Reinforcement Learning Part II 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 70

  11. Toy Example: Ways to Treat Broken Toes 13 Consider deciding how to best treat patients with broken toes Imagine have 3 possible options: (1) surgery (2) buddy taping the broken toe with another toe, (3) do nothing Outcome measure is binary variable: whether the toe has healed (+1) or not healed (0) after 6 weeks, as assessed by x-ray 13 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 70

  12. Toy Example: Ways to Treat Broken Toes 15 Consider deciding how to best treat patients with broken toes Imagine have 3 common options: (1) surgery (2) surgical boot (3) buddy taping the broken toe with another toe Outcome measure is binary variable: whether the toe has healed (+1) or not (0) after 6 weeks, as assessed by x-ray Model as a multi-armed bandit with 3 arms, where each arm is a Bernoulli variable with an unknown parameter θ i Check your understanding: what does a pull of an arm / taking an action correspond to? Why is it reasonable to model this as a multi-armed bandit instead of a Markov decision process? 15 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 70

  13. Toy Example: Ways to Treat Broken Toes 17 Imagine true (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 17 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 70

  14. Toy Example: Ways to Treat Broken Toes, Thompson Sampling 19 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Optimism under uncertainty, UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 19 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 70

  15. Toy Example: Ways to Treat Broken Toes, Optimism 21 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 21 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 22 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 70

  16. Toy Example: Ways to Treat Broken Toes, Optimism 23 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 Set t = 3, Compute upper confidence bound on each action 2 � 2 lnt ucb ( a ) = Q ( a ) + N t ( a ) 23 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 24 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 70

  17. Toy Example: Ways to Treat Broken Toes, Optimism 25 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 Set t = 3, Compute upper confidence bound on each action 2 � 2 lnt ucb ( a ) = Q ( a ) + N t ( a ) t = 3, Select action a t = arg max a ucb ( a ), 3 Observe reward 1 4 Compute upper confidence bound on each action 5 25 Note:This is a made up example. This is not the actual expected efficacies of the Lecture 12: Fast Reinforcement Learning Part II 26 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 70 various treatment options for a broken toe

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend