Class Structure Last time: Midterm This time: Fast Learning Next - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 With many slides from or derived from David Silver, Examples new Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 1 / 65

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 2 / 65

Up Till Now Discussed optimization, generalization, delayed consequences Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 3 / 65

Teach Computers to Help Us Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 4 / 65

Computational Efficiency and Sample Efficiency Computational Efficiency Sample Efficiency Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 5 / 65

Algorithms Seen So Far How many steps did it take for DQN to learn a good policy for pong? Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 6 / 65

Evaluation Criteria How do we evaluate how ”good” an algorithm is? If converges? If converges to optimal policy? How quickly reaches optimal policy? Mistakes make along the way? Will introduce different measures to evaluate RL algorithms Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 7 / 65

Settings, Frameworks & Approaches Over next couple lectures will consider 2 settings, multiple frameworks, and approaches Settings: Bandits (single decisions), MDPs Frameworks: evaluation criteria for formally assessing the quality of a RL algorithm Approaches: Classes of algorithms for achieving particular evaluation criteria in a certain set Note: We will see that some approaches can achieve multiple frameworks in multiple settings Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 8 / 65

Today Setting: Introduction to multi-armed bandits Framework: Regret Approach: Optimism under uncertainty Framework: Bayesian regret Approach: Probability matching / Thompson sampling Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 9 / 65

Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions (arms) R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 10 / 65

Regret Action-value is the mean reward for action a Q ( a ) = E [ r | a ] Optimal value V ∗ V ∗ = Q ( a ∗ ) = max a ∈A Q ( a ) Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Maximize cumulative reward ⇐ ⇒ minimize total regret Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 11 / 65

Evaluating Regret Count N t ( a ) is expected number of selections for action a Gap ∆ a is the difference in value between action a and optimal action a ∗ , ∆ i = V ∗ − Q ( a i ) Regret is a function of gaps and counts � t � V ∗ − Q ( a τ ) � L t = E τ =1 E [ N t ( a )]( V ∗ − Q ( a )) � = a ∈A � = E [ N t ( a )]∆ a a ∈A A good algorithm ensures small counts for large gap, but gaps are not known Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 12 / 65

Greedy Algorithm We consider algorithms that estimate ˆ Q t ( a ) ≈ Q ( a ) Estimate the value of each action by Monte-Carlo evaluation T 1 ˆ � Q t ( a ) = r t 1 ( a t = a ) N t ( a ) t =1 The greedy algorithm selects action with highest value ˆ a ∗ t = arg max Q t ( a ) a ∈A Greedy can lock onto suboptimal action, forever Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 13 / 65

ǫ -Greedy Algorithm The ǫ -greedy algorithm proceeds as follows: With probability 1 − ǫ select a t = arg max a ∈A ˆ Q t ( a ) With probability ǫ select a random action Always will be making a sub-optimal decision ǫ fraction of the time Already used this in prior homeworks Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 14 / 65

Toy Example: Ways to Treat Broken Toes 1 Consider deciding how to best treat patients with broken toes Imagine have 3 possible options: (1) surgery (2) buddy taping the broken toe with another toe, (3) do nothing Outcome measure / reward is binary variable: whether the toe has healed (+1) or not healed (0) after 6 weeks, as assessed by x-ray 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 15 / 65

Toy Example: Ways to Treat Broken Toes 1 Consider deciding how to best treat patients with broken toes Imagine have 3 common options: (1) surgery (2) surgical boot (3) buddy taping the broken toe with another toe Outcome measure is binary variable: whether the toe has healed (+1) or not (0) after 6 weeks, as assessed by x-ray Model as a multi-armed bandit with 3 arms, where each arm is a Bernoulli variable with an unknown parameter θ i Check your understanding: what does a pull of an arm / taking an action correspond to? Why is it reasonable to model this as a multi-armed bandit instead of a Markov decision process? 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 16 / 65

Toy Example: Ways to Treat Broken Toes 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 17 / 65

Toy Example: Ways to Treat Broken Toes, Greedy 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Greedy Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95)), get +1, ˆ Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90)), get +1, ˆ Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1)), get 0, ˆ Q ( a 3 ) = 0 What is the probability of greedy selecting each arm next? Assume ties 2 are split uniformly. 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 18 / 65

Toy Example: Ways to Treat Broken Toes, Optimism, Assessing Regret of Greedy True (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Action Optimal Action Regret a 1 a 1 a 2 a 1 Greedy a 3 a 1 a 1 a 1 a 2 a 1 Will greedy ever select a 3 again? If yes, why? If not, is this a problem? Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 19 / 65

Toy Example: Ways to Treat Broken Toes, ǫ -Greedy 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 ǫ -greedy Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95)), get +1, ˆ Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90)), get +1, ˆ Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1)), get 0, ˆ Q ( a 3 ) = 0 Let ǫ = 0 . 1 2 What is the probability ǫ -greedy will pull each arm next? Assume ties 3 are split uniformly. 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 20 / 65

Toy Example: Ways to Treat Broken Toes, Optimism, Assessing Regret of Greedy True (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Action Optimal Action Regret a 1 a 1 a 2 a 1 a 3 a 1 a 1 a 1 a 2 a 1 Will ǫ -greedy ever select a 3 again? If ǫ is fixed, how many times will each arm be selected? Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2019 21 / 65

Class Structure Last time: Midterm This time: Fast Learning Next - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 With many slides from or derived from David Silver, Examples new Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

Travello A million holidays . One Travel Travello A million holidays . One Travel Travello A

Gandy Connector: Travel Demand Policy Committee August 2013 What options to the Gandy Elevated

Climate Change: Implica0ons for Tourism Key Findings from

Travel and Tourism Chapter 12 1 Learning Outcomes Recall advice from professionals

Tw Two-round Secu cure MPC from Mi Mini nimal Assum umptions ns Sanjam Garg Akshayaram

CPSCs New Age Determination Guidelines: How Do They Affect Your Product? May 20, 2020 Susan

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or