Refresh Your Knowledge Fast RL Part II The prior over arm 1 is - PowerPoint PPT Presentation

Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With a few slides derived from David Silver Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 1 / 40

Refresh Your Knowledge Fast RL Part II The prior over arm 1 is Beta(1,2) (left) and arm 2 is a Beta(1,1) (right figure). Select all that are true. 1 Sample 3 params: 0 . 1,0 . 5,0 . 3. These are more likely to come from the Beta(1,2) distribution than Beta(1,1). 2 Sample 3 params: 0 . 2,0 . 5,0 . 8. These are more likely to come from the Beta(1,1) distribution than Beta(1,2). 3 It is impossible that the true Bernoulli parame is 0 if the prior is Beta(1,1). 4 Not sure The prior over arm 1 is Beta(1,2) (left) and arm 2 is a Beta(1,1) (right). The true parameters are arm 1 θ 1 = 0 . 4 & arm 2 θ 2 = 0 . 6. Thompson sampling = TS 1 TS could sample θ = 0 . 5 (arm 1) and θ = 0 . 55 (arm 2). 2 For the sampled thetas (0.5,0.55) TS is optimistic with respect to the true arm parameters for all arms. 3 For the sampled thetas (0.5,0.55) TS will choose the true optimal arm for this round. 4 Not sure Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 2 / 40

Class Structure Last time: Fast Learning (Bayesian bandits to MDPs) This time: Fast Learning III (MDPs) Next time: Batch RL Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 3 / 40

Settings, Frameworks & Approaches Over these 3 lectures will consider 2 settings, multiple frameworks, and approaches Settings: Bandits (single decisions), MDPs Frameworks: evaluation criteria for formally assessing the quality of a RL algorithm. So far seen empirical evaluations, asymptotic convergence, regret, probably approximately correct Approaches: Classes of algorithms for achieving particular evaluation criteria in a certain set. So far for exploration seen: greedy, ǫ − greedy, optimism, Thompson sampling, for multi-armed bandits Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 4 / 40

Table of Contents MDPs 1 Bayesian MDPs 2 Generalization and Exploration 3 Summary 4 Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 5 / 40

Fast RL in Markov Decision Processes Very similar set of frameworks and approaches are relevant for fast learning in reinforcement learning Frameworks Regret Bayesian regret Probably approximately correct (PAC) Approaches Optimism under uncertainty Probability matching / Thompson sampling Framework: Probably approximately correct Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 6 / 40

Fast RL in Markov Decision Processes Montezuma’s revenge https://www.youtube.com/watch?v=ToSe CUG0F4 Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 7 / 40

Model-Based Interval Estimation with Exploration Bonus (MBIE-EB) (Strehl and Littman, J of Computer & Sciences 2008) 1: Given ǫ , δ , m 1 � 2: β = 0 . 5 ln(2 | S || A | m /δ ) 1 − γ 3: n sas ( s , a , s ′ ) = 0, ∀ s ∈ S , a ∈ A , s ′ ∈ S 4: rc ( s , a ) = 0, n sa ( s , a ) = 0, ˜ Q ( s , a ) = 1 / (1 − γ ), ∀ s ∈ S , a ∈ A 5: t = 0, s t = s init 6: loop a t = arg max a ∈A ˜ 7: Q ( s t , a ) 8: Observe reward r t and state s t +1 9: n sa ( s t , a t ) = n sa ( s t , a t ) + 1, n sas ( s t , a t , s t +1 ) = n sas ( s t , a t , s t +1 ) + 1 rc ( s t , a t ) = rc ( s t , a t )( n sa ( s t , a t ) − 1)+ r t 10: n sa ( s t , a t ) n sa ( s t , a t ) , ∀ s ′ ∈ S R ( s t , a t ) = rc ( s t , a t ) and ˆ ˆ T ( s ′ | s t , a t ) = n sas ( s t , a t , s ′ ) 11: 12: while not converged do Q ( s , a ) = ˆ ˜ s ′ ˆ T ( s ′ | s , a ) max a ′ ˜ Q ( s ′ , a ) + √ β 13: R ( s , a ) + γ � n sa ( s , a ) , ∀ s ∈ S , a ∈ A 14: end while 15: end loop Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 8 / 40

Framework: PAC for MDPs For a given ǫ and δ , A RL algorithm A is PAC if on all but N steps, the action selected by algorithm A on time step t , a t , is ǫ -close to the optimal action, where N is a polynomial function of ( | S | , | A | , γ, ǫ, δ ) Is this true for all algorithms? Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 9 / 40

MBIE-EB is a PAC RL Algorithm Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 10 / 40

A Sufficient Set of Conditions to Make a RL Algorithm PAC Strehl, A. L., Li, L., & Littman, M. L. (2006). Incremental model-based learners with formal learning-time guarantees. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (pp. 485-493) Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 11 / 40

A Sufficient Set of Conditions to Make a RL Algorithm PAC Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 12 / 40

Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 13 / 40

How Does MBIE-EB Fulfill these Conditions? Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 15 / 40

Table of Contents MDPs 1 Bayesian MDPs 2 Generalization and Exploration 3 Summary 4 Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 17 / 40

Refresher: Bayesian Bandits Bayesian bandits exploit prior knowledge of rewards, p [ R ] They compute posterior distribution of rewards p [ R | h t ], where h t = ( a 1 , r 1 , . . . , a t − 1 , r t − 1 ) Use posterior to guide exploration Upper confidence bounds (Bayesian UCB) Probability matching (Thompson Sampling) Better performance if prior knowledge is accurate Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 18 / 40

Refresher: Bernoulli Bandits Consider a bandit problem where the reward of an arm is a binary outcome { 0 , 1 } sampled from a Bernoulli with parameter θ E.g. Advertisement click through rate, patient treatment succeeds/fails, ... The Beta distribution Beta ( α, β ) is conjugate for the Bernoulli distribution p ( θ | α, β ) = θ α − 1 (1 − θ ) β − 1 Γ( α + β ) Γ( α )Γ( β ) where Γ( x ) is the Gamma function. Assume the prior over θ is a Beta ( α, β ) as above Then after observed a reward r ∈ { 0 , 1 } then updated posterior over θ is Beta ( r + α, 1 − r + β ) Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 19 / 40

Thompson Sampling for Bandits 1: Initialize prior over each arm a , p ( R a ) 2: loop For each arm a sample a reward distribution R a from posterior 3: Compute action-value function Q ( a ) = E [ R a ] 4: a t = arg max a ∈A Q ( a ) 5: Observe reward r 6: Update posterior p ( R a | r ) using Bayes law 7: 8: end loop Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 20 / 40

Bayesian Model-Based RL Maintain posterior distribution over MDP models Estimate both transition and rewards, p [ P , R | h t ], where h t = ( s 1 , a 1 , r 1 , . . . , s t ) is the history Use posterior to guide exploration Upper confidence bounds (Bayesian UCB) Probability matching (Thompson sampling) Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 21 / 40

Thompson Sampling: Model-Based RL Thompson sampling implements probability matching π ( s , a | h t ) = P [ Q ( s , a ) ≥ Q ( s , a ′ ) , ∀ a ′ � = a | h t ] � � = E P , R| h t 1 ( a = arg max a ∈A Q ( s , a )) Use Bayes law to compute posterior distribution p [ P , R | h t ] Sample an MDP P , R from posterior Solve MDP using favorite planning algorithm to get Q ∗ ( s , a ) Select optimal action for sample MDP, a t = arg max a ∈A Q ∗ ( s t , a ) Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 22 / 40

Thompson Sampling for MDPs 1: Initialize prior over the dynamics and reward models for each ( s , a ), p ( R as ), p ( T ( s ′ | s , a )) 2: Initialize state s 0 3: loop Sample a MDP M : for each ( s , a ) pair, sample a dynamics model 4: T ( s ′ | s , a ) and reward model R ( s , a ) Compute Q ∗ M , optimal value for MDP M 5: a t = arg max a ∈A Q ∗ M ( s t , a ) 6: Observe reward r t and next state s t +1 7: Update posterior p ( R a t s t | r t ), p ( T ( s ′ | s t , a t ) | s t +1 ) using Bayes rule 8: t = t + 1 9: 10: end loop Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 23 / 40

Refresh Your Knowledge Fast RL Part II The prior over arm 1 is - PowerPoint PPT Presentation

Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With a few slides derived from David Silver Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Part 1: Knowledge Graphs Part 2: Part 3: Knowledge Graph

Part 1: Knowledge Graphs Part 2: Part 3: Knowledge Graph

Tutorial Overview https://kgtutorial.github.io Part 1: Knowledge Graphs Part 2: Part 3:

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Working Texas Style: Do You Have The Skills To Pay The Bills Central East Texas Alliance

ALL THINGS Lindy Strong DNA & OUR THINKING DNA & Our Thinking The following section

The purpose of life, after all, is to live it, to taste experience to the utmost, to reach

The presentation will begin shortly. The content provided herein is provided for informational

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Wireless & Mobile Health to Address COVID-19 Fadel Adib Wireless & Mobile Health to

GALAXY SPIRAL ARMS, DISK DISTURBANCES AND STATISTICS Part I: NGC3081 to build background for

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Refresh Your Knowledge Fast RL Part II The prior over arm 1 is - PowerPoint PPT Presentation

Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With a few slides derived from David Silver Lecture 13: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Part 1: Knowledge Graphs Part 2: Part 3: Knowledge Graph

Part 1: Knowledge Graphs Part 2: Part 3: Knowledge Graph

Tutorial Overview https://kgtutorial.github.io Part 1: Knowledge Graphs Part 2: Part 3:

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Working Texas Style: Do You Have The Skills To Pay The Bills Central East Texas Alliance

ALL THINGS Lindy Strong DNA &amp; OUR THINKING DNA &amp; Our Thinking The following section

The purpose of life, after all, is to live it, to taste experience to the utmost, to reach

The presentation will begin shortly. The content provided herein is provided for informational

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Wireless &amp; Mobile Health to Address COVID-19 Fadel Adib Wireless &amp; Mobile Health to

GALAXY SPIRAL ARMS, DISK DISTURBANCES AND STATISTICS Part I: NGC3081 to build background for

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

ALL THINGS Lindy Strong DNA & OUR THINKING DNA & Our Thinking The following section

Wireless & Mobile Health to Address COVID-19 Fadel Adib Wireless & Mobile Health to