Refresh Your Understanding: Multi-armed Bandits Select all that are - PowerPoint PPT Presentation

Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With some slides derived from David Silver Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 1 / 62

Refresh Your Understanding: Multi-armed Bandits Select all that are true: � UCB selects the arm with arg max a ˆ 1 Q t ( a ) + N t ( a ) log(1 /δ ) 1 Over an infinite trajectory, UCB will sample all arms an infinite number 2 of times UCB still would learn to pull the optimal arm the most if we instead 3 used arg max a ˆ � 1 √ Q t ( a ) + N t ( a ) log( t /δ ) UCB uses a bonus on top of average empirical reward. If the bonus was 4 ¿ 0 but small, the resulting algorithm might still suffer linear regret i Algorithms that minimize regret also maximize reward 5 Not sure 6 Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 2 / 62

Class Structure Last time: Fast Learning (Bandits and regret) This time: Fast Learning (Bayesian bandits) Next time: Fast Learning and Exploration Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 3 / 62

Recall Motivation Fast learning is important when our decisions impact the real world Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 4 / 62

Settings, Frameworks & Approaches Over next couple lectures will consider 2 settings, multiple frameworks, and approaches Settings: Bandits (single decisions), MDPs Frameworks: evaluation criteria for formally assessing the quality of a RL algorithm. So far seen empirical evaluations, asymptotic convergence, regret Approaches: Classes of algorithms for achieving particular evaluation criteria in a certain set. So far for exploration seen: greedy, ǫ − greedy, optimism Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 5 / 62

Table of Contents Recall: Multi-armed Bandit framework 1 Optimism Under Uncertainty for Bandits 2 Bayesian Bandits and Bayesian Regret Framework 3 Probability Matching 4 Framework: Probably Approximately Correct for Bandits 5 MDPs 6 Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 6 / 62

Recall: Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions (arms) R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 7 / 62

Approach: Optimism Under Uncertainty Estimate an upper confidence U t ( a ) for each action value, such that Q ( a ) ≤ U t ( a ) with high probability This depends on the number of times N t ( a ) action a has been selected Select action maximizing Upper Confidence Bound (UCB) a t = arg max a ∈A [ U t ( a )] Theorem: The UCB algorithm achieves logarithmic asymptotic total regret � t →∞ L t ≤ 8 log t lim ∆ a a | ∆ a > 0 Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 9 / 62

Simpler Optimism? Do we need to formally model uncertainty to get the ”right” level of optimism? Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 10 / 62

Greedy Bandit Algorithms and Optimistic Initialization Simple optimism under uncertainty approach Pretend already observed one pull of each arm, and saw some optimistic reward Average these fake pulls and rewards in when computing average empirical reward Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 11 / 62

Greedy Bandit Algorithms and Optimistic Initialization Simple optimism under uncertainty approach Pretend already observed one pull of each arm, and saw some optimistic reward Average these fake pulls and rewards in when computing average empirical reward Comparing regret results: Greedy : Linear total regret Constant ǫ -greedy : Linear total regret Decaying ǫ -greedy : Sublinear regret if can use right schedule for decaying ǫ , but that requires knowledge of gaps, which are unknown Optimistic initialization : Sublinear regret if initialize values sufficiently optimistically, else linear regret Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 12 / 62

Bayesian Bandits So far we have made no assumptions about the reward distribution R Except bounds on rewards Bayesian bandits exploit prior knowledge of rewards, p [ R ] They compute posterior distribution of rewards p [ R | h t ], where h t = ( a 1 , r 1 , . . . , a t − 1 , r t − 1 ) Use posterior to guide exploration Upper confidence bounds (Bayesian UCB) Probability matching (Thompson Sampling) Better performance if prior knowledge is accurate Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 14 / 62

Short Refresher / Review on Bayesian Inference In Bayesian view, we start with a prior over the unknown parameters Here the unknown distribution over the rewards for each arm Given observations / data about that parameter, update our uncertainty over the unknown parameters using Bayes Rule Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 15 / 62

Short Refresher / Review on Bayesian Inference In Bayesian view, we start with a prior over the unknown parameters Here the unknown distribution over the rewards for each arm Given observations / data about that parameter, update our uncertainty over the unknown parameters using Bayes Rule For example, let the reward of arm i be a probability distribution that depends on parameter φ i Initial prior over φ i is p ( φ i ) Pull arm i and observe reward r i 1 Use Bayes rule to update estimate over φ i : Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 16 / 62

Short Refresher / Review on Bayesian Inference In Bayesian view, we start with a prior over the unknown parameters Here the unknown distribution over the rewards for each arm Given observations / data about that parameter, update our uncertainty over the unknown parameters using Bayes Rule For example, let the reward of arm i be a probability distribution that depends on parameter φ i Initial prior over φ i is p ( φ i ) Pull arm i and observe reward r i 1 Use Bayes rule to update estimate over φ i : p ( φ i | r i 1 ) = p ( r i 1 | φ i ) p ( φ i ) p ( r i 1 | φ i ) p ( φ i ) = � p ( r i 1 ) φ i p ( r i 1 | φ i ) p ( φ i ) d φ i Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 17 / 62

Short Refresher / Review on Bayesian Inference II In Bayesian view, we start with a prior over the unknown parameters Given observations / data about that parameter, update our uncertainty over the unknown parameters using Bayes Rule p ( r i 1 | φ i ) p ( φ i ) p ( φ i | r i 1 ) = � φ i p ( r i 1 | φ i ) p ( φ i ) d φ i In general computing this update may be tricky to do exactly with no additional structure on the form of the prior and data likelihood Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 18 / 62

Short Refresher / Review on Bayesian Inference: Conjugate In Bayesian view, we start with a prior over the unknown parameters Given observations / data about that parameter, update our uncertainty over the unknown parameters using Bayes Rule p ( r i 1 | φ i ) p ( φ i ) p ( φ i | r i 1 ) = � φ i p ( r i 1 | φ i ) p ( φ i ) d φ i In general computing this update may be tricky But sometimes can be done analytically If the parametric representation of the prior and posterior is the same, the prior and model are called conjugate . For example, exponential families have conjugate priors Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 19 / 62

Refresh Your Understanding: Multi-armed Bandits Select all that are - PowerPoint PPT Presentation

Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With some slides derived from David Silver Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

A career in telehealth? What, how and why? HISA - Melbourne April 27 th 2017 Susan Jury Peter

www.MBTevents www.MyBigTOE.com 1 How Does It Work VR Mechanics Saturday Big TOE

AI and Big data in health the dilemma of Truth the dilemma of Truth christian.lovis@hcuge.ch

Page 1 How many women are treated for osteoporosis within one year of hip fracture? Risk for

Text and Slides of the 45 min keynote lecture entitled: World energy consumption and resources: an

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by

MA162: Finite mathematics . Jack Schmidt University of Kentucky September 12, 2012 Schedule:

Course Script IN 5110: Specification and Verification of Parallel Sys- tems IN5110, autumn 2019

Sambuz

Useful Links

Newsletter

Mail Us