Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - PowerPoint PPT Presentation

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020

Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. 1

Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. • Abstraction to Multi-armed Bandits : Each drug choice is mapped to an arm and the drug’s effectiveness is mapped to the arm’s reward. 1

Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. • Abstraction to Multi-armed Bandits : Each drug choice is mapped to an arm and the drug’s effectiveness is mapped to the arm’s reward. • Administering a drug is an action that is equivalent to pulling the corresponding arm. The trial goes on for T rounds. 1

Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end 2

Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. 2

Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. • Objective : Minimize the expected cumulative regret R ( T ): � � T � R ( T ) = E [Reward for best arm] − E [Reward for arm pulled in round t ] t =1 2

Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. • Objective : Minimize the expected cumulative regret R ( T ): � � T � R ( T ) = E [Reward for best arm] − E [Reward for arm pulled in round t ] t =1 • Minimizing R ( T ) boils down to a exploration-exploitation trade-off. 2

Bandits 101: structured bandits • In problems with a large number of arms, learning about each arm separately is inefficient. = ⇒ use a shared parameterization for the arms. • Structured bandits : Each arm i has a feature vector x i and there exists an unknown vector θ ∗ such that E [reward for arm i ] = g ( x i , θ ∗ ). • Linear bandits : g ( x i , θ ∗ ) = � x i , θ ∗ � . • Generalized linear bandits : g is a strictly increasing, differentiable link function. E.g. g ( x , θ ∗ ) = 1 / (1 + exp( −� x i , θ ∗ � )) for logistic bandits. 3

Bandits 101: algorithms • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability confidence sets. • Theoretically optimal. Does not depend on the exact distribution of rewards. • Poor empirical performance on typical problem instances. • Thompson Sampling (TS): Randomized strategy that samples from a posterior distribution. • Good empirical performance on typical problem instances. • Depends on the reward distributions. Computationally expensive in the absence of closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting. 4

Bandits 101: algorithms • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability confidence sets. • Theoretically optimal. Does not depend on the exact distribution of rewards. • Poor empirical performance on typical problem instances. • Thompson Sampling (TS): Randomized strategy that samples from a posterior distribution. • Good empirical performance on typical problem instances. • Depends on the reward distributions. Computationally expensive in the absence of closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting. Can we obtain the best of OFU and TS? 4

The RandUCB meta-algorithm Theoretical study 5

RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. 6

RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. • RandUCB : Replace deterministic β by a random variable Z t : i t = arg max { � µ i ( t ) + Z t C i ( t ) } . i ∈ [ K ] Z 1 , . . . , Z T are i.i.d. samples from the sampling distribution. 6

RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. • RandUCB : Replace deterministic β by a random variable Z t : i t = arg max { � µ i ( t ) + Z t C i ( t ) } . i ∈ [ K ] Z 1 , . . . , Z T are i.i.d. samples from the sampling distribution. • Uncoupled RandUCB : i t = arg max { � µ i ( t ) + Z i , t C i ( t ) } . i ∈ [ K ] 6

RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). 7

RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). • Default sampling distribution : Gaussian distribution truncated in the [0 , U ] interval with tunable hyper-parameters ε, σ > 0 such that p M = ε and p m ∝ exp( − α 2 m / 2 σ 2 ) . For 1 ≤ m ≤ M − 1 , 7

RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). • Default sampling distribution : Gaussian distribution truncated in the [0 , U ] interval with tunable hyper-parameters ε, σ > 0 such that p M = ε and p m ∝ exp( − α 2 m / 2 σ 2 ) . For 1 ≤ m ≤ M − 1 , • Default choice across bandit problems : Coupled RandUCB with U = O ( β ), M = 10, ε = 10 − 8 , σ = 0 . 25. 7

RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). 8

RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i 8

RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i � • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β = 2 ln( T ) 8

RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i � • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β = 2 ln( T ) � • RandUCB : L = 0 , U = 2 ln( T ). • We can also construct optimistic Thompson sampling and adaptive ε -greedy algorithms. 8

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - PowerPoint PPT Presentation

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020 Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim : Infer the best drug by

The Agile PMP: Teaching an Old Dog New Tricks The Agile PMP: Teaching an Old Dog New Tricks

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Relationships Between Real Things Man walks dog Dog strains at leash Dog wears collar

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Dog Waste in Santa Barbara Dog Waste in Santa Barbara Dog Waste in Santa Barbara Dog Waste in

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Underwriting and Actuarial: Making Better Dog Food Together You can make the best dog food in

Dog Park Public Open House Outline Dog Parks in other municipalities Issues

Dog Warden Service Dog Warden Service The dog warden service works towards improving quality of

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Teaching old type systems Teaching old type systems new tricks with type providers new tricks

Modern C++ Old Dog, New Tricks Todd L. Montgomery @toddlmontgomery C++ is so old Languages

Foundations of Cryptography MIT-6.875/18.425 , UCB CS-276 Lecture 1 Shafi Goldwasser MIT,

Walking Tips & Tricks PRESENTED BY: PAWS IN TRAINING & MAGSTER DOG TRAINING Reality

Dog Knowledge The 3rd pillar of the Dog Program Index What is Dog knowledge? Topics

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Integrated Development Laura Ghiron , MPH ExpandNet Secretariat and the Evidence to Action

2.1 Explicit & Implicit Surfaces Hao Li http://cs621.hao-li.com 1 Administrative

CSCI 420: Computer Graphics Fall 2018 Hao Li http://cs420.hao-li.com 1 http://hao.li/ Vision

Citizen Participation: Requirements and Best Practices for CDBG-MIT 1 2019 CDBG-MIT PROGRAM

Integrated Telebehavioral Health Part 1 Jonathan Neufeld, PhD Mary DeVany, MA July 17, 2020

Are justification and hyphenation good or bad for the reader? Leyla Akhmadeeva Rinat Gizatullin

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 visualizations, it has been widely

Forest Biomaterials nc state university nc state university Modification of Kraft Lignin for Use

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - PowerPoint PPT Presentation

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020 Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim : Infer the best drug by

The Agile PMP: Teaching an Old Dog New Tricks The Agile PMP: Teaching an Old Dog New Tricks

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Relationships Between Real Things Man walks dog Dog strains at leash Dog wears collar

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Dog Waste in Santa Barbara Dog Waste in Santa Barbara Dog Waste in Santa Barbara Dog Waste in

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Underwriting and Actuarial: Making Better Dog Food Together You can make the best dog food in

Dog Park Public Open House Outline Dog Parks in other municipalities Issues

Dog Warden Service Dog Warden Service The dog warden service works towards improving quality of

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Teaching old type systems Teaching old type systems new tricks with type providers new tricks

Modern C++ Old Dog, New Tricks Todd L. Montgomery @toddlmontgomery C++ is so old Languages

Foundations of Cryptography MIT-6.875/18.425 , UCB CS-276 Lecture 1 Shafi Goldwasser MIT,

Walking Tips &amp; Tricks PRESENTED BY: PAWS IN TRAINING &amp; MAGSTER DOG TRAINING Reality

Dog Knowledge The 3rd pillar of the Dog Program Index What is Dog knowledge? Topics

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Integrated Development Laura Ghiron , MPH ExpandNet Secretariat and the Evidence to Action

2.1 Explicit &amp; Implicit Surfaces Hao Li http://cs621.hao-li.com 1 Administrative

CSCI 420: Computer Graphics Fall 2018 Hao Li http://cs420.hao-li.com 1 http://hao.li/ Vision

Citizen Participation: Requirements and Best Practices for CDBG-MIT 1 2019 CDBG-MIT PROGRAM

Integrated Telebehavioral Health Part 1 Jonathan Neufeld, PhD Mary DeVany, MA July 17, 2020

Are justification and hyphenation good or bad for the reader? Leyla Akhmadeeva Rinat Gizatullin

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 visualizations, it has been widely

Forest Biomaterials nc state university nc state university Modification of Kraft Lignin for Use

Walking Tips & Tricks PRESENTED BY: PAWS IN TRAINING & MAGSTER DOG TRAINING Reality

2.1 Explicit & Implicit Surfaces Hao Li http://cs621.hao-li.com 1 Administrative