old dog learns new tricks randomized ucb for bandit
play

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - PowerPoint PPT Presentation

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020 Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim : Infer the best drug by


  1. Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020

  2. Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. 1

  3. Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. • Abstraction to Multi-armed Bandits : Each drug choice is mapped to an arm and the drug’s effectiveness is mapped to the arm’s reward. 1

  4. Motivating example: clinical trials • Do not have complete information about the effectiveness or side-effects of the drugs. • Aim : Infer the “best” drug by running a sequence of trials. • Abstraction to Multi-armed Bandits : Each drug choice is mapped to an arm and the drug’s effectiveness is mapped to the arm’s reward. • Administering a drug is an action that is equivalent to pulling the corresponding arm. The trial goes on for T rounds. 1

  5. Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end 2

  6. Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. 2

  7. Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. • Objective : Minimize the expected cumulative regret R ( T ): � � T � R ( T ) = E [Reward for best arm] − E [Reward for arm pulled in round t ] t =1 2

  8. Bandits 101: problem setup Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT : Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE : Pull the selected arm and observe the reward. UPDATE : Update the estimated reward for the arm(s). end • Stochastic bandits : Reward for each arm is sampled i.i.d from its underlying distribution. • Objective : Minimize the expected cumulative regret R ( T ): � � T � R ( T ) = E [Reward for best arm] − E [Reward for arm pulled in round t ] t =1 • Minimizing R ( T ) boils down to a exploration-exploitation trade-off. 2

  9. Bandits 101: structured bandits • In problems with a large number of arms, learning about each arm separately is inefficient. = ⇒ use a shared parameterization for the arms. • Structured bandits : Each arm i has a feature vector x i and there exists an unknown vector θ ∗ such that E [reward for arm i ] = g ( x i , θ ∗ ). • Linear bandits : g ( x i , θ ∗ ) = � x i , θ ∗ � . • Generalized linear bandits : g is a strictly increasing, differentiable link function. E.g. g ( x , θ ∗ ) = 1 / (1 + exp( −� x i , θ ∗ � )) for logistic bandits. 3

  10. Bandits 101: algorithms • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability confidence sets. • Theoretically optimal. Does not depend on the exact distribution of rewards. • Poor empirical performance on typical problem instances. • Thompson Sampling (TS): Randomized strategy that samples from a posterior distribution. • Good empirical performance on typical problem instances. • Depends on the reward distributions. Computationally expensive in the absence of closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting. 4

  11. Bandits 101: algorithms • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability confidence sets. • Theoretically optimal. Does not depend on the exact distribution of rewards. • Poor empirical performance on typical problem instances. • Thompson Sampling (TS): Randomized strategy that samples from a posterior distribution. • Good empirical performance on typical problem instances. • Depends on the reward distributions. Computationally expensive in the absence of closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting. Can we obtain the best of OFU and TS? 4

  12. The RandUCB meta-algorithm Theoretical study 5

  13. RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. 6

  14. RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. • RandUCB : Replace deterministic β by a random variable Z t : i t = arg max { � µ i ( t ) + Z t C i ( t ) } . i ∈ [ K ] Z 1 , . . . , Z T are i.i.d. samples from the sampling distribution. 6

  15. RandUCB Meta-algorithm • Generic OFU algorithm : If � µ i ( t ) is the mean reward for arm i at round t , C i ( t ) is the corresponding confidence set, pick the arm with the largest upper confidence bound. i t = arg max { � µ i ( t ) + β C i ( t ) } . i ∈ [ K ] Here, β is deterministic and chosen to trade off exploration and exploitation optimally. • RandUCB : Replace deterministic β by a random variable Z t : i t = arg max { � µ i ( t ) + Z t C i ( t ) } . i ∈ [ K ] Z 1 , . . . , Z T are i.i.d. samples from the sampling distribution. • Uncoupled RandUCB : i t = arg max { � µ i ( t ) + Z i , t C i ( t ) } . i ∈ [ K ] 6

  16. RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). 7

  17. RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). • Default sampling distribution : Gaussian distribution truncated in the [0 , U ] interval with tunable hyper-parameters ε, σ > 0 such that p M = ε and p m ∝ exp( − α 2 m / 2 σ 2 ) . For 1 ≤ m ≤ M − 1 , 7

  18. RandUCB Meta-algorithm • General sampling distribution : Discrete distribution on the interval [ L , U ], supported on M equally-spaced points, α 1 = L , . . . , α M = U . Define p m := P ( Z = α m ). • Default sampling distribution : Gaussian distribution truncated in the [0 , U ] interval with tunable hyper-parameters ε, σ > 0 such that p M = ε and p m ∝ exp( − α 2 m / 2 σ 2 ) . For 1 ≤ m ≤ M − 1 , • Default choice across bandit problems : Coupled RandUCB with U = O ( β ), M = 10, ε = 10 − 8 , σ = 0 . 25. 7

  19. RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). 8

  20. RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i 8

  21. RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i � • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β = 2 ln( T ) 8

  22. RandUCB for multi-armed bandits • Let Y i ( t ) be the sum of rewards obtained for arm i until round t and s i ( t ) be the number of pulls for arm i until round t . � Mean � µ i ( t ) = Y i ( t ) / s i ( t ) and confidence interval C i ( t ) = 1 / s i ( t ). • OFU algorithm for MAB : Pull each arm once, and for t > K , pull arm � � � 1 i t = arg max µ i ( t ) + β � . s i ( t ) i � • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β = 2 ln( T ) � • RandUCB : L = 0 , U = 2 ln( T ). • We can also construct optimistic Thompson sampling and adaptive ε -greedy algorithms. 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend