Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - - PowerPoint PPT Presentation

old dog learns new tricks randomized ucb for bandit
SMART_READER_LITE
LIVE PREVIEW

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - - PowerPoint PPT Presentation

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems AISTATS 2020 Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim : Infer the best drug by


slide-1
SLIDE 1

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

AISTATS 2020

slide-2
SLIDE 2

Motivating example: clinical trials

  • Do not have complete information about the effectiveness or side-effects of the drugs.
  • Aim: Infer the “best” drug by running a sequence of trials.

1

slide-3
SLIDE 3

Motivating example: clinical trials

  • Do not have complete information about the effectiveness or side-effects of the drugs.
  • Aim: Infer the “best” drug by running a sequence of trials.
  • Abstraction to Multi-armed Bandits: Each drug choice is mapped to an arm and the

drug’s effectiveness is mapped to the arm’s reward.

1

slide-4
SLIDE 4

Motivating example: clinical trials

  • Do not have complete information about the effectiveness or side-effects of the drugs.
  • Aim: Infer the “best” drug by running a sequence of trials.
  • Abstraction to Multi-armed Bandits: Each drug choice is mapped to an arm and the

drug’s effectiveness is mapped to the arm’s reward.

  • Administering a drug is an action that is equivalent to pulling the corresponding arm. The

trial goes on for T rounds.

1

slide-5
SLIDE 5

Bandits 101: problem setup

Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT: Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE: Pull the selected arm and observe the reward. UPDATE: Update the estimated reward for the arm(s). end

2

slide-6
SLIDE 6

Bandits 101: problem setup

Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT: Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE: Pull the selected arm and observe the reward. UPDATE: Update the estimated reward for the arm(s). end

  • Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution.

2

slide-7
SLIDE 7

Bandits 101: problem setup

Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT: Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE: Pull the selected arm and observe the reward. UPDATE: Update the estimated reward for the arm(s). end

  • Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution.
  • Objective: Minimize the expected cumulative regret R(T):

R(T) =

T

  • t=1
  • E[Reward for best arm] − E[Reward for arm pulled in round t]
  • 2
slide-8
SLIDE 8

Bandits 101: problem setup

Initialize the expected rewards according to some prior knowledge. for t = 1 → T do SELECT: Use a bandit algorithm to decide which arm to pull. ACT and OBSERVE: Pull the selected arm and observe the reward. UPDATE: Update the estimated reward for the arm(s). end

  • Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution.
  • Objective: Minimize the expected cumulative regret R(T):

R(T) =

T

  • t=1
  • E[Reward for best arm] − E[Reward for arm pulled in round t]
  • Minimizing R(T) boils down to a exploration-exploitation trade-off.

2

slide-9
SLIDE 9

Bandits 101: structured bandits

  • In problems with a large number of arms, learning about each arm separately is inefficient.

= ⇒ use a shared parameterization for the arms.

  • Structured bandits: Each arm i has a feature vector xi and

there exists an unknown vector θ∗ such that E[reward for arm i] = g(xi, θ∗).

  • Linear bandits: g(xi, θ∗) = xi, θ∗.
  • Generalized linear bandits: g is a strictly increasing, differentiable link function.

E.g. g(x, θ∗) = 1/(1 + exp(−xi, θ∗)) for logistic bandits.

3

slide-10
SLIDE 10

Bandits 101: algorithms

  • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability

confidence sets.

  • Theoretically optimal. Does not depend on the exact distribution of rewards.
  • Poor empirical performance on typical problem instances.
  • Thompson Sampling (TS): Randomized strategy that samples from a posterior

distribution.

  • Good empirical performance on typical problem instances.
  • Depends on the reward distributions. Computationally expensive in the absence of

closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting.

4

slide-11
SLIDE 11

Bandits 101: algorithms

  • Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probability

confidence sets.

  • Theoretically optimal. Does not depend on the exact distribution of rewards.
  • Poor empirical performance on typical problem instances.
  • Thompson Sampling (TS): Randomized strategy that samples from a posterior

distribution.

  • Good empirical performance on typical problem instances.
  • Depends on the reward distributions. Computationally expensive in the absence of

closed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting.

Can we obtain the best of OFU and TS?

4

slide-12
SLIDE 12

The RandUCB meta-algorithm Theoretical study

5

slide-13
SLIDE 13

RandUCB Meta-algorithm

  • Generic OFU algorithm: If

µi(t) is the mean reward for arm i at round t, Ci(t) is the corresponding confidence set, pick the arm with the largest upper confidence bound. it = arg max

i∈[K]

{ µi(t) + β Ci(t)} . Here, β is deterministic and chosen to trade off exploration and exploitation optimally.

6

slide-14
SLIDE 14

RandUCB Meta-algorithm

  • Generic OFU algorithm: If

µi(t) is the mean reward for arm i at round t, Ci(t) is the corresponding confidence set, pick the arm with the largest upper confidence bound. it = arg max

i∈[K]

{ µi(t) + β Ci(t)} . Here, β is deterministic and chosen to trade off exploration and exploitation optimally.

  • RandUCB: Replace deterministic β by a random variable Zt:

it = arg max

i∈[K]

{ µi(t) + Zt Ci(t)} . Z1, . . . , ZT are i.i.d. samples from the sampling distribution.

6

slide-15
SLIDE 15

RandUCB Meta-algorithm

  • Generic OFU algorithm: If

µi(t) is the mean reward for arm i at round t, Ci(t) is the corresponding confidence set, pick the arm with the largest upper confidence bound. it = arg max

i∈[K]

{ µi(t) + β Ci(t)} . Here, β is deterministic and chosen to trade off exploration and exploitation optimally.

  • RandUCB: Replace deterministic β by a random variable Zt:

it = arg max

i∈[K]

{ µi(t) + Zt Ci(t)} . Z1, . . . , ZT are i.i.d. samples from the sampling distribution.

  • Uncoupled RandUCB:

it = arg max

i∈[K]

{ µi(t) + Zi,t Ci(t)} .

6

slide-16
SLIDE 16

RandUCB Meta-algorithm

  • General sampling distribution: Discrete distribution on the interval [L, U], supported on

M equally-spaced points, α1 = L, . . . , αM = U. Define pm := P (Z = αm).

7

slide-17
SLIDE 17

RandUCB Meta-algorithm

  • General sampling distribution: Discrete distribution on the interval [L, U], supported on

M equally-spaced points, α1 = L, . . . , αM = U. Define pm := P (Z = αm).

  • Default sampling distribution: Gaussian distribution truncated in the [0, U] interval with

tunable hyper-parameters ε, σ > 0 such that pM = ε and For 1 ≤ m ≤ M − 1, pm ∝ exp(−α2

m/2σ2). 7

slide-18
SLIDE 18

RandUCB Meta-algorithm

  • General sampling distribution: Discrete distribution on the interval [L, U], supported on

M equally-spaced points, α1 = L, . . . , αM = U. Define pm := P (Z = αm).

  • Default sampling distribution: Gaussian distribution truncated in the [0, U] interval with

tunable hyper-parameters ε, σ > 0 such that pM = ε and For 1 ≤ m ≤ M − 1, pm ∝ exp(−α2

m/2σ2).

  • Default choice across bandit problems: Coupled RandUCB with U = O(β), M = 10,

ε = 10−8, σ = 0.25.

7

slide-19
SLIDE 19

RandUCB for multi-armed bandits

  • Let Yi(t) be the sum of rewards obtained for arm i until round t and si(t) be the number
  • f pulls for arm i until round t.

Mean µi(t) = Yi(t)/si(t) and confidence interval Ci(t) =

  • 1/si(t).

8

slide-20
SLIDE 20

RandUCB for multi-armed bandits

  • Let Yi(t) be the sum of rewards obtained for arm i until round t and si(t) be the number
  • f pulls for arm i until round t.

Mean µi(t) = Yi(t)/si(t) and confidence interval Ci(t) =

  • 1/si(t).
  • OFU algorithm for MAB: Pull each arm once, and for t > K, pull arm

it = arg max

i

  • µi(t) + β
  • 1

si(t)

  • .

8

slide-21
SLIDE 21

RandUCB for multi-armed bandits

  • Let Yi(t) be the sum of rewards obtained for arm i until round t and si(t) be the number
  • f pulls for arm i until round t.

Mean µi(t) = Yi(t)/si(t) and confidence interval Ci(t) =

  • 1/si(t).
  • OFU algorithm for MAB: Pull each arm once, and for t > K, pull arm

it = arg max

i

  • µi(t) + β
  • 1

si(t)

  • .
  • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β =
  • 2 ln(T)

8

slide-22
SLIDE 22

RandUCB for multi-armed bandits

  • Let Yi(t) be the sum of rewards obtained for arm i until round t and si(t) be the number
  • f pulls for arm i until round t.

Mean µi(t) = Yi(t)/si(t) and confidence interval Ci(t) =

  • 1/si(t).
  • OFU algorithm for MAB: Pull each arm once, and for t > K, pull arm

it = arg max

i

  • µi(t) + β
  • 1

si(t)

  • .
  • UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β =
  • 2 ln(T)
  • RandUCB: L = 0, U = 2
  • ln(T).
  • We can also construct optimistic Thompson sampling and adaptive ε-greedy algorithms.

8

slide-23
SLIDE 23

Regret of RandUCB for multi-armed bandits

Theorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB) If ∆i = µ1 − µi is the gap for arm i, and Z takes M different values 0 ≤ α1 ≤ · · · ≤ αM with probabilities p1, p2, . . . , pM, the regret R(T) of uncoupled RandUCB can be bounded as: O

∆i>0

∆−1

i

  • ×

M pM + Te−2α2

M + α2

M

  • .

9

slide-24
SLIDE 24

Regret of RandUCB for multi-armed bandits

Theorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB) If ∆i = µ1 − µi is the gap for arm i, and Z takes M different values 0 ≤ α1 ≤ · · · ≤ αM with probabilities p1, p2, . . . , pM, the regret R(T) of uncoupled RandUCB can be bounded as: O

∆i>0

∆−1

i

  • ×

M pM + Te−2α2

M + α2

M

  • .
  • Using U = αM = 2

√ ln T results in the problem-dependent O

  • ln T ×

∆−1

i

  • regret.

9

slide-25
SLIDE 25

Regret of RandUCB for multi-armed bandits

Theorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB) If ∆i = µ1 − µi is the gap for arm i, and Z takes M different values 0 ≤ α1 ≤ · · · ≤ αM with probabilities p1, p2, . . . , pM, the regret R(T) of uncoupled RandUCB can be bounded as: O

∆i>0

∆−1

i

  • ×

M pM + Te−2α2

M + α2

M

  • .
  • Using U = αM = 2

√ ln T results in the problem-dependent O

  • ln T ×

∆−1

i

  • regret.
  • Standard reduction implies a problem-independent

O( √ KT) regret matching that of UCB1 and Thompson sampling [Agrawal and Goyal, 2012].

9

slide-26
SLIDE 26

Regret of RandUCB for multi-armed bandits

Theorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB) If ∆i = µ1 − µi is the gap for arm i, and Z takes M different values 0 ≤ α1 ≤ · · · ≤ αM with probabilities p1, p2, . . . , pM, the regret R(T) of uncoupled RandUCB can be bounded as: O

∆i>0

∆−1

i

  • ×

M pM + Te−2α2

M + α2

M

  • .
  • Using U = αM = 2

√ ln T results in the problem-dependent O

  • ln T ×

∆−1

i

  • regret.
  • Standard reduction implies a problem-independent

O( √ KT) regret matching that of UCB1 and Thompson sampling [Agrawal and Goyal, 2012].

  • We also show the same problem-independent regret for the default coupled variant of

RandUCB.

9

slide-27
SLIDE 27

RandUCB for linear bandits

  • Let Xt = xit and Mt := λId + t−1

ℓ=1 XℓX T ℓ .

θt := M−1

t

t−1

ℓ=1 YℓXℓ. Mean

µi(t) = θt, xi and confidence width Ci(t) = xiM−1

t .

10

slide-28
SLIDE 28

RandUCB for linear bandits

  • Let Xt = xit and Mt := λId + t−1

ℓ=1 XℓX T ℓ .

θt := M−1

t

t−1

ℓ=1 YℓXℓ. Mean

µi(t) = θt, xi and confidence width Ci(t) = xiM−1

t .

  • OFU algorithm for linear bandit: Pull arm:

it = arg max

i∈[K]

  • θt, xi + β xiM−1

t

  • .

10

slide-29
SLIDE 29

RandUCB for linear bandits

  • Let Xt = xit and Mt := λId + t−1

ℓ=1 XℓX T ℓ .

θt := M−1

t

t−1

ℓ=1 YℓXℓ. Mean

µi(t) = θt, xi and confidence width Ci(t) = xiM−1

t .

  • OFU algorithm for linear bandit: Pull arm:

it = arg max

i∈[K]

  • θt, xi + β xiM−1

t

  • .
  • OFU [Abbasi-Yadkori, P´

al and Szepesv´ ari 2011]: β = √ λ + 1

2

  • ln(T 2λ−d det(Mt)).

10

slide-30
SLIDE 30

RandUCB for linear bandits

  • Let Xt = xit and Mt := λId + t−1

ℓ=1 XℓX T ℓ .

θt := M−1

t

t−1

ℓ=1 YℓXℓ. Mean

µi(t) = θt, xi and confidence width Ci(t) = xiM−1

t .

  • OFU algorithm for linear bandit: Pull arm:

it = arg max

i∈[K]

  • θt, xi + β xiM−1

t

  • .
  • OFU [Abbasi-Yadkori, P´

al and Szepesv´ ari 2011]: β = √ λ + 1

2

  • ln(T 2λ−d det(Mt)).
  • RandUCB: L = 0, U = 3

√ λ + 1

2

  • d ln (T + T 2/dλ)
  • .

10

slide-31
SLIDE 31

Regret of RandUCB for linear bandits

Theorem 2 Let c1 = √ λ + 1

2

  • d ln (T + T 2/dλ) and c3 := 2d ln
  • 1 + T

  • . For any c2 > c1, the regret of

RandUCB for linear bandits is bounded by (c1 + c2)

  • 1 +

2 P (Z > c1) − P (|Z| > c2)

  • ×
  • c3T + T P (|Z| > c2) + 1.

11

slide-32
SLIDE 32

Regret of RandUCB for linear bandits

Theorem 2 Let c1 = √ λ + 1

2

  • d ln (T + T 2/dλ) and c3 := 2d ln
  • 1 + T

  • . For any c2 > c1, the regret of

RandUCB for linear bandits is bounded by (c1 + c2)

  • 1 +

2 P (Z > c1) − P (|Z| > c2)

  • ×
  • c3T + T P (|Z| > c2) + 1.
  • Setting U = 3c1 < c2 ensures P (Z > c1) is a positive constant and P (|Z| > c2) = 0,

resulting in O(d √ T) regret bound.

11

slide-33
SLIDE 33

Regret of RandUCB for linear bandits

Theorem 2 Let c1 = √ λ + 1

2

  • d ln (T + T 2/dλ) and c3 := 2d ln
  • 1 + T

  • . For any c2 > c1, the regret of

RandUCB for linear bandits is bounded by (c1 + c2)

  • 1 +

2 P (Z > c1) − P (|Z| > c2)

  • ×
  • c3T + T P (|Z| > c2) + 1.
  • Setting U = 3c1 < c2 ensures P (Z > c1) is a positive constant and P (|Z| > c2) = 0,

resulting in O(d √ T) regret bound.

  • Regret bound does not depend on K and holds for infinite arms.

11

slide-34
SLIDE 34

Regret of RandUCB for linear bandits

Theorem 2 Let c1 = √ λ + 1

2

  • d ln (T + T 2/dλ) and c3 := 2d ln
  • 1 + T

  • . For any c2 > c1, the regret of

RandUCB for linear bandits is bounded by (c1 + c2)

  • 1 +

2 P (Z > c1) − P (|Z| > c2)

  • ×
  • c3T + T P (|Z| > c2) + 1.
  • Setting U = 3c1 < c2 ensures P (Z > c1) is a positive constant and P (|Z| > c2) = 0,

resulting in O(d √ T) regret bound.

  • Regret bound does not depend on K and holds for infinite arms.
  • Matches the bound of OFU in [Abbasi-Yadkori et al., 2011] and is better than the

O(d3/2√ T) bound for TS [Agrawal and Goyal, 2013].

11

slide-35
SLIDE 35

Regret of RandUCB for linear bandits

Theorem 2 Let c1 = √ λ + 1

2

  • d ln (T + T 2/dλ) and c3 := 2d ln
  • 1 + T

  • . For any c2 > c1, the regret of

RandUCB for linear bandits is bounded by (c1 + c2)

  • 1 +

2 P (Z > c1) − P (|Z| > c2)

  • ×
  • c3T + T P (|Z| > c2) + 1.
  • Setting U = 3c1 < c2 ensures P (Z > c1) is a positive constant and P (|Z| > c2) = 0,

resulting in O(d √ T) regret bound.

  • Regret bound does not depend on K and holds for infinite arms.
  • Matches the bound of OFU in [Abbasi-Yadkori et al., 2011] and is better than the

O(d3/2√ T) bound for TS [Agrawal and Goyal, 2013].

  • We prove a similar

O(d √ T) bound for generalized linear bandits.

11

slide-36
SLIDE 36

The RandUCB meta-algorithm Empirical study

12

slide-37
SLIDE 37

Experiments - multi-armed bandit

  • B-TS: Thompson Sampling with a beta posterior
  • KL-UCB [Garivier and Capp´

e, 2011]: UCB with tighter confidence intervals.

  • Randomized exploration baselines: Giro [Kveton et al., 2019c], PHE [Kveton et al., 2019b]

13

slide-38
SLIDE 38

Experiments - linear bandit

  • Lin-TS: Thompson Sampling with a Gaussian posterior
  • ε-greedy [Langford and Zhang, 2008]
  • Randomized exploration baseline: LinPHE [Kveton et al., 2019a]

14

slide-39
SLIDE 39

Experiments - logistic bandit

  • GLM-TS [Kveton et al., 2019d]: TS with a Laplace approximation to the posterior.
  • GLM-UCB [Filippi et al., 2010] and UCB-GLM [Li et al., 2017]
  • ε-greedy [Langford and Zhang, 2008]
  • Randomized exploration baseline: LogPHE [Kveton et al., 2019d]

15

slide-40
SLIDE 40

Proposed RandUCB, a generic meta-algorithm achieving the theoretical performance of UCB and the practical perfor- mance of Thompson sampling. Paper: https://arxiv.org/abs/1910.04928 Code: https://github.com/vaswanis/randucb

15

slide-41
SLIDE 41

References

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´

  • ari. Improved algorithms for linear stochastic bandits. In NIPS, 2011.

Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In COLT, 2012. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013. Sarah Filippi, Olivier Cappe, Aur´ elien Garivier, and Csaba Szepesv´

  • ari. Parametric bandits: The generalized linear case. In NIPS,

2010. Aur´ elien Garivier and Olivier Capp´

  • e. The KL-UCB algorithm for bounded stochastic bandits and beyond. In COLT, 2011.

Branislav Kveton, Csaba Szepesv´ ari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochastic linear bandits. UAI, 2019a. Branislav Kveton, Csaba Szepesv´ ari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochastic multi-armed bandits. In IJCAI-19, 2019b. Branislav Kveton, Csaba Szepesv´ ari, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In ICML, 2019c. Branislav Kveton, Manzil Zaheer, Csaba Szepesv´ ari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. arXiv:1906.08947, 2019d. John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2008. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In ICML, 2017. 16