Upper confidence bound strategy on stochastical bandits Multiarmed - - PDF document

▶

Sep 12, 2023 234 likes •280 views

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we can choose one arm to be pulled while the other K-1 arms stay frozen (no reward). Stochastic bandit: Each arm has fixed distribution in all

SLIDE 1

Upper confidence bound strategy on stochastical bandits

Multiarmed bandit: K arms, at each step we can choose one arm to be pulled while the other K-1 arms stay frozen (no reward).

Stochastic bandit: Each arm has fixed distribution in all rounds.
Adversarial bandit: Bandits can change payout in each round.
Markovian bandit: Activated arm changes in a ’Markovian style’.

We are only looking at stochastic bandits and Markovian bandits. Stochastic bandits K arms with an unknown, fixed probability distribution ν1, ..., νK on [0, 1]. At each step t = 1, 2, ... choose arm It ∈ {1, ..., K} and draw reward XIt,t ∼ νIt independent of the past. Let µi be the mean of νi, µ∗ = max

i=1,...,Kµi and i∗ ∈ argmax i=1,..,K

µi. The regret after n rounds is defined as Rn := max

i=1,...,K

n

t=1 Xi,t − n t=1 XIt,t

The pseudo-regret is Rn := max

i=1,...,KE[n t=1 Xi,t − n t=1 XIt,t] = nµ∗ − n t=1 E[µIt]

By defining Nn(i) = s

t=1 ✶It=i, i.e number of times arm i is pulled up to time n, and let

△i = µ∗ − µi we can rewrite the pseudo-regret as Rn =

K

E[Nn(i)]µ∗ −

K

E[Nn(i)µi] =

K

△iENn(i) The upper confidence bound strategy (UCB) For the UCB strategy we need the following assumption: There is a convex function ψ on R such that, ∀λ ≥ 0: ln Eeλ(X−E[X]) ≤ ψ(λ), and ln Eeλ(E[X]−X) ≤ ψ(λ) (1) Note that if X ∈ [0, 1] we can take ψ(λ) = λ2/8. (Hoeffding’s lemma) The Legendre-Fenchel (also known as the convex conjugate) of ψ is defined as ψ∗(ǫ) = sup

λ∈R

(λǫ − ψ(λ)) Note that for ψ(λ) = λ2/8 we have ψ∗(ǫ) = 2ǫ2 Let ˆ µi,s be the sample mean of the rewards, i.e ˆ µi,s = 1

s

t=1 Xi,s in distribution since

the rewards are i.i.d. By Markov’s inequality and by equation (1) we obtain P(µi − ˆ µi,s > ǫ) ≤ e−sψ∗(ǫ) (2) And by defining δ = e−sψ∗(ǫ) we have, with probability at least 1 − δ ˆ µi,s + (ψ∗)−1(1 s ln(1 δ)) > µi Hence, for a parameter α > 0 the (α, ψ)-UCB strategy is to select the arm It ∈ argmax

i=1,...,K

µi,Nt−1(i) + (ψ∗)−1 α ln t Nt−1(i)

SLIDE 2

Theorem (Pseudo-regret for UCB strategy): Assume that the νi satisfy the convex assumption (1). Then the pseudo-regret for a (α, ψ)-UCB stategy with α > 2 satisfies Rn ≤

i:△i>0
α△i

ψ∗(△i/2) ln n + α α − 2

If we have X ∈ [0, 1], using ψ∗(ǫ) = 2ǫ2, then

Rn ≤

i:△i>0

2α △i ln n + α α − 2

Lower bound for Bernoulli-distributed rewards

For the following result, we are assuming that Xi,t ∼ Bernoulli(p, q) with p, q ∈ [0, 1] Theorem (Lower bound): Assume ENn(i) = o(na) for a > 0 and that △i > 0 ∀i. Then we have lim inf

n→∞

Rn ln n ≥

i:△i>0

△i kl(µi, µ∗) where kl(µi, µ∗) = µi ln

µ∗

+ (1 − µi) ln
1−µi

1−µ∗

is the Kullback-Leibler divergence.

Comparision of lower & upper bound We have that kl(µi, µ∗) ≤ (µ∗ − µi)2 µ∗(1 − µ∗) which follows from ln x ≤ x − 1. Hence, the lower bound satisfies lim inf

n→∞

Rn ln n ≥

i:µ∗−µi>0

µ∗(1 − µ∗) (µ∗ − µi) Comparing this with the upper bound Rn ≤

i:µ∗−µi>0
2α

µ∗ − µi ln n + α α − 2

we see that the difference between upper and lower bound for a Bernoulli-distributed

reward is given by some constants. Page 2

SLIDE 3

Markovian bandits Again we consider K arms, at each step we can choose one arm to be pulled while the remaining K-1 arms stay frozen. But now the rewards of the pulled arm can change its state in a ’Markovian style’, i.e the arm produces reward r(xt) and changes start to xt+1 according to a Markov dynamic x → y with P(x, y) The goal now it to maximize a β-discounted reward E ∞

rit(xit(t))βt

where it is the arm pulled at time t and 0 < β < 1 is the discounting factor. This

discounted reward is maximized by forward induction. It can be shown (not part of the talk) that the biggest Gittins index Gi(xi) = sup

τ≥1

E τ−1

t=0 ri(xi(t))βt|xi(0) = xi

τ−1

t=0 βt|xi(0) = xi

, where τ is a stopping time,

is enough to determine which arm is to be pulled. Note that the numerator denotes the discounted rewards up to τ and the denumerator represents the discounted time up to τ. Hence, we can find the best strategy by computing the Gittins Index for all arms, where each index is independent of all other arms. Thus, we only need to solve a K-dimensional problem in each step, which greatly reduces the computational work. Page 3