SLIDE 1
Upper confidence bound strategy on stochastical bandits
Multiarmed bandit: K arms, at each step we can choose one arm to be pulled while the other K-1 arms stay frozen (no reward).
- Stochastic bandit: Each arm has fixed distribution in all rounds.
- Adversarial bandit: Bandits can change payout in each round.
- Markovian bandit: Activated arm changes in a ’Markovian style’.
We are only looking at stochastic bandits and Markovian bandits. Stochastic bandits K arms with an unknown, fixed probability distribution ν1, ..., νK on [0, 1]. At each step t = 1, 2, ... choose arm It ∈ {1, ..., K} and draw reward XIt,t ∼ νIt independent of the past. Let µi be the mean of νi, µ∗ = max
i=1,...,Kµi and i∗ ∈ argmax i=1,..,K
µi. The regret after n rounds is defined as Rn := max
i=1,...,K
n
t=1 Xi,t − n t=1 XIt,t
The pseudo-regret is Rn := max
i=1,...,KE[n t=1 Xi,t − n t=1 XIt,t] = nµ∗ − n t=1 E[µIt]
By defining Nn(i) = s
t=1 ✶It=i, i.e number of times arm i is pulled up to time n, and let
△i = µ∗ − µi we can rewrite the pseudo-regret as Rn =
K
- i=1
E[Nn(i)]µ∗ −
K
- i=1
E[Nn(i)µi] =
K
- i=1
△iENn(i) The upper confidence bound strategy (UCB) For the UCB strategy we need the following assumption: There is a convex function ψ on R such that, ∀λ ≥ 0: ln Eeλ(X−E[X]) ≤ ψ(λ), and ln Eeλ(E[X]−X) ≤ ψ(λ) (1) Note that if X ∈ [0, 1] we can take ψ(λ) = λ2/8. (Hoeffding’s lemma) The Legendre-Fenchel (also known as the convex conjugate) of ψ is defined as ψ∗(ǫ) = sup
λ∈R
(λǫ − ψ(λ)) Note that for ψ(λ) = λ2/8 we have ψ∗(ǫ) = 2ǫ2 Let ˆ µi,s be the sample mean of the rewards, i.e ˆ µi,s = 1
s
s
t=1 Xi,s in distribution since
the rewards are i.i.d. By Markov’s inequality and by equation (1) we obtain P(µi − ˆ µi,s > ǫ) ≤ e−sψ∗(ǫ) (2) And by defining δ = e−sψ∗(ǫ) we have, with probability at least 1 − δ ˆ µi,s + (ψ∗)−1(1 s ln(1 δ)) > µi Hence, for a parameter α > 0 the (α, ψ)-UCB strategy is to select the arm It ∈ argmax
i=1,...,K
- ˆ