On conditional versus marginal bias in multi-armed bandits
Jaehyeok Shin1, Aaditya Ramdas1,2 and Alessandro Rinaldo1
- Dept. of Statistics and Data Science1,
On conditional versus marginal bias in multi-armed bandits Jaehyeok - - PowerPoint PPT Presentation
On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2 and Alessandro Rinaldo 1 Dept. of Statistics and Data Science 1 , Machine Learning Dept. 2 , CMU Stochastic Multi-armed bandits (MABs) K 2 1
2
3
3
3
3
3
3
3
3
3
4
k
4
5
6
6
7
8
9
9
9
9
10
11
12
Item 1
(e.g. Starr & Woodroofe [1968], Shin et al. [2019])
𝒰 ̂ μ(t) ̂ μ(t) T
Item 2 Item K
𝒰 ̂ μ(t)
13
Item 1
𝒰 ̂ μ(t) ̂ μ(t) T
Item 2 Item K
𝒰 ̂ μ(t)
13
Item 1
𝒰 ̂ μ(t) ̂ μ(t) T
Item 2 Item K
𝒰 ̂ μ(t)
13
Item 1
𝒰 ̂ μ(t) ̂ μ(t) T
Item 2 Item K
𝒰 ̂ μ(t)
14
1,1
1,2
1,K
2,1
2,2
2,K
∞ ∈ ℝℕ×K
i.i.d.
i.i.d.
i.i.d.
15
1,1
1,2
1,K
2,1
2,2
2,K
16
1,1
1,2
1,K
2,1
2,2
2,K
16
1,1
1,K
2,1
2,2
2,K
16
1,1
1,K
2,1
2,2
2,K
16
1,1
1,K
2,1
2,2
2,K
16
1,1
1,K
2,2
2,K
16
∞ = X* ∞ ∪ {Wt}∞ t=1
17
∞
∞ = X* ∞ ∪ {Wt}∞ t=1
18
19
i,k
∞
19
i,k
∞
20
i,k
∞
20
i,k
∞
21
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
22
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
22
k
(Upper confidence bound)
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
j≠k
22
k
(Upper confidence bound)
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
j≠k
k
22
k
(Upper confidence bound)
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
23
Increasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
24
Increasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
24
Increasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
24
Decreasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
25
Decreasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
25
Decreasing
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
μ2
At = arg max
k
̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑
j≠k
Nj(t) κ = arg max
k
Nk(𝒰)
25
26
0.00 0.25 0.50 0.75 1.00 −2 2 4
y CDF of item 1
Item 1 is chosen Item 2 is chosen Item 3 is chosen
Mean bias = (0.2, −0.93, −1.14)
lilʹUCB on 3 items (µ1 = 1)