SLIDE 1 Are sample means in multi-armed bandits positively or negatively biased?
Jaehyeok Shin1, Aaditya Ramdas1,2 and Alessandro Rinaldo1
- Dept. of Statistics and Data Science1,
Machine Learning Dept.2, CMU
Poster #12 @ Hall B + C
SLIDE 2
Stochastic multi-armed bandit
μK μ1 μ2
. . .
Y ∼
"Random reward"
SLIDE 3
μK μ1
. . .
Time
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 4
μK μ1
. . .
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 5
μK μ1
. . .
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 6
μK μ1
. . .
Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 7
μK μ1
. . .
t = 2 Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 8
μK μ1
. . .
t = 2 Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 9
μK μ1
. . .
Y2 t = 2 Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 10
μK μ1
. . .
⋮ Y2 t = 2 Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
SLIDE 11
μK μ1
. . .
⋮ Y2 t = 2 Y1
Time
t = 1
Adaptive sampling scheme to maximize rewards / to identify the best arm
μ2
𝒰
Stopping time
SLIDE 12
μK μ1
. . .
⋮ Y2 t = 2 Y1
Time
t = 1
Collected data can be used to identify an interesting arm...
μ2
𝒰
"Interesting!"
SLIDE 13 μK μ1
. . .
⋮ Y2 t = 2 Y1
Time
t = 1
...and data can be used to estimate the mean.
μ2
̂ μκ(𝒰) 𝒰
Sample mean
SLIDE 14 𝔽 [ ̂ μκ(𝒰) − μκ] ≤ or ≥ 0?
SLIDE 15
Nie et al. 2018 : Sample mean is negatively biased.
𝔽 [ ̂ μk(t) − μk] ≤ 0
SLIDE 16
Nie et al. 2018 : Sample mean is negatively biased. Fixed Arm Fixed Time
𝔽 [ ̂ μk(t) − μk] ≤ 0
SLIDE 17
Nie et al. 2018 : Sample mean is negatively biased. Fixed Arm Fixed Time
𝔽 [ ̂ μk(t) − μk] ≤ 0
This work : Sample mean of chosen arm at stopping time Chosen Arm Stopping Time
𝔽 [ ̂ μκ(𝒰) − μκ]
SLIDE 18
This work : Sample mean of chosen arm at stopping time is ...
𝔽 [ ̂ μκ(𝒰) − μκ]
SLIDE 19
This work : Sample mean of chosen arm at stopping time is ...
𝔽 [ ̂ μκ(𝒰) − μκ]
(a) negatively biased under ‘optimistic sampling'.
SLIDE 20
This work : Sample mean of chosen arm at stopping time is ...
𝔽 [ ̂ μκ(𝒰) − μκ]
(a) negatively biased under ‘optimistic sampling'. (b) positively biased under ‘optimistic stopping’.
SLIDE 21
This work : Sample mean of chosen arm at stopping time is ...
𝔽 [ ̂ μκ(𝒰) − μκ]
(a) negatively biased under ‘optimistic sampling'. (b) positively biased under ‘optimistic stopping’. (c) positively biased under ‘optimistic choosing’.
SLIDE 22
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Theorem [Informal]
SLIDE 23
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Positive bias Increasing Theorem [Informal]
SLIDE 24
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Positive bias Increasing Theorem [Informal] Decreasing Negative bias
SLIDE 25
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Positive bias Increasing Theorem [Informal] Decreasing Negative bias Agnostic to algorithm
SLIDE 26
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Positive bias Increasing Theorem [Informal] Decreasing Negative bias Agnostic to algorithm Includes Nie et al. 2018 as a special case
SLIDE 27
Monotone effect of a sample
Sample from arm k
1(κ = k) Nk(𝒰)
Positive bias Increasing Theorem [Informal] Decreasing Negative bias Agnostic to algorithm Positive bias under best arm identification, sequential testing Includes Nie et al. 2018 as a special case
SLIDE 28
Are sample means in multi-armed bandits positively or negatively biased?
Jaehyeok Shin, Aaditya Ramdas and Alessandro Rinaldo
Poster #12 @ Hall B + C