On conditional versus marginal bias in multi-armed bandits Jaehyeok - - PowerPoint PPT Presentation

on conditional versus marginal bias in multi armed bandits
SMART_READER_LITE
LIVE PREVIEW

On conditional versus marginal bias in multi-armed bandits Jaehyeok - - PowerPoint PPT Presentation

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2 and Alessandro Rinaldo 1 Dept. of Statistics and Data Science 1 , Machine Learning Dept. 2 , CMU Stochastic Multi-armed bandits (MABs) K 2 1


slide-1
SLIDE 1

On conditional versus marginal bias in multi-armed bandits

Jaehyeok Shin1, Aaditya Ramdas1,2 and Alessandro Rinaldo1

  • Dept. of Statistics and Data Science1,

Machine Learning Dept.2, CMU

slide-2
SLIDE 2

Stochastic Multi-armed bandits (MABs)

2

μK μ1 μ2

. . .

Y ∼

"Random reward"

slide-3
SLIDE 3

μK μ1

. . .

Time

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-4
SLIDE 4

μK μ1

. . .

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-5
SLIDE 5

μK μ1

. . .

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-6
SLIDE 6

μK μ1

. . .

Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-7
SLIDE 7

μK μ1

. . .

t = 2 Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-8
SLIDE 8

μK μ1

. . .

t = 2 Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-9
SLIDE 9

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-10
SLIDE 10

μK μ1

. . .

⋮ Y2 t = 2 Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

3

slide-11
SLIDE 11

μK μ1

. . .

⋮ Y2 t = 2 Y1

Time

t = 1

Adaptive sampling scheme to maximize rewards / to identify the best arm

μ2

𝒰

Stopping time

3

slide-12
SLIDE 12

μK μ1

. . .

⋮ Y2 t = 2 Y1

Time

t = 1

Collected data can be used to identify an interesting arm...

μ2

𝒰

4

slide-13
SLIDE 13

μK μ1

. . .

⋮ Y2 t = 2 Y1

Time

t = 1

Collected data can be used to identify an interesting arm...

μ2

𝒰

e.g., κ = arg max

k

̂ μk(𝒰)

4

slide-14
SLIDE 14

μK μ1

. . .

⋮ Y2 t = 2 Y1

Time

t = 1

...and the data can be used to conduct statistical inferences.

μ2

𝒰 ̂ μκ(𝒰)

Sample mean at a stopping time 𝒰

5

slide-15
SLIDE 15
  • Q. Sign of the bias of sample mean?

Xu et al. [2013] : An informal argument why the sample mean is negatively biased for “optimistic” algorithms. Villar et al. [2015] : Demonstrate this negative bias in a simulation study motivated by using MAB for clinical trials.

𝔽 [ ̂ μκ(𝒰) − μκ] ≤ or ≥ 0?

6

slide-16
SLIDE 16
  • Q. Sign of the bias of sample mean?

Xu et al. [2013] : An informal argument why the sample mean is negatively biased for “optimistic” algorithms. Villar et al. [2015] : Demonstrate this negative bias in a simulation study motivated by using MAB for clinical trials.

𝔽 [ ̂ μκ(𝒰) − μκ] ≤ or ≥ 0?

6

slide-17
SLIDE 17

Shin et al. [2019] Introduced "monotonicity property" characterizing the bias of the sample mean for more general classes of MABs. Chosen Arm Stopping Time

𝔽 [ ̂ μκ(𝒰) − μκ]

7

Nie et al. [2018] Sample mean is negatively biased Fixed Arm Fixed Time

𝔽 [ ̂ μk(t) − μk] ≤ 0

for MABs designed to maximize cumulative reward.

slide-18
SLIDE 18

Shin et al. [2019] Introduced "monotonicity property" characterizing the bias of the sample mean for more general classes of MABs. Chosen Arm Stopping Time

𝔽 [ ̂ μκ(𝒰) − μκ]

8

Nie et al. [2018] Sample mean is negatively biased Fixed Arm Fixed Time

𝔽 [ ̂ μk(t) − μk] ≤ 0

for MABs designed to maximize cumulative reward.

slide-19
SLIDE 19

However, current understanding

  • f bias is limited in two aspects.

9

  • 1. Existing results concern the bias of the sample mean only.
slide-20
SLIDE 20

However, current understanding

  • f bias is limited in two aspects.

9

  • 1. Existing results concern the bias of the sample mean only.

We study the bias of monotone functions of the rewards.

slide-21
SLIDE 21

9

  • 1. Existing results concern the bias of the sample mean only.

We study the bias of monotone functions of the rewards.

  • 2. Existing guarantees cover only the marginal bias.

However, current understanding

  • f bias is limited in two aspects.
slide-22
SLIDE 22

9

  • 1. Existing results concern the bias of the sample mean only.

We study the bias of monotone functions of the rewards.

  • 2. Existing guarantees cover only the marginal bias.

We extend previous results to cover the conditional bias.

However, current understanding

  • f bias is limited in two aspects.
slide-23
SLIDE 23

Marginal vs Conditional bias

10

μK μ1 μ2

. . .

  • K prototype items

Want to screen out some items by testing for

H0k : μk ≥ c

vs H1k : μk < c

k = 1,…, K .

slide-24
SLIDE 24

Marginal vs Conditional bias

11

H0 : μ ≥ c vs H1 : μ < c

̂ μ(t) T

"Keep the item" "Screen out the item at "

𝒰

𝒰 ̂ μ(t)

(Fail to reject the null) (Reject the null)

slide-25
SLIDE 25

Marginally, the sample mean is negatively biased.

12

Item 1

. . .

𝔽 [ ̂ μk − μk] ≤ 0, k = 1,…, K

(e.g. Starr & Woodroofe [1968], Shin et al. [2019])

𝒰 ̂ μ(t) ̂ μ(t) T

Item 2 Item K

𝒰 ̂ μ(t)

"Underestimating the true mean revenue."

slide-26
SLIDE 26

13

Item 1

. . .

𝒰 ̂ μ(t) ̂ μ(t) T

Item 2 Item K

𝒰 ̂ μ(t)

...however, we usually do not evaluate the sample mean every time.

slide-27
SLIDE 27

13

Item 1

. . .

𝒰 ̂ μ(t) ̂ μ(t) T

Item 2 Item K

𝒰 ̂ μ(t)

...however, we usually do not evaluate the sample mean every time.

slide-28
SLIDE 28

13

𝔽 [ ̂ μk − μk ∣ item k is active] ≥ 0, k = 1,…, K

Item 1

. . .

𝒰 ̂ μ(t) ̂ μ(t) T

Item 2 Item K

𝒰 ̂ μ(t)

Conditioned on the "active" event, the sample mean is positively biased.

"Overestimating the true mean revenue."

slide-29
SLIDE 29

e.g., C = {Reject the null}, {Chosen as the best arm} . . . . ̂ F k,𝒰 : Empirical CDF of arm k at time 𝒰 Fk : True CDF of arm k at time 𝒰 where

14

Conditional bias of the empirical cumulative distribution function (CDF)

For a fixed y ∈ ℝ,

𝔽 [ ̂ F k,𝒰(y) − Fk(y) ∣ C] ≤ or ≥ 0?

slide-30
SLIDE 30

Tabular model of MABs

μK μ1 μ2

. . .

X*

1,1

X*

1,2

X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮

}

X*

∞ ∈ ℝℕ×K

:= ∼

i.i.d.

i.i.d.

i.i.d.

"Hypothetical table"

15

slide-31
SLIDE 31

μK μ1 μ2

. . .

X*

1,1

X*

1,2

X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮

Tabular model of MABs

Time

16

slide-32
SLIDE 32

μK μ1 μ2

. . . Time

X*

1,1

X*

1,2

X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮ t = 1

Tabular model of MABs

16

slide-33
SLIDE 33

μK μ1 μ2

. . . Time

X*

1,1

Y1 X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮ t = 1

Tabular model of MABs

16

slide-34
SLIDE 34

μK μ1 μ2

. . . Time

X*

1,1

Y1 X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮ t = 1 t = 2

Tabular model of MABs

16

slide-35
SLIDE 35

μK μ1 μ2

. . . Time

X*

1,1

Y1 X*

1,K

. . .

X*

2,1

X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮ t = 1 t = 2

Tabular model of MABs

16

slide-36
SLIDE 36

μK μ1 μ2

. . . Time

X*

1,1

Y1 X*

1,K

. . .

Y2 X*

2,2

X*

2,K

. . .

⋮ ⋮ ⋮ ⋮ t = 1 t = 2

Tabular model of MABs

16

slide-37
SLIDE 37

Hypothetical dataset

𝒠*

∞ = X* ∞ ∪ {Wt}∞ t=1

Hypothetical table Random seeds

17

slide-38
SLIDE 38

Hypothetical dataset

, and for each and can be expressed as some functions of .

C 𝒰 Nk(t) t k 𝒠*

Given 𝒠*

∞ = X* ∞ ∪ {Wt}∞ t=1

18

slide-39
SLIDE 39

Monotone effect of a sample

(Negative conditional bias of the empirical CDF) 𝔽 [ ̂ F k,𝒰(y) − Fk(y) ∣ C] ≤ 0

19

Theorem Suppose arm has a finite mean. If is an increasing function of each while keeping all other entries in fixed then we have

k

1(C) Nk(𝒰)

X*

i,k

𝒠*

slide-40
SLIDE 40

Monotone effect of a sample

(Negative conditional bias of the empirical CDF) 𝔽 [ ̂ F k,𝒰(y) − Fk(y) ∣ C] ≤ 0 (Positive conditional bias of the sample mean) 𝔽 [ ̂ μk(𝒰) − μk ∣ C] ≥ 0

19

Theorem Suppose arm has a finite mean. If is an increasing function of each while keeping all other entries in fixed then we have

k

1(C) Nk(𝒰)

X*

i,k

𝒠*

slide-41
SLIDE 41

Monotone effect of a sample

(Positive conditional bias of the empirical CDF) 𝔽 [ ̂ F k,𝒰(y) − Fk(y) ∣ C] ≥ 0

20

Theorem Suppose arm has a finite mean. If is a decreasing function of each while keeping all other entries in fixed then we have

k

1(C) Nk(𝒰)

X*

i,k

𝒠*

slide-42
SLIDE 42

Monotone effect of a sample

(Positive conditional bias of the empirical CDF) 𝔽 [ ̂ F k,𝒰(y) − Fk(y) ∣ C] ≥ 0 (Negative conditional bias

  • f the sample mean)

𝔽 [ ̂ μk(𝒰) − μk ∣ C] ≤ 0

20

Theorem Suppose arm has a finite mean. If is a decreasing function of each while keeping all other entries in fixed then we have

k

1(C) Nk(𝒰)

X*

i,k

𝒠*

slide-43
SLIDE 43

E.g.: Best arm identification

21

μK μ1 μ2

. . .

  • K prototype items

Want to figure out which one has the largest revenue.

slide-44
SLIDE 44

lil' UCB algorithm

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

22

E.g.: Best arm identification

slide-45
SLIDE 45

lil' UCB algorithm

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

22

E.g.: Best arm identification

At = arg max

k

̂ μk(t) + u(Nk(t))

(Upper confidence bound)

slide-46
SLIDE 46

lil' UCB algorithm

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t)

22

E.g.: Best arm identification

At = arg max

k

̂ μk(t) + u(Nk(t))

(Upper confidence bound)

slide-47
SLIDE 47

lil' UCB algorithm

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

22

E.g.: Best arm identification

At = arg max

k

̂ μk(t) + u(Nk(t))

(Upper confidence bound)

slide-48
SLIDE 48

lil' UCB algorithm

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

23

E.g.: Best arm identification

a) Item 1 is chosen as the best. b) Item 1 is NOT chosen as the best.

slide-49
SLIDE 49

a) Item 1 is chosen as the best ( ).

κ = 1

Sample from item

1

1(κ = 1) N1(𝒰)

Increasing

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

24

E.g.: Best arm identification

slide-50
SLIDE 50

a) Item 1 is chosen as the best ( ).

κ = 1

Sample from item

1

1(κ = 1) N1(𝒰)

Increasing

Negative conditional bias of the empirical CDF

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

24

E.g.: Best arm identification

slide-51
SLIDE 51

a) Item 1 is chosen as the best ( ).

κ = 1

Sample from item

1

1(κ = 1) N1(𝒰)

Increasing

Negative conditional bias of the empirical CDF Positive conditional bias of the sample mean

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

24

E.g.: Best arm identification

slide-52
SLIDE 52

Sample from item

1

1(κ ≠ 1) N1(𝒰)

Decreasing

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

b) Item 1 is NOT chosen as the best ( ).

κ ≠ 1

25

E.g.: Best arm identification

slide-53
SLIDE 53

Sample from item

1

1(κ ≠ 1) N1(𝒰)

Decreasing

Positive conditional bias of the empirical CDF

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

b) Item 1 is NOT chosen as the best ( ).

κ ≠ 1

25

E.g.: Best arm identification

slide-54
SLIDE 54

Sample from item

1

1(κ ≠ 1) N1(𝒰)

Decreasing

Positive conditional bias of the empirical CDF Negative conditional bias

  • f the sample mean

μK μ1

. . .

Y2 t = 2 Y1

Time

t = 1

μ2

At = arg max

k

̂ μk(t) + u(Nk(t)) 𝒰 = inf t : ∃k, Nk(t) ≥ 1 + λ∑

j≠k

Nj(t) κ = arg max

k

Nk(𝒰)

b) Item 1 is NOT chosen as the best ( ).

κ ≠ 1

25

E.g.: Best arm identification

slide-55
SLIDE 55

26

Average of the empirical CDF of item 1 conditioned on each event

0.00 0.25 0.50 0.75 1.00 −2 2 4

y CDF of item 1

Item 1 is chosen Item 2 is chosen Item 3 is chosen

Mean bias = (0.2, −0.93, −1.14)

lilʹUCB on 3 items (µ1 = 1)

slide-56
SLIDE 56

On conditional versus marginal bias in multi-armed bandits

Jaehyeok Shin, Aaditya Ramdas and Alessandro Rinaldo

Thank you!