Nonparametric Bandits with Covariates Philippe Rigollet Princeton - - PowerPoint PPT Presentation

nonparametric bandits with covariates
SMART_READER_LITE
LIVE PREVIEW

Nonparametric Bandits with Covariates Philippe Rigollet Princeton - - PowerPoint PPT Presentation

Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32 Example: Real time web page optimization 2 / 32 Example: Real time web page optimization


slide-1
SLIDE 1

Nonparametric Bandits with Covariates

Philippe Rigollet

Princeton University

with A. Zeevi (Columbia University) Support from NSF (DMS-0906424)

1 / 32

slide-2
SLIDE 2

Example: Real time web page optimization

2 / 32

slide-3
SLIDE 3

Example: Real time web page optimization

2 / 32

slide-4
SLIDE 4

Example: Real time web page optimization Which ad will generate the most $/clicks ?

2 / 32

slide-5
SLIDE 5

Characteristics of the problem

  • A choice must be made for each customer.
  • Cannot observe the outcome of the alternative choice.
  • Try to maximize the rewards.

Exploration vs. Exploitation dilemma Exploration: which one is the best? Exploitation: display the best as much as possible.

3 / 32

slide-6
SLIDE 6

Two armed bandit problem: setup

  • Two arms (e.g.: actions, ads): i ∈ {1, 2}.
  • At time t, random reward Y (i)

t

is observed when arm i is pulled.

  • A policy π is a sequence π1, π2, . . . ∈ {1, 2}, which

indicates which arm to pull at each time t.

  • Performance: Expected cumulative reward at time n

I E

n

  • t=1

Y (πt)

t

  • Goal: maximize reward.

4 / 32

slide-7
SLIDE 7

Two armed bandit problem: regret

  • Oracle policy π⋆ = (π⋆

1, π⋆ 2, . . .) pulls at each time t the

best arm (in expectation) π⋆

t = argmax i=1,2

I E[Y (i)

t

] .

  • We measure our performance by the regret

Rn(π) = I E

n

  • t=1
  • Y

(π⋆

t )

t

− Y (πt)

t

  • 5 / 32
slide-8
SLIDE 8

Static Environment

  • The problem is not new: Robbins (’52), Lai & Robbins

(’85)

6 / 32

slide-9
SLIDE 9

Static Environment

  • The problem is not new: Robbins (’52), Lai & Robbins

(’85)

  • Key assumption:

Static environment

  • i.e., the (unknown) expected rewards µi = I

E[Y (i)

t ] are

constant.

  • One way to solve the problem is to use

Upper Confidence Bounds policy.

6 / 32

slide-10
SLIDE 10

Side information

7 / 32

slide-11
SLIDE 11

Side information

7 / 32

slide-12
SLIDE 12

Side information

7 / 32

slide-13
SLIDE 13

Side information and covariates

  • At time t, the reward of each arm i ∈ {1, 2} depends on

a covariate Xt ∈ X(⊂ (I Rd)) Y (i)

t

= f (i)(Xt) + εt, t = 1, 2, . . ., i = 1, 2 . with standard regression assumptions on {εt}.

  • A policy is now a sequence of functions

πt : X → {1, 2} .

  • Oracle policy

π⋆(x) = argmax

i=1,2

I E[Y (i)

t

|Xt = x] = argmax

i=1,2

f (i)(x)

8 / 32

slide-14
SLIDE 14

Assumptions on the expected rewards

Assume now that X = [0, 1].

  • 1. Constant: Static model studies by Lai & Robbins:

f (i)(x) = µi, i = 1, 2 µi unknown

  • 2. Linear: One-armed bandit problem, studied by

Goldenshluger & Zeevi (2008) f (1)(x) = x − θ, i = 1, 2 θ unknown and f (2)(x) = 0 is constant and known.

  • 3. Smooth: We assume that the functions are H¨
  • lder

smooth with parameter β ≤ 1: |f (i)(x) − f (i)(x′)| ≤ L|x − x′|β . (Consistency studied by Yang & Zhu, 2002)

9 / 32

slide-15
SLIDE 15

Constant rewards

f (1) f (2) 1

10 / 32

slide-16
SLIDE 16

One-armed linear reward

f (1) f (2) 1

11 / 32

slide-17
SLIDE 17

Smooth rewards

f (1) f (2) 1

12 / 32

slide-18
SLIDE 18

Nonparametric bandit with covariates

13 / 32

slide-19
SLIDE 19

Two armed bandit problem with uniform covariates

  • Covariates: {Xt} i.i.d in [0, 1] with uniform distribution
  • Rewards: Y (i)

t

∈ [0, 1] I E

  • Y (i)

t |Xt] = f (i)(Xt)

t = 1, 2, . . ., i = 1, 2 , where |f (i)(x) − f (i)(x′)| ≤ L|x − x′|β, β ≤ 1, i = 1, 2

  • Oracle policy pulls at time t

π⋆(Xt) = argmax

i=1,2

f (i)(Xt)

  • Regret

Rn(π) = I E

n

  • t=1
  • f (π⋆(Xt))(Xt) − f (πt(Xt))(Xt)
  • 14 / 32
slide-20
SLIDE 20

Margin condition

Margin condition I P

  • 0 < |f (1)(X) − f (2)(X)| ≤ δ
  • ≤ Cδα .
  • first used by Goldenshluger and Zeevi (2008) in the
  • ne-armed bandit setting
  • In the one-armed setup, it is an assumption on the

distribution of X only

  • Here: fixed marginal (e.g. uniform) so it measures how

close the functions are

15 / 32

slide-21
SLIDE 21

Margin condition

Margin condition I P

  • 0 < |f (1)(X) − f (2)(X)| ≤ δ
  • ≤ Cδα .
  • first used by Goldenshluger and Zeevi (2008) in the
  • ne-armed bandit setting
  • In the one-armed setup, it is an assumption on the

distribution of X only

  • Here: fixed marginal (e.g. uniform) so it measures how

close the functions are

Proposition: Conflict α vs. β

αβ > 1 = ⇒ π⋆ is a.s constant on [0, 1]

15 / 32

slide-22
SLIDE 22

Illustration of the margin condition

f (1) f (2) 1

16 / 32

slide-23
SLIDE 23

Illustration of the margin condition

f (1) f (2) 1 α = 1

16 / 32

slide-24
SLIDE 24

Illustration of the margin condition

1

α = 2 β = 1 2

f (1) f (2)

16 / 32

slide-25
SLIDE 25

Binning (Exploiting smoothness)

  • Fix M > 1. Consider the bins

Bj = [j/M, (j + 1)/M)

  • Consider the average reward on each bin

¯ f (i)

j

= 1 pj

  • Bj

f (i)(x)dx , Zt = j iff Xt ∈ Bj

17 / 32

slide-26
SLIDE 26

Binned UCB

  • For uniformly distributed Xt, we have

pj = I P(Zt = j) = I P(Xt ∈ Bj) = 1/M

  • The rewards are

I E

  • Y (i)

t |Zt = j] = ¯

f (i)

j

t = 1, 2, . . ., i = 1, 2 , Play UCB on the (Zt, Yt), t = 1, . . . , n

18 / 32

slide-27
SLIDE 27

Binned problem

f (1) f (2) 1

19 / 32

slide-28
SLIDE 28

Binned problem

f (1) f (2) 1

19 / 32

slide-29
SLIDE 29

Binned problem

f (1) f (2) 1

19 / 32

slide-30
SLIDE 30

Binned problem

1 ¯ f (1) ¯ f (2)

19 / 32

slide-31
SLIDE 31

Two armed bandit problem with discrete covariates

  • Covariates: {Zt} i.i.d in {1, . . ., M}

P(Zt = j) = pj, t = 1, 2, . . .

  • Rewards: Y (i)

t

∈ [0, 1] I E

  • Y (i)

t |Zt = j] = ¯

f (i)

j

t = 1, 2, . . ., i = 1, 2 ,

  • Oracle policy pulls at time t

π⋆(Zt) = argmax

i=1,2

¯ f (i)

Zt

20 / 32

slide-32
SLIDE 32

Regret

  • Regret given by

Rn(π) = I E

M

  • j=1

n

  • t=1

¯ f (π⋆(j))

j

− ¯ f (πt(j))

j

  • 1

I(Zt = j)

21 / 32

slide-33
SLIDE 33

Regret

  • Regret given by

Rn(π) = I E

M

  • j=1

n

  • t=1

¯ f (π⋆(j))

j

− ¯ f (πt(j))

j

  • 1

I(Zt = j) Idea: play independently for each j = 1, . . . M

21 / 32

slide-34
SLIDE 34

UCB policy for discrete covariate

  • Based Upper Confidence Bounds given by

concentration inequalities (Hoeffding or Bernstein): Bt(s) :=

  • 2 log t

s .

  • Define the number of times ˆ

π prescribed to pull arm i and Zt = j, before time t N(i)

j (t) = t

  • s=1

1 I(Zs = j, ˆ πs(Zs) = i) ,

  • Average reward collected at those times

Y

(i) j (t) =

1 N(i)

j (t) t

  • s=1

Y (i)

s 1

I(Zs = j, ˆ πs(Zs) = i) ,

22 / 32

slide-35
SLIDE 35

A first bound on the regret

Binned UCB policy: conditionally on Zt = j, ˆ πt(j) = argmax

i=1,2

  • Y

(i) j (t) + Bt(N(i) j (t))

  • Theorem 1. A first bound on the regret

Denote by ∆j = | ¯ f (1)

j

− ¯ f (2)

j |.

Rn(ˆ π) ≤ C

M

  • j=1
  • ∆j + log n

∆j

  • Direct consequence of Auer, Cesa-Bianchi & Fischer (2002)

23 / 32

slide-36
SLIDE 36

Margin condition

M

  • j=1
  • ∆j + log n

∆j

  • The previous bound can become arbitrary large if one the

∆j, j = 1, . . . , M becomes too small.

  • Using the margin condition we can make local conclusions
  • n gaps ∆j:

Few j’s such that ∆j is small

24 / 32

slide-37
SLIDE 37

Upper bound

Theorem 2. A bound on the regret for the binned UCB policy

Fix α > 0 and 0 < β ≤ 1 and choose M ∼ (n/ log n)

1 2β+1.

Then Rn(ˆ π) ≤      Cn

  • n

log n

− β(1+α)

2β+1

if α < 1 Cn

  • n

log n

2β 2β+1

if α > 1

25 / 32

slide-38
SLIDE 38

Suboptimality for α > 1

  • If α > 1, the bound becomes

Rn(ˆ π) ≤ C

  • nM−β(1+α) + M log n
  • Minimum for

M ∼

  • n

log n

  • 1

β(1+α)+1

  • which yields

Rn(ˆ π) ≤ Cn

  • n

log n −

β(1+α) β(1+α)+1

  • Problem is: too many bins. Solution: Online/adaptive

construction of the bins

26 / 32

slide-39
SLIDE 39

Conditional distributions

  • The distribution of Y (i)|X belongs to P = {Pθ, θ ∈ Θ},

where θ is the mean parameter: θ =

  • xdPθ(x)
  • Assume that the family P is such that

K(Pθ, Pθ′) ≤ (θ − θ′)2 κ , κ > 0 . For any θ, θ′ ∈ Θ ⊂ I R

  • Satisfied in particular for Gaussian (location) and

Bernoulli families.

27 / 32

slide-40
SLIDE 40

Minimax lower bound

Theorem 3.

Let αβ ≤ 1 and the covariates {Xt} be uniformly distributed

  • n [0, 1]d. Assume also that {P (i)

θ

, θ ∈ Imf(i)(X )} satisfies the condition on Kullback-leibler for any i = 1, 2. Then, for any policy π, sup

f(1),f(2)∈Σ(β,L)

Rn(π) ≥ Cn · n− β(1+α)

2β+1 ,

for some positive constant C.

28 / 32

slide-41
SLIDE 41

Comments

  • Same bound as in the full information case (see Audibert

& Tsybakov, 07)

  • Gap (of logarithmic size) between upper and lower bound.

29 / 32

slide-42
SLIDE 42

Extensions

  • Higher dimension d ≥ 2, choose · ∞

Rn(ˆ π) ≤ C(d)n

  • n

log n − β(1+α)

2β+d

  • The lower bound also holds.
  • Unknown n: doubling trick

30 / 32

slide-43
SLIDE 43

K-armed bandit

  • K-armed bandit problem

I P

  • 0 < min

i=i⋆(X) |f (i)(X) − f (i⋆(X))(X)| ≤ δ

  • ≤ Cδα .

where i⋆(x) = argmax1≤i≤K f (i)(x) Rn(ˆ π) ≤ CKn

  • n

log n − β(1+α)

2β+1 31 / 32

slide-44
SLIDE 44

Conclusion

  • We introduced a simple model to handle covariates and

proposed a naive policy.

  • It has near optimal rates on the regret
  • Same rates as full information case but new techniques.
  • Current research”
  • 1. Adaptive partitioning to handle α > 1
  • 2. Use of kernel-type (smooth) regression estimators (fill

the gap??)

  • 3. Time varying rewards

32 / 32