[PPT] - The Contextual Bandits Problem The Contextual Bandits Problem The PowerPoint Presentation

SLIDE 1

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm Alekh Agarwal (MSR) Daniel Hsu (Columbia) Satyen Kale (Yahoo) John Langford (MSR) Lihong Li (MSR) Rob Schapire Rob Schapire Rob Schapire Rob Schapire Rob Schapire (MSR/Princeton)

SLIDE 2

Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement

repeat:
1. website visited by user (with profile, browsing history,

etc.)

2. website chooses ad/content to present to user
3. user responds (clicks, leaves page, etc.)
goal: make choices that elicit desired user behavior

SLIDE 3

Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment

repeat:
1. doctor visited by patient (with symptoms, test results,

etc.)

2. doctor chooses treatment
3. patient responds (recovers, gets worse, etc.)
goal: make choices that maximize favorable outcomes

SLIDE 4

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

repeat:
1. learner presented with context
2. learner chooses an action
3. learner observes reward (but only for chosen action)
goal: learn to choose actions to maximize rewards

SLIDE 5

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

repeat:
1. learner presented with context
2. learner chooses an action
3. learner observes reward (but only for chosen action)
goal: learn to choose actions to maximize rewards
general and fundamental problem: how to learn to make

intelligent decisions through experience

SLIDE 6

Issues Issues Issues Issues Issues

classic dilemma:
exploit what has already been learned
explore to learn which behaviors give best results

SLIDE 7

Issues Issues Issues Issues Issues

classic dilemma:
exploit what has already been learned
explore to learn which behaviors give best results
in addition, must use context effectively
many choices of behavior possible
may never see same context twice

SLIDE 8

Issues Issues Issues Issues Issues

classic dilemma:
exploit what has already been learned
explore to learn which behaviors give best results
in addition, must use context effectively
many choices of behavior possible
may never see same context twice
selection bias: if explore while exploiting, will tend to get

highly skewed data

SLIDE 9

Issues Issues Issues Issues Issues

classic dilemma:
exploit what has already been learned
explore to learn which behaviors give best results
in addition, must use context effectively
many choices of behavior possible
may never see same context twice
selection bias: if explore while exploiting, will tend to get

highly skewed data

efficiency

SLIDE 10

This Talk This Talk This Talk This Talk This Talk

new and general algorithm for contextual bandits
optimal statistical performance
far faster and simpler than predecessors

SLIDE 11

Formal Model Formal Model Formal Model Formal Model Formal Model

repeat
1a. learner observes context xt
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)

SLIDE 12

Formal Model Formal Model Formal Model Formal Model Formal Model

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)

SLIDE 13

Formal Model Formal Model Formal Model Formal Model Formal Model

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)
goal: maximize total reward:

T

t=1

rt(at)

SLIDE 14

Formal Model Formal Model Formal Model Formal Model Formal Model

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)
goal: maximize total reward:

T

t=1

rt(at)

assume pairs (xt, rt) chosen at random i.i.d.

SLIDE 15

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .)

SLIDE 16

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0

SLIDE 17

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0

SLIDE 18

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 total reward = 0.2 +

SLIDE 19

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) total reward = 0.2 +

SLIDE 20

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 +

SLIDE 21

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 + 1.0 +

SLIDE 22

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) total reward = 0.2 + 1.0 +

SLIDE 23

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 total reward = 0.2 + 1.0 +

SLIDE 24

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . total reward = 0.2 + 1.0 + 0.1 + · · ·

SLIDE 25

Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem

no context
try to do as well as best single action

SLIDE 26

Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem

no context
try to do as well as best single action
tacitly assuming there is one action that gives high

rewards

e.g.: single treatment/ad/content that is right for entire

population

SLIDE 27

Policies Policies Policies Policies Policies

in contextual bandits setting, can use context to choose

actions

may exist good policy (decision rule) for choosing actions

based on context

SLIDE 28

Policies Policies Policies Policies Policies

in contextual bandits setting, can use context to choose

actions

may exist good policy (decision rule) for choosing actions

based on context

e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

SLIDE 29

Policies Policies Policies Policies Policies

in contextual bandits setting, can use context to choose

actions

may exist good policy (decision rule) for choosing actions

based on context

e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

policy π : (context x) → (action a)

SLIDE 30

Policies Policies Policies Policies Policies

in contextual bandits setting, can use context to choose

actions

may exist good policy (decision rule) for choosing actions

based on context

e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

policy π : (context x) → (action a)
learner generally working with some rich policy space Π
e.g.: all decision trees (“if-then-else” rules)

SLIDE 31

Policies Policies Policies Policies Policies

in contextual bandits setting, can use context to choose

actions

may exist good policy (decision rule) for choosing actions

based on context

e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

policy π : (context x) → (action a)
learner generally working with some rich policy space Π
e.g.: all decision trees (“if-then-else” rules)
assume Π finite, but typically extremely large
tacit assumption:

∃ (unknown) policy π ∈ Π that gives high rewards

SLIDE 32

Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies

goal: learn through experimentation to do (almost) as well as

best π ∈ Π

policies may be very complex and expressive

⇒ powerful approach

SLIDE 33

Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies

goal: learn through experimentation to do (almost) as well as

best π ∈ Π

policies may be very complex and expressive

⇒ powerful approach

challenges:
Π extremely large
need to be learning about all policies simultaneously

while also performing as well as the best

when action selected, only observe reward for policies

that would have chosen same action

exploration versus exploitation on a gigantic scale!

SLIDE 34

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)
goal: want high total (or average) reward

relative to best policy π ∈ Π

SLIDE 35

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)
goal: want high total (or average) reward

relative to best policy π ∈ Π

i.e., want small regret:

1 T

T

t=1

rt(at)

learner’s average reward

SLIDE 36

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

repeat
1a. learner observes context xt
1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
2. learner selects action at ∈ {1, . . . , K}
3. learner receives observed reward rt(at)
goal: want high total (or average) reward

relative to best policy π ∈ Π

i.e., want small regret:

max

π∈Π

1 T

T

t=1

rt(π(xt))

best policy’s average reward

− 1 T

T

t=1

rt(at)

learner’s average reward

SLIDE 37

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

Exp4 solves this problem
maintains weights over all policies in Π

SLIDE 38

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

Exp4 solves this problem
maintains weights over all policies in Π
regret is essentially optimal:

O

K ln |Π|

T

even works for adversarial (i.e., non-random, non-iid) data

SLIDE 39

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

Exp4 solves this problem
maintains weights over all policies in Π
regret is essentially optimal:

O

K ln |Π|

T

even works for adversarial (i.e., non-random, non-iid) data
but: time/space are linear in |Π|
too slow if |Π| gigantic

SLIDE 40

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

Exp4 solves this problem
maintains weights over all policies in Π
regret is essentially optimal:

O

K ln |Π|

T

even works for adversarial (i.e., non-random, non-iid) data
but: time/space are linear in |Π|
too slow if |Π| gigantic
seems hopeless to do better for fully general policy spaces
this talk: aim for time/space only poly(log |Π|)

when Π is “well structured”

SLIDE 41

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

SLIDE 42

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) = learner’s action

SLIDE 43

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action

SLIDE 44

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action learner’s total reward = 0.2 +

SLIDE 45

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

SLIDE 46

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

for any π, can compute rewards would have received

SLIDE 47

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

for any π, can compute rewards would have received

SLIDE 48

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

for any π, can compute rewards would have received
average is good estimate of π’s expected reward

SLIDE 49

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

for any π, can compute rewards would have received
average is good estimate of π’s expected reward
choose empirically best π ∈ Π
regret = O
ln |Π|

T

SLIDE 50

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

output:

ˆ π = arg max

π∈Π T

t=1

rt(π(xt))

SLIDE 51

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

output:

ˆ π = arg max

π∈Π T

t=1

rt(π(xt))

really just (cost-sensitive) classification:

context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

SLIDE 52

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

output:

ˆ π = arg max

π∈Π T

t=1

rt(π(xt))

really just (cost-sensitive) classification:

context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

so: if have “good” classification algorithm for Π, can use to

find good policy (in full-information setting)

SLIDE 53

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

SLIDE 54

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

SLIDE 55

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

SLIDE 56

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

SLIDE 57

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

for any policy π, only observe π’s rewards on subset of rounds

SLIDE 58

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·

for any policy π, only observe π’s rewards on subset of rounds

SLIDE 59

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·

for any policy π, only observe π’s rewards on subset of rounds
might like to use AMO to find empirically good policy
problems:
only see some rewards
observed rewards highly biased

(due to skewed choice of actions)

SLIDE 60

Key Question Key Question Key Question Key Question Key Question

still: AMO is a natural primitive
key question: can we solve the contextual bandits problem

given access to AMO?

SLIDE 61

Key Question Key Question Key Question Key Question Key Question

still: AMO is a natural primitive
key question: can we solve the contextual bandits problem

given access to AMO?

can we use an AMO on bandit data by somehow:
filling in missing data
overcoming bias

SLIDE 62

Key Question Key Question Key Question Key Question Key Question

still: AMO is a natural primitive
key question: can we solve the contextual bandits problem

given access to AMO?

can we use an AMO on bandit data by somehow:
filling in missing data
overcoming bias
want:
optimal regret
time/space bounds poly(log |Π|)

SLIDE 63

Key Question Key Question Key Question Key Question Key Question

still: AMO is a natural primitive
key question: can we solve the contextual bandits problem

given access to AMO?

can we use an AMO on bandit data by somehow:
filling in missing data
overcoming bias
want:
optimal regret
time/space bounds poly(log |Π|)
AMO is theoretical idealization
captures structure in policy space
in practice, can use off-the-shelf classification algorithm

SLIDE 64

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

partially solved by the ǫ-greedy/epoch-greedy algorithm
on each round, choose action:
according to “best” policy so far (with probability 1 − ǫ)
uniformly at random

(with probability ǫ)

SLIDE 65

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

partially solved by the ǫ-greedy/epoch-greedy algorithm
on each round, choose action:
according to “best” policy so far (with probability 1 − ǫ)

[can find with AMO]

uniformly at random

(with probability ǫ)

SLIDE 66

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

partially solved by the ǫ-greedy/epoch-greedy algorithm
on each round, choose action:
according to “best” policy so far (with probability 1 − ǫ)

[can find with AMO]

uniformly at random

(with probability ǫ)

regret = O

K ln |Π| T 1/3

fast and simple, but not optimal

SLIDE 67

“Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm

[Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang]

RandomizedUCB (aka “Monster”) algorithm gets optimal

regret using AMO

solves multiple optimization problems using ellipsoid algorithm
very slow: calls AMO about ˜

O

T 4

times on every round

SLIDE 68

Main Result Main Result Main Result Main Result Main Result

new, simple algorithm for contextual bandits with AMO access
(nearly) optimal regret: ˜

O

K ln |Π|

T

fast: calls AMO far less than once per round!
on average, calls AMO

˜ O

K

T ln |Π|

≪ 1

times per round

SLIDE 69

Main Result Main Result Main Result Main Result Main Result

new, simple algorithm for contextual bandits with AMO access
(nearly) optimal regret: ˜

O

K ln |Π|

T

fast: calls AMO far less than once per round!
on average, calls AMO

˜ O

K

T ln |Π|

≪ 1

times per round

rest of talk: sketching main ideas of the algorithm

SLIDE 70

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates

selection bias is major problem:
only observe reward for single action
exploring while exploiting leads to inherently biased

estimates

SLIDE 71

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates

selection bias is major problem:
only observe reward for single action
exploring while exploiting leads to inherently biased

estimates

nevertheless: can use simple trick to get unbiased estimates

for all actions

SLIDE 72

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

SLIDE 73

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

define ˆ

r(a) = r(a)/p(a) if a chosen else

then E [ˆ

r(a)] = r(a)

SLIDE 74

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

define ˆ

r(a) = r(a)/p(a) if a chosen else

then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions

SLIDE 75

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

define ˆ

r(a) = r(a)/p(a) if a chosen else

then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1

t−1

τ=1

ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))]

SLIDE 76

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

define ˆ

r(a) = r(a)/p(a) if a chosen else

then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1

t−1

τ=1

ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))] ∴ can estimate regret of any policy π:

Regret(π) = max

ˆ π∈Π

ˆ R(ˆ π) − ˆ R(π)

can find maximizing ˆ

π using AMO

SLIDE 77

Variance Control Variance Control Variance Control Variance Control Variance Control

estimates are unbiased — done?

SLIDE 78

Variance Control Variance Control Variance Control Variance Control Variance Control

estimates are unbiased — done?
no! — variance may be extremely large

SLIDE 79

Variance Control Variance Control Variance Control Variance Control Variance Control

estimates are unbiased — done?
no! — variance may be extremely large
can show variance(ˆ

r(a)) ≤ 1 p(a)

SLIDE 80

Variance Control Variance Control Variance Control Variance Control Variance Control

estimates are unbiased — done?
no! — variance may be extremely large
can show variance(ˆ

r(a)) ≤ 1 p(a) ∴ to get good estimates, must ensure that 1/p(a) not too large

SLIDE 81

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

need to choose actions (semi-)randomly

SLIDE 82

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

need to choose actions (semi-)randomly
approach: on each round,
compute distribution Q over policy space Π
randomly pick π ∼ Q
on current context x, choose action π(x)

SLIDE 83

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

need to choose actions (semi-)randomly
approach: on each round,
compute distribution Q over policy space Π
randomly pick π ∼ Q
on current context x, choose action π(x)
Q induces distribution over actions (for any x):

Q(a|x) = Pr

π∼Q [π(x) = a]

SLIDE 84

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

need to choose actions (semi-)randomly
approach: on each round,
compute distribution Q over policy space Π
randomly pick π ∼ Q
on current context x, choose action π(x)
Q induces distribution over actions (for any x):

Q(a|x) = Pr

π∼Q [π(x) = a]

seems will require time/space O(|Π|) to compute Q over

space Π

will see later how to avoid!

SLIDE 85

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

on each round, want to pick Q with:
1. low (estimated) regret

i.e., choose actions think will give high reward

SLIDE 86

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

on each round, want to pick Q with:
1. low (estimated) regret

i.e., choose actions think will give high reward

2. low (estimated) variance

i.e., ensure future estimates will be accurate

SLIDE 87

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

on each round, want to pick Q with:
1. low (estimated) regret

[exploit] i.e., choose actions think will give high reward

2. low (estimated) variance

[explore] i.e., ensure future estimates will be accurate

SLIDE 88

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

SLIDE 89

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

so: estimated regret for random π ∼ Q is
π

Q(π) Regret(π) = Eπ∼Q

Regret(π)

SLIDE 90

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

so: estimated regret for random π ∼ Q is
π

Q(π) Regret(π) = Eπ∼Q

Regret(π)
want small:
π

Q(π) Regret(π) ≤ [small]

SLIDE 91

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Q(a|x) = variance of estimate of reward for action a

SLIDE 92

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Q(a|x) = variance of estimate of reward for action a

so

1 Q(π(x)|x) = variance if action chosen by π

SLIDE 93

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Q(a|x) = variance of estimate of reward for action a

so

1 Q(π(x)|x) = variance if action chosen by π

can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

1

Q(π(x)|x)

=

1 t − 1

t−1

τ=1

1 Q(π(xτ)|xτ)

SLIDE 94

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Q(a|x) = variance of estimate of reward for action a

so

1 Q(π(x)|x) = variance if action chosen by π

can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

1

Q(π(x)|x)

=

1 t − 1

t−1

τ=1

1 Q(π(xτ)|xτ)

want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

SLIDE 95

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Q(a|x) = variance of estimate of reward for action a

so

1 Q(π(x)|x) = variance if action chosen by π

can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

1

Q(π(x)|x)

=

1 t − 1

t−1

τ=1

1 Q(π(xτ)|xτ)

want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

detail: problematic if Q(a|x) too close to zero

SLIDE 96

Low Variance Low Variance Low Variance Low Variance Low Variance

1

Qµ(a|x) = variance of estimate of reward for action a

so

1 Qµ(π(x)|x) = variance if action chosen by π

can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

1

Qµ(π(x)|x)

=

1 t − 1

t−1

τ=1

1 Qµ(π(xτ)|xτ)

want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

detail: problematic if Q(a|x) too close to zero
to avoid, “smooth” probabilities by occassionally picking

action uniformly at random: Qµ(a|x) = (1 − Kµ)Q(a|x) + µ

SLIDE 97

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

want Q such that:
π

Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π

SLIDE 98

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

want Q such that:
π

Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π

π

Q(π) = 1

SLIDE 99

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

want Q such that:
π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0 for all π ∈ Π

π

Q(π) = 1

can fill in constants

SLIDE 100

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

want Q such that:
π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

π

Q(π) = 1

can fill in constants
make easier by:
allowing higher variance for policies with higher regret

(poor policies can be eliminated even with fairly poor performance estimates)

SLIDE 101

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

want Q such that:
π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

π

Q(π) ≤ 1

can fill in constants
make easier by:
allowing higher variance for policies with higher regret

(poor policies can be eliminated even with fairly poor performance estimates)

only require Q to be sub-distribution

(can put all remaining mass on ˆ π with maximum estimated reward)

SLIDE 102

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

π

Q(π) ≤ 1

SLIDE 103

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

π

Q(π) ≤ 1 [sub-distribution]

SLIDE 104

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

π

Q(π) ≤ 1 [sub-distribution]

similar to [Dud´

ık et al.]

SLIDE 105

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

π

Q(π) ≤ 1 [sub-distribution]

similar to [Dud´

ık et al.]

seems awful:
|Π| variables
|Π| constraints
constraints involve nasty non-linear functions

(recall ˆ V Q(π) = ˆ E

1

Qµ(π(x)|x)

)
not even clear if feasible

SLIDE 106

If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...

Theorem: if can solve OP on every round (for appropriate

constants), then will get regret ˜ O

K ln |Π|

T

.

SLIDE 107

If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...

Theorem: if can solve OP on every round (for appropriate

constants), then will get regret ˜ O

K ln |Π|

T

.
proof idea:
regret constraint ensures low regret

(if estimates are good enough)

variance constraint ensures that they actually will be

good enough

essentially same approach as [Dud´

ık et al.]

SLIDE 108

How to Solve? How to Solve? How to Solve? How to Solve? How to Solve?

basic idea:
find a violated constraint
(attempt to) fix it
repeat

SLIDE 109

How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)

Q ← 0
repeat:
1. if Q “too big” then rescale
(i.e., multiply Q by scalar < 1)
ensures sub-distribution and regret constraints are

satisfied

SLIDE 110

How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)

Q ← 0
repeat:
1. if Q “too big” then rescale
(i.e., multiply Q by scalar < 1)
ensures sub-distribution and regret constraints are

satisfied

2. find π ∈ Π for which corresponding variance constraint is

violated

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]

SLIDE 111

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

SLIDE 112

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

after this step, have
π

Q(π)(C0 + Regret(π)) ≤ C0

SLIDE 113

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

after this step, have
π

Q(π)(C0 + Regret(π)) ≤ C0 which implies:

π Q(π) ≤ 1

[sub-distribution]

π Q(π)

Regret(π) ≤ C0 [regret constraint]

SLIDE 114

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]

SLIDE 115

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]
if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

SLIDE 116

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]
if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

can execute step using AMO:
can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

τ

˜ rτ(π(xτ)) + [constant]

SLIDE 117

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]
if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

can execute step using AMO:
can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

τ

˜ rτ(π(xτ)) + [constant]

so: can maximize with AMO
will find violating constraint (if one exists)

SLIDE 118

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

a. if none exists, halt and output Q
b. else Q(π) ← Q(π) + α where α = [some formula]
if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

can execute step using AMO:
can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

τ

˜ rτ(π(xτ)) + [constant]

so: can maximize with AMO
will find violating constraint (if one exists)

∴ one AMO call per iteration

SLIDE 119

Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work?

so: if halts, then outputs solution to OP
but how long will it take to halt (if ever)?
to answer, analyze using a potential function

SLIDE 120

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

low variance

+B·

π

Q(π) Regret(π)

low regret

SLIDE 121

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

low variance

+B·

π

Q(π) Regret(π)

low regret
defined for all nonnegative vectors Q over Π

(not just sub-distributions)

SLIDE 122

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

low variance

+B·

π

Q(π) Regret(π)

low regret
defined for all nonnegative vectors Q over Π

(not just sub-distributions)

properties:
Φ(Q) ≥ 0
convex

SLIDE 123

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

low variance

+B·

π

Q(π) Regret(π)

low regret
defined for all nonnegative vectors Q over Π

(not just sub-distributions)

properties:
Φ(Q) ≥ 0
convex
if Q minimizes Φ then Q is a solution to OP
key proof step:

∂Φ/∂Q(π) ∝ variance constraint for π ∴ OP is feasible

SLIDE 124

Analysis Analysis Analysis Analysis Analysis

algorithm turns out to be (roughly) coordinate descent on Φ
each step adjusts Q along one coordinate direction Q(π)

SLIDE 125

Analysis Analysis Analysis Analysis Analysis

algorithm turns out to be (roughly) coordinate descent on Φ
each step adjusts Q along one coordinate direction Q(π)
can lower-bound how much Φ decreases on each update
can also show rescaling step never increases Φ

SLIDE 126

Analysis Analysis Analysis Analysis Analysis

algorithm turns out to be (roughly) coordinate descent on Φ
each step adjusts Q along one coordinate direction Q(π)
can lower-bound how much Φ decreases on each update
can also show rescaling step never increases Φ
since Φ ≥ 0, gives bound on number of iterations of the

algorithm

SLIDE 127

Analysis Analysis Analysis Analysis Analysis

algorithm turns out to be (roughly) coordinate descent on Φ
each step adjusts Q along one coordinate direction Q(π)
can lower-bound how much Φ decreases on each update
can also show rescaling step never increases Φ
since Φ ≥ 0, gives bound on number of iterations of the

algorithm

Theorem: On round t, algorithm halts after at most

˜ O

Kt

ln |Π|

iterations (and calls to AMO).

SLIDE 128

Analysis Analysis Analysis Analysis Analysis

algorithm turns out to be (roughly) coordinate descent on Φ
each step adjusts Q along one coordinate direction Q(π)
can lower-bound how much Φ decreases on each update
can also show rescaling step never increases Φ
since Φ ≥ 0, gives bound on number of iterations of the

algorithm

Theorem: On round t, algorithm halts after at most

˜ O

Kt

ln |Π|

iterations (and calls to AMO).
as corollary, also get bound on sparsity of Q

SLIDE 129

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

so far, assumed solve OP from scratch on each round
naively, gives ˜

O

T 3/2

calls to AMO in T rounds

can do much better!

SLIDE 130

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

so far, assumed solve OP from scratch on each round
naively, gives ˜

O

T 3/2

calls to AMO in T rounds

can do much better!
first improvement: since data iid, can use same solution for

many rounds, i.e., for long “epochs”

gives same (near optimal) regret
essentially no computation required on rounds where Q

not updated

SLIDE 131

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

so far, assumed solve OP from scratch on each round
naively, gives ˜

O

T 3/2

calls to AMO in T rounds

can do much better!
first improvement: since data iid, can use same solution for

many rounds, i.e., for long “epochs”

gives same (near optimal) regret
essentially no computation required on rounds where Q

not updated

second improvement: can initialize algorithm with the

previous solution (rather than starting fresh each time)

works because each new example can only cause Φ to

increase slightly

SLIDE 132

Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.)

putting together:

if only update Q on rounds 1, 4, 9, 16, 25, . . .

get same (near optimal) regret
only need

˜ O

KT

ln |Π|

calls to AMO total for entire sequence of T rounds

SLIDE 133

Summary Summary Summary Summary Summary

new algorithm for contextual bandits problem with AMO

access

(nearly) optimal regret
simple and fast
only requires an average of

˜ O

K

T ln |Π|

≪ 1

AMO calls per round

SLIDE 134

Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions

try out experimentally
is there an algorithm that uses an online (rather than batch)
racle?
is there a lower bound on number of AMO calls necessary to

solve this problem?

can we find a similar algorithm that handles adversarial data?