The Contextual Bandits Problem The Contextual Bandits Problem The - - PowerPoint PPT Presentation

the contextual bandits problem the contextual bandits
SMART_READER_LITE
LIVE PREVIEW

The Contextual Bandits Problem The Contextual Bandits Problem The - - PowerPoint PPT Presentation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and


slide-1
SLIDE 1

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm Alekh Agarwal (MSR) Daniel Hsu (Columbia) Satyen Kale (Yahoo) John Langford (MSR) Lihong Li (MSR) Rob Schapire Rob Schapire Rob Schapire Rob Schapire Rob Schapire (MSR/Princeton)

slide-2
SLIDE 2

Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement

  • repeat:
  • 1. website visited by user (with profile, browsing history,

etc.)

  • 2. website chooses ad/content to present to user
  • 3. user responds (clicks, leaves page, etc.)
  • goal: make choices that elicit desired user behavior
slide-3
SLIDE 3

Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment

  • repeat:
  • 1. doctor visited by patient (with symptoms, test results,

etc.)

  • 2. doctor chooses treatment
  • 3. patient responds (recovers, gets worse, etc.)
  • goal: make choices that maximize favorable outcomes
slide-4
SLIDE 4

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

  • repeat:
  • 1. learner presented with context
  • 2. learner chooses an action
  • 3. learner observes reward (but only for chosen action)
  • goal: learn to choose actions to maximize rewards
slide-5
SLIDE 5

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem

  • repeat:
  • 1. learner presented with context
  • 2. learner chooses an action
  • 3. learner observes reward (but only for chosen action)
  • goal: learn to choose actions to maximize rewards
  • general and fundamental problem: how to learn to make

intelligent decisions through experience

slide-6
SLIDE 6

Issues Issues Issues Issues Issues

  • classic dilemma:
  • exploit what has already been learned
  • explore to learn which behaviors give best results
slide-7
SLIDE 7

Issues Issues Issues Issues Issues

  • classic dilemma:
  • exploit what has already been learned
  • explore to learn which behaviors give best results
  • in addition, must use context effectively
  • many choices of behavior possible
  • may never see same context twice
slide-8
SLIDE 8

Issues Issues Issues Issues Issues

  • classic dilemma:
  • exploit what has already been learned
  • explore to learn which behaviors give best results
  • in addition, must use context effectively
  • many choices of behavior possible
  • may never see same context twice
  • selection bias: if explore while exploiting, will tend to get

highly skewed data

slide-9
SLIDE 9

Issues Issues Issues Issues Issues

  • classic dilemma:
  • exploit what has already been learned
  • explore to learn which behaviors give best results
  • in addition, must use context effectively
  • many choices of behavior possible
  • may never see same context twice
  • selection bias: if explore while exploiting, will tend to get

highly skewed data

  • efficiency
slide-10
SLIDE 10

This Talk This Talk This Talk This Talk This Talk

  • new and general algorithm for contextual bandits
  • optimal statistical performance
  • far faster and simpler than predecessors
slide-11
SLIDE 11

Formal Model Formal Model Formal Model Formal Model Formal Model

  • repeat
  • 1a. learner observes context xt
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
slide-12
SLIDE 12

Formal Model Formal Model Formal Model Formal Model Formal Model

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
slide-13
SLIDE 13

Formal Model Formal Model Formal Model Formal Model Formal Model

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
  • goal: maximize total reward:

T

  • t=1

rt(at)

slide-14
SLIDE 14

Formal Model Formal Model Formal Model Formal Model Formal Model

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
  • goal: maximize total reward:

T

  • t=1

rt(at)

  • assume pairs (xt, rt) chosen at random i.i.d.
slide-15
SLIDE 15

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .)

slide-16
SLIDE 16

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0

slide-17
SLIDE 17

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0

slide-18
SLIDE 18

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 total reward = 0.2 +

slide-19
SLIDE 19

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) total reward = 0.2 +

slide-20
SLIDE 20

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 +

slide-21
SLIDE 21

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 + 1.0 +

slide-22
SLIDE 22

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) total reward = 0.2 + 1.0 +

slide-23
SLIDE 23

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 total reward = 0.2 + 1.0 +

slide-24
SLIDE 24

Example Example Example Example Example

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . total reward = 0.2 + 1.0 + 0.1 + · · ·

slide-25
SLIDE 25

Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem

  • no context
  • try to do as well as best single action
slide-26
SLIDE 26

Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem

  • no context
  • try to do as well as best single action
  • tacitly assuming there is one action that gives high

rewards

  • e.g.: single treatment/ad/content that is right for entire

population

slide-27
SLIDE 27

Policies Policies Policies Policies Policies

  • in contextual bandits setting, can use context to choose

actions

  • may exist good policy (decision rule) for choosing actions

based on context

slide-28
SLIDE 28

Policies Policies Policies Policies Policies

  • in contextual bandits setting, can use context to choose

actions

  • may exist good policy (decision rule) for choosing actions

based on context

  • e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

slide-29
SLIDE 29

Policies Policies Policies Policies Policies

  • in contextual bandits setting, can use context to choose

actions

  • may exist good policy (decision rule) for choosing actions

based on context

  • e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

  • policy π : (context x) → (action a)
slide-30
SLIDE 30

Policies Policies Policies Policies Policies

  • in contextual bandits setting, can use context to choose

actions

  • may exist good policy (decision rule) for choosing actions

based on context

  • e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

  • policy π : (context x) → (action a)
  • learner generally working with some rich policy space Π
  • e.g.: all decision trees (“if-then-else” rules)
slide-31
SLIDE 31

Policies Policies Policies Policies Policies

  • in contextual bandits setting, can use context to choose

actions

  • may exist good policy (decision rule) for choosing actions

based on context

  • e.g.:

If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3

  • policy π : (context x) → (action a)
  • learner generally working with some rich policy space Π
  • e.g.: all decision trees (“if-then-else” rules)
  • assume Π finite, but typically extremely large
  • tacit assumption:

∃ (unknown) policy π ∈ Π that gives high rewards

slide-32
SLIDE 32

Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies

  • goal: learn through experimentation to do (almost) as well as

best π ∈ Π

  • policies may be very complex and expressive

⇒ powerful approach

slide-33
SLIDE 33

Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies

  • goal: learn through experimentation to do (almost) as well as

best π ∈ Π

  • policies may be very complex and expressive

⇒ powerful approach

  • challenges:
  • Π extremely large
  • need to be learning about all policies simultaneously

while also performing as well as the best

  • when action selected, only observe reward for policies

that would have chosen same action

  • exploration versus exploitation on a gigantic scale!
slide-34
SLIDE 34

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
  • goal: want high total (or average) reward

relative to best policy π ∈ Π

slide-35
SLIDE 35

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
  • goal: want high total (or average) reward

relative to best policy π ∈ Π

  • i.e., want small regret:

1 T

T

  • t=1

rt(at)

  • learner’s average reward
slide-36
SLIDE 36

Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)

  • repeat
  • 1a. learner observes context xt
  • 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
  • 2. learner selects action at ∈ {1, . . . , K}
  • 3. learner receives observed reward rt(at)
  • goal: want high total (or average) reward

relative to best policy π ∈ Π

  • i.e., want small regret:

max

π∈Π

1 T

T

  • t=1

rt(π(xt))

  • best policy’s average reward

− 1 T

T

  • t=1

rt(at)

  • learner’s average reward
slide-37
SLIDE 37

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

  • Exp4 solves this problem
  • maintains weights over all policies in Π
slide-38
SLIDE 38

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

  • Exp4 solves this problem
  • maintains weights over all policies in Π
  • regret is essentially optimal:

O

  • K ln |Π|

T

  • even works for adversarial (i.e., non-random, non-iid) data
slide-39
SLIDE 39

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

  • Exp4 solves this problem
  • maintains weights over all policies in Π
  • regret is essentially optimal:

O

  • K ln |Π|

T

  • even works for adversarial (i.e., non-random, non-iid) data
  • but: time/space are linear in |Π|
  • too slow if |Π| gigantic
slide-40
SLIDE 40

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem

[Auer, Cesa-Bianchi, Freund, Schapire]

  • Exp4 solves this problem
  • maintains weights over all policies in Π
  • regret is essentially optimal:

O

  • K ln |Π|

T

  • even works for adversarial (i.e., non-random, non-iid) data
  • but: time/space are linear in |Π|
  • too slow if |Π| gigantic
  • seems hopeless to do better for fully general policy spaces
  • this talk: aim for time/space only poly(log |Π|)

when Π is “well structured”

slide-41
SLIDE 41

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions
slide-42
SLIDE 42

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) = learner’s action

slide-43
SLIDE 43

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action

slide-44
SLIDE 44

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action learner’s total reward = 0.2 +

slide-45
SLIDE 45

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

slide-46
SLIDE 46

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

  • for any π, can compute rewards would have received
slide-47
SLIDE 47

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

  • 1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

  • for any π, can compute rewards would have received
slide-48
SLIDE 48

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

  • 1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

  • for any π, can compute rewards would have received
  • average is good estimate of π’s expected reward
slide-49
SLIDE 49

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting

  • say see rewards for all actions

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0

  • 1.0

(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·

  • for any π, can compute rewards would have received
  • average is good estimate of π’s expected reward
  • choose empirically best π ∈ Π
  • regret = O
  • ln |Π|

T

slide-50
SLIDE 50

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

  • to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

  • input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

  • output:

ˆ π = arg max

π∈Π T

  • t=1

rt(π(xt))

slide-51
SLIDE 51

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

  • to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

  • input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

  • output:

ˆ π = arg max

π∈Π T

  • t=1

rt(π(xt))

  • really just (cost-sensitive) classification:

context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

slide-52
SLIDE 52

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)

  • to apply, just need “oracle” (algorithm/subroutine) for finding

best π ∈ Π on observed rewards

  • input: (x1, r1), . . . , (xT, rT)

xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions

  • output:

ˆ π = arg max

π∈Π T

  • t=1

rt(π(xt))

  • really just (cost-sensitive) classification:

context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

  • so: if have “good” classification algorithm for Π, can use to

find good policy (in full-information setting)

slide-53
SLIDE 53

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken
slide-54
SLIDE 54

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

slide-55
SLIDE 55

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

slide-56
SLIDE 56

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

slide-57
SLIDE 57

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·

  • for any policy π, only observe π’s rewards on subset of rounds
slide-58
SLIDE 58

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·

  • for any policy π, only observe π’s rewards on subset of rounds
slide-59
SLIDE 59

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...

  • ...only see rewards for actions taken

Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action

= π’s action

learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·

  • for any policy π, only observe π’s rewards on subset of rounds
  • might like to use AMO to find empirically good policy
  • problems:
  • only see some rewards
  • observed rewards highly biased

(due to skewed choice of actions)

slide-60
SLIDE 60

Key Question Key Question Key Question Key Question Key Question

  • still: AMO is a natural primitive
  • key question: can we solve the contextual bandits problem

given access to AMO?

slide-61
SLIDE 61

Key Question Key Question Key Question Key Question Key Question

  • still: AMO is a natural primitive
  • key question: can we solve the contextual bandits problem

given access to AMO?

  • can we use an AMO on bandit data by somehow:
  • filling in missing data
  • overcoming bias
slide-62
SLIDE 62

Key Question Key Question Key Question Key Question Key Question

  • still: AMO is a natural primitive
  • key question: can we solve the contextual bandits problem

given access to AMO?

  • can we use an AMO on bandit data by somehow:
  • filling in missing data
  • overcoming bias
  • want:
  • optimal regret
  • time/space bounds poly(log |Π|)
slide-63
SLIDE 63

Key Question Key Question Key Question Key Question Key Question

  • still: AMO is a natural primitive
  • key question: can we solve the contextual bandits problem

given access to AMO?

  • can we use an AMO on bandit data by somehow:
  • filling in missing data
  • overcoming bias
  • want:
  • optimal regret
  • time/space bounds poly(log |Π|)
  • AMO is theoretical idealization
  • captures structure in policy space
  • in practice, can use off-the-shelf classification algorithm
slide-64
SLIDE 64

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

  • partially solved by the ǫ-greedy/epoch-greedy algorithm
  • on each round, choose action:
  • according to “best” policy so far (with probability 1 − ǫ)
  • uniformly at random

(with probability ǫ)

slide-65
SLIDE 65

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

  • partially solved by the ǫ-greedy/epoch-greedy algorithm
  • on each round, choose action:
  • according to “best” policy so far (with probability 1 − ǫ)

[can find with AMO]

  • uniformly at random

(with probability ǫ)

slide-66
SLIDE 66

ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy

[Langford & Zhang]

  • partially solved by the ǫ-greedy/epoch-greedy algorithm
  • on each round, choose action:
  • according to “best” policy so far (with probability 1 − ǫ)

[can find with AMO]

  • uniformly at random

(with probability ǫ)

  • regret = O

K ln |Π| T 1/3

  • fast and simple, but not optimal
slide-67
SLIDE 67

“Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm

[Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang]

  • RandomizedUCB (aka “Monster”) algorithm gets optimal

regret using AMO

  • solves multiple optimization problems using ellipsoid algorithm
  • very slow: calls AMO about ˜

O

  • T 4

times on every round

slide-68
SLIDE 68

Main Result Main Result Main Result Main Result Main Result

  • new, simple algorithm for contextual bandits with AMO access
  • (nearly) optimal regret: ˜

O

  • K ln |Π|

T

  • fast: calls AMO far less than once per round!
  • on average, calls AMO

˜ O

  • K

T ln |Π|

  • ≪ 1

times per round

slide-69
SLIDE 69

Main Result Main Result Main Result Main Result Main Result

  • new, simple algorithm for contextual bandits with AMO access
  • (nearly) optimal regret: ˜

O

  • K ln |Π|

T

  • fast: calls AMO far less than once per round!
  • on average, calls AMO

˜ O

  • K

T ln |Π|

  • ≪ 1

times per round

  • rest of talk: sketching main ideas of the algorithm
slide-70
SLIDE 70

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates

  • selection bias is major problem:
  • only observe reward for single action
  • exploring while exploiting leads to inherently biased

estimates

slide-71
SLIDE 71

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates

  • selection bias is major problem:
  • only observe reward for single action
  • exploring while exploiting leads to inherently biased

estimates

  • nevertheless: can use simple trick to get unbiased estimates

for all actions

slide-72
SLIDE 72

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

  • say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

slide-73
SLIDE 73

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

  • say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

  • define ˆ

r(a) = r(a)/p(a) if a chosen else

  • then E [ˆ

r(a)] = r(a)

slide-74
SLIDE 74

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

  • say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

  • define ˆ

r(a) = r(a)/p(a) if a chosen else

  • then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions

slide-75
SLIDE 75

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

  • say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

  • define ˆ

r(a) = r(a)/p(a) if a chosen else

  • then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1

t−1

  • τ=1

ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))]

slide-76
SLIDE 76

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)

  • say r(a) = (unknown) reward for action a

p(a) = (known) probability of choosing a

  • define ˆ

r(a) = r(a)/p(a) if a chosen else

  • then E [ˆ

r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1

t−1

  • τ=1

ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))] ∴ can estimate regret of any policy π:

  • Regret(π) = max

ˆ π∈Π

ˆ R(ˆ π) − ˆ R(π)

  • can find maximizing ˆ

π using AMO

slide-77
SLIDE 77

Variance Control Variance Control Variance Control Variance Control Variance Control

  • estimates are unbiased — done?
slide-78
SLIDE 78

Variance Control Variance Control Variance Control Variance Control Variance Control

  • estimates are unbiased — done?
  • no! — variance may be extremely large
slide-79
SLIDE 79

Variance Control Variance Control Variance Control Variance Control Variance Control

  • estimates are unbiased — done?
  • no! — variance may be extremely large
  • can show variance(ˆ

r(a)) ≤ 1 p(a)

slide-80
SLIDE 80

Variance Control Variance Control Variance Control Variance Control Variance Control

  • estimates are unbiased — done?
  • no! — variance may be extremely large
  • can show variance(ˆ

r(a)) ≤ 1 p(a) ∴ to get good estimates, must ensure that 1/p(a) not too large

slide-81
SLIDE 81

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

  • need to choose actions (semi-)randomly
slide-82
SLIDE 82

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

  • need to choose actions (semi-)randomly
  • approach: on each round,
  • compute distribution Q over policy space Π
  • randomly pick π ∼ Q
  • on current context x, choose action π(x)
slide-83
SLIDE 83

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

  • need to choose actions (semi-)randomly
  • approach: on each round,
  • compute distribution Q over policy space Π
  • randomly pick π ∼ Q
  • on current context x, choose action π(x)
  • Q induces distribution over actions (for any x):

Q(a|x) = Pr

π∼Q [π(x) = a]

slide-84
SLIDE 84

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies

  • need to choose actions (semi-)randomly
  • approach: on each round,
  • compute distribution Q over policy space Π
  • randomly pick π ∼ Q
  • on current context x, choose action π(x)
  • Q induces distribution over actions (for any x):

Q(a|x) = Pr

π∼Q [π(x) = a]

  • seems will require time/space O(|Π|) to compute Q over

space Π

  • will see later how to avoid!
slide-85
SLIDE 85

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

  • on each round, want to pick Q with:
  • 1. low (estimated) regret

i.e., choose actions think will give high reward

slide-86
SLIDE 86

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

  • on each round, want to pick Q with:
  • 1. low (estimated) regret

i.e., choose actions think will give high reward

  • 2. low (estimated) variance

i.e., ensure future estimates will be accurate

slide-87
SLIDE 87

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q

  • on each round, want to pick Q with:
  • 1. low (estimated) regret

[exploit] i.e., choose actions think will give high reward

  • 2. low (estimated) variance

[explore] i.e., ensure future estimates will be accurate

slide-88
SLIDE 88

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

slide-89
SLIDE 89

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

  • so: estimated regret for random π ∼ Q is
  • π

Q(π) Regret(π) = Eπ∼Q

  • Regret(π)
slide-90
SLIDE 90

Low Regret Low Regret Low Regret Low Regret Low Regret

Regret(π) = estimated regret of π

  • so: estimated regret for random π ∼ Q is
  • π

Q(π) Regret(π) = Eπ∼Q

  • Regret(π)
  • want small:
  • π

Q(π) Regret(π) ≤ [small]

slide-91
SLIDE 91

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Q(a|x) = variance of estimate of reward for action a

slide-92
SLIDE 92

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Q(a|x) = variance of estimate of reward for action a

  • so

1 Q(π(x)|x) = variance if action chosen by π

slide-93
SLIDE 93

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Q(a|x) = variance of estimate of reward for action a

  • so

1 Q(π(x)|x) = variance if action chosen by π

  • can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

  • 1

Q(π(x)|x)

  • =

1 t − 1

t−1

  • τ=1

1 Q(π(xτ)|xτ)

slide-94
SLIDE 94

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Q(a|x) = variance of estimate of reward for action a

  • so

1 Q(π(x)|x) = variance if action chosen by π

  • can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

  • 1

Q(π(x)|x)

  • =

1 t − 1

t−1

  • τ=1

1 Q(π(xτ)|xτ)

  • want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

slide-95
SLIDE 95

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Q(a|x) = variance of estimate of reward for action a

  • so

1 Q(π(x)|x) = variance if action chosen by π

  • can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

  • 1

Q(π(x)|x)

  • =

1 t − 1

t−1

  • τ=1

1 Q(π(xτ)|xτ)

  • want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

  • detail: problematic if Q(a|x) too close to zero
slide-96
SLIDE 96

Low Variance Low Variance Low Variance Low Variance Low Variance

  • 1

Qµ(a|x) = variance of estimate of reward for action a

  • so

1 Qµ(π(x)|x) = variance if action chosen by π

  • can estimate expected variance for actions chosen by π:

ˆ V Q(π) = ˆ E

  • 1

Qµ(π(x)|x)

  • =

1 t − 1

t−1

  • τ=1

1 Qµ(π(xτ)|xτ)

  • want small:

ˆ V Q(π) ≤ [small] for all π ∈ Π

  • detail: problematic if Q(a|x) too close to zero
  • to avoid, “smooth” probabilities by occassionally picking

action uniformly at random: Qµ(a|x) = (1 − Kµ)Q(a|x) + µ

slide-97
SLIDE 97

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

  • want Q such that:
  • π

Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π

slide-98
SLIDE 98

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

  • want Q such that:
  • π

Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π

  • π

Q(π) = 1

slide-99
SLIDE 99

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

  • want Q such that:
  • π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0 for all π ∈ Π

  • π

Q(π) = 1

  • can fill in constants
slide-100
SLIDE 100

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

  • want Q such that:
  • π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

  • π

Q(π) = 1

  • can fill in constants
  • make easier by:
  • allowing higher variance for policies with higher regret

(poor policies can be eliminated even with fairly poor performance estimates)

slide-101
SLIDE 101

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together

  • want Q such that:
  • π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

  • π

Q(π) ≤ 1

  • can fill in constants
  • make easier by:
  • allowing higher variance for policies with higher regret

(poor policies can be eliminated even with fairly poor performance estimates)

  • only require Q to be sub-distribution

(can put all remaining mass on ˆ π with maximum estimated reward)

slide-102
SLIDE 102

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

  • π

Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π

  • π

Q(π) ≤ 1

slide-103
SLIDE 103

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

  • π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

  • π

Q(π) ≤ 1 [sub-distribution]

slide-104
SLIDE 104

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

  • π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

  • π

Q(π) ≤ 1 [sub-distribution]

  • similar to [Dud´

ık et al.]

slide-105
SLIDE 105

Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”

find Q such that:

  • π

Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]

  • π

Q(π) ≤ 1 [sub-distribution]

  • similar to [Dud´

ık et al.]

  • seems awful:
  • |Π| variables
  • |Π| constraints
  • constraints involve nasty non-linear functions

(recall ˆ V Q(π) = ˆ E

  • 1

Qµ(π(x)|x)

  • )
  • not even clear if feasible
slide-106
SLIDE 106

If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...

  • Theorem: if can solve OP on every round (for appropriate

constants), then will get regret ˜ O

  • K ln |Π|

T

  • .
slide-107
SLIDE 107

If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...

  • Theorem: if can solve OP on every round (for appropriate

constants), then will get regret ˜ O

  • K ln |Π|

T

  • .
  • proof idea:
  • regret constraint ensures low regret

(if estimates are good enough)

  • variance constraint ensures that they actually will be

good enough

  • essentially same approach as [Dud´

ık et al.]

slide-108
SLIDE 108

How to Solve? How to Solve? How to Solve? How to Solve? How to Solve?

  • basic idea:
  • find a violated constraint
  • (attempt to) fix it
  • repeat
slide-109
SLIDE 109

How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)

  • Q ← 0
  • repeat:
  • 1. if Q “too big” then rescale
  • (i.e., multiply Q by scalar < 1)
  • ensures sub-distribution and regret constraints are

satisfied

slide-110
SLIDE 110

How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)

  • Q ← 0
  • repeat:
  • 1. if Q “too big” then rescale
  • (i.e., multiply Q by scalar < 1)
  • ensures sub-distribution and regret constraints are

satisfied

  • 2. find π ∈ Π for which corresponding variance constraint is

violated

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
slide-111
SLIDE 111

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

  • 1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

slide-112
SLIDE 112

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

  • 1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

  • after this step, have
  • π

Q(π)(C0 + Regret(π)) ≤ C0

slide-113
SLIDE 113

More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step

  • 1. [detailed version]

if

π Q(π)(C0 +

Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality

  • after this step, have
  • π

Q(π)(C0 + Regret(π)) ≤ C0 which implies:

π Q(π) ≤ 1

[sub-distribution]

π Q(π)

Regret(π) ≤ C0 [regret constraint]

slide-114
SLIDE 114

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

  • 2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
slide-115
SLIDE 115

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

  • 2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
  • if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

slide-116
SLIDE 116

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

  • 2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
  • if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

  • can execute step using AMO:
  • can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

  • τ

˜ rτ(π(xτ)) + [constant]

slide-117
SLIDE 117

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

  • 2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
  • if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

  • can execute step using AMO:
  • can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

  • τ

˜ rτ(π(xτ)) + [constant]

  • so: can maximize with AMO
  • will find violating constraint (if one exists)
slide-118
SLIDE 118

More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints

  • 2. [detailed version]

find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0

  • a. if none exists, halt and output Q
  • b. else Q(π) ← Q(π) + α where α = [some formula]
  • if halts then C1 · ˆ

V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]

  • can execute step using AMO:
  • can construct “pseudo-rewards” ˜

rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =

  • τ

˜ rτ(π(xτ)) + [constant]

  • so: can maximize with AMO
  • will find violating constraint (if one exists)

∴ one AMO call per iteration

slide-119
SLIDE 119

Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work?

  • so: if halts, then outputs solution to OP
  • but how long will it take to halt (if ever)?
  • to answer, analyze using a potential function
slide-120
SLIDE 120

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

  • define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

  • low variance

+B·

  • π

Q(π) Regret(π)

  • low regret
slide-121
SLIDE 121

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

  • define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

  • low variance

+B·

  • π

Q(π) Regret(π)

  • low regret
  • defined for all nonnegative vectors Q over Π

(not just sub-distributions)

slide-122
SLIDE 122

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

  • define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

  • low variance

+B·

  • π

Q(π) Regret(π)

  • low regret
  • defined for all nonnegative vectors Q over Π

(not just sub-distributions)

  • properties:
  • Φ(Q) ≥ 0
  • convex
slide-123
SLIDE 123

A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function

  • define potential function to quantify progress:

Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]

  • low variance

+B·

  • π

Q(π) Regret(π)

  • low regret
  • defined for all nonnegative vectors Q over Π

(not just sub-distributions)

  • properties:
  • Φ(Q) ≥ 0
  • convex
  • if Q minimizes Φ then Q is a solution to OP
  • key proof step:

∂Φ/∂Q(π) ∝ variance constraint for π ∴ OP is feasible

slide-124
SLIDE 124

Analysis Analysis Analysis Analysis Analysis

  • algorithm turns out to be (roughly) coordinate descent on Φ
  • each step adjusts Q along one coordinate direction Q(π)
slide-125
SLIDE 125

Analysis Analysis Analysis Analysis Analysis

  • algorithm turns out to be (roughly) coordinate descent on Φ
  • each step adjusts Q along one coordinate direction Q(π)
  • can lower-bound how much Φ decreases on each update
  • can also show rescaling step never increases Φ
slide-126
SLIDE 126

Analysis Analysis Analysis Analysis Analysis

  • algorithm turns out to be (roughly) coordinate descent on Φ
  • each step adjusts Q along one coordinate direction Q(π)
  • can lower-bound how much Φ decreases on each update
  • can also show rescaling step never increases Φ
  • since Φ ≥ 0, gives bound on number of iterations of the

algorithm

slide-127
SLIDE 127

Analysis Analysis Analysis Analysis Analysis

  • algorithm turns out to be (roughly) coordinate descent on Φ
  • each step adjusts Q along one coordinate direction Q(π)
  • can lower-bound how much Φ decreases on each update
  • can also show rescaling step never increases Φ
  • since Φ ≥ 0, gives bound on number of iterations of the

algorithm

  • Theorem: On round t, algorithm halts after at most

˜ O

  • Kt

ln |Π|

  • iterations (and calls to AMO).
slide-128
SLIDE 128

Analysis Analysis Analysis Analysis Analysis

  • algorithm turns out to be (roughly) coordinate descent on Φ
  • each step adjusts Q along one coordinate direction Q(π)
  • can lower-bound how much Φ decreases on each update
  • can also show rescaling step never increases Φ
  • since Φ ≥ 0, gives bound on number of iterations of the

algorithm

  • Theorem: On round t, algorithm halts after at most

˜ O

  • Kt

ln |Π|

  • iterations (and calls to AMO).
  • as corollary, also get bound on sparsity of Q
slide-129
SLIDE 129

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

  • so far, assumed solve OP from scratch on each round
  • naively, gives ˜

O

  • T 3/2

calls to AMO in T rounds

  • can do much better!
slide-130
SLIDE 130

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

  • so far, assumed solve OP from scratch on each round
  • naively, gives ˜

O

  • T 3/2

calls to AMO in T rounds

  • can do much better!
  • first improvement: since data iid, can use same solution for

many rounds, i.e., for long “epochs”

  • gives same (near optimal) regret
  • essentially no computation required on rounds where Q

not updated

slide-131
SLIDE 131

Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start

  • so far, assumed solve OP from scratch on each round
  • naively, gives ˜

O

  • T 3/2

calls to AMO in T rounds

  • can do much better!
  • first improvement: since data iid, can use same solution for

many rounds, i.e., for long “epochs”

  • gives same (near optimal) regret
  • essentially no computation required on rounds where Q

not updated

  • second improvement: can initialize algorithm with the

previous solution (rather than starting fresh each time)

  • works because each new example can only cause Φ to

increase slightly

slide-132
SLIDE 132

Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.)

  • putting together:

if only update Q on rounds 1, 4, 9, 16, 25, . . .

  • get same (near optimal) regret
  • only need

˜ O

  • KT

ln |Π|

  • calls to AMO total for entire sequence of T rounds
slide-133
SLIDE 133

Summary Summary Summary Summary Summary

  • new algorithm for contextual bandits problem with AMO

access

  • (nearly) optimal regret
  • simple and fast
  • only requires an average of

˜ O

  • K

T ln |Π|

  • ≪ 1

AMO calls per round

slide-134
SLIDE 134

Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions

  • try out experimentally
  • is there an algorithm that uses an online (rather than batch)
  • racle?
  • is there a lower bound on number of AMO calls necessary to

solve this problem?

  • can we find a similar algorithm that handles adversarial data?