SLIDE 1
The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem
A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm Alekh Agarwal (MSR) Daniel Hsu (Columbia) Satyen Kale (Yahoo) John Langford (MSR) Lihong Li (MSR) Rob Schapire Rob Schapire Rob Schapire Rob Schapire Rob Schapire (MSR/Princeton)
SLIDE 2 Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement Example: Ad/Content Placement
- repeat:
- 1. website visited by user (with profile, browsing history,
etc.)
- 2. website chooses ad/content to present to user
- 3. user responds (clicks, leaves page, etc.)
- goal: make choices that elicit desired user behavior
SLIDE 3 Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment Example: Medical Treatment
- repeat:
- 1. doctor visited by patient (with symptoms, test results,
etc.)
- 2. doctor chooses treatment
- 3. patient responds (recovers, gets worse, etc.)
- goal: make choices that maximize favorable outcomes
SLIDE 4 The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem
- repeat:
- 1. learner presented with context
- 2. learner chooses an action
- 3. learner observes reward (but only for chosen action)
- goal: learn to choose actions to maximize rewards
SLIDE 5 The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem
- repeat:
- 1. learner presented with context
- 2. learner chooses an action
- 3. learner observes reward (but only for chosen action)
- goal: learn to choose actions to maximize rewards
- general and fundamental problem: how to learn to make
intelligent decisions through experience
SLIDE 6 Issues Issues Issues Issues Issues
- classic dilemma:
- exploit what has already been learned
- explore to learn which behaviors give best results
SLIDE 7 Issues Issues Issues Issues Issues
- classic dilemma:
- exploit what has already been learned
- explore to learn which behaviors give best results
- in addition, must use context effectively
- many choices of behavior possible
- may never see same context twice
SLIDE 8 Issues Issues Issues Issues Issues
- classic dilemma:
- exploit what has already been learned
- explore to learn which behaviors give best results
- in addition, must use context effectively
- many choices of behavior possible
- may never see same context twice
- selection bias: if explore while exploiting, will tend to get
highly skewed data
SLIDE 9 Issues Issues Issues Issues Issues
- classic dilemma:
- exploit what has already been learned
- explore to learn which behaviors give best results
- in addition, must use context effectively
- many choices of behavior possible
- may never see same context twice
- selection bias: if explore while exploiting, will tend to get
highly skewed data
SLIDE 10 This Talk This Talk This Talk This Talk This Talk
- new and general algorithm for contextual bandits
- optimal statistical performance
- far faster and simpler than predecessors
SLIDE 11 Formal Model Formal Model Formal Model Formal Model Formal Model
- repeat
- 1a. learner observes context xt
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
SLIDE 12 Formal Model Formal Model Formal Model Formal Model Formal Model
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
SLIDE 13 Formal Model Formal Model Formal Model Formal Model Formal Model
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
- goal: maximize total reward:
T
rt(at)
SLIDE 14 Formal Model Formal Model Formal Model Formal Model Formal Model
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
- goal: maximize total reward:
T
rt(at)
- assume pairs (xt, rt) chosen at random i.i.d.
SLIDE 15
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .)
SLIDE 16
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0
SLIDE 17
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0
SLIDE 18
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 total reward = 0.2 +
SLIDE 19
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) total reward = 0.2 +
SLIDE 20
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 +
SLIDE 21
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 total reward = 0.2 + 1.0 +
SLIDE 22
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) total reward = 0.2 + 1.0 +
SLIDE 23
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 total reward = 0.2 + 1.0 +
SLIDE 24
Example Example Example Example Example
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . total reward = 0.2 + 1.0 + 0.1 + · · ·
SLIDE 25 Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem
- no context
- try to do as well as best single action
SLIDE 26 Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem Special Case: Multi-armed Bandit Problem
- no context
- try to do as well as best single action
- tacitly assuming there is one action that gives high
rewards
- e.g.: single treatment/ad/content that is right for entire
population
SLIDE 27 Policies Policies Policies Policies Policies
- in contextual bandits setting, can use context to choose
actions
- may exist good policy (decision rule) for choosing actions
based on context
SLIDE 28 Policies Policies Policies Policies Policies
- in contextual bandits setting, can use context to choose
actions
- may exist good policy (decision rule) for choosing actions
based on context
If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3
SLIDE 29 Policies Policies Policies Policies Policies
- in contextual bandits setting, can use context to choose
actions
- may exist good policy (decision rule) for choosing actions
based on context
If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3
- policy π : (context x) → (action a)
SLIDE 30 Policies Policies Policies Policies Policies
- in contextual bandits setting, can use context to choose
actions
- may exist good policy (decision rule) for choosing actions
based on context
If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3
- policy π : (context x) → (action a)
- learner generally working with some rich policy space Π
- e.g.: all decision trees (“if-then-else” rules)
SLIDE 31 Policies Policies Policies Policies Policies
- in contextual bandits setting, can use context to choose
actions
- may exist good policy (decision rule) for choosing actions
based on context
If (sex = male) choose action 2 Else if (age > 45) choose action 1 else choose action 3
- policy π : (context x) → (action a)
- learner generally working with some rich policy space Π
- e.g.: all decision trees (“if-then-else” rules)
- assume Π finite, but typically extremely large
- tacit assumption:
∃ (unknown) policy π ∈ Π that gives high rewards
SLIDE 32 Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies
- goal: learn through experimentation to do (almost) as well as
best π ∈ Π
- policies may be very complex and expressive
⇒ powerful approach
SLIDE 33 Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies
- goal: learn through experimentation to do (almost) as well as
best π ∈ Π
- policies may be very complex and expressive
⇒ powerful approach
- challenges:
- Π extremely large
- need to be learning about all policies simultaneously
while also performing as well as the best
- when action selected, only observe reward for policies
that would have chosen same action
- exploration versus exploitation on a gigantic scale!
SLIDE 34 Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
- goal: want high total (or average) reward
relative to best policy π ∈ Π
SLIDE 35 Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
- goal: want high total (or average) reward
relative to best policy π ∈ Π
1 T
T
rt(at)
SLIDE 36 Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited) Formal Model (revisited)
- repeat
- 1a. learner observes context xt
- 1b. reward vector rt ∈ [0, 1]K chosen (but not observed)
- 2. learner selects action at ∈ {1, . . . , K}
- 3. learner receives observed reward rt(at)
- goal: want high total (or average) reward
relative to best policy π ∈ Π
max
π∈Π
1 T
T
rt(π(xt))
- best policy’s average reward
− 1 T
T
rt(at)
SLIDE 37 An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem
[Auer, Cesa-Bianchi, Freund, Schapire]
- Exp4 solves this problem
- maintains weights over all policies in Π
SLIDE 38 An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem
[Auer, Cesa-Bianchi, Freund, Schapire]
- Exp4 solves this problem
- maintains weights over all policies in Π
- regret is essentially optimal:
O
T
- even works for adversarial (i.e., non-random, non-iid) data
SLIDE 39 An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem
[Auer, Cesa-Bianchi, Freund, Schapire]
- Exp4 solves this problem
- maintains weights over all policies in Π
- regret is essentially optimal:
O
T
- even works for adversarial (i.e., non-random, non-iid) data
- but: time/space are linear in |Π|
- too slow if |Π| gigantic
SLIDE 40 An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem
[Auer, Cesa-Bianchi, Freund, Schapire]
- Exp4 solves this problem
- maintains weights over all policies in Π
- regret is essentially optimal:
O
T
- even works for adversarial (i.e., non-random, non-iid) data
- but: time/space are linear in |Π|
- too slow if |Π| gigantic
- seems hopeless to do better for fully general policy spaces
- this talk: aim for time/space only poly(log |Π|)
when Π is “well structured”
SLIDE 41 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
SLIDE 42 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) = learner’s action
SLIDE 43 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action
SLIDE 44 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 = learner’s action learner’s total reward = 0.2 +
SLIDE 45 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·
SLIDE 46 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·
- for any π, can compute rewards would have received
SLIDE 47 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0
(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·
- for any π, can compute rewards would have received
SLIDE 48 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0
(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·
- for any π, can compute rewards would have received
- average is good estimate of π’s expected reward
SLIDE 49 The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting
- say see rewards for all actions
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0
(Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 + 1.0 + 0.5 + · · ·
- for any π, can compute rewards would have received
- average is good estimate of π’s expected reward
- choose empirically best π ∈ Π
- regret = O
- ln |Π|
T
SLIDE 50 “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)
- to apply, just need “oracle” (algorithm/subroutine) for finding
best π ∈ Π on observed rewards
- input: (x1, r1), . . . , (xT, rT)
xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions
ˆ π = arg max
π∈Π T
rt(π(xt))
SLIDE 51 “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)
- to apply, just need “oracle” (algorithm/subroutine) for finding
best π ∈ Π on observed rewards
- input: (x1, r1), . . . , (xT, rT)
xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions
ˆ π = arg max
π∈Π T
rt(π(xt))
- really just (cost-sensitive) classification:
context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost
SLIDE 52 “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO)
- to apply, just need “oracle” (algorithm/subroutine) for finding
best π ∈ Π on observed rewards
- input: (x1, r1), . . . , (xT, rT)
xt = context rt = (rt(1), . . . , rt(K)) = rewards for all actions
ˆ π = arg max
π∈Π T
rt(π(xt))
- really just (cost-sensitive) classification:
context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost
- so: if have “good” classification algorithm for Π, can use to
find good policy (in full-information setting)
SLIDE 53 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
SLIDE 54 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
SLIDE 55 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
SLIDE 56 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·
SLIDE 57 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · ·
- for any policy π, only observe π’s rewards on subset of rounds
SLIDE 58 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·
- for any policy π, only observe π’s rewards on subset of rounds
SLIDE 59 But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting...
- ...only see rewards for actions taken
Actions Context 1 2 3 (Male, 50, . . .) 1.0 0.2 0.0 (Female, 18, . . .) 1.0 0.0 1.0 (Female, 48, . . .) 0.5 0.1 0.7 . . . . . . = learner’s action
= π’s action
learner’s total reward = 0.2 + 1.0 + 0.1 + · · · π’s total reward = 0.0 ?? + 1.0 + 0.5 ?? + · · ·
- for any policy π, only observe π’s rewards on subset of rounds
- might like to use AMO to find empirically good policy
- problems:
- only see some rewards
- observed rewards highly biased
(due to skewed choice of actions)
SLIDE 60 Key Question Key Question Key Question Key Question Key Question
- still: AMO is a natural primitive
- key question: can we solve the contextual bandits problem
given access to AMO?
SLIDE 61 Key Question Key Question Key Question Key Question Key Question
- still: AMO is a natural primitive
- key question: can we solve the contextual bandits problem
given access to AMO?
- can we use an AMO on bandit data by somehow:
- filling in missing data
- overcoming bias
SLIDE 62 Key Question Key Question Key Question Key Question Key Question
- still: AMO is a natural primitive
- key question: can we solve the contextual bandits problem
given access to AMO?
- can we use an AMO on bandit data by somehow:
- filling in missing data
- overcoming bias
- want:
- optimal regret
- time/space bounds poly(log |Π|)
SLIDE 63 Key Question Key Question Key Question Key Question Key Question
- still: AMO is a natural primitive
- key question: can we solve the contextual bandits problem
given access to AMO?
- can we use an AMO on bandit data by somehow:
- filling in missing data
- overcoming bias
- want:
- optimal regret
- time/space bounds poly(log |Π|)
- AMO is theoretical idealization
- captures structure in policy space
- in practice, can use off-the-shelf classification algorithm
SLIDE 64 ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy
[Langford & Zhang]
- partially solved by the ǫ-greedy/epoch-greedy algorithm
- on each round, choose action:
- according to “best” policy so far (with probability 1 − ǫ)
- uniformly at random
(with probability ǫ)
SLIDE 65 ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy
[Langford & Zhang]
- partially solved by the ǫ-greedy/epoch-greedy algorithm
- on each round, choose action:
- according to “best” policy so far (with probability 1 − ǫ)
[can find with AMO]
(with probability ǫ)
SLIDE 66 ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy ǫ-Greedy/Epoch-Greedy
[Langford & Zhang]
- partially solved by the ǫ-greedy/epoch-greedy algorithm
- on each round, choose action:
- according to “best” policy so far (with probability 1 − ǫ)
[can find with AMO]
(with probability ǫ)
K ln |Π| T 1/3
- fast and simple, but not optimal
SLIDE 67 “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm
[Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang]
- RandomizedUCB (aka “Monster”) algorithm gets optimal
regret using AMO
- solves multiple optimization problems using ellipsoid algorithm
- very slow: calls AMO about ˜
O
times on every round
SLIDE 68 Main Result Main Result Main Result Main Result Main Result
- new, simple algorithm for contextual bandits with AMO access
- (nearly) optimal regret: ˜
O
T
- fast: calls AMO far less than once per round!
- on average, calls AMO
˜ O
T ln |Π|
times per round
SLIDE 69 Main Result Main Result Main Result Main Result Main Result
- new, simple algorithm for contextual bandits with AMO access
- (nearly) optimal regret: ˜
O
T
- fast: calls AMO far less than once per round!
- on average, calls AMO
˜ O
T ln |Π|
times per round
- rest of talk: sketching main ideas of the algorithm
SLIDE 70 De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates
- selection bias is major problem:
- only observe reward for single action
- exploring while exploiting leads to inherently biased
estimates
SLIDE 71 De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates
- selection bias is major problem:
- only observe reward for single action
- exploring while exploiting leads to inherently biased
estimates
- nevertheless: can use simple trick to get unbiased estimates
for all actions
SLIDE 72 De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)
- say r(a) = (unknown) reward for action a
p(a) = (known) probability of choosing a
SLIDE 73 De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)
- say r(a) = (unknown) reward for action a
p(a) = (known) probability of choosing a
r(a) = r(a)/p(a) if a chosen else
r(a)] = r(a)
SLIDE 74 De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)
- say r(a) = (unknown) reward for action a
p(a) = (known) probability of choosing a
r(a) = r(a)/p(a) if a chosen else
r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions
SLIDE 75 De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)
- say r(a) = (unknown) reward for action a
p(a) = (known) probability of choosing a
r(a) = r(a)/p(a) if a chosen else
r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1
t−1
ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))]
SLIDE 76 De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.)
- say r(a) = (unknown) reward for action a
p(a) = (known) probability of choosing a
r(a) = r(a)/p(a) if a chosen else
r(a)] = r(a) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π: ˆ R(π) = 1 t − 1
t−1
ˆ rτ(π(xτ)) = ˆ E [ˆ r(π(x))] ∴ can estimate regret of any policy π:
ˆ π∈Π
ˆ R(ˆ π) − ˆ R(π)
π using AMO
SLIDE 77 Variance Control Variance Control Variance Control Variance Control Variance Control
- estimates are unbiased — done?
SLIDE 78 Variance Control Variance Control Variance Control Variance Control Variance Control
- estimates are unbiased — done?
- no! — variance may be extremely large
SLIDE 79 Variance Control Variance Control Variance Control Variance Control Variance Control
- estimates are unbiased — done?
- no! — variance may be extremely large
- can show variance(ˆ
r(a)) ≤ 1 p(a)
SLIDE 80 Variance Control Variance Control Variance Control Variance Control Variance Control
- estimates are unbiased — done?
- no! — variance may be extremely large
- can show variance(ˆ
r(a)) ≤ 1 p(a) ∴ to get good estimates, must ensure that 1/p(a) not too large
SLIDE 81 Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies
- need to choose actions (semi-)randomly
SLIDE 82 Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies
- need to choose actions (semi-)randomly
- approach: on each round,
- compute distribution Q over policy space Π
- randomly pick π ∼ Q
- on current context x, choose action π(x)
SLIDE 83 Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies
- need to choose actions (semi-)randomly
- approach: on each round,
- compute distribution Q over policy space Π
- randomly pick π ∼ Q
- on current context x, choose action π(x)
- Q induces distribution over actions (for any x):
Q(a|x) = Pr
π∼Q [π(x) = a]
SLIDE 84 Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies
- need to choose actions (semi-)randomly
- approach: on each round,
- compute distribution Q over policy space Π
- randomly pick π ∼ Q
- on current context x, choose action π(x)
- Q induces distribution over actions (for any x):
Q(a|x) = Pr
π∼Q [π(x) = a]
- seems will require time/space O(|Π|) to compute Q over
space Π
- will see later how to avoid!
SLIDE 85 How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q
- on each round, want to pick Q with:
- 1. low (estimated) regret
i.e., choose actions think will give high reward
SLIDE 86 How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q
- on each round, want to pick Q with:
- 1. low (estimated) regret
i.e., choose actions think will give high reward
- 2. low (estimated) variance
i.e., ensure future estimates will be accurate
SLIDE 87 How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q
- on each round, want to pick Q with:
- 1. low (estimated) regret
[exploit] i.e., choose actions think will give high reward
- 2. low (estimated) variance
[explore] i.e., ensure future estimates will be accurate
SLIDE 88 Low Regret Low Regret Low Regret Low Regret Low Regret
Regret(π) = estimated regret of π
SLIDE 89 Low Regret Low Regret Low Regret Low Regret Low Regret
Regret(π) = estimated regret of π
- so: estimated regret for random π ∼ Q is
- π
Q(π) Regret(π) = Eπ∼Q
SLIDE 90 Low Regret Low Regret Low Regret Low Regret Low Regret
Regret(π) = estimated regret of π
- so: estimated regret for random π ∼ Q is
- π
Q(π) Regret(π) = Eπ∼Q
Q(π) Regret(π) ≤ [small]
SLIDE 91 Low Variance Low Variance Low Variance Low Variance Low Variance
Q(a|x) = variance of estimate of reward for action a
SLIDE 92 Low Variance Low Variance Low Variance Low Variance Low Variance
Q(a|x) = variance of estimate of reward for action a
1 Q(π(x)|x) = variance if action chosen by π
SLIDE 93 Low Variance Low Variance Low Variance Low Variance Low Variance
Q(a|x) = variance of estimate of reward for action a
1 Q(π(x)|x) = variance if action chosen by π
- can estimate expected variance for actions chosen by π:
ˆ V Q(π) = ˆ E
Q(π(x)|x)
1 t − 1
t−1
1 Q(π(xτ)|xτ)
SLIDE 94 Low Variance Low Variance Low Variance Low Variance Low Variance
Q(a|x) = variance of estimate of reward for action a
1 Q(π(x)|x) = variance if action chosen by π
- can estimate expected variance for actions chosen by π:
ˆ V Q(π) = ˆ E
Q(π(x)|x)
1 t − 1
t−1
1 Q(π(xτ)|xτ)
ˆ V Q(π) ≤ [small] for all π ∈ Π
SLIDE 95 Low Variance Low Variance Low Variance Low Variance Low Variance
Q(a|x) = variance of estimate of reward for action a
1 Q(π(x)|x) = variance if action chosen by π
- can estimate expected variance for actions chosen by π:
ˆ V Q(π) = ˆ E
Q(π(x)|x)
1 t − 1
t−1
1 Q(π(xτ)|xτ)
ˆ V Q(π) ≤ [small] for all π ∈ Π
- detail: problematic if Q(a|x) too close to zero
SLIDE 96 Low Variance Low Variance Low Variance Low Variance Low Variance
Qµ(a|x) = variance of estimate of reward for action a
1 Qµ(π(x)|x) = variance if action chosen by π
- can estimate expected variance for actions chosen by π:
ˆ V Q(π) = ˆ E
Qµ(π(x)|x)
1 t − 1
t−1
1 Qµ(π(xτ)|xτ)
ˆ V Q(π) ≤ [small] for all π ∈ Π
- detail: problematic if Q(a|x) too close to zero
- to avoid, “smooth” probabilities by occassionally picking
action uniformly at random: Qµ(a|x) = (1 − Kµ)Q(a|x) + µ
SLIDE 97 Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together
Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π
SLIDE 98 Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together
Q(π) Regret(π) ≤ [small] ˆ V Q(π) ≤ [small] for all π ∈ Π
Q(π) = 1
SLIDE 99 Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together
Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0 for all π ∈ Π
Q(π) = 1
SLIDE 100 Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together
Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π
Q(π) = 1
- can fill in constants
- make easier by:
- allowing higher variance for policies with higher regret
(poor policies can be eliminated even with fairly poor performance estimates)
SLIDE 101 Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together
Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π
Q(π) ≤ 1
- can fill in constants
- make easier by:
- allowing higher variance for policies with higher regret
(poor policies can be eliminated even with fairly poor performance estimates)
- only require Q to be sub-distribution
(can put all remaining mass on ˆ π with maximum estimated reward)
SLIDE 102 Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”
find Q such that:
Q(π) Regret(π) ≤ C0 C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π
Q(π) ≤ 1
SLIDE 103 Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”
find Q such that:
Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]
Q(π) ≤ 1 [sub-distribution]
SLIDE 104 Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”
find Q such that:
Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]
Q(π) ≤ 1 [sub-distribution]
ık et al.]
SLIDE 105 Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP” Optimization Problem “OP”
find Q such that:
Q(π) Regret(π) ≤ C0 [regret constraint] C1· ˆ V Q(π) ≤ C0+ Regret(π) for all π ∈ Π [variance constraint]
Q(π) ≤ 1 [sub-distribution]
ık et al.]
- seems awful:
- |Π| variables
- |Π| constraints
- constraints involve nasty non-linear functions
(recall ˆ V Q(π) = ˆ E
Qµ(π(x)|x)
- )
- not even clear if feasible
SLIDE 106 If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...
- Theorem: if can solve OP on every round (for appropriate
constants), then will get regret ˜ O
T
SLIDE 107 If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It... If We Can Solve It...
- Theorem: if can solve OP on every round (for appropriate
constants), then will get regret ˜ O
T
- .
- proof idea:
- regret constraint ensures low regret
(if estimates are good enough)
- variance constraint ensures that they actually will be
good enough
- essentially same approach as [Dud´
ık et al.]
SLIDE 108 How to Solve? How to Solve? How to Solve? How to Solve? How to Solve?
- basic idea:
- find a violated constraint
- (attempt to) fix it
- repeat
SLIDE 109 How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)
- Q ← 0
- repeat:
- 1. if Q “too big” then rescale
- (i.e., multiply Q by scalar < 1)
- ensures sub-distribution and regret constraints are
satisfied
SLIDE 110 How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.) How to Solve? (cont.)
- Q ← 0
- repeat:
- 1. if Q “too big” then rescale
- (i.e., multiply Q by scalar < 1)
- ensures sub-distribution and regret constraints are
satisfied
- 2. find π ∈ Π for which corresponding variance constraint is
violated
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
SLIDE 111 More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step
if
π Q(π)(C0 +
Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality
SLIDE 112 More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step
if
π Q(π)(C0 +
Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality
Q(π)(C0 + Regret(π)) ≤ C0
SLIDE 113 More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step More Detail: Rescaling Step
if
π Q(π)(C0 +
Regret(π)) > C0 then rescale Q (multiply by scalar < 1) so holds with equality
Q(π)(C0 + Regret(π)) ≤ C0 which implies:
π Q(π) ≤ 1
[sub-distribution]
π Q(π)
Regret(π) ≤ C0 [regret constraint]
SLIDE 114 More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints
find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
SLIDE 115 More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints
find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
- if halts then C1 · ˆ
V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]
SLIDE 116 More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints
find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
- if halts then C1 · ˆ
V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]
- can execute step using AMO:
- can construct “pseudo-rewards” ˜
rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =
˜ rτ(π(xτ)) + [constant]
SLIDE 117 More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints
find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
- if halts then C1 · ˆ
V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]
- can execute step using AMO:
- can construct “pseudo-rewards” ˜
rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =
˜ rτ(π(xτ)) + [constant]
- so: can maximize with AMO
- will find violating constraint (if one exists)
SLIDE 118 More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints More Detail: Checking Variance Constraints
find π ∈ Π for which C1 · ˆ V Q(π) − Regret(π) > C0
- a. if none exists, halt and output Q
- b. else Q(π) ← Q(π) + α where α = [some formula]
- if halts then C1 · ˆ
V Q(π) ≤ C0 + Regret(π) for all π ∈ Π [variance constraint]
- can execute step using AMO:
- can construct “pseudo-rewards” ˜
rτ for which (∀π): C1 · ˆ V Q(π) − Regret(π) =
˜ rτ(π(xτ)) + [constant]
- so: can maximize with AMO
- will find violating constraint (if one exists)
∴ one AMO call per iteration
SLIDE 119 Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work? Why Does It Work?
- so: if halts, then outputs solution to OP
- but how long will it take to halt (if ever)?
- to answer, analyze using a potential function
SLIDE 120 A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function
- define potential function to quantify progress:
Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]
+B·
Q(π) Regret(π)
SLIDE 121 A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function
- define potential function to quantify progress:
Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]
+B·
Q(π) Regret(π)
- low regret
- defined for all nonnegative vectors Q over Π
(not just sub-distributions)
SLIDE 122 A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function
- define potential function to quantify progress:
Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]
+B·
Q(π) Regret(π)
- low regret
- defined for all nonnegative vectors Q over Π
(not just sub-distributions)
- properties:
- Φ(Q) ≥ 0
- convex
SLIDE 123 A Potential Function A Potential Function A Potential Function A Potential Function A Potential Function
- define potential function to quantify progress:
Φ(Q) = A·ˆ E [RE (uniform Qµ(·|x))]
+B·
Q(π) Regret(π)
- low regret
- defined for all nonnegative vectors Q over Π
(not just sub-distributions)
- properties:
- Φ(Q) ≥ 0
- convex
- if Q minimizes Φ then Q is a solution to OP
- key proof step:
∂Φ/∂Q(π) ∝ variance constraint for π ∴ OP is feasible
SLIDE 124 Analysis Analysis Analysis Analysis Analysis
- algorithm turns out to be (roughly) coordinate descent on Φ
- each step adjusts Q along one coordinate direction Q(π)
SLIDE 125 Analysis Analysis Analysis Analysis Analysis
- algorithm turns out to be (roughly) coordinate descent on Φ
- each step adjusts Q along one coordinate direction Q(π)
- can lower-bound how much Φ decreases on each update
- can also show rescaling step never increases Φ
SLIDE 126 Analysis Analysis Analysis Analysis Analysis
- algorithm turns out to be (roughly) coordinate descent on Φ
- each step adjusts Q along one coordinate direction Q(π)
- can lower-bound how much Φ decreases on each update
- can also show rescaling step never increases Φ
- since Φ ≥ 0, gives bound on number of iterations of the
algorithm
SLIDE 127 Analysis Analysis Analysis Analysis Analysis
- algorithm turns out to be (roughly) coordinate descent on Φ
- each step adjusts Q along one coordinate direction Q(π)
- can lower-bound how much Φ decreases on each update
- can also show rescaling step never increases Φ
- since Φ ≥ 0, gives bound on number of iterations of the
algorithm
- Theorem: On round t, algorithm halts after at most
˜ O
ln |Π|
- iterations (and calls to AMO).
SLIDE 128 Analysis Analysis Analysis Analysis Analysis
- algorithm turns out to be (roughly) coordinate descent on Φ
- each step adjusts Q along one coordinate direction Q(π)
- can lower-bound how much Φ decreases on each update
- can also show rescaling step never increases Φ
- since Φ ≥ 0, gives bound on number of iterations of the
algorithm
- Theorem: On round t, algorithm halts after at most
˜ O
ln |Π|
- iterations (and calls to AMO).
- as corollary, also get bound on sparsity of Q
SLIDE 129 Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start
- so far, assumed solve OP from scratch on each round
- naively, gives ˜
O
calls to AMO in T rounds
SLIDE 130 Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start
- so far, assumed solve OP from scratch on each round
- naively, gives ˜
O
calls to AMO in T rounds
- can do much better!
- first improvement: since data iid, can use same solution for
many rounds, i.e., for long “epochs”
- gives same (near optimal) regret
- essentially no computation required on rounds where Q
not updated
SLIDE 131 Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start Epochs and Warm Start
- so far, assumed solve OP from scratch on each round
- naively, gives ˜
O
calls to AMO in T rounds
- can do much better!
- first improvement: since data iid, can use same solution for
many rounds, i.e., for long “epochs”
- gives same (near optimal) regret
- essentially no computation required on rounds where Q
not updated
- second improvement: can initialize algorithm with the
previous solution (rather than starting fresh each time)
- works because each new example can only cause Φ to
increase slightly
SLIDE 132 Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.) Epochs and Warm Start (cont.)
if only update Q on rounds 1, 4, 9, 16, 25, . . .
- get same (near optimal) regret
- only need
˜ O
ln |Π|
- calls to AMO total for entire sequence of T rounds
SLIDE 133 Summary Summary Summary Summary Summary
- new algorithm for contextual bandits problem with AMO
access
- (nearly) optimal regret
- simple and fast
- only requires an average of
˜ O
T ln |Π|
AMO calls per round
SLIDE 134 Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions Open Problems and Future Directions
- try out experimentally
- is there an algorithm that uses an online (rather than batch)
- racle?
- is there a lower bound on number of AMO calls necessary to
solve this problem?
- can we find a similar algorithm that handles adversarial data?