Reducing contextual bandits to supervised learning Daniel Hsu - - PowerPoint PPT Presentation

reducing contextual bandits to supervised learning
SMART_READER_LITE
LIVE PREVIEW

Reducing contextual bandits to supervised learning Daniel Hsu - - PowerPoint PPT Presentation

Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing physician 2 Learning to


slide-1
SLIDE 1

Reducing contextual bandits to supervised learning

Daniel Hsu

Columbia University Based on joint work with A. Agarwal, S. Kale,

  • J. Langford, L. Li, and R. Schapire

1

slide-2
SLIDE 2

Learning to interact: example #1

Practicing physician

2

slide-3
SLIDE 3

Learning to interact: example #1

Practicing physician Loop:

  • 1. Patient arrives with symptoms, medical history, genome . . .

2

slide-4
SLIDE 4

Learning to interact: example #1

Practicing physician Loop:

  • 1. Patient arrives with symptoms, medical history, genome . . .
  • 2. Prescribe treatment.

2

slide-5
SLIDE 5

Learning to interact: example #1

Practicing physician Loop:

  • 1. Patient arrives with symptoms, medical history, genome . . .
  • 2. Prescribe treatment.
  • 3. Observe impact on patient’s health (e.g., improves, worsens).

2

slide-6
SLIDE 6

Learning to interact: example #1

Practicing physician Loop:

  • 1. Patient arrives with symptoms, medical history, genome . . .
  • 2. Prescribe treatment.
  • 3. Observe impact on patient’s health (e.g., improves, worsens).

Goal: prescribe treatments that yield good health outcomes.

2

slide-7
SLIDE 7

Learning to interact: example #2

Website operator

3

slide-8
SLIDE 8

Learning to interact: example #2

Website operator Loop:

  • 1. User visits website with profile, browsing history . . .

3

slide-9
SLIDE 9

Learning to interact: example #2

Website operator Loop:

  • 1. User visits website with profile, browsing history . . .
  • 2. Choose content to display on website.

3

slide-10
SLIDE 10

Learning to interact: example #2

Website operator Loop:

  • 1. User visits website with profile, browsing history . . .
  • 2. Choose content to display on website.
  • 3. Observe user reaction to content (e.g., click, “like”).

3

slide-11
SLIDE 11

Learning to interact: example #2

Website operator Loop:

  • 1. User visits website with profile, browsing history . . .
  • 2. Choose content to display on website.
  • 3. Observe user reaction to content (e.g., click, “like”).

Goal: choose content that yield desired user behavior.

3

slide-12
SLIDE 12

Contextual bandit problem

For t = 1, 2, . . . , T:

4

slide-13
SLIDE 13

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

4

slide-14
SLIDE 14

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

4

slide-15
SLIDE 15

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

4

slide-16
SLIDE 16

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

4

slide-17
SLIDE 17

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

Task: choose at’s that yield high expected reward (w.r.t. D).

4

slide-18
SLIDE 18

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

Task: choose at’s that yield high expected reward (w.r.t. D). Contextual: use features xt to choose good actions at.

4

slide-19
SLIDE 19

Contextual bandit problem

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

Task: choose at’s that yield high expected reward (w.r.t. D). Contextual: use features xt to choose good actions at. Bandit: rt(a) for a = at is not observed. (Non-bandit setting: whole reward vector r t ∈ [0, 1]A is observed.)

4

slide-20
SLIDE 20

Challenges

5

slide-21
SLIDE 21

Challenges

  • 1. Exploration vs. exploitation.

◮ Use what you’ve already learned (exploit), but also

learn about actions that could be good (explore).

◮ Must balance to get good statistical performance. 5

slide-22
SLIDE 22

Challenges

  • 1. Exploration vs. exploitation.

◮ Use what you’ve already learned (exploit), but also

learn about actions that could be good (explore).

◮ Must balance to get good statistical performance.

  • 2. Must use context.

◮ Want to do as well as the best policy (i.e., decision rule)

π: context x → action a from some policy class Π (a set of decision rules).

◮ Computationally constrained w/ large Π (e.g., all decision trees). 5

slide-23
SLIDE 23

Challenges

  • 1. Exploration vs. exploitation.

◮ Use what you’ve already learned (exploit), but also

learn about actions that could be good (explore).

◮ Must balance to get good statistical performance.

  • 2. Must use context.

◮ Want to do as well as the best policy (i.e., decision rule)

π: context x → action a from some policy class Π (a set of decision rules).

◮ Computationally constrained w/ large Π (e.g., all decision trees).

  • 3. Selection bias, especially while exploiting.

5

slide-24
SLIDE 24

Learning objective

Regret (i.e., relative performance) to a policy class Π: max

π∈Π

1 T

T

  • t=1

rt(π(xt))

  • average reward of best policy

− 1 T

T

  • t=1

rt(at)

  • average reward of learner

6

slide-25
SLIDE 25

Learning objective

Regret (i.e., relative performance) to a policy class Π: max

π∈Π

1 T

T

  • t=1

rt(π(xt))

  • average reward of best policy

− 1 T

T

  • t=1

rt(at)

  • average reward of learner

Strong benchmark if Π contains a policy w/ high expected reward!

6

slide-26
SLIDE 26

Learning objective

Regret (i.e., relative performance) to a policy class Π: max

π∈Π

1 T

T

  • t=1

rt(π(xt))

  • average reward of best policy

− 1 T

T

  • t=1

rt(at)

  • average reward of learner

Strong benchmark if Π contains a policy w/ high expected reward! Goal: regret → 0 as fast as possible as T → ∞.

6

slide-27
SLIDE 27

Our result

New fast and simple algorithm for contextual bandits.

7

slide-28
SLIDE 28

Our result

New fast and simple algorithm for contextual bandits.

◮ Operates via reduction to supervised learning

(with computationally-efficient reduction).

7

slide-29
SLIDE 29

Our result

New fast and simple algorithm for contextual bandits.

◮ Operates via reduction to supervised learning

(with computationally-efficient reduction).

◮ Statistically (near) optimal regret bound.

7

slide-30
SLIDE 30

Need for exploration

No-exploration approach:

8

slide-31
SLIDE 31

Need for exploration

No-exploration approach:

  • 1. Using historical data, learn a “reward predictor” for each

action a ∈ A based on context x ∈ X: ˆ r(a | x) .

8

slide-32
SLIDE 32

Need for exploration

No-exploration approach:

  • 1. Using historical data, learn a “reward predictor” for each

action a ∈ A based on context x ∈ X: ˆ r(a | x) .

  • 2. Then deploy policy ˆ

π, given by ˆ π(x) := arg max

a∈A

ˆ r(a | x) , and collect more data.

8

slide-33
SLIDE 33

Need for exploration

No-exploration approach:

  • 1. Using historical data, learn a “reward predictor” for each

action a ∈ A based on context x ∈ X: ˆ r(a | x) .

  • 2. Then deploy policy ˆ

π, given by ˆ π(x) := arg max

a∈A

ˆ r(a | x) , and collect more data. Suffers from selection bias.

8

slide-34
SLIDE 34

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

9

slide-35
SLIDE 35

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1

9

slide-36
SLIDE 36

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

9

slide-37
SLIDE 37

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

New policy: ˆ π′(X) = ˆ π′(Y ) = A.

9

slide-38
SLIDE 38

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

New policy: ˆ π′(X) = ˆ π′(Y ) = A.

Observed rewards A B X 0.7 — Y 0.3 0.1

9

slide-39
SLIDE 39

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

New policy: ˆ π′(X) = ˆ π′(Y ) = A.

Observed rewards A B X 0.7 — Y 0.3 0.1 Reward estimates A B X 0.7 0.5 Y 0.3 0.1

9

slide-40
SLIDE 40

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

New policy: ˆ π′(X) = ˆ π′(Y ) = A.

Observed rewards A B X 0.7 — Y 0.3 0.1 Reward estimates A B X 0.7 0.5 Y 0.3 0.1

Never try action B in context X.

9

slide-41
SLIDE 41

Using no-exploration

Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆ π(X) = A and ˆ π(Y ) = B.

Observed rewards A B X 0.7 — Y — 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1

New policy: ˆ π′(X) = ˆ π′(Y ) = A.

Observed rewards A B X 0.7 — Y 0.3 0.1 Reward estimates A B X 0.7 0.5 Y 0.3 0.1 True rewards A B X 0.7 1.0 Y 0.3 0.1

Never try action B in context X. Ω(1) regret.

9

slide-42
SLIDE 42

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

10

slide-43
SLIDE 43

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

Possible approach: track average reward of each π ∈ Π.

10

slide-44
SLIDE 44

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

Possible approach: track average reward of each π ∈ Π.

◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995).

10

slide-45
SLIDE 45

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

Possible approach: track average reward of each π ∈ Π.

◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ◮ Statistically optimal regret bound O

  • K log N

T

  • for K := |A| actions and N := |Π| policies after T rounds.

10

slide-46
SLIDE 46

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

Possible approach: track average reward of each π ∈ Π.

◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ◮ Statistically optimal regret bound O

  • K log N

T

  • for K := |A| actions and N := |Π| policies after T rounds.

◮ Explicit bookkeeping is computationally intractable

for large N.

10

slide-47
SLIDE 47

Dealing with policies

Feedback in round t: reward of chosen action rt(at).

◮ Tells us about policies π ∈ Π s.t. π(xt) = at. ◮ Not informative about other policies!

Possible approach: track average reward of each π ∈ Π.

◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ◮ Statistically optimal regret bound O

  • K log N

T

  • for K := |A| actions and N := |Π| policies after T rounds.

◮ Explicit bookkeeping is computationally intractable

for large N. But perhaps policy class Π has some structure . . .

10

slide-48
SLIDE 48

Hypothetical “full-information” setting

If we observed rewards for all actions . . .

11

slide-49
SLIDE 49

Hypothetical “full-information” setting

If we observed rewards for all actions . . .

◮ Like supervised learning, have labeled data after t rounds:

(x1, ρ1), . . . , (xt, ρt) ∈ X × RA .

11

slide-50
SLIDE 50

Hypothetical “full-information” setting

If we observed rewards for all actions . . .

◮ Like supervised learning, have labeled data after t rounds:

(x1, ρ1), . . . , (xt, ρt) ∈ X × RA . context − → features actions − → classes rewards − → −costs policy − → classifier

11

slide-51
SLIDE 51

Hypothetical “full-information” setting

If we observed rewards for all actions . . .

◮ Like supervised learning, have labeled data after t rounds:

(x1, ρ1), . . . , (xt, ρt) ∈ X × RA . context − → features actions − → classes rewards − → −costs policy − → classifier

◮ Can often exploit structure of Π to get tractable algorithms.

Abstraction: arg max oracle (AMO) AMO

  • (xi, ρi)

t

i=1

  • := arg max

π∈Π t

  • i=1

ρi(π(xi)) .

11

slide-52
SLIDE 52

Hypothetical “full-information” setting

If we observed rewards for all actions . . .

◮ Like supervised learning, have labeled data after t rounds:

(x1, ρ1), . . . , (xt, ρt) ∈ X × RA . context − → features actions − → classes rewards − → −costs policy − → classifier

◮ Can often exploit structure of Π to get tractable algorithms.

Abstraction: arg max oracle (AMO) AMO

  • (xi, ρi)

t

i=1

  • := arg max

π∈Π t

  • i=1

ρi(π(xi)) . Can’t directly use this in bandit setting.

11

slide-53
SLIDE 53

Using AMO with some exploration

Explore-then-exploit

12

slide-54
SLIDE 54

Using AMO with some exploration

Explore-then-exploit

  • 1. In first τ rounds, choose at ∈ A u.a.r. to get

unbiased estimates ˆ r t of r t for all t ≤ τ.

12

slide-55
SLIDE 55

Using AMO with some exploration

Explore-then-exploit

  • 1. In first τ rounds, choose at ∈ A u.a.r. to get

unbiased estimates ˆ r t of r t for all t ≤ τ.

  • 2. Get ˆ

π := AMO({(xt, ˆ r t)}τ

t=1).

12

slide-56
SLIDE 56

Using AMO with some exploration

Explore-then-exploit

  • 1. In first τ rounds, choose at ∈ A u.a.r. to get

unbiased estimates ˆ r t of r t for all t ≤ τ.

  • 2. Get ˆ

π := AMO({(xt, ˆ r t)}τ

t=1).

  • 3. Henceforth use at := ˆ

π(xt), for t = τ+1, τ+2, . . . , T.

12

slide-57
SLIDE 57

Using AMO with some exploration

Explore-then-exploit

  • 1. In first τ rounds, choose at ∈ A u.a.r. to get

unbiased estimates ˆ r t of r t for all t ≤ τ.

  • 2. Get ˆ

π := AMO({(xt, ˆ r t)}τ

t=1).

  • 3. Henceforth use at := ˆ

π(xt), for t = τ+1, τ+2, . . . , T. Regret bound with best τ: ∼ T −1/3 (sub-optimal).

(Dependencies on |A| and |Π| hidden.)

12

slide-58
SLIDE 58

Previous contextual bandit algorithms

13

slide-59
SLIDE 59

Previous contextual bandit algorithms

Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π.

13

slide-60
SLIDE 60

Previous contextual bandit algorithms

Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO.

13

slide-61
SLIDE 61

Previous contextual bandit algorithms

Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, &

Zhang, UAI 2011)

Near optimal regret, but O(T 6) calls to AMO.

13

slide-62
SLIDE 62

Our result

Let K := |A| and N := |Π|. Our result: a new, fast and simple algorithm.

◮ Regret bound: ˜

O

  • K log N

T

  • .

Near optimal.

◮ # calls to AMO: ˜

O

  • TK

log N

  • .

Less than once per round!

14

slide-63
SLIDE 63

Rest of the talk

Components of the new algorithm:

Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS

  • 1. “Classical” tricks: randomization, inverse propensity weighting.
  • 2. Efficient algorithm for balancing exploration/exploitation.
  • 3. Additional tricks: warm-start and epoch structure.

15

slide-64
SLIDE 64
  • 1. Classical tricks

16

slide-65
SLIDE 65

What would’ve happened if I had done X?

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

17

slide-66
SLIDE 66

What would’ve happened if I had done X?

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

Q: How do I learn about rt(a) for actions a I don’t actually take?

17

slide-67
SLIDE 67

What would’ve happened if I had done X?

For t = 1, 2, . . . , T:

  • 0. Nature draws (xt, r t) from dist. D over X × [0, 1]A.
  • 1. Observe context xt ∈ X.

[e.g., user profile, search query]

  • 2. Choose action at ∈ A.

[e.g., ad to display]

  • 3. Collect reward rt(at) ∈ [0, 1].

[e.g., 1 if click, 0 otherwise]

Q: How do I learn about rt(a) for actions a I don’t actually take? A: Randomize. Draw at ∼ pt for some pre-specified prob. dist. pt.

17

slide-68
SLIDE 68

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at)

18

slide-69
SLIDE 69

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at) =          rt(at) pt(at) if a = at ,

  • therwise .

18

slide-70
SLIDE 70

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at) =          rt(at) pt(at) if a = at ,

  • therwise .

Unbiasedness: Eat∼pt

  • ˆ

rt(a)

  • =
  • a′∈A

pt(a′) · rt(a′) · 1{a = a′} pt(a′) = rt(a) .

18

slide-71
SLIDE 71

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at) =          rt(at) pt(at) if a = at ,

  • therwise .

Unbiasedness: Eat∼pt

  • ˆ

rt(a)

  • =
  • a′∈A

pt(a′) · rt(a′) · 1{a = a′} pt(a′) = rt(a) . Range and variance: upper-bounded by

1 pt(a).

18

slide-72
SLIDE 72

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at) =          rt(at) pt(at) if a = at ,

  • therwise .

Unbiasedness: Eat∼pt

  • ˆ

rt(a)

  • =
  • a′∈A

pt(a′) · rt(a′) · 1{a = a′} pt(a′) = rt(a) . Range and variance: upper-bounded by

1 pt(a).

Estimate avg. reward of policy: Rewt(π) := 1

t

t

i=1 ˆ

ri(π(xi)).

18

slide-73
SLIDE 73

Inverse propensity weighting (Horvitz & Thompson, JASA 1952)

Importance-weighted estimate of reward from round t: ∀a ∈ A ˆ rt(a) := rt(at) · 1{a = at} pt(at) =          rt(at) pt(at) if a = at ,

  • therwise .

Unbiasedness: Eat∼pt

  • ˆ

rt(a)

  • =
  • a′∈A

pt(a′) · rt(a′) · 1{a = a′} pt(a′) = rt(a) . Range and variance: upper-bounded by

1 pt(a).

Estimate avg. reward of policy: Rewt(π) := 1

t

t

i=1 ˆ

ri(π(xi)). How should we choose the pt?

18

slide-74
SLIDE 74

Hedging over policies

Get action distributions via policy distributions. (Q, x) (policy distribution, context) − → p action distribution

19

slide-75
SLIDE 75

Hedging over policies

Get action distributions via policy distributions. (Q, x) (policy distribution, context) − → p action distribution Policy distribution: Q = (Q(π) : π ∈ Π) probability dist. over policies π in the policy class Π

19

slide-76
SLIDE 76

Hedging over policies

Get action distributions via policy distributions. (Q, x) (policy distribution, context) − → p action distribution

1: Pick initial distribution Q1 over policies Π. 2: for round t = 1, 2, . . . do 3:

Nature draws (xt, r t) from dist. D over X × [0, 1]A.

4:

Observe context xt.

5:

Compute distribution pt over A (using Qt and xt).

6:

Pick action at ∼ pt.

7:

Collect reward rt(at).

8:

Compute new distribution Qt+1 over policies Π.

9: end for

19

slide-77
SLIDE 77
  • 2. Efficient construction of good policy distributions

20

slide-78
SLIDE 78

Our approach

Q: How do we choose Qt for good exploration/exploitation?

21

slide-79
SLIDE 79

Our approach

Q: How do we choose Qt for good exploration/exploitation? Caveat: Qt must be efficiently computable + representable!

21

slide-80
SLIDE 80

Our approach

Q: How do we choose Qt for good exploration/exploitation? Caveat: Qt must be efficiently computable + representable! Our approach:

  • 1. Define convex feasibility problem (over distributions Q on Π)

such that solutions yield (near) optimal regret bounds.

21

slide-81
SLIDE 81

Our approach

Q: How do we choose Qt for good exploration/exploitation? Caveat: Qt must be efficiently computable + representable! Our approach:

  • 1. Define convex feasibility problem (over distributions Q on Π)

such that solutions yield (near) optimal regret bounds.

  • 2. Design algorithm that finds a sparse solution Q.

21

slide-82
SLIDE 82

Our approach

Q: How do we choose Qt for good exploration/exploitation? Caveat: Qt must be efficiently computable + representable! Our approach:

  • 1. Define convex feasibility problem (over distributions Q on Π)

such that solutions yield (near) optimal regret bounds.

  • 2. Design algorithm that finds a sparse solution Q.

Algorithm only accesses Π via calls to AMO = ⇒ nnz(Q) = O(# AMO calls)

21

slide-83
SLIDE 83

The “good policy distribution” problem

Convex feasibility problem for policy distribution Q

22

slide-84
SLIDE 84

The “good policy distribution” problem

Convex feasibility problem for policy distribution Q

  • π∈Π

Q(π) · Regt(π) ≤

  • K log N

t (Low regret)

22

slide-85
SLIDE 85

The “good policy distribution” problem

Convex feasibility problem for policy distribution Q

  • π∈Π

Q(π) · Regt(π) ≤

  • K log N

t (Low regret)

  • var
  • Rewt(π)
  • ≤ b(π)

∀π ∈ Π (Low variance)

22

slide-86
SLIDE 86

The “good policy distribution” problem

Convex feasibility problem for policy distribution Q

  • π∈Π

Q(π) · Regt(π) ≤

  • K log N

t (Low regret)

  • var
  • Rewt(π)
  • ≤ b(π)

∀π ∈ Π (Low variance) Using feasible Qt in round t gives near-optimal regret.

22

slide-87
SLIDE 87

The “good policy distribution” problem

Convex feasibility problem for policy distribution Q

  • π∈Π

Q(π) · Regt(π) ≤

  • K log N

t (Low regret)

  • var
  • Rewt(π)
  • ≤ b(π)

∀π ∈ Π (Low variance) Using feasible Qt in round t gives near-optimal regret. But |Π| variables and >|Π| constraints, . . .

22

slide-88
SLIDE 88

Solving the convex feasibility problem

Solver for “good policy distribution” problem

(Technical detail: Q can be a sub-distribution that sums to less than one.)

23

slide-89
SLIDE 89

Solving the convex feasibility problem

Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0), then repeat:

(Technical detail: Q can be a sub-distribution that sums to less than one.)

23

slide-90
SLIDE 90

Solving the convex feasibility problem

Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0), then repeat:

  • 1. If “low regret” constraint violated, then fix by rescaling:

Q := cQ for some c < 1.

(Technical detail: Q can be a sub-distribution that sums to less than one.)

23

slide-91
SLIDE 91

Solving the convex feasibility problem

Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0), then repeat:

  • 1. If “low regret” constraint violated, then fix by rescaling:

Q := cQ for some c < 1.

  • 2. Find most violated “low variance” constraint—say,

corresponding to policy π—and update Q( π) := Q( π) + α .

(c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.)

23

slide-92
SLIDE 92

Solving the convex feasibility problem

Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0), then repeat:

  • 1. If “low regret” constraint violated, then fix by rescaling:

Q := cQ for some c < 1.

  • 2. Find most violated “low variance” constraint—say,

corresponding to policy π—and update Q( π) := Q( π) + α . (If no such violated constraint, stop and return Q.)

(c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.)

23

slide-93
SLIDE 93

Implementation via AMO

Finding “low variance” constraint violation:

  • 1. Create fictitious rewards for each i = 1, 2, . . . , t:
  • ri(a) := ˆ

ri(a) + µ Q(a|xi) ∀a ∈ A ,

where µ ≈

  • (log N)/(Kt).
  • 2. Obtain

π := AMO

  • (xi,

r i) t

i=1

  • .

3. Rewt( π) > threshold iff π’s “low variance” constraint is violated.

24

slide-94
SLIDE 94

Iteration bound

Solver is coordinate descent for minimizing potential function Φ(Q) := c1 · Ex

  • RE(uniformQ(·|x))
  • + c2 ·
  • π∈Π

Q(π) Regt(π) .

(Actually use (1 − ε) · Q + ε · uniform inside RE expression.)

25

slide-95
SLIDE 95

Iteration bound

Solver is coordinate descent for minimizing potential function Φ(Q) := c1 · Ex

  • RE(uniformQ(·|x))
  • + c2 ·
  • π∈Π

Q(π) Regt(π) . (Partial derivative w.r.t. Q(π) is “low variance” constraint for π.)

(Actually use (1 − ε) · Q + ε · uniform inside RE expression.)

25

slide-96
SLIDE 96

Iteration bound

Solver is coordinate descent for minimizing potential function Φ(Q) := c1 · Ex

  • RE(uniformQ(·|x))
  • + c2 ·
  • π∈Π

Q(π) Regt(π) . (Partial derivative w.r.t. Q(π) is “low variance” constraint for π.) Returns a feasible solution after ˜ O  

  • Kt

log N   steps .

(Actually use (1 − ε) · Q + ε · uniform inside RE expression.)

25

slide-97
SLIDE 97

Algorithm

1: Pick initial distribution Q1 over policies Π. 2: for round t = 1, 2, . . . do 3:

Nature draws (xt, r t) from dist. D over X × [0, 1]A.

4:

Observe context xt.

5:

Compute action distribution pt := Qt( · |xt).

6:

Pick action at ∼ pt.

7:

Collect reward rt(at).

8:

Compute new policy distribution Qt+1 using coordinate descent + AMO.

9: end for

26

slide-98
SLIDE 98

Recap

27

slide-99
SLIDE 99

Recap

Feasible solution to “good policy distribution problem” gives near optimal regret bound.

27

slide-100
SLIDE 100

Recap

Feasible solution to “good policy distribution problem” gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it.

27

slide-101
SLIDE 101

Recap

Feasible solution to “good policy distribution problem” gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. Analysis: In round t, nnz(Qt+1) = O(# AMO calls) = ˜ O  

  • Kt

log N   .

27

slide-102
SLIDE 102
  • 3. Additional tricks: warm-start and epoch structure

28

slide-103
SLIDE 103

Total complexity over all rounds

In round t, coordinate descent for computing Qt+1 requires ˜ O  

  • Kt

log N   AMO calls.

29

slide-104
SLIDE 104

Total complexity over all rounds

In round t, coordinate descent for computing Qt+1 requires ˜ O  

  • Kt

log N   AMO calls. To compute Qt+1 in all rounds t = 1, 2, . . . , T, need ˜ O  

  • K

log N T 1.5   AMO calls over T rounds.

29

slide-105
SLIDE 105

Warm start

To compute Qt+1 using coordinate descent, initialize with Qt.

30

slide-106
SLIDE 106

Warm start

To compute Qt+1 using coordinate descent, initialize with Qt.

  • 1. Total epoch-to-epoch increase in potential is ˜

O(

  • T/K) over

all T rounds (w.h.p.—exploiting i.i.d. assumption).

30

slide-107
SLIDE 107

Warm start

To compute Qt+1 using coordinate descent, initialize with Qt.

  • 1. Total epoch-to-epoch increase in potential is ˜

O(

  • T/K) over

all T rounds (w.h.p.—exploiting i.i.d. assumption).

  • 2. Each coordinate descent step decreases potential by Ω
  • log N

K

  • .

30

slide-108
SLIDE 108

Warm start

To compute Qt+1 using coordinate descent, initialize with Qt.

  • 1. Total epoch-to-epoch increase in potential is ˜

O(

  • T/K) over

all T rounds (w.h.p.—exploiting i.i.d. assumption).

  • 2. Each coordinate descent step decreases potential by Ω
  • log N

K

  • .
  • 3. Over all T rounds,

total # calls to AMO ≤ ˜ O  

  • KT

log N  

30

slide-109
SLIDE 109

Warm start

To compute Qt+1 using coordinate descent, initialize with Qt.

  • 1. Total epoch-to-epoch increase in potential is ˜

O(

  • T/K) over

all T rounds (w.h.p.—exploiting i.i.d. assumption).

  • 2. Each coordinate descent step decreases potential by Ω
  • log N

K

  • .
  • 3. Over all T rounds,

total # calls to AMO ≤ ˜ O  

  • KT

log N   But still need an AMO call to even check if Qt is feasible!

30

slide-110
SLIDE 110

Epoch trick

Regret analysis: Qt has low instantaneous per-round regret—this also crucially relies on i.i.d. assumption.

31

slide-111
SLIDE 111

Epoch trick

Regret analysis: Qt has low instantaneous per-round regret—this also crucially relies on i.i.d. assumption. = ⇒ same Qt can be used for O(t) more rounds!

31

slide-112
SLIDE 112

Epoch trick

Regret analysis: Qt has low instantaneous per-round regret—this also crucially relies on i.i.d. assumption. = ⇒ same Qt can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Qt at start

  • f each epoch.

31

slide-113
SLIDE 113

Epoch trick

Regret analysis: Qt has low instantaneous per-round regret—this also crucially relies on i.i.d. assumption. = ⇒ same Qt can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Qt at start

  • f each epoch.

Doubling: only update on rounds 21, 22, 23, 24, . . . log T updates, so ˜ O(

  • KT/ log N) AMO calls overall.

31

slide-114
SLIDE 114

Epoch trick

Regret analysis: Qt has low instantaneous per-round regret—this also crucially relies on i.i.d. assumption. = ⇒ same Qt can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Qt at start

  • f each epoch.

Doubling: only update on rounds 21, 22, 23, 24, . . . log T updates, so ˜ O(

  • KT/ log N) AMO calls overall.

Squares: only update on rounds 12, 22, 32, 42, . . . √ T updates, so ˜ O(

  • K/ log N) AMO calls per update, on average.

31

slide-115
SLIDE 115

Warm start + epoch trick

Over all T rounds:

◮ Update policy distribution on rounds 12, 22, 32, 42, . . . ,

i.e., total of √ T times.

◮ Total # calls to AMO:

˜ O  

  • KT

log N  .

◮ # AMO calls per update (on average):

˜ O  

  • K

log N  .

32

slide-116
SLIDE 116
  • 4. Closing remarks and open problems

33

slide-117
SLIDE 117

Recap

34

slide-118
SLIDE 118

Recap

  • 1. New algorithm for general contextual bandits

34

slide-119
SLIDE 119

Recap

  • 1. New algorithm for general contextual bandits
  • 2. Accesses policy class Π only via AMO.

34

slide-120
SLIDE 120

Recap

  • 1. New algorithm for general contextual bandits
  • 2. Accesses policy class Π only via AMO.
  • 3. Defined convex feasibility problem over policy distributions

that are good for exploration/exploitation: Regret ≤ ˜ O

  • K log N

T

  • .

Coordinate descent finds a ˜ O(

  • KT/ log N)-sparse solution.

34

slide-121
SLIDE 121

Recap

  • 1. New algorithm for general contextual bandits
  • 2. Accesses policy class Π only via AMO.
  • 3. Defined convex feasibility problem over policy distributions

that are good for exploration/exploitation: Regret ≤ ˜ O

  • K log N

T

  • .

Coordinate descent finds a ˜ O(

  • KT/ log N)-sparse solution.
  • 4. Epoch structure allows for policy distribution to change very

infrequently; combine with warm start for computational improvements.

34

slide-122
SLIDE 122

Open problems

35

slide-123
SLIDE 123

Open problems

  • 1. Empirical evaluation.

35

slide-124
SLIDE 124

Open problems

  • 1. Empirical evaluation.
  • 2. Adaptive algorithm that takes advantage of problem easiness.

35

slide-125
SLIDE 125

Open problems

  • 1. Empirical evaluation.
  • 2. Adaptive algorithm that takes advantage of problem easiness.
  • 3. Alternatives to AMO.

35

slide-126
SLIDE 126

Open problems

  • 1. Empirical evaluation.
  • 2. Adaptive algorithm that takes advantage of problem easiness.
  • 3. Alternatives to AMO.

Thanks!

35

slide-127
SLIDE 127

36

slide-128
SLIDE 128

Projections of policy distributions

Given policy distribution Q and context x, ∀a ∈ A Q(a|x) :=

  • π∈Π

Q(π) · 1{π(x) = a} (so Q → Q(·|x) is a linear map).

37

slide-129
SLIDE 129

Projections of policy distributions

Given policy distribution Q and context x, ∀a ∈ A Q(a|x) :=

  • π∈Π

Q(π) · 1{π(x) = a} (so Q → Q(·|x) is a linear map). We actually use pt := Qµt

t ( · |xt) := (1 − Kµt)Qt( · |xt) + µt1

so every action has probability at least µt (to be determined).

37

slide-130
SLIDE 130

The potential function

Φ(Q) := tµt Ex∈Ht

  • RE(uniformQµt(·|x))
  • 1 − Kµt

+

  • π∈Π Q(π)

Regt(π) Kt · µt

  • ,

38