Linear (and contextual) Bandits: Rich decision sets (and side - - PowerPoint PPT Presentation

linear and contextual bandits rich decision sets and side
SMART_READER_LITE
LIVE PREVIEW

Linear (and contextual) Bandits: Rich decision sets (and side - - PowerPoint PPT Presentation

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14 Announcements... Poster


slide-1
SLIDE 1

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Machine Learning for Big Data CSE547/STAT548 University of Washington

  • S. M. Kakade (UW)

Optimization for Big data 1 / 14

slide-2
SLIDE 2

Announcements...

Poster session: June 1, 9-11:30a

Request: CSE grad students, could you please help others with poster printing? Aravind: Ask by 2p on Weds for help printing. Prepare, at most, a 2 minute verbal summary. Come earlier to setup. Submit your poster on Canvas.

Due Dates: Please be on time. Today: review: Linear bandits today: contextual bandits, game trees?

  • S. M. Kakade (UW)

Optimization for Big data 2 / 14

slide-3
SLIDE 3

Review

  • S. M. Kakade (UW)

Optimization for Big data 2 / 14

slide-4
SLIDE 4

Bandits in practice: two major issues

The decision space is very large.

Drug cocktails Ad design

We often have “side information” when making a decision

history of a user

  • S. M. Kakade (UW)

Optimization for Big data 2 / 9

slide-5
SLIDE 5

More real motivations...

  • S. M. Kakade (UW)

Optimization for Big data 2 / 9

slide-6
SLIDE 6

Linear bandits

An additive effects model. Suppose each round we take a decision x ∈ D ⊂ Rd.

x is paths on a graph. x is a feature vector of properties of an ad x is a which drugs are being taken

Upon taking action x, we get reward r, with expectation: E[r|x] = µ>x

  • nly d unknown parameters (and “effectively” 2d actions)

W desire an algorithm A (mapping histories to decisions), which has low regret. Tµ>x⇤ −

T

X

t=1

E[µ>xt|A] ≤?? (where x⇤ is the best decision)

  • S. M. Kakade (UW)

Optimization for Big data 4 / 14

slide-7
SLIDE 7

Example: Shortest paths...

  • S. M. Kakade (UW)

Optimization for Big data 3 / 9

slide-8
SLIDE 8

Algorithm Idea

again, let’s think of optimism in the face of uncertainty we observed some r1, . . . rt1, and have taken x1, . . . xt1. Questions:

what is an estimate of the reward of E[r|x] and what is our uncertainty? what is an estimate of µ and what is our uncertainty?

  • S. M. Kakade (UW)

Optimization for Big data 4 / 9

slide-9
SLIDE 9

Regression!

Define: At := X

τ<t

xτx>

τ + λI, bt :=

X

τ<t

xτrτ Our estimate of µ ˆ µt = A1

t

bt Confidence of our estimate: kµ ˆ µtk2

At  O(d log t)

  • S. M. Kakade (UW)

Optimization for Big data 5 / 9

slide-10
SLIDE 10

LinUCB

Again, optimism in the face of uncertainty. Define: Bt := {ν|kν ˆ µtk2

At  Od log t}

(Lin UCB) take action: xt = argmaxx2D max

ν2Bt

ν>x then update At, Bt, bt, and ˆ µt. Equivalently, take action: xt = argmaxx2D ˆ µ>

t x + (d log t)

q xA1

t

x

  • S. M. Kakade (UW)

Optimization for Big data 7 / 14

slide-11
SLIDE 11

LinUCB: Geometry

  • S. M. Kakade (UW)

Optimization for Big data 8 / 14

slide-12
SLIDE 12

LinUCB: Confidence intervals

  • S. M. Kakade (UW)

Optimization for Big data 8 / 9

slide-13
SLIDE 13

Today

  • S. M. Kakade (UW)

Optimization for Big data 9 / 14

slide-14
SLIDE 14

LinUCB

Regret bound of LinUCB Tµ>x⇤ −

T

X

t=1

E[µ>xt] ≤ ⇤(d √ T) (this is the best possible, up to log factors). Compare to O( √ KT)

Independent of number of actions. k-arm case is a special case.

Thompson sampling: This is a good algorithm in practice.

  • S. M. Kakade (UW)

Optimization for Big data 10 / 14

slide-15
SLIDE 15

Proof Idea...

Stats: need to show that Bt is a valid confidence region. Geometric lemma: The regret is upper bounded by the: log volume of posterior cov volume of prior cov Then just bound the worst case log volume change.

  • S. M. Kakade (UW)

Optimization for Big data 10 / 14

slide-16
SLIDE 16

What about context?

  • S. M. Kakade (UW)

Optimization for Big data 10 / 14

slide-17
SLIDE 17

The Contextual Bandit Game

Game: for t = 1, 2, . . .

At each time t, we obtain context (e.g. side information, user information) ct Our feasible action set is At. We choose arm at ∈ At and receive reward rt,at. (what assumptions on the reward process?)

Goal: Algorithm A to have low regret: E[ X

t

(rt,a∗

t − rt)|A] ≤??

where E[rt,a∗

t ] is the optimal expected reward at time t.

  • S. M. Kakade (UW)

Optimization for Big data 11 / 14

slide-18
SLIDE 18

How should we model outcomes?

Example: ad (or movie, song, etc) prediction. What is prob. that a user u clicks on an ad a. How should we model the click probability of a for user u? Featurizations: suppose we have φad(a) ∈ Rdad and φuser(u) ∈ Rduser. We could make an “outer product” feature vector x as: x(a, u) = Vector(φad(a)φuser(u)>) ∈ Rdadduser We could model the probabilities as: E[click = 1|a, u] = µ>x(a, u) (or log linear) How do we estimate µ?

  • S. M. Kakade (UW)

Optimization for Big data 12 / 14

slide-19
SLIDE 19

Contextual Linear bandits

Suppose each round t, we take a decision x ∈ Dt ⊂ Rd (Dt may be time varying).

map each ad/user a to x(a, u). Dt = {x(a, ut)|a is a feasible ad at time t} Our decision is a feature vector in x ∈ Dt.

Upon taking action xt ∈ Dt, we get reward rt, with expectation: E[rt|xt ∈ Dt] = µ>xt (here µ is assumed constant over time). Our regret: E[ X

t

(µ>xt,a∗

t − µ>xt)|A] ≤??

(where xt,a∗

t is the best decision at time t)

  • S. M. Kakade (UW)

Optimization for Big data 13 / 14

slide-20
SLIDE 20

Algorithm

let’s just run linUCB (or Thompson sampling) Nothing really changes:

At and bt are the same updating rules now our decision is: xt = argmaxx2Dt max

ν2Bt ν>x

i.e. xt = argmaxx2Dt ˆ µ>

t x + (d log t)

q xA1

t

x

Regret bound is still O(d √ T).

  • S. M. Kakade (UW)

Optimization for Big data 14 / 14

slide-21
SLIDE 21

Acknowledgements

http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ http://www.yisongyue.com/courses/cs159/lectures/ LinUCB.pdf

  • S. M. Kakade (UW)

Optimization for Big data 14 / 14