Weighted bandits or: How bandits learn distorted values that are not - - PowerPoint PPT Presentation

weighted bandits or how bandits learn distorted values
SMART_READER_LITE
LIVE PREVIEW

Weighted bandits or: How bandits learn distorted values that are not - - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science


slide-1
SLIDE 1

Weighted bandits or: How bandits learn distorted values that are not expected

Prashanth L.A.∗

Joint work with Aditya Gopalan†, Michael Fu∗ and Steve Marcus∗

∗ University of Maryland, College Park † Indian Institute of Science

slide-2
SLIDE 2

Going to offjce - bandit style

On every day

  • 1. Pick a route to offjce
  • 2. Reach offjce and record (sufgered)

delay

1

slide-3
SLIDE 3

Why not distort?

Delays are stochastic In choosing between routes, humans *need not* minimize expected delay 2

slide-4
SLIDE 4

Why not distort?

Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

slide-5
SLIDE 5

Why not distort?

Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

slide-6
SLIDE 6

What we do

Probability distortion1

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 p0.69 (p0.69 + (1 − p)0.69)1/0.69 Probability p Weight w(p)

Multi-armed bandits

The weight-distorted value µk for any arm k ∈ {1, . . . , K} is µk = ∫ ∞ w(P [Yk > z])dz − ∫ ∞ w(P [−Yk > z])dz, Yk is the r.v. corresponding to stochastic costs from arm k. Weight function w : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1

[1] Cumulative prospect theory - Tversky & Kahneman (1992) Rank-dependent expected utility - Quiggin (1982)

4

slide-7
SLIDE 7

1-slide summary

K-armed bandits

  • Upper Confjdence Bound (UCB) + distortions
  • Sublinear regret O(n(2−α)/2), α ∈ (0, 1) is Hölder exponent of w
  • Matching lower bound

Linearly parameterized bandits

  • Optimism in the Face of Uncertainty Linear (OFUL) + arm-dependent

noise model

  • Regret O(d√n polylog(n)), for sub-Gaussian cost distributions.

Application: Traveler’s route choice

  • optimizing the route choice of a human traveler using GLD traffjc simulator
  • implement vanilla OFUL and weight-distorted OFUL (WOFUL)
  • exhibit qualitative difgerence between WOFUL and OFUL routes

5

slide-8
SLIDE 8

Outline

K-armed bandits Linear bandits Routing application

6

slide-9
SLIDE 9

Bandit model

Known # of arms K and horizon n Unknown Distributions Fk, k = 1, . . . , K, distorted values µ1, . . . , µK Interaction In each round m = 1, . . . , n

  • pull arm Im ∈ {1, . . . , K}
  • observe a sample cost from FIm

Benchmark: µ∗ = min

1,...,K

{ µk := ∫ ∞ w(1 − Fk(z))dz } . Regret: Rn =

K

k=1

Tk(n)µk − nµ∗ =

K

k=1

Tk(n)∆k

  • Tk(n) is the # of times arm k is pulled up to time n
  • ∆k = µk − µ∗ is the gap

Goal: Minimize expected regret ERn =

K

k=1

E[Tk(n)]∆k 7

slide-10
SLIDE 10

UCB values1

  • Mean-reward estimate

UCB(k) = ˆ µk − ˆ σk

  • Confjdence width

At each round t, select a tap. Optimize the quality of n selected beers

[1] Auer et al. (2002) Finite-time analysis of the multiarmed bandit problem. In: MLJ.

8

slide-11
SLIDE 11

Assumptions

(A1). Weight w is Hölder continuous with constant L and exponent α ∈ (0, 1] (A2). The arms’ costs are bounded by M > 0 a.s.

Weighted UCB

Pull each arm once For each round m = 1, 2, . . . do For each arm k = 1, . . . , K do Compute an estimate ˆ µk,Tk(m−1) of weight-distorted value µk UCB index: UCB(k, m) = µk,Tk(m−1) − LM ( 3 log m 2Tk(m − 1) ) α

2

Pull arm Im = arg min

k={1,...,K}

UCB(k, m). 9

slide-12
SLIDE 12

Weight-distorted value estimation

Problem: Estimate weight-distorted value µk = ∫ ∞ w(1 − Fk(z))dz for some k ∈ {1, . . . , K} Input: Samples Yk,1, . . . , Yk,j from distribution Fk Solution: µk,j :=

j

i=1

Y[k,i] ( w ( j + 1 − i j ) −w ( j − i j )) Interpretation: µk,j = ∫ ∞ w(1 − ˆ Fk,j(z))dz

ˆ Fk,j(x) := 1 j

j

i=1

I[

Yk,i≤x ] is the empirical distribution function for arm k

Sample complexity Under (A1) and (A2), ∀ϵ > 0 and any k ∈ {1, . . . , K}, we have P(

  • µk,j − µk
  • > ϵ) ≤ 2 exp

( −2j(ϵ/LM)2/α) .

k lies in k j

LM 3 log m 2j

2 k j

LM 3 log m 2j

2

w.h.p.

10

slide-13
SLIDE 13

Weight-distorted value estimation

Problem: Estimate weight-distorted value µk = ∫ ∞ w(1 − Fk(z))dz for some k ∈ {1, . . . , K} Input: Samples Yk,1, . . . , Yk,j from distribution Fk Solution: µk,j :=

j

i=1

Y[k,i] ( w ( j + 1 − i j ) −w ( j − i j )) Interpretation: µk,j = ∫ ∞ w(1 − ˆ Fk,j(z))dz

ˆ Fk,j(x) := 1 j

j

i=1

I[

Yk,i≤x ] is the empirical distribution function for arm k

Sample complexity Under (A1) and (A2), ∀ϵ > 0 and any k ∈ {1, . . . , K}, we have P(

  • µk,j − µk
  • > ϵ) ≤ 2 exp

( −2j(ϵ/LM)2/α) .

µk lies in [

  • µk,j − LM

( 3 log m 2j ) α

2 ,

µk,j + LM ( 3 log m 2j ) α

2

] w.h.p.

10

slide-14
SLIDE 14

How I learn to stop regretting..

Upper bound

Gap-dependent: ERn ≤ ∑

{k:∆k>0}

3(2LM)2/α log n 2∆2/α−1

k

+ MK ( 1 + 2π2 3 ) . Gap-independent: ERn ≤ MKα/2 ( 3 2(2L)2/α log n + c ) α

2

n

2−α 2

. For α < 1, the bound above is worse than usual UCB upper bound of O(√n)

Lower bound For any sub-polynomial regret algorithm,

a stochastic environment and Hölder weight w such that Rn

k

k

LM 2 log n 4

2 1 k f n g n f n cg n for some positive c and n n0

11

slide-15
SLIDE 15

How I learn to stop regretting..

Upper bound

Gap-dependent: ERn ≤ ∑

{k:∆k>0}

3(2LM)2/α log n 2∆2/α−1

k

+ MK ( 1 + 2π2 3 ) . Gap-independent: ERn ≤ MKα/2 ( 3 2(2L)2/α log n + c ) α

2

n

2−α 2

. For α < 1, the bound above is worse than usual UCB upper bound of O(√n)

Lower bound For any sub-polynomial regret algorithm, ∃ a stochastic

environment and Hölder weight w such that ERn = Ω   ∑

{k:∆k>0}

(LM)2/α log n 4∆2/α−1

k

  .

f(n) = Ω(g(n)) ⇔ f(n) ≥ cg(n) for some positive c and n > n0

11

slide-16
SLIDE 16

Outline

K-armed bandits Linear bandits Routing application

12

slide-17
SLIDE 17

Linear bandit model

Choose xIm Observe cm Estimate θ cm := xT

Im (θ + Nm)

Use ridge regression

Unknown parameter θ ∈ Rd Large set of arms xi ∈ Rd, i = 1, . . . , K, K ≫ 1 Gaussian noise Nm := (N1

m, . . . , Nd m),is a

random vector of i.i.d. standard Gaussian r.v.s Linearity No need to estimate mean-reward of all arms, estimating is enough Optimize the beer you drink, before you get drunk 13

slide-18
SLIDE 18

Linear bandit model

Choose xIm Observe cm Estimate θ cm := xT

Im (θ + Nm)

Use ridge regression

Unknown parameter θ ∈ Rd Large set of arms xi ∈ Rd, i = 1, . . . , K, K ≫ 1 Gaussian noise Nm := (N1

m, . . . , Nd m),is a

random vector of i.i.d. standard Gaussian r.v.s Linearity ⇒ No need to estimate mean-reward of all arms, estimating θ is enough Optimize the beer you drink, before you get drunk 13

slide-19
SLIDE 19

Arm-dependent noise model

Routing example:

1 2 3 7 8 9 10 11 12 4 5 6

src dst

Dimension d = # number of lanes Route: x is a collection of edges encoded by a vector of 0 − 1 values Edge weight: For any edge j, θj specifjes the edge delay Noise model: cm := xT

Im (θ + Nm) for

any Im ∈ {1, . . . , K}

Previous linear bandit algorithms, e.g. OFUL1, assume cm := xT

Im θ + ξm, where ξm is

standard Gaussian [1] Abbasi-Yadkori et al. (2011) Improved algorithms for linear stochastic bandits. In NIPS.

14

slide-20
SLIDE 20

WOFUL Algorithm

Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.

  • Ensures θ lies in

Cm with high Probability

  • OFUL’s choice

xT within ellipsoid won’t work with distortions

  • Updates for ridge

regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :

  • θ − ˆ

θm

  • Am

≤ Dm } .

where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ

Arm selection + feedback Let (xm, ˜ θm) := arg min

(x,θ′)∈X×Cm

µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT

m

∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1

m+1bm+1

15

slide-21
SLIDE 21

WOFUL Algorithm

Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.

  • Ensures θ lies in

Cm with high Probability

  • OFUL’s choice

xTθ within ellipsoid won’t work with distortions

  • Updates for ridge

regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :

  • θ − ˆ

θm

  • Am

≤ Dm } .

where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ

Arm selection + feedback Let (xm, ˜ θm) := arg min

(x,θ′)∈X×Cm

µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT

m

∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1

m+1bm+1

15

slide-22
SLIDE 22

WOFUL Algorithm

Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.

  • Ensures θ lies in

Cm with high Probability

  • OFUL’s choice

xTθ within ellipsoid won’t work with distortions

  • Updates for ridge

regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :

  • θ − ˆ

θm

  • Am

≤ Dm } .

where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ

Arm selection + feedback Let (xm, ˜ θm) := arg min

(x,θ′)∈X×Cm

µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT

m

∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1

m+1bm+1

15

slide-23
SLIDE 23

WOFUL but no-regret

Assume 0 ≤ w(p) ≤ 1, ∀p ∈ (0, 1), ∀x ∈ X : xTθ ∈ [−1, 1], and ∥θ∥2 ≤ β. Benchmark: µ∗ := min

x′∈X µx′(θ).

Regret: Rn =

n

m=1

µxm(θ) − nµ∗,

Upper bound For any

0, the regret Rn of WOFUL satisfjes P Rn 32dnDn log n n 1 1 Dn c d log n 2 and hence, the bound above is O d n and matches the upper bound of OFUL.

O is a variant of the O that ignores log-factors.

16

slide-24
SLIDE 24

WOFUL but no-regret

Assume 0 ≤ w(p) ≤ 1, ∀p ∈ (0, 1), ∀x ∈ X : xTθ ∈ [−1, 1], and ∥θ∥2 ≤ β. Benchmark: µ∗ := min

x′∈X µx′(θ).

Regret: Rn =

n

m=1

µxm(θ) − nµ∗,

Upper bound For any δ > 0, the regret Rn of WOFUL satisfjes

P ( Rn ≤ √ 32dnDn log n ∀n ≥ 1 ) ≥ 1 − δ. Dn = c √ d log ( nℓ2 λδ ) and hence, the bound above is ˜ O ( d√n ) and matches the upper bound of OFUL.

˜ O(·) is a variant of the O(·) that ignores log-factors.

16

slide-25
SLIDE 25

Outline

K-armed bandits Linear bandits Routing application

17

slide-26
SLIDE 26

Take a detour and be happy in unexpected ways..

1 2 3 7 8 9 10 11 12 4 5 6

src dst

Expected value Distorted value xT

algˆ

θofg µxalg(ˆ θofg) OFUL 51.71 −2.9 WOFUL 57.86 −6.75 Ground truth ˆ θofg is a ridge regression based estimate of true θ that is obtained by running an *long* independent simulation xOFUL is the blue-dotted route xWOFUL is the green-dotted detour 18

slide-27
SLIDE 27

Dilbert on AI

19