Weighted bandits or: How bandits learn distorted values that are not - - PowerPoint PPT Presentation
Weighted bandits or: How bandits learn distorted values that are not - - PowerPoint PPT Presentation
Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science
Going to offjce - bandit style
On every day
- 1. Pick a route to offjce
- 2. Reach offjce and record (sufgered)
delay
1
Why not distort?
Delays are stochastic In choosing between routes, humans *need not* minimize expected delay 2
Why not distort?
Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3
Why not distort?
Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3
What we do
Probability distortion1
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 p0.69 (p0.69 + (1 − p)0.69)1/0.69 Probability p Weight w(p)
Multi-armed bandits
The weight-distorted value µk for any arm k ∈ {1, . . . , K} is µk = ∫ ∞ w(P [Yk > z])dz − ∫ ∞ w(P [−Yk > z])dz, Yk is the r.v. corresponding to stochastic costs from arm k. Weight function w : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1
[1] Cumulative prospect theory - Tversky & Kahneman (1992) Rank-dependent expected utility - Quiggin (1982)
4
1-slide summary
K-armed bandits
- Upper Confjdence Bound (UCB) + distortions
- Sublinear regret O(n(2−α)/2), α ∈ (0, 1) is Hölder exponent of w
- Matching lower bound
Linearly parameterized bandits
- Optimism in the Face of Uncertainty Linear (OFUL) + arm-dependent
noise model
- Regret O(d√n polylog(n)), for sub-Gaussian cost distributions.
Application: Traveler’s route choice
- optimizing the route choice of a human traveler using GLD traffjc simulator
- implement vanilla OFUL and weight-distorted OFUL (WOFUL)
- exhibit qualitative difgerence between WOFUL and OFUL routes
5
Outline
K-armed bandits Linear bandits Routing application
6
Bandit model
Known # of arms K and horizon n Unknown Distributions Fk, k = 1, . . . , K, distorted values µ1, . . . , µK Interaction In each round m = 1, . . . , n
- pull arm Im ∈ {1, . . . , K}
- observe a sample cost from FIm
Benchmark: µ∗ = min
1,...,K
{ µk := ∫ ∞ w(1 − Fk(z))dz } . Regret: Rn =
K
∑
k=1
Tk(n)µk − nµ∗ =
K
∑
k=1
Tk(n)∆k
- Tk(n) is the # of times arm k is pulled up to time n
- ∆k = µk − µ∗ is the gap
Goal: Minimize expected regret ERn =
K
∑
k=1
E[Tk(n)]∆k 7
UCB values1
- Mean-reward estimate
UCB(k) = ˆ µk − ˆ σk
- Confjdence width
At each round t, select a tap. Optimize the quality of n selected beers
[1] Auer et al. (2002) Finite-time analysis of the multiarmed bandit problem. In: MLJ.
8
Assumptions
(A1). Weight w is Hölder continuous with constant L and exponent α ∈ (0, 1] (A2). The arms’ costs are bounded by M > 0 a.s.
Weighted UCB
Pull each arm once For each round m = 1, 2, . . . do For each arm k = 1, . . . , K do Compute an estimate ˆ µk,Tk(m−1) of weight-distorted value µk UCB index: UCB(k, m) = µk,Tk(m−1) − LM ( 3 log m 2Tk(m − 1) ) α
2
Pull arm Im = arg min
k={1,...,K}
UCB(k, m). 9
Weight-distorted value estimation
Problem: Estimate weight-distorted value µk = ∫ ∞ w(1 − Fk(z))dz for some k ∈ {1, . . . , K} Input: Samples Yk,1, . . . , Yk,j from distribution Fk Solution: µk,j :=
j
∑
i=1
Y[k,i] ( w ( j + 1 − i j ) −w ( j − i j )) Interpretation: µk,j = ∫ ∞ w(1 − ˆ Fk,j(z))dz
ˆ Fk,j(x) := 1 j
j
∑
i=1
I[
Yk,i≤x ] is the empirical distribution function for arm k
Sample complexity Under (A1) and (A2), ∀ϵ > 0 and any k ∈ {1, . . . , K}, we have P(
- µk,j − µk
- > ϵ) ≤ 2 exp
( −2j(ϵ/LM)2/α) .
k lies in k j
LM 3 log m 2j
2 k j
LM 3 log m 2j
2
w.h.p.
10
Weight-distorted value estimation
Problem: Estimate weight-distorted value µk = ∫ ∞ w(1 − Fk(z))dz for some k ∈ {1, . . . , K} Input: Samples Yk,1, . . . , Yk,j from distribution Fk Solution: µk,j :=
j
∑
i=1
Y[k,i] ( w ( j + 1 − i j ) −w ( j − i j )) Interpretation: µk,j = ∫ ∞ w(1 − ˆ Fk,j(z))dz
ˆ Fk,j(x) := 1 j
j
∑
i=1
I[
Yk,i≤x ] is the empirical distribution function for arm k
Sample complexity Under (A1) and (A2), ∀ϵ > 0 and any k ∈ {1, . . . , K}, we have P(
- µk,j − µk
- > ϵ) ≤ 2 exp
( −2j(ϵ/LM)2/α) .
µk lies in [
- µk,j − LM
( 3 log m 2j ) α
2 ,
µk,j + LM ( 3 log m 2j ) α
2
] w.h.p.
10
How I learn to stop regretting..
Upper bound
Gap-dependent: ERn ≤ ∑
{k:∆k>0}
3(2LM)2/α log n 2∆2/α−1
k
+ MK ( 1 + 2π2 3 ) . Gap-independent: ERn ≤ MKα/2 ( 3 2(2L)2/α log n + c ) α
2
n
2−α 2
. For α < 1, the bound above is worse than usual UCB upper bound of O(√n)
Lower bound For any sub-polynomial regret algorithm,
a stochastic environment and Hölder weight w such that Rn
k
k
LM 2 log n 4
2 1 k f n g n f n cg n for some positive c and n n0
11
How I learn to stop regretting..
Upper bound
Gap-dependent: ERn ≤ ∑
{k:∆k>0}
3(2LM)2/α log n 2∆2/α−1
k
+ MK ( 1 + 2π2 3 ) . Gap-independent: ERn ≤ MKα/2 ( 3 2(2L)2/α log n + c ) α
2
n
2−α 2
. For α < 1, the bound above is worse than usual UCB upper bound of O(√n)
Lower bound For any sub-polynomial regret algorithm, ∃ a stochastic
environment and Hölder weight w such that ERn = Ω ∑
{k:∆k>0}
(LM)2/α log n 4∆2/α−1
k
.
f(n) = Ω(g(n)) ⇔ f(n) ≥ cg(n) for some positive c and n > n0
11
Outline
K-armed bandits Linear bandits Routing application
12
Linear bandit model
Choose xIm Observe cm Estimate θ cm := xT
Im (θ + Nm)
Use ridge regression
Unknown parameter θ ∈ Rd Large set of arms xi ∈ Rd, i = 1, . . . , K, K ≫ 1 Gaussian noise Nm := (N1
m, . . . , Nd m),is a
random vector of i.i.d. standard Gaussian r.v.s Linearity No need to estimate mean-reward of all arms, estimating is enough Optimize the beer you drink, before you get drunk 13
Linear bandit model
Choose xIm Observe cm Estimate θ cm := xT
Im (θ + Nm)
Use ridge regression
Unknown parameter θ ∈ Rd Large set of arms xi ∈ Rd, i = 1, . . . , K, K ≫ 1 Gaussian noise Nm := (N1
m, . . . , Nd m),is a
random vector of i.i.d. standard Gaussian r.v.s Linearity ⇒ No need to estimate mean-reward of all arms, estimating θ is enough Optimize the beer you drink, before you get drunk 13
Arm-dependent noise model
Routing example:
1 2 3 7 8 9 10 11 12 4 5 6
src dst
Dimension d = # number of lanes Route: x is a collection of edges encoded by a vector of 0 − 1 values Edge weight: For any edge j, θj specifjes the edge delay Noise model: cm := xT
Im (θ + Nm) for
any Im ∈ {1, . . . , K}
Previous linear bandit algorithms, e.g. OFUL1, assume cm := xT
Im θ + ξm, where ξm is
standard Gaussian [1] Abbasi-Yadkori et al. (2011) Improved algorithms for linear stochastic bandits. In NIPS.
14
WOFUL Algorithm
Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.
- Ensures θ lies in
Cm with high Probability
- OFUL’s choice
xT within ellipsoid won’t work with distortions
- Updates for ridge
regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :
- θ − ˆ
θm
- Am
≤ Dm } .
where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ
Arm selection + feedback Let (xm, ˜ θm) := arg min
(x,θ′)∈X×Cm
µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT
m
∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1
m+1bm+1
15
WOFUL Algorithm
Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.
- Ensures θ lies in
Cm with high Probability
- OFUL’s choice
xTθ within ellipsoid won’t work with distortions
- Updates for ridge
regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :
- θ − ˆ
θm
- Am
≤ Dm } .
where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ
Arm selection + feedback Let (xm, ˜ θm) := arg min
(x,θ′)∈X×Cm
µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT
m
∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1
m+1bm+1
15
WOFUL Algorithm
Initialization: A1 = λId×d, b1 = 0, ˆ θ1 = 0.
- Ensures θ lies in
Cm with high Probability
- OFUL’s choice
xTθ within ellipsoid won’t work with distortions
- Updates for ridge
regression For each round m = 1, 2, . . . do Confjdence ellipsoid: Set Cm := { θ ∈ Rd :
- θ − ˆ
θm
- Am
≤ Dm } .
where Dm := √ 2 log ( det(Am)1/2 λd/2/δ ) + β √ λ
Arm selection + feedback Let (xm, ˜ θm) := arg min
(x,θ′)∈X×Cm
µx(θ′). Choose arm xm and observe cost cm. Update statistics Update Am+1 = Am + xmxT
m
∥xm∥2 , bm+1 = bm + cmxm ∥xm∥ , and ˆ θm+1 = A−1
m+1bm+1
15
WOFUL but no-regret
Assume 0 ≤ w(p) ≤ 1, ∀p ∈ (0, 1), ∀x ∈ X : xTθ ∈ [−1, 1], and ∥θ∥2 ≤ β. Benchmark: µ∗ := min
x′∈X µx′(θ).
Regret: Rn =
n
∑
m=1
µxm(θ) − nµ∗,
Upper bound For any
0, the regret Rn of WOFUL satisfjes P Rn 32dnDn log n n 1 1 Dn c d log n 2 and hence, the bound above is O d n and matches the upper bound of OFUL.
O is a variant of the O that ignores log-factors.
16
WOFUL but no-regret
Assume 0 ≤ w(p) ≤ 1, ∀p ∈ (0, 1), ∀x ∈ X : xTθ ∈ [−1, 1], and ∥θ∥2 ≤ β. Benchmark: µ∗ := min
x′∈X µx′(θ).
Regret: Rn =
n
∑
m=1
µxm(θ) − nµ∗,
Upper bound For any δ > 0, the regret Rn of WOFUL satisfjes
P ( Rn ≤ √ 32dnDn log n ∀n ≥ 1 ) ≥ 1 − δ. Dn = c √ d log ( nℓ2 λδ ) and hence, the bound above is ˜ O ( d√n ) and matches the upper bound of OFUL.
˜ O(·) is a variant of the O(·) that ignores log-factors.
16
Outline
K-armed bandits Linear bandits Routing application
17
Take a detour and be happy in unexpected ways..
1 2 3 7 8 9 10 11 12 4 5 6
src dst
Expected value Distorted value xT
algˆ