weighted bandits or how bandits learn distorted values
play

Weighted bandits or: How bandits learn distorted values that are not - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science


  1. Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. ∗ Joint work with Aditya Gopalan † , Michael Fu ∗ and Steve Marcus ∗ ∗ University of Maryland, College Park † Indian Institute of Science

  2. Going to offjce - bandit style On every day 1. Pick a route to offjce 2. Reach offjce and record (sufgered) delay 1

  3. Why not distort? Delays are stochastic In choosing between routes, humans *need not* minimize expected delay 2

  4. Why not distort? Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

  5. Why not distort? Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

  6. What we do 1 Rank-dependent expected utility - Quiggin (1982) [1] Cumulative prospect theory - Tversky & Kahneman (1992) 0 0 Multi-armed bandits Probability distortion 1 Probability p 4 0 1 0 0 . 8 Weight w ( p ) 0 . 6 p 0 . 69 0 . 4 ( p 0 . 69 + ( 1 − p ) 0 . 69 ) 1 / 0 . 69 0 . 2 0 . 2 0 . 4 0 . 6 0 . 8 The weight-distorted value µ k for any arm k ∈ { 1 , . . . , K } is ∫ ∞ ∫ ∞ µ k = w ( P [ Y k > z ]) dz − w ( P [ −Y k > z ]) dz , Y k is the r.v. corresponding to stochastic costs from arm k. Weight function w : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1

  7. 1-slide summary K-armed bandits Linearly parameterized bandits noise model Application: Traveler’s route choice • optimizing the route choice of a human traveler using GLD traffjc simulator • implement vanilla OFUL and weight-distorted OFUL ( WOFUL ) • exhibit qualitative difgerence between WOFUL and OFUL routes 5 • Upper Confjdence Bound (UCB) + distortions • Sublinear regret O ( n ( 2 − α ) / 2 ) , α ∈ ( 0 , 1 ) is Hölder exponent of w • Matching lower bound • Optimism in the Face of Uncertainty Linear (OFUL) + arm-dependent • Regret O ( d √ n polylog ( n )) , for sub-Gaussian cost distributions.

  8. Outline K-armed bandits Linear bandits Routing application 6

  9. Bandit model 0 K • • K Known # of arms K and horizon n K . 7 • observe a sample cost from F I m Unknown Distributions F k , k = 1 , . . . , K, distorted values µ 1 , . . . , µ K Interaction In each round m = 1 , . . . , n • pull arm I m ∈ { 1 , . . . , K } { } ∫ ∞ Benchmark: µ ∗ = min µ k := w ( 1 − F k ( z )) dz 1 ,..., K ∑ ∑ Regret: R n = T k ( n ) µ k − n µ ∗ = T k ( n )∆ k k = 1 k = 1 T k ( n ) is the # of times arm k is pulled up to time n ∆ k = µ k − µ ∗ is the gap ∑ Goal: Minimize expected regret E R n = E [ T k ( n )]∆ k k = 1

  10. UCB values 1 • Mean-reward estimate • Confjdence width At each round t, select a tap. Optimize the quality of n selected beers [1] Auer et al. (2002) Finite-time analysis of the multiarmed bandit problem. In: MLJ. 8 UCB ( k ) = ˆ ˆ µ k σ k −

  11. Assumptions 3 log m arg min 2 9 Weighted UCB Pull each arm once (A1). Weight w is Hölder continuous with constant L and exponent α ∈ ( 0 , 1 ] (A2). The arms’ costs are bounded by M > 0 a.s. For each round m = 1 , 2 , . . . do For each arm k = 1 , . . . , K do Compute an estimate ˆ µ k , T k ( m − 1 ) of weight-distorted value µ k ( ) α UCB index: UCB ( k , m ) = � µ k , T k ( m − 1 ) − LM 2T k ( m − 1 ) Pull arm I m = UCB ( k , m ) . k = { 1 ,..., K }

  12. k lies in Weight-distorted value estimation 2j j j k j LM 3 log m 2 j k j LM 3 log m 2j 2 w.h.p. 0 1 10 j w 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k .

  13. Weight-distorted value estimation w w.h.p. 2 2j 2j j 1 0 j j j 10 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k . [ ] ( 3 log m ) α ( 3 log m ) α 2 , � µ k lies in µ k , j − LM µ k , j + LM �

  14. LM 2 How I learn to stop regretting.. 4 a stochastic environment and Hölder weight w such that R n k k 0 log n 2 Upper bound 1 k f n g n f n cg n for some positive c and n n 0 Lower bound For any sub-polynomial regret algorithm, 2 11 Gap-independent: Gap-dependent: k 2 3 n ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n )

  15. How I learn to stop regretting.. 3 k environment and Hölder weight w such that 2 n Upper bound Gap-independent: 2 11 k Gap-dependent: ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n ) Lower bound For any sub-polynomial regret algorithm, ∃ a stochastic   ∑ ( LM ) 2 /α log n   . E R n = Ω 4 ∆ 2 /α − 1 { k :∆ k > 0 } f ( n ) = Ω( g ( n )) ⇔ f ( n ) ≥ cg ( n ) for some positive c and n > n 0

  16. Outline K-armed bandits Linear bandits Routing application 12

  17. Linear bandit model Choose x I m before you get drunk Optimize the beer you drink, is enough estimating No need to estimate mean-reward of all arms, Linearity standard Gaussian r.v.s random vector of i.i.d. 13 Gaussian noise Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ

  18. Linear bandit model Gaussian noise before you get drunk Optimize the beer you drink, No need to estimate mean-reward of all arms, standard Gaussian r.v.s random vector of i.i.d. Choose x I m 13 Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ Linearity ⇒ estimating θ is enough

  19. Arm-dependent noise model 5 [1] Abbasi-Yadkori et al. (2011) Improved algorithms for linear stochastic bandits. In NIPS. standard Gaussian Noise model: specifjes the edge delay encoded by a vector of Route: x is a collection of edges Routing example: src 6 dst 4 12 1 2 3 7 8 9 10 11 14 Dimension d = # number of lanes 0 − 1 values Edge weight: For any edge j, θ j c m := x T I m ( θ + N m ) for any I m ∈ { 1 , . . . , K } Previous linear bandit algorithms, e.g. OFUL 1 , assume c m := x T Im θ + ξ m , where ξ m is

  20. WOFUL Algorithm • Updates for ridge m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: regression 15 distortions within ellipsoid won’t Probability work with • OFUL’s choice x T Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

  21. WOFUL Algorithm regression m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: 15 • Updates for ridge ellipsoid won’t work with Probability distortions • OFUL’s choice Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := x T θ within ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend