Weighted bandits or: How bandits learn distorted values that are not - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. ∗ Joint work with Aditya Gopalan † , Michael Fu ∗ and Steve Marcus ∗ ∗ University of Maryland, College Park † Indian Institute of Science

Going to offjce - bandit style On every day 1. Pick a route to offjce 2. Reach offjce and record (sufgered) delay 1

Why not distort? Delays are stochastic In choosing between routes, humans *need not* minimize expected delay 2

Why not distort? Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

What we do 1 Rank-dependent expected utility - Quiggin (1982) [1] Cumulative prospect theory - Tversky & Kahneman (1992) 0 0 Multi-armed bandits Probability distortion 1 Probability p 4 0 1 0 0 . 8 Weight w ( p ) 0 . 6 p 0 . 69 0 . 4 ( p 0 . 69 + ( 1 − p ) 0 . 69 ) 1 / 0 . 69 0 . 2 0 . 2 0 . 4 0 . 6 0 . 8 The weight-distorted value µ k for any arm k ∈ { 1 , . . . , K } is ∫ ∞ ∫ ∞ µ k = w ( P [ Y k > z ]) dz − w ( P [ −Y k > z ]) dz , Y k is the r.v. corresponding to stochastic costs from arm k. Weight function w : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1

1-slide summary K-armed bandits Linearly parameterized bandits noise model Application: Traveler’s route choice • optimizing the route choice of a human traveler using GLD traffjc simulator • implement vanilla OFUL and weight-distorted OFUL ( WOFUL ) • exhibit qualitative difgerence between WOFUL and OFUL routes 5 • Upper Confjdence Bound (UCB) + distortions • Sublinear regret O ( n ( 2 − α ) / 2 ) , α ∈ ( 0 , 1 ) is Hölder exponent of w • Matching lower bound • Optimism in the Face of Uncertainty Linear (OFUL) + arm-dependent • Regret O ( d √ n polylog ( n )) , for sub-Gaussian cost distributions.

Outline K-armed bandits Linear bandits Routing application 6

Bandit model 0 K • • K Known # of arms K and horizon n K . 7 • observe a sample cost from F I m Unknown Distributions F k , k = 1 , . . . , K, distorted values µ 1 , . . . , µ K Interaction In each round m = 1 , . . . , n • pull arm I m ∈ { 1 , . . . , K } { } ∫ ∞ Benchmark: µ ∗ = min µ k := w ( 1 − F k ( z )) dz 1 ,..., K ∑ ∑ Regret: R n = T k ( n ) µ k − n µ ∗ = T k ( n )∆ k k = 1 k = 1 T k ( n ) is the # of times arm k is pulled up to time n ∆ k = µ k − µ ∗ is the gap ∑ Goal: Minimize expected regret E R n = E [ T k ( n )]∆ k k = 1

UCB values 1 • Mean-reward estimate • Confjdence width At each round t, select a tap. Optimize the quality of n selected beers [1] Auer et al. (2002) Finite-time analysis of the multiarmed bandit problem. In: MLJ. 8 UCB ( k ) = ˆ ˆ µ k σ k −

Assumptions 3 log m arg min 2 9 Weighted UCB Pull each arm once (A1). Weight w is Hölder continuous with constant L and exponent α ∈ ( 0 , 1 ] (A2). The arms’ costs are bounded by M > 0 a.s. For each round m = 1 , 2 , . . . do For each arm k = 1 , . . . , K do Compute an estimate ˆ µ k , T k ( m − 1 ) of weight-distorted value µ k ( ) α UCB index: UCB ( k , m ) = � µ k , T k ( m − 1 ) − LM 2T k ( m − 1 ) Pull arm I m = UCB ( k , m ) . k = { 1 ,..., K }

k lies in Weight-distorted value estimation 2j j j k j LM 3 log m 2 j k j LM 3 log m 2j 2 w.h.p. 0 1 10 j w 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k .

Weight-distorted value estimation w w.h.p. 2 2j 2j j 1 0 j j j 10 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k . [ ] ( 3 log m ) α ( 3 log m ) α 2 , � µ k lies in µ k , j − LM µ k , j + LM �

LM 2 How I learn to stop regretting.. 4 a stochastic environment and Hölder weight w such that R n k k 0 log n 2 Upper bound 1 k f n g n f n cg n for some positive c and n n 0 Lower bound For any sub-polynomial regret algorithm, 2 11 Gap-independent: Gap-dependent: k 2 3 n ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n )

How I learn to stop regretting.. 3 k environment and Hölder weight w such that 2 n Upper bound Gap-independent: 2 11 k Gap-dependent: ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n ) Lower bound For any sub-polynomial regret algorithm, ∃ a stochastic   ∑ ( LM ) 2 /α log n   . E R n = Ω 4 ∆ 2 /α − 1 { k :∆ k > 0 } f ( n ) = Ω( g ( n )) ⇔ f ( n ) ≥ cg ( n ) for some positive c and n > n 0

Outline K-armed bandits Linear bandits Routing application 12

Linear bandit model Choose x I m before you get drunk Optimize the beer you drink, is enough estimating No need to estimate mean-reward of all arms, Linearity standard Gaussian r.v.s random vector of i.i.d. 13 Gaussian noise Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ

Linear bandit model Gaussian noise before you get drunk Optimize the beer you drink, No need to estimate mean-reward of all arms, standard Gaussian r.v.s random vector of i.i.d. Choose x I m 13 Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ Linearity ⇒ estimating θ is enough

Arm-dependent noise model 5 [1] Abbasi-Yadkori et al. (2011) Improved algorithms for linear stochastic bandits. In NIPS. standard Gaussian Noise model: specifjes the edge delay encoded by a vector of Route: x is a collection of edges Routing example: src 6 dst 4 12 1 2 3 7 8 9 10 11 14 Dimension d = # number of lanes 0 − 1 values Edge weight: For any edge j, θ j c m := x T I m ( θ + N m ) for any I m ∈ { 1 , . . . , K } Previous linear bandit algorithms, e.g. OFUL 1 , assume c m := x T Im θ + ξ m , where ξ m is

WOFUL Algorithm • Updates for ridge m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: regression 15 distortions within ellipsoid won’t Probability work with • OFUL’s choice x T Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

WOFUL Algorithm regression m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: 15 • Updates for ridge ellipsoid won’t work with Probability distortions • OFUL’s choice Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := x T θ within ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

Weighted bandits or: How bandits learn distorted values that are not - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Weighted graphs 2 Weighted graphs So far we have only considered weighted graphs with

Weighted graphs 3 Weighted graph Edges in weighted graph are assigned a weight: w(v 1 , v 2 ),

Learn Blackboard Learn Learn with others Learn in your own time, pace, space Learn through

Values Learning Outcomes Define what values are Identify your personal values Relate

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

1 Measurement Passive Measurements Sprint has a passive measurement architecture Passive

Multiaccess Communication Satellite systems, radio networks (WLAN), Ethernet segment The

Multi-rate Signal Processing 6. Quadrature Mirror Filter (QMF) Bank Electrical & Computer

Data Transmission Analog and Digital Impairments Capacity ITS323: Introduction to Data

A Survey of Delay and Gain Correction Methods for the Indirect Learning of Digital Predistorters

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Delay limited Joint Source-Channel Coding Morteza Varasteh Imperial College London (ICL) M.

1 Outline Introduction to WMSNs Spatial correlation for visual information in WMSNs

Weighted bandits or: How bandits learn distorted values that are not - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Weighted graphs 2 Weighted graphs So far we have only considered weighted graphs with

Weighted graphs 3 Weighted graph Edges in weighted graph are assigned a weight: w(v 1 , v 2 ),

Learn Blackboard Learn Learn with others Learn in your own time, pace, space Learn through

Values Learning Outcomes Define what values are Identify your personal values Relate

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

1 Measurement Passive Measurements Sprint has a passive measurement architecture Passive

Multiaccess Communication Satellite systems, radio networks (WLAN), Ethernet segment The

Multi-rate Signal Processing 6. Quadrature Mirror Filter (QMF) Bank Electrical &amp; Computer

Data Transmission Analog and Digital Impairments Capacity ITS323: Introduction to Data

A Survey of Delay and Gain Correction Methods for the Indirect Learning of Digital Predistorters

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Delay limited Joint Source-Channel Coding Morteza Varasteh Imperial College London (ICL) M.

1 Outline Introduction to WMSNs Spatial correlation for visual information in WMSNs

Multi-rate Signal Processing 6. Quadrature Mirror Filter (QMF) Bank Electrical & Computer