Nonparametric Bandits with Covariates Philippe Rigollet Princeton - PowerPoint PPT Presentation

Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32

Example: Real time web page optimization 2 / 32

Example: Real time web page optimization Which ad will generate the most $/clicks ? 2 / 32

Characteristics of the problem • A choice must be made for each customer. • Cannot observe the outcome of the alternative choice. • Try to maximize the rewards. Exploration vs. Exploitation dilemma Exploration: which one is the best? Exploitation: display the best as much as possible. 3 / 32

Two armed bandit problem: setup • Two arms (e.g.: actions, ads): i ∈ { 1 , 2 } . • At time t , random reward Y ( i ) is observed when arm i is t pulled. • A policy π is a sequence π 1 , π 2 , . . . ∈ { 1 , 2 } , which indicates which arm to pull at each time t . • Performance: Expected cumulative reward at time n n � Y ( π t ) I E t t =1 • Goal: maximize reward. 4 / 32

Two armed bandit problem: regret • Oracle policy π ⋆ = ( π ⋆ 1 , π ⋆ 2 , . . . ) pulls at each time t the best arm (in expectation) E[ Y ( i ) π ⋆ t = argmax I ] . t i =1 , 2 • We measure our performance by the regret n ( π ⋆ t ) − Y ( π t ) � � � R n ( π ) = I E Y t t t =1 5 / 32

Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) 6 / 32

Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) • Key assumption: Static environment E[ Y ( i ) • i.e., the (unknown) expected rewards µ i = I t ] are constant . • One way to solve the problem is to use U pper C onfidence B ounds policy. 6 / 32

Side information 7 / 32

Side information and covariates • At time t , the reward of each arm i ∈ { 1 , 2 } depends on R d )) a covariate X t ∈ X ( ⊂ (I Y ( i ) = f ( i ) ( X t ) + ε t , t = 1 , 2 , . . ., i = 1 , 2 . t with standard regression assumptions on { ε t } . • A policy is now a sequence of functions π t : X → { 1 , 2 } . • Oracle policy E[ Y ( i ) f ( i ) ( x ) π ⋆ ( x ) = argmax I | X t = x ] = argmax t i =1 , 2 i =1 , 2 8 / 32

Assumptions on the expected rewards Assume now that X = [0 , 1] . 1. Constant: Static model studies by Lai & Robbins: f ( i ) ( x ) = µ i , i = 1 , 2 µ i unknown 2. Linear: One-armed bandit problem, studied by Goldenshluger & Zeevi (2008) f (1) ( x ) = x − θ, i = 1 , 2 θ unknown and f (2) ( x ) = 0 is constant and known . 3. Smooth: We assume that the functions are H¨ older smooth with parameter β ≤ 1 : | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β . (Consistency studied by Yang & Zhu, 2002) 9 / 32

Constant rewards f (1) f (2) 0 1 10 / 32

One-armed linear reward f (1) f (2) 0 0 1 11 / 32

Smooth rewards f (1) f (2) 0 1 12 / 32

Nonparametric bandit with covariates 13 / 32

Two armed bandit problem with uniform covariates • Covariates: { X t } i.i.d in [0 , 1] with uniform distribution • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) � t | X t ] = f ( i ) ( X t ) I E t = 1 , 2 , . . ., i = 1 , 2 , where | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β , β ≤ 1 , i = 1 , 2 • Oracle policy pulls at time t π ⋆ ( X t ) = argmax f ( i ) ( X t ) i =1 , 2 • Regret n � f ( π ⋆ ( X t )) ( X t ) − f ( π t ( X t )) ( X t ) � � R n ( π ) = I E t =1 14 / 32

Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are 15 / 32

Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are Proposition: Conflict α vs. β ⇒ π ⋆ is a.s constant on [0 , 1] αβ > 1 = 15 / 32

Illustration of the margin condition f (1) f (2) 0 1 16 / 32

Illustration of the margin condition α = 1 f (1) f (2) 0 1 16 / 32

Illustration of the margin condition f (1) f (2) α = 2 1 β = 2 0 1 16 / 32

Binning (Exploiting smoothness) • Fix M > 1 . Consider the bins B j = [ j/M, ( j + 1) /M ) • Consider the average reward on each bin = 1 � f ( i ) ¯ f ( i ) ( x )d x , j p j B j Z t = j iff X t ∈ B j 17 / 32

Binned UCB • For uniformly distributed X t , we have p j = I P( Z t = j ) = I P( X t ∈ B j ) = 1 /M • The rewards are Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j Play UCB on the ( Z t , Y t ) , t = 1 , . . . , n 18 / 32

Binned problem f (1) f (2) 0 1 19 / 32

Binned problem f (1) ¯ f (2) ¯ 0 1 19 / 32

Two armed bandit problem with discrete covariates • Covariates: { Z t } i.i.d in { 1 , . . ., M } P ( Z t = j ) = p j , t = 1 , 2 , . . . • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j • Oracle policy pulls at time t f ( i ) ¯ π ⋆ ( Z t ) = argmax Z t i =1 , 2 20 / 32

Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 21 / 32

Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 Idea: play independently for each j = 1 , . . . M 21 / 32

UCB policy for discrete covariate • Based Upper Confidence Bounds given by concentration inequalities (Hoeffding or Bernstein): � 2 log t B t ( s ) := . s • Define the number of times ˆ π prescribed to pull arm i and Z t = j , before time t t N ( i ) � j ( t ) = 1 I( Z s = j, ˆ π s ( Z s ) = i ) , s =1 • Average reward collected at those times t 1 ( i ) � Y ( i ) Y j ( t ) = s 1 I( Z s = j, ˆ π s ( Z s ) = i ) , N ( i ) j ( t ) s =1 22 / 32

A first bound on the regret Binned UCB policy: conditionally on Z t = j , � � ( i ) j ( t ) + B t ( N ( i ) π t ( j ) = argmax ˆ Y j ( t )) i =1 , 2 Theorem 1. A first bound on the regret Denote by ∆ j = | ¯ f (1) − ¯ f (2) j | . j M ∆ j + log n � � � R n (ˆ π ) ≤ C ∆ j j =1 Direct consequence of Auer, Cesa-Bianchi & Fischer (2002) 23 / 32

Margin condition M ∆ j + log n � � � ∆ j j =1 • The previous bound can become arbitrary large if one the ∆ j , j = 1 , . . . , M becomes too small. • Using the margin condition we can make local conclusions on gaps ∆ j : Few j ’s such that ∆ j is small 24 / 32

Upper bound Theorem 2. A bound on the regret for the binned UCB policy 1 Fix α > 0 and 0 < β ≤ 1 and choose M ∼ ( n/ log n ) 2 β +1 . Then � − β (1+ α )  � 2 β +1 n Cn if α < 1   log n R n (ˆ π ) ≤ 2 β � − � 2 β +1 n Cn if α > 1   log n 25 / 32

Suboptimality for α > 1 • If α > 1 , the bound becomes � nM − β (1+ α ) + M log n � R n (ˆ π ) ≤ C • Minimum for 1 n � � β (1+ α )+1 M ∼ log n • which yields β (1+ α ) n � − � β (1+ α )+1 R n (ˆ π ) ≤ Cn log n • Problem is: too many bins. Solution: Online/adaptive construction of the bins 26 / 32

Conditional distributions • The distribution of Y ( i ) | X belongs to P = { P θ , θ ∈ Θ } , where θ is the mean parameter: � θ = x d P θ ( x ) • Assume that the family P is such that K ( P θ , P θ ′ ) ≤ ( θ − θ ′ ) 2 , κ > 0 . κ For any θ, θ ′ ∈ Θ ⊂ I R • Satisfied in particular for Gaussian (location) and Bernoulli families. 27 / 32

Minimax lower bound Theorem 3. Let αβ ≤ 1 and the covariates { X t } be uniformly distributed on [0 , 1] d . Assume also that { P ( i ) , θ ∈ Im f ( i ) ( X ) } satisfies θ the condition on Kullback-leibler for any i = 1 , 2 . Then, for any policy π , R n ( π ) ≥ Cn · n − β (1+ α ) 2 β +1 , sup f (1) ,f (2) ∈ Σ( β,L ) for some positive constant C . 28 / 32

Comments • Same bound as in the full information case (see Audibert & Tsybakov, 07) • Gap (of logarithmic size) between upper and lower bound. 29 / 32

Extensions • Higher dimension d ≥ 2 , choose � · � ∞ � − β (1+ α ) n � 2 β + d R n (ˆ π ) ≤ C ( d ) n log n • The lower bound also holds. • Unknown n : doubling trick 30 / 32

K -armed bandit • K -armed bandit problem ≤ Cδ α . i � = i ⋆ ( X ) | f ( i ) ( X ) − f ( i ⋆ ( X )) ( X ) | ≤ δ � � I P 0 < min where i ⋆ ( x ) = argmax 1 ≤ i ≤ K f ( i ) ( x ) � − β (1+ α ) n � 2 β +1 R n (ˆ π ) ≤ CKn log n 31 / 32

Nonparametric Bandits with Covariates Philippe Rigollet Princeton - PowerPoint PPT Presentation

Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32 Example: Real time web page optimization 2 / 32 Example: Real time web page optimization

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Handling Covariates in the Design Rosenberger of Clinical Trials I. Introduction Covariates and

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Time-dependent covariates In many situations it is useful to consider covariates that change over

Jason Roberts, October 2015 Topics for this session Why use covariates other than x and y?

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Improving the efficiency of individualized designs for the covariates mixed logit choice model by

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

FORECASTING PAKISTANI STOCK MARKET VOLATILITY WITH MACROECONOMIC VARIABLES: EVIDENCE FROM THE

Third International Conference on Business Analytics and Intelligence Tracks Schedule Track: 02

SQLite as a Result File Format in OMNeT++ Rudolf Hornig OMNeT++ Result Files Scalar and

Model-based measurement of volatility using high-frequency data Borus Jungbacker & Siem Jan

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David

Bayesian Inference of a Finite Population Mean Under Length-Biased Sampling Zhiqing Xu, Balgobin

Bayes Estimator Lecture 15 Biostatistics 602 - Statistical Inference . . Summary Conjugate

SIPTA-Community based on Paper Contributions Paul Fink July 3, 2019 ISIPTA 2019 Ghent, Belgium

Nonparametric Bandits with Covariates Philippe Rigollet Princeton - PowerPoint PPT Presentation

Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32 Example: Real time web page optimization 2 / 32 Example: Real time web page optimization

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Handling Covariates in the Design Rosenberger of Clinical Trials I. Introduction Covariates and

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Time-dependent covariates In many situations it is useful to consider covariates that change over

Jason Roberts, October 2015 Topics for this session Why use covariates other than x and y?

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Improving the efficiency of individualized designs for the covariates mixed logit choice model by

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

FORECASTING PAKISTANI STOCK MARKET VOLATILITY WITH MACROECONOMIC VARIABLES: EVIDENCE FROM THE

Third International Conference on Business Analytics and Intelligence Tracks Schedule Track: 02

SQLite as a Result File Format in OMNeT++ Rudolf Hornig OMNeT++ Result Files Scalar and

Model-based measurement of volatility using high-frequency data Borus Jungbacker &amp; Siem Jan

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David

Bayesian Inference of a Finite Population Mean Under Length-Biased Sampling Zhiqing Xu, Balgobin

Bayes Estimator Lecture 15 Biostatistics 602 - Statistical Inference . . Summary Conjugate

SIPTA-Community based on Paper Contributions Paul Fink July 3, 2019 ISIPTA 2019 Ghent, Belgium

Model-based measurement of volatility using high-frequency data Borus Jungbacker & Siem Jan