nonparametric bandits with covariates
play

Nonparametric Bandits with Covariates Philippe Rigollet Princeton - PowerPoint PPT Presentation

Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32 Example: Real time web page optimization 2 / 32 Example: Real time web page optimization


  1. Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32

  2. Example: Real time web page optimization 2 / 32

  3. Example: Real time web page optimization 2 / 32

  4. Example: Real time web page optimization Which ad will generate the most $/clicks ? 2 / 32

  5. Characteristics of the problem • A choice must be made for each customer. • Cannot observe the outcome of the alternative choice. • Try to maximize the rewards. Exploration vs. Exploitation dilemma Exploration: which one is the best? Exploitation: display the best as much as possible. 3 / 32

  6. Two armed bandit problem: setup • Two arms (e.g.: actions, ads): i ∈ { 1 , 2 } . • At time t , random reward Y ( i ) is observed when arm i is t pulled. • A policy π is a sequence π 1 , π 2 , . . . ∈ { 1 , 2 } , which indicates which arm to pull at each time t . • Performance: Expected cumulative reward at time n n � Y ( π t ) I E t t =1 • Goal: maximize reward. 4 / 32

  7. Two armed bandit problem: regret • Oracle policy π ⋆ = ( π ⋆ 1 , π ⋆ 2 , . . . ) pulls at each time t the best arm (in expectation) E[ Y ( i ) π ⋆ t = argmax I ] . t i =1 , 2 • We measure our performance by the regret n ( π ⋆ t ) − Y ( π t ) � � � R n ( π ) = I E Y t t t =1 5 / 32

  8. Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) 6 / 32

  9. Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) • Key assumption: Static environment E[ Y ( i ) • i.e., the (unknown) expected rewards µ i = I t ] are constant . • One way to solve the problem is to use U pper C onfidence B ounds policy. 6 / 32

  10. Side information 7 / 32

  11. Side information 7 / 32

  12. Side information 7 / 32

  13. Side information and covariates • At time t , the reward of each arm i ∈ { 1 , 2 } depends on R d )) a covariate X t ∈ X ( ⊂ (I Y ( i ) = f ( i ) ( X t ) + ε t , t = 1 , 2 , . . ., i = 1 , 2 . t with standard regression assumptions on { ε t } . • A policy is now a sequence of functions π t : X → { 1 , 2 } . • Oracle policy E[ Y ( i ) f ( i ) ( x ) π ⋆ ( x ) = argmax I | X t = x ] = argmax t i =1 , 2 i =1 , 2 8 / 32

  14. Assumptions on the expected rewards Assume now that X = [0 , 1] . 1. Constant: Static model studies by Lai & Robbins: f ( i ) ( x ) = µ i , i = 1 , 2 µ i unknown 2. Linear: One-armed bandit problem, studied by Goldenshluger & Zeevi (2008) f (1) ( x ) = x − θ, i = 1 , 2 θ unknown and f (2) ( x ) = 0 is constant and known . 3. Smooth: We assume that the functions are H¨ older smooth with parameter β ≤ 1 : | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β . (Consistency studied by Yang & Zhu, 2002) 9 / 32

  15. Constant rewards f (1) f (2) 0 1 10 / 32

  16. One-armed linear reward f (1) f (2) 0 0 1 11 / 32

  17. Smooth rewards f (1) f (2) 0 1 12 / 32

  18. Nonparametric bandit with covariates 13 / 32

  19. Two armed bandit problem with uniform covariates • Covariates: { X t } i.i.d in [0 , 1] with uniform distribution • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) � t | X t ] = f ( i ) ( X t ) I E t = 1 , 2 , . . ., i = 1 , 2 , where | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β , β ≤ 1 , i = 1 , 2 • Oracle policy pulls at time t π ⋆ ( X t ) = argmax f ( i ) ( X t ) i =1 , 2 • Regret n � f ( π ⋆ ( X t )) ( X t ) − f ( π t ( X t )) ( X t ) � � R n ( π ) = I E t =1 14 / 32

  20. Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are 15 / 32

  21. Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are Proposition: Conflict α vs. β ⇒ π ⋆ is a.s constant on [0 , 1] αβ > 1 = 15 / 32

  22. Illustration of the margin condition f (1) f (2) 0 1 16 / 32

  23. Illustration of the margin condition α = 1 f (1) f (2) 0 1 16 / 32

  24. Illustration of the margin condition f (1) f (2) α = 2 1 β = 2 0 1 16 / 32

  25. Binning (Exploiting smoothness) • Fix M > 1 . Consider the bins B j = [ j/M, ( j + 1) /M ) • Consider the average reward on each bin = 1 � f ( i ) ¯ f ( i ) ( x )d x , j p j B j Z t = j iff X t ∈ B j 17 / 32

  26. Binned UCB • For uniformly distributed X t , we have p j = I P( Z t = j ) = I P( X t ∈ B j ) = 1 /M • The rewards are Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j Play UCB on the ( Z t , Y t ) , t = 1 , . . . , n 18 / 32

  27. Binned problem f (1) f (2) 0 1 19 / 32

  28. Binned problem f (1) f (2) 0 1 19 / 32

  29. Binned problem f (1) f (2) 0 1 19 / 32

  30. Binned problem f (1) ¯ f (2) ¯ 0 1 19 / 32

  31. Two armed bandit problem with discrete covariates • Covariates: { Z t } i.i.d in { 1 , . . ., M } P ( Z t = j ) = p j , t = 1 , 2 , . . . • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j • Oracle policy pulls at time t f ( i ) ¯ π ⋆ ( Z t ) = argmax Z t i =1 , 2 20 / 32

  32. Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 21 / 32

  33. Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 Idea: play independently for each j = 1 , . . . M 21 / 32

  34. UCB policy for discrete covariate • Based Upper Confidence Bounds given by concentration inequalities (Hoeffding or Bernstein): � 2 log t B t ( s ) := . s • Define the number of times ˆ π prescribed to pull arm i and Z t = j , before time t t N ( i ) � j ( t ) = 1 I( Z s = j, ˆ π s ( Z s ) = i ) , s =1 • Average reward collected at those times t 1 ( i ) � Y ( i ) Y j ( t ) = s 1 I( Z s = j, ˆ π s ( Z s ) = i ) , N ( i ) j ( t ) s =1 22 / 32

  35. A first bound on the regret Binned UCB policy: conditionally on Z t = j , � � ( i ) j ( t ) + B t ( N ( i ) π t ( j ) = argmax ˆ Y j ( t )) i =1 , 2 Theorem 1. A first bound on the regret Denote by ∆ j = | ¯ f (1) − ¯ f (2) j | . j M ∆ j + log n � � � R n (ˆ π ) ≤ C ∆ j j =1 Direct consequence of Auer, Cesa-Bianchi & Fischer (2002) 23 / 32

  36. Margin condition M ∆ j + log n � � � ∆ j j =1 • The previous bound can become arbitrary large if one the ∆ j , j = 1 , . . . , M becomes too small. • Using the margin condition we can make local conclusions on gaps ∆ j : Few j ’s such that ∆ j is small 24 / 32

  37. Upper bound Theorem 2. A bound on the regret for the binned UCB policy 1 Fix α > 0 and 0 < β ≤ 1 and choose M ∼ ( n/ log n ) 2 β +1 . Then � − β (1+ α )  � 2 β +1 n Cn if α < 1   log n R n (ˆ π ) ≤ 2 β � − � 2 β +1 n Cn if α > 1   log n 25 / 32

  38. Suboptimality for α > 1 • If α > 1 , the bound becomes � nM − β (1+ α ) + M log n � R n (ˆ π ) ≤ C • Minimum for 1 n � � β (1+ α )+1 M ∼ log n • which yields β (1+ α ) n � − � β (1+ α )+1 R n (ˆ π ) ≤ Cn log n • Problem is: too many bins. Solution: Online/adaptive construction of the bins 26 / 32

  39. Conditional distributions • The distribution of Y ( i ) | X belongs to P = { P θ , θ ∈ Θ } , where θ is the mean parameter: � θ = x d P θ ( x ) • Assume that the family P is such that K ( P θ , P θ ′ ) ≤ ( θ − θ ′ ) 2 , κ > 0 . κ For any θ, θ ′ ∈ Θ ⊂ I R • Satisfied in particular for Gaussian (location) and Bernoulli families. 27 / 32

  40. Minimax lower bound Theorem 3. Let αβ ≤ 1 and the covariates { X t } be uniformly distributed on [0 , 1] d . Assume also that { P ( i ) , θ ∈ Im f ( i ) ( X ) } satisfies θ the condition on Kullback-leibler for any i = 1 , 2 . Then, for any policy π , R n ( π ) ≥ Cn · n − β (1+ α ) 2 β +1 , sup f (1) ,f (2) ∈ Σ( β,L ) for some positive constant C . 28 / 32

  41. Comments • Same bound as in the full information case (see Audibert & Tsybakov, 07) • Gap (of logarithmic size) between upper and lower bound. 29 / 32

  42. Extensions • Higher dimension d ≥ 2 , choose � · � ∞ � − β (1+ α ) n � 2 β + d R n (ˆ π ) ≤ C ( d ) n log n • The lower bound also holds. • Unknown n : doubling trick 30 / 32

  43. K -armed bandit • K -armed bandit problem ≤ Cδ α . i � = i ⋆ ( X ) | f ( i ) ( X ) − f ( i ⋆ ( X )) ( X ) | ≤ δ � � I P 0 < min where i ⋆ ( x ) = argmax 1 ≤ i ≤ K f ( i ) ( x ) � − β (1+ α ) n � 2 β +1 R n (ˆ π ) ≤ CKn log n 31 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend