new perspectives for multi armed bandits and their
play

New Perspectives for Multi-Armed Bandits and Their Applications - PowerPoint PPT Presentation

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay Motivations & Objectives Classical Examples of Bandits Problems Size of


  1. New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay

  2. Motivations & Objectives

  3. Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible

  4. Classical Examples of Bandits Problems – Size of data: n banners with some proba of click or – Banner clicked or ignored 1) Inference: Find the best ad between the red and blue 2) Cumul: Get as many clicks as possible 3 – Choose one of two ads to display

  5. Classical Examples of Bandits Problems – Size of data: n auctions with some expected revenue or – Auction won or lost 1) Inference: Find the best strategy between the red and blue 2) Cumul: Win as many profitable auctions as possible 3 – Choose one of two strategies(bid/opt out) to follow

  6. Classical Examples of Bandits Problems – Size of data: n mails with some proba of spam or – Mail correctly or incorrectly classified 1) Inference: Find the best strategy between the red and blue 2) Cumul: Minimize number of errors as possible 3 – Choose one of two actions: spam or ham

  7. Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible

  8. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  9. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  10. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  11. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  12. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  13. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  14. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  15. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  16. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  17. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  18. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  19. – Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

  20. Two-Armed Bandit – Patients arrive and are treated sequentially. – Save as many as possible. 4

  21. A bit of theory 5

  22. Stochastic Multi-Armed Bandit

  23. K -Armed Stochastic Bandit Problems i.i.d. T T T – Goal: Maximize expected reward 7 bounded X i – K actions i ∈ { 1 , . . . , K } , outcome X i t ∈ R (sub-)Gaussian, ( ) 1 , X i 2 , . . . , ∼ N µ i , 1 ( 2 , . . . , X π t − 1 ) X π 1 1 , X π 2 ∈ { 1 , . . . , K } – Non-Anticipative Policy: π t t − 1 t = 1 E X π t t = 1 µ π t ∑ T t = ∑ T – Performance: Cumulative Regret µ i − µ π t = ∆ i ∑ ∑ ∑ { } R T = max π t = i ̸ = ⋆ 1 i ∈{ 1 , 2 } t = 1 t = 1 t = 1 with ∆ i = µ ⋆ − µ i , the “gap” or cost of error i .

  24. Most Famous algorithm [Auer, Cesa-Bianchi, Fisher, ’02] i Worst-Case: k Regret: s . • UCB - “Upper Confidence Bound” T i t 8 i X i √ 2 log ( t ) { } π t + 1 = arg max t + , T i ( t ) where T i ( t ) = ∑ t t = 1 1 { π t = i } and X ∑ t = 1 s : i s = i X i K log ( T ) log ( T ) E R T ≲ ∑ ∧ T ∆ E R T ≲ sup ∆ k ∆ ∆ √ KT log ( T ) ≂

  25. 9 i X i i • “optimal”: no algo can always have a regret smaller than • 2-lines proof: i i i i √ { } 2 log ( t ) Ideas of proof π t + 1 = arg max i t + T i ( t ) √ √ 2 log ( t ) 2 log ( t ) ⋆ π t + 1 = i ̸ = ⋆ ⇐ ⇒ X ≤ X t + t + T ⋆ ( t ) T i ( t ) √ 2 log ( t ) ⇒ T i ( t ) ≲ log ( t ) “ = ⇒ ”∆ i ≤ = T i ( t ) ∆ 2 • Number of mistakes grows as log ( t ) i ; each mistake costs ∆ i . ∆ 2 log ( T ) log ( T ) Regret at stage T ≲ ∑ × ∆ i ≂ ∑ ∆ 2 ∆ i • “ = ⇒ ” actually happens with overwhelming proba log ( T ) ∑ ∆ i

  26. Other Algos • Other algo, MOSS [Audibert, Bubeck], variants of UCB T Discretize + UCB gives TK • Other algo, ETC [Perchet,Rigollet], pulls in round robin then 10 k eliminates log ( T ∆ k ) √ R T ≲ ∑ , worst case R T ≤ T log ( K ) K ∆ k √ R T ≲ K log ( T ∆ min / K ) , worst case R T ≤ ∆ min • Infinite number of actions x ∈ [ 0 , 1 ] d with ∆( x ) 1 Lipschitz. √ ε ≤ T 2 / 3 R T ≲ T ε +

  27. Very interesting.... useful ? no... Here is a list of reasons 11

  28. On the basic assumptions 1. Stochastic: Data are not iid, patients are different ill-posedness , feature selection/model selection pomdp, learn trade bias/variance grouping, evaluations 4. Combinatorial: Several decisions at each stage combinatorial optimization, cascading 12 2. Different Timing: several actions for one reward 3. Delays: Rewards not received instantaneously 5. Non-linearity: concave gain, diminishing returns, etc

  29. Investigating (past/present/futur) them 13

  30. Patients are different • We assumed (implicitly ?) that all patients/users are identical • Treatments efficiency 9proba of clicks) depend on age, gender... • Those covariates or contexts are observed/known before taking the decision of blue/red pill The decision (and regret...) should ultimately depend on it 14

  31. General Model of Contextual Bandits The cookies of a user, the medical history, etc. • Reward: X k 15 • Covariates: ω t ∈ Ω = [ 0 , 1 ] d , i.i.d., law µ (equivalent to) λ • Decisions: π t ∈ { 1 , .., K } The decision can (should) depend on the context ω t t ∈ [ 0 , 1 ] ∼ ν k ( ω t ) , E [ X k | ω ] = µ k ( ω ) The expected reward of action k depend on the context ω • Objectives: Find the best decision given the request t = 1 µ π ⋆ ( ω t ) ( ω t ) − µ π t ( ω t ) Minimize regret R T := ∑ T

  32. k and Regularity assumptions max k is not continuous. -Hölder but is 2: With K is the second max. k s t k max is the maximal k where 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] ≤ C 0 δ α P X 0 < � µ 1 ( ω ) − µ 2 ( ω ) � < δ � �

  33. Regularity assumptions is the second max. 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] � µ ⋆ ( ω ) − µ ♯ ( ω ) ≤ C 0 δ α P X 0 < � < δ � � where µ ⋆ ( ω ) = max k µ k ( ω ) is the maximal µ k and µ ♯ ( ω ) = max { µ k ( ω ) s . t . µ k ( ω ) < µ ⋆ ( ω ) } With K > 2: µ ⋆ is β -Hölder but µ ♯ is not continuous.

  34. 17 Regularity: an easy example ( α big) µ 1 ( ω )

  35. 17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω )

  36. 17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )

  37. 17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ 3 ( ω )

  38. 17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )

  39. 17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )

  40. 18 Regularity: a hard example ( α small) µ 1 ( ω )

  41. 18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 2 ( ω )

  42. 18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ 2 ( ω )

  43. 18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ 2 ( ω )

  44. 18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ ♯ ( ω ) µ 2 ( ω )

  45. 18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ ♯ ( ω ) µ 2 ( ω )

  46. 19 Binned policy µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )

  47. Binned policy 19 µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )

  48. Binned policy 19 µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )

  49. Binned Successive Elimination (BSE) Theorem [P. and Rigollet (’13)] the effects of exploration/exploitation. • Same bound with full monit [Audibert and Tsybakov, ’07] 1 T 20 T ) β ( 1 + α ) ( ( ) 2 β + d , bin side 2 β + d . K log ( K ) K log ( K ) If α < 1, E [ R T ( BSE )] ≲ T For K = 2, matches lower bound: minimax optimal w.r.t. T . • No log ( T ) : difficulty of nonparametric estimation washes away • α < 1: cannot attain fast rates for easy problems. • Adaptive partitioning !

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend