introduction to multi armed bandits and reinforcement
play

Introduction to Multi-Armed Bandits and Reinforcement Learning - PowerPoint PPT Presentation

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine Learning for Communications Paris, 23-25 September 2019 Who am I ? . Hi, Im Lilian Besson finishing my PhD in telecommunication and machine


  1. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  2. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 � T � � T � Proof. � � R ν ( A , T ) = µ ⋆ T − E X A t , t = µ ⋆ T − E µ A t t =1 t =1 � T � � = ( µ ⋆ − µ A t ) E t =1 � T � K � � = ( µ ⋆ − µ a ) 1 ( A t = a ) . E � �� � a =1 t =1 � �� � ∆ a N a ( T ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  3. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 A strategy with small regret should: ◮ select not too often arms for which ∆ a > 0 (sub-optimal arms) ◮ . . . which requires to try all arms to estimate the values of the ∆ a = ⇒ Exploration / Exploitation trade-off ! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  4. Two naive strategies . ◮ Idea 1 : = ⇒ EXPLORATION Draw each arm T / K times    1 �  T = Ω( T ) → R ν ( A , T ) = ∆ a ֒ K a : µ a >µ ⋆ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 19/ 92

  5. Two naive strategies . ◮ Idea 1 : = ⇒ EXPLORATION Draw each arm T / K times    1 �  T = Ω( T ) → R ν ( A , T ) = ∆ a ֒ K a : µ a >µ ⋆ ◮ Idea 2 : Always trust the empirical best arm = ⇒ EXPLOITATION A t +1 = argmax µ a ( t ) using estimates of the unknown means µ a � a ∈{ 1 ,..., K } t � 1 µ a ( t ) = X a , s 1 ( A s = a ) � N a ( t ) s =1 → R ν ( A , T ) ≥ (1 − µ 1 ) × µ 2 × ( µ 1 − µ 2 ) T = Ω( T ) ֒ (with K = 2 Bernoulli arms of means µ 1 � = µ 2 ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 19/ 92

  6. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  7. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for K = 2 arms. If µ 1 > µ 2 , ∆ := µ 1 − µ 2 . R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × P ( � µ 1 , m ) ≤ µ 2 , m ≥ � µ a , m : empirical mean of the first m observations from arm a � Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  8. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for K = 2 arms. If µ 1 > µ 2 , ∆ := µ 1 − µ 2 . R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × P ( � µ 1 , m ) ≤ µ 2 , m ≥ � µ a , m : empirical mean of the first m observations from arm a � = ⇒ requires a concentration inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  9. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 1: ν 1 , ν 2 are bounded in [0 , 1]. R ν ( T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 2) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Hoeffding’s inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 21/ 92

  10. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 2: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 4 σ 2 ) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Gaussian tail inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 22/ 92

  11. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 2: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 4 σ 2 ) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Gaussian tail inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 22/ 92

  12. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. � � For m = 4 σ 2 T ∆ 2 ∆ 2 log , 4 σ 2 � � � � � 1 � R ν ( ETC , T ) ≤ 4 σ 2 T ∆ 2 log + 1 = O ∆ log( T ) . ∆ 2 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 23/ 92

  13. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. � � For m = 4 σ 2 T ∆ 2 ∆ 2 log , 4 σ 2 � � � � � 1 � R ν ( ETC , T ) ≤ 4 σ 2 T ∆ 2 log + 1 = O ∆ log( T ) . ∆ 2 + logarithmic regret! − requires the knowledge of T ( ≃ OKAY) and ∆ (NOT OKAY) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 23/ 92

  14. Sequential Explore-Then-Commit (2 Gaussian arms) . ◮ explore uniformly until the random time   � 8 σ 2 log( T / t )   τ = inf  t ∈ N : | � µ 1 ( t ) − � µ 2 ( t ) | > t  1.0 0.5 0.0 −0.5 −1.0 0 200 400 600 800 1000 ◮ � a τ = argmax a � µ a ( τ ) and ( A t +1 = � a τ ) for t ∈ { τ + 1 , . . . , T } � 1 � � � T ∆ 2 � R ν ( S-ETC , T ) ≤ 4 σ 2 ∆ log + C log( T ) = O ∆ log( T ) . = ⇒ same regret rate, without knowing ∆ [Garivier et al. 2016] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 24/ 92

  15. Numerical illustration . Two Gaussian arms: ν 1 = N (1 , 1) and ν 2 = N (1 . 5 , 1) 500 Uniform 40 FTL 35 Sequential-ETC 400 30 300 25 20 200 15 10 100 5 Sequential-ETC 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Expected regret estimated over N = 500 runs for Sequential-ETC versus our two naive baselines. (dashed lines: empirical 0.05% and 0.95% quantiles of the regret) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 25/ 92

  16. Is this a good regret rate? . For two-armed Gaussian bandits, � 1 � R ν ( ETC , T ) � 4 σ 2 � T ∆ 2 � ∆ log = O ∆ log( T ) . = ⇒ problem-dependent logarithmic regret bound R ν (algo , T ) = O (log( T )). Observation: blows up when ∆ tends to zero. . . � � � T ∆ 2 � 4 σ 2 R ν ( ETC , T ) min ∆ log , ∆ T � � � √ 4 σ 2 √ log( u 2 ) , u T min ≤ C T . ≤ u u > 0 = ⇒ problem-independent square-root regret bound √ R ν (algo , T ) = O ( T ). Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 26/ 92

  17. Best possible regret? Lower Bounds Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 27/ 92

  18. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence � � � ν µ , ν µ ′ � = E X ∼ ν µ log d ν µ kl ( µ, µ ′ ) := KL d ν µ ′ ( X ) Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  19. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence kl ( µ, µ ′ ) := ( µ − µ ′ ) 2 (Gaussian bandits with variance σ 2 ) 2 σ 2 Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  20. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence � 1 − µ � µ � � kl ( µ, µ ′ ) := µ log + (1 − µ ) log (Bernoulli bandits) µ ′ 1 − µ ′ Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  21. Some room for better algorithms? . ◮ For two-armed Gaussian bandits, ETC satisfies � 1 � R ν ( ETC , T ) � 4 σ 2 � T ∆ 2 � ∆ log = O ∆ log( T ) , with ∆ = | µ 1 − µ 2 | . ◮ The Lai and Robbins’ lower bound yields, for large values of T , � 1 � � T ∆ 2 � R ν ( A , T ) � 2 σ 2 ∆ log = Ω ∆ log( T ) , as kl ( µ 1 , µ 2 ) = ( µ 1 − µ 2 ) 2 . 2 σ 2 = ⇒ Explore-Then-Commit is not asymptotically optimal. Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 29/ 92

  22. Mixing Exploration and Exploitation Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 30/ 92

  23. A simple strategy: ε -greedy . The ε -greedy rule [Sutton and Barton, 98] is the simplest way to alternate exploration and exploitation. ε -greedy strategy At round t , ◮ with probability ε A t ∼ U ( { 1 , . . . , K } ) ◮ with probability 1 − ε A t = argmax µ a ( t ) . � a =1 ,..., K ⇒ Linear regret: R ν ( ε -greedy , T ) ≥ ε K − 1 = K ∆ min T . ∆ min = a : µ a <µ ⋆ ∆ a . min Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 31/ 92

  24. A simple strategy: ε -greedy . A simple fix: make ε decreasing! ε t -greedy strategy At round t , � � ◮ with probability ε t := min 1 , K probability ց with t d 2 t A t ∼ U ( { 1 , . . . , K } ) ◮ with probability 1 − ε t A t = argmax µ a ( t − 1) . � a =1 ,..., K Theorem [Auer et al. 02] � � 1 If 0 < d ≤ ∆ min , R ν ( ε t -greedy , T ) = O d 2 K log( T ) . = ⇒ requires the knowledge of a lower bound on ∆ min . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 32/ 92

  25. The Optimism Principle Upper Confidence Bounds Algorithms Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 33/ 92

  26. The optimism principle . Step 1: construct a set of statistically plausible models ◮ For each arm a , build a confidence interval I a ( t ) on the mean µ a : I a ( t ) = [ LCB a ( t ) , UCB a ( t )] LCB = Lower Confidence Bound UCB = Upper Confidence Bound Figure: Confidence intervals on the means after t rounds Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 34/ 92

  27. The optimism principle . Step 2 : act as if the best possible model were the true model (“optimism in face of uncertainty”) Figure: Confidence intervals on the means after t rounds Optimistic bandit model = argmax a =1 ,..., K µ a max µ ∈C ( t ) ◮ That is, select A t +1 = argmax UCB a ( t ) . a =1 ,..., K Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 35/ 92

  28. Optimistic Algorithms Building Confidence Intervals Analysis of UCB( α ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 36/ 92

  29. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≥ µ + x P s ◮ ν a bounded in [0 , 1]: 1 / 4 sub-Gaussian ◮ ν a = N ( µ a , σ 2 ): σ 2 sub-Gaussian Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  30. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≤ µ − x P s ◮ ν a bounded in [0 , 1]: 1 / 4 sub-Gaussian ◮ ν a = N ( µ a , σ 2 ): σ 2 sub-Gaussian Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  31. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≤ µ − x P s � Cannot be used directly in a bandit model as the number of observations s from each arm is random! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  32. How to build confidence intervals? . ◮ N a ( t ) = � t s =1 1 ( A s = a ) number of selections of a after t rounds � s µ a , s = 1 ◮ ˆ k =1 Y a , k average of the first s observations from arm a s ◮ � µ a ( t ) = � µ a , N a ( t ) empirical estimate of µ a after t rounds Hoeffding inequality + union bound � � � α log( t ) 1 µ a ( t ) + σ ≥ 1 − µ a ≤ � P N a ( t ) α 2 − 1 t Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 38/ 92

  33. How to build confidence intervals? . ◮ N a ( t ) = � t s =1 1 ( A s = a ) number of selections of a after t rounds � s µ a , s = 1 ◮ ˆ k =1 Y a , k average of the first s observations from arm a s ◮ � µ a ( t ) = � µ a , N a ( t ) empirical estimate of µ a after t rounds Hoeffding inequality + union bound � � � α log( t ) 1 µ a ( t ) + σ ≥ 1 − µ a ≤ � P N a ( t ) α 2 − 1 t Proof.   � � � � α log( t ) α log( t ) µ a ( t ) + σ  ∃ s ≤ t : µ a > � µ a , s + σ  µ a > � ≤ P P N a ( t ) s  �  t t � α log( t ) � 1 1  ≤ � t α/ 2 = ≤ µ a , s < µ a − σ t α/ 2 − 1 . P s s =1 s =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 38/ 92

  34. A first UCB algorithm . UCB( α ) selects A t +1 = argmax a UCB a ( t ) where � α log( t ) UCB a ( t ) = µ a ( t ) + � . N a ( t ) � �� � � �� � exploitation term exploration bonus ◮ this form of UCB was first proposed for Gaussian rewards [Katehakis and Robbins, 95] ◮ popularized by [Auer et al. 02] for bounded rewards: UCB1, for α = 2 → see the next talk at 4pm ! ֒ ◮ the analysis was UCB( α ) was further refined to hold for α > 1 / 2 in that case [Bubeck, 11, Cappé et al. 13] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 39/ 92

  35. A UCB algorithm in action (movie) . 1 0 6 31 436 17 9 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 40/ 92

  36. Optimistic Algorithms Building Confidence Intervals Analysis of UCB( α ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 41/ 92

  37. Regret of UCB( α ) for bounded rewards . Theorem [Auer et al, 02] UCB( α ) with parameter α = 2 satisfies   � � � K � 1 + π 2 � 1 �  log( T ) + R ν ( UCB1 , T ) ≤ 8  ∆ a . ∆ a 3 a : µ a <µ ⋆ a =1 Theorem For every α > 1 and every sub-optimal arm a , there exists a constant 4 α C α > 0 such that E µ [ N a ( T )] ≤ ( µ ⋆ − µ a ) 2 log( T ) + C α . It follows that   � 1  log( T ) + KC α . R ν ( UCB ( α ) , T ) ≤ 4 α  ∆ a a : µ a <µ ⋆ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 42/ 92

  38. Intermediate Summary . ◮ Several ways to solve the exploration/exploitation trade-off ◮ Explore-Then-Commit ◮ ε -greedy ◮ Upper Confidence Bound algorithms ◮ Good concentration inequalities are crucial to build good UCB algorithms! ◮ Performance lower bounds motivate the design of (optimal) algorithms Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 43/ 92

  39. A Bayesian Look at the MAB Model Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 44/ 92

  40. Bayesian Bandits Two points of view Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 45/ 92

  41. Historical perspective . 1952 Robbins, formulation of the MAB problem 1985 Lai and Robbins: lower bound, first asymptotically optimal algorithm 1987 Lai, asymptotic regret of kl -UCB 1995 Agrawal, UCB algorithms 1995 Katehakis and Robbins, a UCB algorithm for Gaussian bandits 2002 Auer et al: UCB1 with finite-time regret bound 2009 UCB-V, MOSS. . . 2011,13 Cappé et al: finite-time regret bound for kl -UCB Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 46/ 92

  42. Historical perspective . 1933 Thompson: a Bayesian mechanism for clinical trials 1952 Robbins, formulation of the MAB problem 1956 Bradt et al, Bellman: optimal solution of a Bayesian MAB problem 1979 Gittins: first Bayesian index policy 1985 Lai and Robbins: lower bound, first asymptocally optimal algorithm 1985 Berry and Fristedt: Bandit Problems, a survey on the Bayesian MAB 1987 Lai, asymptotic regret of kl -UCB + study of its Bayesian regret 1995 Agrawal, UCB algorithms 1995 Katehakis and Robbins, a UCB algorithm for Gaussian bandits 2002 Auer et al: UCB1 with finite-time regret bound 2009 UCB-V, MOSS. . . 2010 Thompson Sampling is re-discovered 2011,13 Cappé et al: finite-time regret bound for kl -UCB 2012,13 Thompson Sampling is asymptotically optimal Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 47/ 92

  43. Frequentist versus Bayesian bandit . ν µ = ( ν µ 1 , . . . , ν µ K ) ∈ ( P ) K . ◮ Two probabilistic models two points of view! Frequentist model Bayesian model µ 1 , . . . , µ K drawn from a µ 1 , . . . , µ K unknown parameters prior distribution : µ a ∼ π a i.i.d. arm a : ( Y a , s ) s | µ i.i.d. arm a : ( Y a , s ) s ∼ ν µ a ∼ ν µ a ◮ The regret can be computed in each case Frequentist Regret Bayesian regret (regret) (Bayes risk) � � � � T T � � R µ ( A , T )= E µ ( µ ⋆ − µ A t ) R π ( A , T )= E µ ∼ π ( µ ⋆ − µ A t ) t =1 t =1 � R µ ( A , T ) d π ( µ ) = Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 48/ 92

  44. Frequentist and Bayesian algorithms . ◮ Two types of tools to build bandit algorithms: Frequentist tools Bayesian tools MLE estimators of the means Posterior distributions π t Confidence Intervals a = L ( µ a | Y a , 1 , . . . , Y a , N a ( t ) ) 1 1 0 0 6 3 451 5 34 9 3 448 18 21 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 49/ 92

  45. Example: Bernoulli bandits . Bernoulli bandit model µ = ( µ 1 , . . . , µ K ) ◮ Bayesian view : µ 1 , . . . , µ K are random variables prior distribution : µ a ∼ U ([0 , 1]) = ⇒ posterior distribution: π a ( t ) = L ( µ a | R 1 , . . . , R t ) � � = S a ( t ) +1 , N a ( t ) − S a ( t ) +1 Beta � �� � � �� � # ones # zeros 3.5 3 3 π a (t) 2.5 π a (t+1) π a (t) 2.5 si X t+1 = 1 2 π a (t+1) 2 si X t+1 = 0 1.5 1.5 π 0 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 t � S a ( t ) = R s 1 ( A s = a ) sum of the rewards from arm a s =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 50/ 92

  46. Bayesian algorithm . A Bayesian bandit algorithm exploits the posterior distributions of the means to decide which arm to select. 1 0 2 4 346 107 40 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 51/ 92

  47. Bayesian Bandits Two points of view Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 52/ 92

  48. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . α Q( α , π ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  49. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . Bernoulli reward with uniform prior: ◮ π a (0) i . i . d ∼ U ([0 , 1]) = Beta(1 , 1) ◮ π a ( t ) = Beta( S a ( t ) + 1 , N a ( t ) − S a ( t ) + 1) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  50. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . Gaussian rewards with Gaussian prior: ◮ π a (0) i . i . d ∼ N (0 , κ 2 ) � � S a ( t ) σ 2 ◮ π a ( t ) = N N a ( t )+ σ 2 /κ 2 , N a ( t )+ σ 2 /κ 2 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  51. Bayes UCB in action (movie) . 1 0 6 19 443 4 27 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 54/ 92

  52. Theoretical results in the Bernoulli case . ◮ Bayes-UCB is asymptotically optimal for Bernoulli rewards Theorem [K.,Cappé,Garivier 2012] Let ε > 0. The Bayes-UCB algorithm using a uniform prior over the arms and parameter c ≥ 5 satisfies 1 + ε E µ [ N a ( T )] ≤ kl ( µ a , µ ⋆ ) log( T ) + o ε, c (log( T )) . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 55/ 92

  53. Bayesian Bandits Insights from the Optimal Solution Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 56/ 92

  54. Historical perspective . 1933 Thompson: in the context of clinical trial, the allocation of a treatment should be some increasing function of its posterior probability to be optimal 2010 Thompson Sampling rediscovered under different names Bayesian Learning Automaton [Granmo, 2010] Randomized probability matching [Scott, 2010] 2011 An empirical evaluation of Thompson Sampling: an efficient algorithm, beyond simple bandit models [Li and Chapelle, 2011] 2012 First (logarithmic) regret bound for Thompson Sampling [Agrawal and Goyal, 2012] 2012 Thompson Sampling is asymptotically optimal for Bernoulli bandits [K., Korda and Munos, 2012][Agrawal and Goyal, 2013] 2013- Many successful uses of Thompson Sampling beyond Bernoulli bandits (contextual bandits, reinforcement learning) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 57/ 92

  55. Thompson Sampling . Two equivalent interpretations : ◮ “select an arm at random according to its probability of being the best” ◮ “draw a possible bandit model from the posterior distribution and act optimally in this sampled model” � = optimistic Thompson Sampling: a randomized Bayesian algorithm  ∀ a ∈ { 1 .. K } , θ a ( t ) ∼ π a ( t )  A t +1 = argmax θ a ( t ) .  a =1 ... K 10 8 6 4 2 θ 1 (t) µ 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 µ 2 θ 2 (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 58/ 92

  56. Thompson Sampling is asymptotically optimal . Problem-dependent regret 1 + ε ∀ ε > 0 , E µ [ N a ( T )] ≤ kl ( µ a , µ ⋆ ) log( T ) + o µ,ε (log( T )) . This results holds: ◮ for Bernoulli bandits, with a uniform prior [K. Korda, Munos 12][Agrawal and Goyal 13] ◮ for Gaussian bandits, with Gaussian prior[Agrawal and Goyal 17] ◮ for exponential family bandits, with Jeffrey’s prior [Korda et al. 13] Problem-independent regret [Agrawal and Goyal 13] For Bernoulli and Gaussian bandits, Thompson Sampling satisfies �� � R µ ( TS , T ) = O KT log( T ) . ◮ Thompson Sampling is also asymptotically optimal for Gaussian with unknown mean and variance [Honda and Takemura, 14] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 59/ 92

  57. Understanding Thompson Sampling . ◮ a key ingredient in the analysis of [K. Korda and Munos 12] Proposition There exists constants b = b ( µ ) ∈ (0 , 1) and C b < ∞ such that ∞ � N 1 ( t ) ≤ t b � � ≤ C b . P t =1 � N 1 ( t ) ≤ t b � = { there exists a time range of length at least t 1 − b − 1 with no draw of arm 1 } 9 8 7 6 5 4 3 2 1 µ 2 + δ µ 2 µ 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 60/ 92

  58. Bayesian versus Frequentist algorithms . ◮ Short horizon, T = 1000 (average over N = 10000 runs) 12 10 8 6 4 KLUCB 2 KLUCB + KLUCB−H + Bayes UCB 0 Thompson Sampling FH−Gittins −2 0 100 200 300 400 500 600 700 800 900 1000 K = 2 Bernoulli arms µ 1 = 0 . 2 , µ 2 = 0 . 25 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 61/ 92

  59. Bayesian versus Frequentist algorithms . ◮ Long horizon, T = 20000 (average over N = 50000 runs) K = 10 Bernoulli arms bandit problem µ = [0 . 1 0 . 05 0 . 05 0 . 05 0 . 02 0 . 02 0 . 02 0 . 01 0 . 01 0 . 01] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 62/ 92

  60. Other Bandit Models Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 63/ 92

  61. Other Bandit Models Many different extensions Piece-wise stationary bandits Multi-player bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 64/ 92

  62. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  63. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  64. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  65. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  66. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  67. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  68. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  69. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . ◮ (decentralized) collaborative/communicating bandits over a graph → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  70. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . ◮ (decentralized) collaborative/communicating bandits over a graph ◮ (decentralized) non communicating multi-player bandits → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  71. Many other bandits models and problems (2/2) . And many more extensions. . . ◮ non stochastic, Markov models rested/restless ◮ best arm identification (vs reward maximization) ◮ fixed budget setting ◮ fixed confidence setting ◮ PAC (probably approximately correct) algorithms ◮ bandits with (differential) privacy constraints ◮ for some applications (content recommendation) ◮ contextual bandits : observe a reward and a context ( C t ∈ R d ) ◮ cascading bandits ◮ delayed feedback bandits ◮ structured bandits (low-rank, many-armed, Lipschitz etc) ◮ X -armed, continuous-armed bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 66/ 92

  72. Other Bandit Models Many different extensions Piece-wise stationary bandits Multi-player bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 67/ 92

  73. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  74. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Non stationary MAB problems? (possibly) different distributions for any time step ! ∀ t , r a ( t ) iid ∼ ν a ( t ) = B ( µ a ( t )) . = ⇒ harder problem! And very hard if µ a ( t ) can change at any step! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  75. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Non stationary MAB problems? (possibly) different distributions for any time step ! ∀ t , r a ( t ) iid ∼ ν a ( t ) = B ( µ a ( t )) . = ⇒ harder problem! And very hard if µ a ( t ) can change at any step! Piece-wise stationary problems! → the litterature usually focuses on the easier case, when there are at ֒ √ most Y T = o ( T ) intervals, on which the means are all stationary. Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  76. Example of a piece-wise stationary MAB problem . We plots the means µ 1 ( t ), µ 2 ( t ), µ 3 ( t ) of K = 3 arms. There are Y T = 4 break-points and 5 sequences between t = 1 and t = T = 5000: History of means for Non-Stationary MAB, Bernoulli with 4 break-points Arm #0 Arm #1 Arm #2 0.8 Successive means of the K = 3 arms 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 Time steps t = 1 . . . T , horizon T = 5000 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 69/ 92

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend