thompson sampling algorithms for mean variance bandits
play

Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu - PowerPoint PPT Presentation

Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu Vincent Y. F. Tan Institute of Operations Research and Analytics, National University of Singapore ICML 2020 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance


  1. Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu Vincent Y. F. Tan Institute of Operations Research and Analytics, National University of Singapore ICML 2020 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 1 / 23

  2. Stochastic multi-armed bandit Problem formulation A stochastic multi-armed bandit is a collection of distributions ν = ( P 1 , P 2 , . . . , P K ), where K is the number of the arms. In each period t ∈ [ T ] : 1 Player picks arm i ( t ) ∈ A . 2 Player observes reward X i ( t ) , t ∼ P i ( t ) for the chosen arm. Learning policy A policy π : ( t , A 1 , X 1 , . . . , A t − 1 , X t − 1 ) → [ K ] is characterised by, i ( t ) = π ( t , i (1) , X i (1) , 1 , · · · , i ( t − 1) , X i ( t − 1) , t − 1 ) , t = 1 , · · · , T The player can only use the past observations in current decisions. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 2 / 23

  3. The learning objective Objective Minimize the expected cumulative regret � n � n K � � � ( µ ∗ − µ i ( t ) ) = R n = E ( X i ∗ , t − X i ( t ) , t ) = ∆ i E [ T i , n ] t =1 t =1 i =1 where µ i is the mean of each arm, i ∗ = arg max[ µ i ], ∆ i = µ ∗ − µ i and T i , n = � n t =1 1 { i ( t )= i } Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 3 / 23

  4. Motivation Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  5. Motivation Mean = ( − 1 . 44 , 3 . 00 , 3 . 12) Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  6. Motivation True reward distribution: Arm 1 ∼ N (1 , 3) Arm 2 ∼ N (3 , 0 . 1) Arm 3 ∼ N (3 . 3 , 4) Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  7. Motivation True reward distribution: Arm 1 ∼ N (1 , 3) Arm 2 ∼ N (3 , 0 . 1) Arm 3 ∼ N (3 . 3 , 4) Some applications require a trade-off between risk and return. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  8. Mean-variance multi-armed bandit Definition 1 (Mean-Variance) The mean-variance of an arm i with mean µ i , variance σ 2 i and coefficient absolute risk tolerance ρ > 0 is defined as MV i = ρµ i − σ 2 i Definition 2 (Empirical Mean-Variance) Suppose we have i.i.d. samples { X i , t } s t =1 from the distribution ν i , the empirical mean-variance is defined as � σ 2 MV i , s = ρ ˆ µ i , s − ˆ i , s σ 2 where ˆ i , s and ˆ µ i , s are empirical variance and mean respectively. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 5 / 23

  9. The learning objective For a given policy π , and its corresponding performance over n rounds { Z t , t = 1 , 2 , . . . , n } . We define its empirical mean-variance as � σ 2 MV n ( π ) = ρ ˆ µ n ( π ) − ˆ n ( π ) where � T � n µ n ( π ) = 1 n ( π ) = 1 σ 2 µ n ( π )) 2 . ( Z t − ˆ ˆ Z t , and ˆ n n t =1 t =1 Definition 3 (Regret) The expected regret of a policy π ( · ) over n rounds is defined as � � �� � E [ R n ( π )] = n MV 1 − E MV n ( π ) where we assume the first arm is the best arm. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 6 / 23

  10. The variances Law of total variance Var ( reward ) = E [ Var ( reward | arm )] + Var ( E [ reward | arm ]) Figure 1: Reward Process Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 7 / 23

  11. Pseudo-regret Definition 4 The expected pseudo-regret for a policy π ( · ) over n rounds is defined as � K � K � � � � E [ T i , n ] ∆ i + 1 E [ T i , n T j , n ] Γ 2 R n ( π ) = i , j . E n i =2 i =1 j � = i where ∆ i = σ 2 i − σ 2 1 − ρ ( µ i − µ 1 ) is the gap between MV i and MV 1 , and Γ i , j is the gap between µ i and µ j . Lemma 1 The difference between the expected regret and the expected pseudo-regret can be bounded as follows: � K � � � σ 2 E [ R n ( π )] ≤ E R n ( π ) + 3 i i =1 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 8 / 23

  12. Pseudo-regret Simplification of pseudo-regret � K � � K 1 E [ T i , n T j , n ] Γ 2 E [ T i , n ] Γ 2 i , j ≤ 2 (1) i , max n i =1 j � = i i =2 i , max = max { ( µ i − µ j ) 2 : j = 1 , . . . , K } . where Γ 2 By applying Definition 4, Lemma 1 and Eqn. (1), it suffices to bound the expected number of pulls of suboptimal arms E [ T i , n ]. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 9 / 23

  13. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 0 → Samples: (1 . 30 , 1 . 22 , − 0 . 07) → Play arm 1 → Get reward − 1 . 44 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  14. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 1 → Samples: (0 . 17 , − 0 . 24 , 0 . 65) → Play arm 3 → Get reward 0 . 62 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  15. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 10 → Samples: ( − 0 . 24 , 2 . 15 , 3 . 23) → Play arm 2 → Get reward 2 . 12 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  16. TS algorithm for mean learning Algorithm 1 Thompson Sampling for Mean Learning µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 . . . , do Sample θ i ( t ) from N (ˆ µ i , t − 1 , 1 / ( T i , t − 1 + 1)). 3: Play arm i ( t ) = arg max i ρθ i ( t ) − 2 β i , t − 1 and observe X i ( t ) , t 4: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 5: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 6: 7: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 11 / 23

  17. Regret bound Theorem 1 � � σ 2 If ρ > max 1 / Γ i : i = 1 , 2 , . . . , K , the asymptotic expected regret incurredd by MTS for mean-variance Gaussian bandits satisfies � � � � K 2 ρ 2 R n ( MTS ) � � E ∆ i + 2Γ 2 ≤ lim � � 2 i , max log n n →∞ ρ Γ 1 , i − σ 2 i =2 1 Remark 1 (The bound) Since ∆ i = σ 2 i − σ 2 1 + ρ Γ 1 , i , as ρ tends to + ∞ , we observe that � � � � K R n ( MTS ) E 2 ≤ lim . ρ log n Γ 1 , i n →∞ i =2 This bound is near-optimal according to [Agrawal and Goyal, 2012]. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 12 / 23

  18. TS algorithm for variance learning Algorithm 2 TS for Variance Learning µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 . . . , do Sample τ i ( t ) from Gamma ( α i , t − 1 , β i , t − 1 ). 3: Play arm i ( t ) = arg max i ∈ [ K ] ρ ˆ µ i , t − 1 − 1 /τ i ( t ) and observe X i ( t ) , t 4: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 5: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 6: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 13 / 23

  19. Regret bound Theorem 2 � � Let h ( x ) = 1 2 ( x − 1 − log x ) . If ρ ≤ min ∆ i / Γ i : ∆ i / Γ i > 0 , the asymptotic regret incurred by VTS for mean-variance Gaussian bandits satisfies � � � K � � � � R n ( VTS ) E 1 ∆ i + 2Γ 2 � lim ≤ . i , max σ 2 i /σ 2 log n h n →∞ 1 i =2 Remark 2 (Order optimality) Vakili and Zhao (2015) proved that the expected regret of any consistent � (log n ) / ∆ 2 � algorithm is Ω where ∆ = min i � =1 ∆ i . Since h ( x ) = ( x − 1) 2 / 4 + o (( x − 1) 2 ) as x → 1 , MTS and VTS are order optimal in both n and ∆ . Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 14 / 23

  20. TS algorithm for mean-variance learning Algorithm 3 Thompson Sampling for Mean-Variance bandits (MVTS) µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 , . . . , do Sample τ i ( t ) from Gamma ( α i , t − 1 , β i , t − 1 ). 3: Sample θ i ( t ) from N (ˆ µ i , t − 1 , 1 / ( T i , t − 1 + 1)) 4: Play arm i ( t ) = arg max i ∈ [ K ] ρθ i ( t ) − 1 /τ i ( t ) and observe X i ( t ) , t 5: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 6: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 7: 8: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 15 / 23

  21. Hierarchical structure of Thompson samples µ i , T i , t ∼ N ( µ i , σ 2 2 β i , t /σ 2 i ∼ χ 2 � i / T i , t ) s − 1 ❄ ❄ � � θ i , t ∼ N µ i , T i , t , 1 / T i , t ˆ τ i , t ∼ Gamma ( α i , t , β i , t ) ❍❍❍❍ ✟ ✟ ✟ ❥ ✟ ✙ � MV i , t = ρθ i , t − 1 /τ i , t Figure 2: Hierarchical structure of the mean-variance Thompson samples in MVTS. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 16 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend