announcements
play

Announcements HW 1 is due now 1 CS6501: T opics in Learning and - PowerPoint PPT Presentation

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu Outline The Adversarial Multi-armed Bandit Problem A Basic Algorithm: Exp3 Regret


  1. Announcements Ø HW 1 is due now 1

  2. CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu

  3. Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 3

  4. Recap: Online Learning So Far Setup: 𝑈 rounds; the following occurs at round 𝑢 : Learner picks a distribution 𝑞 $ over actions [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Action 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner observes 𝑑 $ (for use in future time steps) 4. Performance is typically measured by regret: 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) The multiplicative weight update algorithm has regret 𝑃( 𝑈 ln 𝑜) . 4

  5. Recap: Online Learning So Far Convergence to equilibrium Ø In repeated zero-sum games, if both players use a no-regret learning algorithm, their average strategy converges to an NE Ø In general games, the average strategy converges to a CCE Swap regret – a “stronger” regret concept and better convergence Ø Def: each action 𝑗 has a chance to deviate to another action 𝑡(𝑗) Ø In repeated general games, if both players use a no-swap-regret learning algorithm, their average strategy converges to a CE There is a general reduction, converting any learning algorithm with regret 𝑆 to one with swap regret 𝑜𝑆 . 5

  6. This Lecture: Address Partial Feedback Ø In online learning, the whole cost vector 𝑑 $ can be observed by the learner, despite she only takes a single action 𝑗 $ • Realistic in some applications, e.g., stock investment Ø In many cases, we only see the reward of the action we take • For example: slot machines, a.k.a., multi-armed bandits 6

  7. Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions 7

  8. Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback 8

  9. Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback Ø Clinical trials • Action = a treatment • Don’t know what would happen for treatments not chosen Ø Playing strategic games • Cannot observe opponents’ strategies but only know the payoff of the taken action • E.g., Poker games, competition in markets 9

  10. Adversarial Multi-Armed Bandits (MAB) Ø Very much like online learning, except partial feedback • The name “bandit” is inspired by slot machines Ø Model: at each time step 𝑢 = 1, ⋯ , 𝑈 ; the following occurs in order Learner picks a distribution 𝑞 $ over arms [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Arm 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner only observes 𝑑 $ (𝑗 $ ) (for use in future time steps) 4. Ø Though we cannot observe 𝑑 $ , adversary still picks 𝑑 $ before 𝑗 $ is sampled Q: since learner does not observe 𝑑 $ (𝑗) for 𝑗 ≠ 𝑗 $ , can adversary arbitrarily modify these 𝑑 $ (𝑗) ’s after 𝑗 $ has been selected? No, because this makes 𝑑 $ depends on sampled 𝑗 $ which is not allowed 10

  11. Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 11

  12. Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. (1 − 𝜗𝑑 $ (𝑗)) 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . 12

  13. Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 13

  14. Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø In this lecture we will use this exponential-weight variant, and prove its regret bound en route Ø Also called Exponential Weight Update (EWU) Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 14

  15. Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Basic idea of Exp3 Ø Want to use EWU, but do not know vector 𝑑 $ à try to estimate 𝑑 $ ! Ø Well, we really only have 𝑑 $ (𝑗 $ ) , what can we do? 𝑑 $ = 0, ⋯ , 0, 𝑑 $ 𝑗 $ , 0, ⋯ 0 3 ? Estimate O Too optimistic 3 M N 6 N Estimate O 𝑑 $ = 0, ⋯ , 0, S N 6 N , 0, ⋯ 0 15

  16. Exp3: a Basic Algorithm for Adversarial MAB Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. M N (6) where O For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅ O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø That is, weight is updated only for the pulled arm • Because we really don’t know how good are other arms at 𝑢 • But 𝑗 $ is more heavily penalized now • Attention: 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ) may be extremely large if 𝑞 $ (𝑗 $ ) is small Ø Called Exp3: Exponential-weight algorithm for Exploration and Exploitation 16

  17. A Closer Look at the Estimator O 𝑑 $ Ø O 𝑑 $ is random – it depends on the randomly sampled 𝑗 $ ∼ 𝑞 $ Ø O 𝑑 $ is an unbiased estimator of 𝑑 $ , i.e., 𝔽 6 N ∼S N O 𝑑 $ = 𝑑 $ • Because given 𝑞 $ , for any 𝑗 we have 𝑑 $ 𝑗 = ℙ 𝑗 $ = 𝑗 ⋅ 𝑑 $ 𝑗 𝔽 6 N ∼S N O 𝑞 $ 𝑗 + ℙ 𝑗 $ ≠ 𝑗 ⋅ 0 = 𝑞 $ (𝑗) ⋅ 𝑑 $ 𝑗 𝑞 $ 𝑗 = 𝑑 $ (𝑗) Ø This is exactly the reason for our choice of O 𝑑 $ 17

  18. Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 18

  19. Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 1 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 1 < 1 arm 1 round 1 round 2 19

  20. Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 2 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 2 < 1 arm 2 round 1 round 2 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend