Announcements HW 1 is due now 1 CS6501: T opics in Learning and - PowerPoint PPT Presentation

Announcements Ø HW 1 is due now 1

CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu

Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 3

Recap: Online Learning So Far Setup: 𝑈 rounds; the following occurs at round 𝑢 : Learner picks a distribution 𝑞 $ over actions [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Action 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner observes 𝑑 $ (for use in future time steps) 4. Performance is typically measured by regret: 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) The multiplicative weight update algorithm has regret 𝑃( 𝑈 ln 𝑜) . 4

Recap: Online Learning So Far Convergence to equilibrium Ø In repeated zero-sum games, if both players use a no-regret learning algorithm, their average strategy converges to an NE Ø In general games, the average strategy converges to a CCE Swap regret – a “stronger” regret concept and better convergence Ø Def: each action 𝑗 has a chance to deviate to another action 𝑡(𝑗) Ø In repeated general games, if both players use a no-swap-regret learning algorithm, their average strategy converges to a CE There is a general reduction, converting any learning algorithm with regret 𝑆 to one with swap regret 𝑜𝑆 . 5

This Lecture: Address Partial Feedback Ø In online learning, the whole cost vector 𝑑 $ can be observed by the learner, despite she only takes a single action 𝑗 $ • Realistic in some applications, e.g., stock investment Ø In many cases, we only see the reward of the action we take • For example: slot machines, a.k.a., multi-armed bandits 6

Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions 7

Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback 8

Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback Ø Clinical trials • Action = a treatment • Don’t know what would happen for treatments not chosen Ø Playing strategic games • Cannot observe opponents’ strategies but only know the payoff of the taken action • E.g., Poker games, competition in markets 9

Adversarial Multi-Armed Bandits (MAB) Ø Very much like online learning, except partial feedback • The name “bandit” is inspired by slot machines Ø Model: at each time step 𝑢 = 1, ⋯ , 𝑈 ; the following occurs in order Learner picks a distribution 𝑞 $ over arms [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Arm 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner only observes 𝑑 $ (𝑗 $ ) (for use in future time steps) 4. Ø Though we cannot observe 𝑑 $ , adversary still picks 𝑑 $ before 𝑗 $ is sampled Q: since learner does not observe 𝑑 $ (𝑗) for 𝑗 ≠ 𝑗 $ , can adversary arbitrarily modify these 𝑑 $ (𝑗) ’s after 𝑗 $ has been selected? No, because this makes 𝑑 $ depends on sampled 𝑗 $ which is not allowed 10

Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 11

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. (1 − 𝜗𝑑 $ (𝑗)) 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . 12

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 13

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø In this lecture we will use this exponential-weight variant, and prove its regret bound en route Ø Also called Exponential Weight Update (EWU) Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 14

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Basic idea of Exp3 Ø Want to use EWU, but do not know vector 𝑑 $ à try to estimate 𝑑 $ ! Ø Well, we really only have 𝑑 $ (𝑗 $ ) , what can we do? 𝑑 $ = 0, ⋯ , 0, 𝑑 $ 𝑗 $ , 0, ⋯ 0 3 ? Estimate O Too optimistic 3 M N 6 N Estimate O 𝑑 $ = 0, ⋯ , 0, S N 6 N , 0, ⋯ 0 15

Exp3: a Basic Algorithm for Adversarial MAB Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. M N (6) where O For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅ O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø That is, weight is updated only for the pulled arm • Because we really don’t know how good are other arms at 𝑢 • But 𝑗 $ is more heavily penalized now • Attention: 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ) may be extremely large if 𝑞 $ (𝑗 $ ) is small Ø Called Exp3: Exponential-weight algorithm for Exploration and Exploitation 16

A Closer Look at the Estimator O 𝑑 $ Ø O 𝑑 $ is random – it depends on the randomly sampled 𝑗 $ ∼ 𝑞 $ Ø O 𝑑 $ is an unbiased estimator of 𝑑 $ , i.e., 𝔽 6 N ∼S N O 𝑑 $ = 𝑑 $ • Because given 𝑞 $ , for any 𝑗 we have 𝑑 $ 𝑗 = ℙ 𝑗 $ = 𝑗 ⋅ 𝑑 $ 𝑗 𝔽 6 N ∼S N O 𝑞 $ 𝑗 + ℙ 𝑗 $ ≠ 𝑗 ⋅ 0 = 𝑞 $ (𝑗) ⋅ 𝑑 $ 𝑗 𝑞 $ 𝑗 = 𝑑 $ (𝑗) Ø This is exactly the reason for our choice of O 𝑑 $ 17

Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 18

Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 1 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 1 < 1 arm 1 round 1 round 2 19

Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 2 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 2 < 1 arm 2 round 1 round 2 20

Announcements HW 1 is due now 1 CS6501: T opics in Learning and - PowerPoint PPT Presentation

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu Outline The Adversarial Multi-armed Bandit Problem A Basic Algorithm: Exp3 Regret

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability & CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Why Learn Haskell? Jan van Eijck CWI & ILLC, Amsterdam November 2, 2012 Abstract This is

Do I need to switch to Go(lang) Max Tepkeev 20 July 2016 Bilbao, Spain 1 / 40 About me Max

Taking action by service (TABS) Information Session August 17th, 2018 OUR MISSION TABS is an

Shar Sharing O Our H Histo story En Envi visi sioning O Our F Futu ture Orientation and

We search for possible mechanisms of under- standing quantifiers in natural

Parent-Teacher Meeting 8 Mar 2019 Programme 2.30 p.m. 3.15 p.m. Presentation by form

Chapkrt win ? can I : we distribution ps.iq So with iid Last time : { G) new let be . at ?g

Sdfg Sdfg Fg Fggbbort rg rg rtgrethtj grhkui Fthjghf Tzj jjzr

Sambuz

Useful Links

Newsletter

Mail Us

Announcements HW 1 is due now 1 CS6501: T opics in Learning and - PowerPoint PPT Presentation

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu Outline The Adversarial Multi-armed Bandit Problem A Basic Algorithm: Exp3 Regret

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

61A Lecture 14 Wednesday, February 25 Announcements 2 Announcements Project 2 due Thursday

Linearizability &amp; CAP Announcements No hours this week. Announcements No hours this

61A Lecture 13 Wednesday, October 2 Announcements 2 Announcements Homework 3 deadline

61A Lecture 24 Friday, November 1 Announcements 2 Announcements Homework 7 due Tuesday 11/5

61A Extra Lecture 2 Thursday, February 5 Announcements 2 Announcements If you want 1 unit

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Announcements Lecture 22 System Development Leah Perlmutter / Summer 2018 Announcements

Lecture 30: Conclusion Brian Hou August 11, 2016 Announcements Announcements Final Exam

Why Learn Haskell? Jan van Eijck CWI &amp; ILLC, Amsterdam November 2, 2012 Abstract This is

Do I need to switch to Go(lang) Max Tepkeev 20 July 2016 Bilbao, Spain 1 / 40 About me Max

Taking action by service (TABS) Information Session August 17th, 2018 OUR MISSION TABS is an

Shar Sharing O Our H Histo story En Envi visi sioning O Our F Futu ture Orientation and

We search for possible mechanisms of under- standing quantifiers in natural

Parent-Teacher Meeting 8 Mar 2019 Programme 2.30 p.m. 3.15 p.m. Presentation by form

Chapkrt win ? can I : we distribution ps.iq So with iid Last time : { G) new let be . at ?g

Sdfg Sdfg Fg Fggbbort rg rg rtgrethtj grhkui Fthjghf Tzj jjzr

Sambuz

Useful Links

Newsletter

Mail Us

Linearizability & CAP Announcements No hours this week. Announcements No hours this

Why Learn Haskell? Jan van Eijck CWI & ILLC, Amsterdam November 2, 2012 Abstract This is