Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits - PowerPoint PPT Presentation

Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25

Bandits Agenda ◮ Thus far: “Supervised machine learning” – data are given. Next: “Active learning” – experimentation. ◮ Setup: The multi-armed bandit problem. Adaptive experiment with exploration / exploitation trade-off. ◮ Two popular approximate algorithms: 1. Thompson sampling 2. Upper Confidence Bound algorithm ◮ Characterizing regret. ◮ Characterizing an exact solution: Gittins Index. ◮ Extension to settings with covariates (contextual bandits). 2 / 25

Bandits Takeaways for this part of class ◮ When experimental units arrive over time, and we can adapt our treatment choices, we can learn optimal treatment quickly. ◮ Treatment choice: Trade-off between 1. choosing good treatments now (exploitation), 2. and learning for future treatment choices (exploration). ◮ Optimal solutions are hard, but good heuristics are available. ◮ We will derive a bound on the regret of one heuristic. ◮ Bounding the number of times a sub-optimal treatment is chosen, ◮ using large deviations bounds (cf. testing!). ◮ We will also derive a characterization of the optimal solution in the infinite-horizon case. This relies on a separate index for each arm. 3 / 25

Bandits The multi-armed bandit The multi-armed bandit Setup ◮ Treatments D t ∈ 1 ,..., k ◮ Experimental units come in sequentially over time. One unit per time period t = 1 , 2 ,... ◮ Potential outcomes: i.i.d. over time, Y t = Y D t t , Y d t ∼ F d E [ Y d t ] = θ d ◮ Treatment assignment can depend on past treatments and outcomes, D t + 1 = d t ( D 1 ,..., D t , Y 1 ,..., Y t ) . 4 / 25

Bandits The multi-armed bandit The multi-armed bandit Setup continued ◮ Optimal treatment: d ∗ = argmax θ ∗ = max θ d = θ d ∗ θ d d d ◮ Expected regret for treatment d : Y d ∗ − Y d � = θ d ∗ − θ d . ∆ d = E � ◮ Finite horizon objective: Average outcome, U T = 1 T ∑ Y t . 1 ≤ t ≤ T ◮ Infinite horizon objective: Discounted average outcome, U ∞ = ∑ β t Y t t ≥ 1 5 / 25

Bandits The multi-armed bandit The multi-armed bandit Expectations of objectives ◮ Expected finite horizon objective: � � T ∑ 1 θ D t E [ U T ] = E 1 ≤ t ≤ T ◮ Expected infinite horizon objective: � � ∑ β t θ D t E [ U ∞ ] = E t ≥ 1 ◮ Expected finite horizon regret: Compare to always assigning optimal treatment d ∗ . � �� Y d ∗ T ∑ 1 T ∑ 1 ∆ D t R T = E − Y t = E t 1 ≤ t ≤ T 1 ≤ t ≤ T 6 / 25

Bandits The multi-armed bandit Practice problem ◮ Show that these equalities hold. ◮ Interpret these objectives. ◮ Relate them to our decision theory terminology. 7 / 25

Bandits Two popular algorithms Two popular algorithms Upper Confidence Bound (UCB) algorithm ◮ Define ¯ Y d 1 t ∑ t = 1 ( D s = d ) · Y s , T d 1 ≤ s ≤ t t = ∑ T d 1 ( D s = d ) 1 ≤ s ≤ t B d t = B ( T d t ) . ◮ B ( · ) is a decreasing function, giving the width of the “confidence interval.” We will specify this function later. ◮ At time t + 1, choose ¯ Y d t + B d D t + 1 = argmax t . d ◮ “Optimism in the face of uncertainty.” 8 / 25

Bandits Two popular algorithms Two popular algorithms Thompson sampling ◮ Start with a Bayesian prior for θ . ◮ Assign each treatment with probability equal to the posterior probability that it is optimal. ◮ Put differently, obtain one draw ˆ θ t + 1 from the posterior given ( D 1 ,..., D t , Y 1 ,..., Y t ) , and choose ˆ θ d D t + 1 = argmax t + 1 . d ◮ Easily extendable to more complicated dynamic decision problems, complicated priors, etc.! 9 / 25

Bandits Two popular algorithms Two popular algorithms Thompson sampling - the binomial case ◮ Assume that Y ∈ { 0 , 1 } , Y d t ∼ Ber ( θ d ) . ◮ Start with a uniform prior for θ on [ 0 , 1 ] k . ◮ Then the posterior for θ d at time t + 1 is a Beta distribution with parameters t · ¯ α d t = 1 + T d Y d t , t · ( 1 − ¯ β d t = 1 + T d Y d t ) . ◮ Thus ˆ D t = argmax θ t . d where ˆ θ d t ∼ Beta ( α d t , β d t ) is a random draw from the posterior. 10 / 25

Bandits Regret bounds Regret bounds ◮ Back to the general case. ◮ Recall expected finite horizon regret, � �� Y d ∗ 1 T ∑ T ∑ 1 ∆ D t R T = E − Y t = E . t 1 ≤ t ≤ T 1 ≤ t ≤ T ◮ Thus, T · R T = ∑ E [ T d T ] · ∆ d . d T ] small when ∆ d > 0. ◮ Good algorithms will have E [ T d ◮ We will next derive upper bounds on E [ T d T ] for the UCB algorithm. ◮ We will then state that for large T similar upper bounds hold for Thompson sampling. ◮ There is also a lower bound on regret across all possible algorithms which is the same, up to a constant. 11 / 25

Bandits Regret bounds Probability theory preliminary Large deviations ◮ Suppose that E [exp( λ · ( Y − E [ Y ]))] ≤ exp( ψ ( λ )) . ◮ Let ¯ Y T = 1 T ∑ 1 ≤ t ≤ T Y t for i.i.d. Y t . Then, by Markov’s inequality and independence across t , Y T − E [ Y ] > ε ) ≤ E [exp( λ · (¯ Y T − E [ Y ]))] P (¯ exp( λ · ε ) = ∏ 1 ≤ t ≤ T E [exp(( λ / T ) · ( Y t − E [ Y ]))] exp( λ · ε ) ≤ exp( T ψ ( λ / T ) − λ · ε ) . 12 / 25

Bandits Regret bounds Large deviations continued ◮ Define the Legendre-transformation of ψ as ψ ∗ ( ε ) = sup [ λ · ε − ψ ( λ )] . λ ≥ 0 ◮ Taking the inf over λ in the previous slide implies P (¯ Y T − E [ Y ] > ε ) ≤ exp( − T · ψ ∗ ( ε )) . ◮ For distributions bounded by [ 0 , 1 ] : ψ ( λ ) = λ 2 / 8 and ψ ∗ ( ε ) = 2 ε 2 . ◮ For normal distributions: ψ ( λ ) = λ 2 σ 2 / 2 and ψ ∗ ( ε ) = ε 2 / ( 2 σ 2 ) . 13 / 25

Bandits Regret bounds Applied to the Bandit setting ◮ Suppose that for all d E [exp( λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) E [exp( − λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) . ◮ Recall / define � α log( t ) � ¯ t = ( ψ ∗ ) − 1 Y d 1 t ∑ B d t = 1 ( D s = d ) · Y s , . T d T d 1 ≤ s ≤ t t ◮ Then we get t − θ d > B d P (¯ t · ψ ∗ ( B d Y d t ) ≤ exp( − T d t )) = exp( − α log( t )) = t − α t − θ d < − B d P (¯ t ) ≤ t − α . Y d 14 / 25

Bandits Regret bounds Why this choice of B ( · ) ? ◮ A smaller B ( · ) is better for exploitation. ◮ A larger B ( · ) is better for exploration. ◮ Special cases: ◮ Distributions bounded by [ 0 , 1 ] : � α log( t ) B d t = . 2 T d t ◮ Normal distributions: � 2 σ 2 α log( t ) B d t = . T d t ◮ The α log( t ) term ensures that coverage goes to 1, but slow enough to not waste too much in terms of exploitation. 15 / 25

Bandits Regret bounds When d is chosen by the UCB algorithm ◮ By definition of UCB, at least one of these three events has to hold when d is chosen at time t + 1: Y d ∗ + B d ∗ ¯ ≤ θ ∗ (1) t t Y d ¯ t − B d t > θ d (2) 2 B d t > ∆ d . (3) ◮ 1 and 2 have low probability. By previous slide, � ≤ θ ∗ � Y d ∗ + B d ∗ ¯ ≤ t − α , � ¯ ≤ t − α . Y d t − B d t > θ d � P P t t ◮ 3 only happens when T d t is small. By definition of B d t , 3 happens iff α log( t ) T d t < ψ ∗ (∆ d / 2 ) . 16 / 25

Bandits Regret bounds Practice problem Show that at least one of the statements 1, 2, or 3 has to be true whenever D t + 1 = d , for the UCB algorithm. 17 / 25

Bandits Regret bounds Bounding E [ T d t ] ◮ Let � α log( T ) � ˜ T d T = . ψ ∗ (∆ d / 2 ) ◮ Forcing the algorithm to pick d the first ˜ T d T periods can only increase T d T . ◮ We can collect our results to get T ] = ∑ T + ∑ E [ T d 1 ( D t = d ) ≤ ˜ T d E [ 1 ( D t = d )] 1 ≤ t ≤ T ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1) or (2) is true at t )] ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1)is true at t )]+ E [ 1 ( (2) is true at t )] ˜ T d T < t ≤ T α T + ∑ 2 t − α + 1 ≤ ˜ ≤ ˜ T d T d T + α − 2 . T d ˜ T < t ≤ T 18 / 25

Bandits Regret bounds Upper bound on expected regret for UCB ◮ We thus get: T ] ≤ α log( T ) α E [ T d ψ ∗ (∆ d / 2 ) + α − 2 , � α log( T ) � α R T ≤ 1 · ∆ d . T ∑ ψ ∗ (∆ d / 2 ) + α − 2 d ◮ Expected regret (difference to optimal policy) goes to 0 at a rate of O (log( T ) / T ) – pretty fast! ◮ While the cost of “getting treatment wrong” is ∆ d , the difficulty of figuring out the right treatment is of order 1 / ψ ∗ (∆ d / 2 ) . Typically, this is of order ( 1 / ∆ d ) 2 . 19 / 25

Bandits Regret bounds Related bounds - rate optimality ◮ Lower bound : Consider the Bandit problem with binary outcomes and any algorithm such that E [ T d t ] = o ( t a ) for all a > 0. Then ∆ d R T ≥ ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) , t → ∞ d where kl ( p , q ) = p · log( p / q )+( 1 − p ) · log(( 1 − p ) / ( 1 − q )) . ◮ Upper bound for Thompson sampling : In the binary outcome setting, Thompson sampling achieves this bound, i.e., ∆ d R T = ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) . t → ∞ d 20 / 25

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits - PowerPoint PPT Presentation

Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Advanced Econometrics 2, Hilary term 2021 Reinforcement learning Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

BS2247 Introduction to Econometrics Lecture 1: Basic Mathematical Review Dr. Kai Sun Aston

& Technology Hilary Halpern March, 2018 About Me Hilary Halpern MBA: Georgetown

Hilary Putnam Meaning and Reference Samer Nour Eddine and Hrag Vosgerichian Hilary Putnam

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Armed Services Advice Project (ASAP) - A Gateway to Armed Forces Services Championing Partnership

Responding Responding to Armed to Armed Conflict Conflict ILO Crisis Response : Trainers

Communications William Lyn Armed Forces Covenant Team The Armed Forces Covenant Conference

Directorate of Admissions The 5 Branches of the Armed Forces Military Service BY ARMED

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

NSP O NSP Open Forum pen Forum NSP Open Forum Q & A with HUD Staff September 10 th , 2013

DESI https://www.youtube.com/watch? time_continue=191&v=kPXx9tqyzYg Dark Energy

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

More on Polyhedra and Farkas Lemma Marco Chiarandini Department of Mathematics & Computer

Noisy differential equations with power type coefficients Samy Tindel Universit de Lorraine

Outline Background on Proteins and Shotgun Proteomics

1 & 2 KINGS 1 & 2 KINGS 1Kgs 111 1Kgs 122Kgs 17 2Kgs 1825 Single United

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits - PowerPoint PPT Presentation

Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Advanced Econometrics 2, Hilary term 2021 Reinforcement learning Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

BS2247 Introduction to Econometrics Lecture 1: Basic Mathematical Review Dr. Kai Sun Aston

&amp; Technology Hilary Halpern March, 2018 About Me Hilary Halpern MBA: Georgetown

Hilary Putnam Meaning and Reference Samer Nour Eddine and Hrag Vosgerichian Hilary Putnam

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Armed Services Advice Project (ASAP) - A Gateway to Armed Forces Services Championing Partnership

Responding Responding to Armed to Armed Conflict Conflict ILO Crisis Response : Trainers

Communications William Lyn Armed Forces Covenant Team The Armed Forces Covenant Conference

Directorate of Admissions The 5 Branches of the Armed Forces Military Service BY ARMED

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

NSP O NSP Open Forum pen Forum NSP Open Forum Q &amp; A with HUD Staff September 10 th , 2013

DESI https://www.youtube.com/watch? time_continue=191&amp;v=kPXx9tqyzYg Dark Energy

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

More on Polyhedra and Farkas Lemma Marco Chiarandini Department of Mathematics &amp; Computer

Noisy differential equations with power type coefficients Samy Tindel Universit de Lorraine

Outline Background on Proteins and Shotgun Proteomics

1 &amp; 2 KINGS 1 &amp; 2 KINGS 1Kgs 111 1Kgs 122Kgs 17 2Kgs 1825 Single United

& Technology Hilary Halpern March, 2018 About Me Hilary Halpern MBA: Georgetown

NSP O NSP Open Forum pen Forum NSP Open Forum Q & A with HUD Staff September 10 th , 2013

DESI https://www.youtube.com/watch? time_continue=191&v=kPXx9tqyzYg Dark Energy

More on Polyhedra and Farkas Lemma Marco Chiarandini Department of Mathematics & Computer

1 & 2 KINGS 1 & 2 KINGS 1Kgs 111 1Kgs 122Kgs 17 2Kgs 1825 Single United