advanced econometrics 2 hilary term 2021 multi armed
play

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits - PowerPoint PPT Presentation

Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active


  1. Bandits Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of Economics, Oxford University 1 / 25

  2. Bandits Agenda ◮ Thus far: “Supervised machine learning” – data are given. Next: “Active learning” – experimentation. ◮ Setup: The multi-armed bandit problem. Adaptive experiment with exploration / exploitation trade-off. ◮ Two popular approximate algorithms: 1. Thompson sampling 2. Upper Confidence Bound algorithm ◮ Characterizing regret. ◮ Characterizing an exact solution: Gittins Index. ◮ Extension to settings with covariates (contextual bandits). 2 / 25

  3. Bandits Takeaways for this part of class ◮ When experimental units arrive over time, and we can adapt our treatment choices, we can learn optimal treatment quickly. ◮ Treatment choice: Trade-off between 1. choosing good treatments now (exploitation), 2. and learning for future treatment choices (exploration). ◮ Optimal solutions are hard, but good heuristics are available. ◮ We will derive a bound on the regret of one heuristic. ◮ Bounding the number of times a sub-optimal treatment is chosen, ◮ using large deviations bounds (cf. testing!). ◮ We will also derive a characterization of the optimal solution in the infinite-horizon case. This relies on a separate index for each arm. 3 / 25

  4. Bandits The multi-armed bandit The multi-armed bandit Setup ◮ Treatments D t ∈ 1 ,..., k ◮ Experimental units come in sequentially over time. One unit per time period t = 1 , 2 ,... ◮ Potential outcomes: i.i.d. over time, Y t = Y D t t , Y d t ∼ F d E [ Y d t ] = θ d ◮ Treatment assignment can depend on past treatments and outcomes, D t + 1 = d t ( D 1 ,..., D t , Y 1 ,..., Y t ) . 4 / 25

  5. Bandits The multi-armed bandit The multi-armed bandit Setup continued ◮ Optimal treatment: d ∗ = argmax θ ∗ = max θ d = θ d ∗ θ d d d ◮ Expected regret for treatment d : Y d ∗ − Y d � = θ d ∗ − θ d . ∆ d = E � ◮ Finite horizon objective: Average outcome, U T = 1 T ∑ Y t . 1 ≤ t ≤ T ◮ Infinite horizon objective: Discounted average outcome, U ∞ = ∑ β t Y t t ≥ 1 5 / 25

  6. Bandits The multi-armed bandit The multi-armed bandit Expectations of objectives ◮ Expected finite horizon objective: � � T ∑ 1 θ D t E [ U T ] = E 1 ≤ t ≤ T ◮ Expected infinite horizon objective: � � ∑ β t θ D t E [ U ∞ ] = E t ≥ 1 ◮ Expected finite horizon regret: Compare to always assigning optimal treatment d ∗ . � �� � � � Y d ∗ T ∑ 1 T ∑ 1 ∆ D t R T = E − Y t = E t 1 ≤ t ≤ T 1 ≤ t ≤ T 6 / 25

  7. Bandits The multi-armed bandit Practice problem ◮ Show that these equalities hold. ◮ Interpret these objectives. ◮ Relate them to our decision theory terminology. 7 / 25

  8. Bandits Two popular algorithms Two popular algorithms Upper Confidence Bound (UCB) algorithm ◮ Define ¯ Y d 1 t ∑ t = 1 ( D s = d ) · Y s , T d 1 ≤ s ≤ t t = ∑ T d 1 ( D s = d ) 1 ≤ s ≤ t B d t = B ( T d t ) . ◮ B ( · ) is a decreasing function, giving the width of the “confidence interval.” We will specify this function later. ◮ At time t + 1, choose ¯ Y d t + B d D t + 1 = argmax t . d ◮ “Optimism in the face of uncertainty.” 8 / 25

  9. Bandits Two popular algorithms Two popular algorithms Thompson sampling ◮ Start with a Bayesian prior for θ . ◮ Assign each treatment with probability equal to the posterior probability that it is optimal. ◮ Put differently, obtain one draw ˆ θ t + 1 from the posterior given ( D 1 ,..., D t , Y 1 ,..., Y t ) , and choose ˆ θ d D t + 1 = argmax t + 1 . d ◮ Easily extendable to more complicated dynamic decision problems, complicated priors, etc.! 9 / 25

  10. Bandits Two popular algorithms Two popular algorithms Thompson sampling - the binomial case ◮ Assume that Y ∈ { 0 , 1 } , Y d t ∼ Ber ( θ d ) . ◮ Start with a uniform prior for θ on [ 0 , 1 ] k . ◮ Then the posterior for θ d at time t + 1 is a Beta distribution with parameters t · ¯ α d t = 1 + T d Y d t , t · ( 1 − ¯ β d t = 1 + T d Y d t ) . ◮ Thus ˆ D t = argmax θ t . d where ˆ θ d t ∼ Beta ( α d t , β d t ) is a random draw from the posterior. 10 / 25

  11. Bandits Regret bounds Regret bounds ◮ Back to the general case. ◮ Recall expected finite horizon regret, � �� � � � Y d ∗ 1 T ∑ T ∑ 1 ∆ D t R T = E − Y t = E . t 1 ≤ t ≤ T 1 ≤ t ≤ T ◮ Thus, T · R T = ∑ E [ T d T ] · ∆ d . d T ] small when ∆ d > 0. ◮ Good algorithms will have E [ T d ◮ We will next derive upper bounds on E [ T d T ] for the UCB algorithm. ◮ We will then state that for large T similar upper bounds hold for Thompson sampling. ◮ There is also a lower bound on regret across all possible algorithms which is the same, up to a constant. 11 / 25

  12. Bandits Regret bounds Probability theory preliminary Large deviations ◮ Suppose that E [exp( λ · ( Y − E [ Y ]))] ≤ exp( ψ ( λ )) . ◮ Let ¯ Y T = 1 T ∑ 1 ≤ t ≤ T Y t for i.i.d. Y t . Then, by Markov’s inequality and independence across t , Y T − E [ Y ] > ε ) ≤ E [exp( λ · (¯ Y T − E [ Y ]))] P (¯ exp( λ · ε ) = ∏ 1 ≤ t ≤ T E [exp(( λ / T ) · ( Y t − E [ Y ]))] exp( λ · ε ) ≤ exp( T ψ ( λ / T ) − λ · ε ) . 12 / 25

  13. Bandits Regret bounds Large deviations continued ◮ Define the Legendre-transformation of ψ as ψ ∗ ( ε ) = sup [ λ · ε − ψ ( λ )] . λ ≥ 0 ◮ Taking the inf over λ in the previous slide implies P (¯ Y T − E [ Y ] > ε ) ≤ exp( − T · ψ ∗ ( ε )) . ◮ For distributions bounded by [ 0 , 1 ] : ψ ( λ ) = λ 2 / 8 and ψ ∗ ( ε ) = 2 ε 2 . ◮ For normal distributions: ψ ( λ ) = λ 2 σ 2 / 2 and ψ ∗ ( ε ) = ε 2 / ( 2 σ 2 ) . 13 / 25

  14. Bandits Regret bounds Applied to the Bandit setting ◮ Suppose that for all d E [exp( λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) E [exp( − λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) . ◮ Recall / define � α log( t ) � ¯ t = ( ψ ∗ ) − 1 Y d 1 t ∑ B d t = 1 ( D s = d ) · Y s , . T d T d 1 ≤ s ≤ t t ◮ Then we get t − θ d > B d P (¯ t · ψ ∗ ( B d Y d t ) ≤ exp( − T d t )) = exp( − α log( t )) = t − α t − θ d < − B d P (¯ t ) ≤ t − α . Y d 14 / 25

  15. Bandits Regret bounds Why this choice of B ( · ) ? ◮ A smaller B ( · ) is better for exploitation. ◮ A larger B ( · ) is better for exploration. ◮ Special cases: ◮ Distributions bounded by [ 0 , 1 ] : � α log( t ) B d t = . 2 T d t ◮ Normal distributions: � 2 σ 2 α log( t ) B d t = . T d t ◮ The α log( t ) term ensures that coverage goes to 1, but slow enough to not waste too much in terms of exploitation. 15 / 25

  16. Bandits Regret bounds When d is chosen by the UCB algorithm ◮ By definition of UCB, at least one of these three events has to hold when d is chosen at time t + 1: Y d ∗ + B d ∗ ¯ ≤ θ ∗ (1) t t Y d ¯ t − B d t > θ d (2) 2 B d t > ∆ d . (3) ◮ 1 and 2 have low probability. By previous slide, � ≤ θ ∗ � Y d ∗ + B d ∗ ¯ ≤ t − α , � ¯ ≤ t − α . Y d t − B d t > θ d � P P t t ◮ 3 only happens when T d t is small. By definition of B d t , 3 happens iff α log( t ) T d t < ψ ∗ (∆ d / 2 ) . 16 / 25

  17. Bandits Regret bounds Practice problem Show that at least one of the statements 1, 2, or 3 has to be true whenever D t + 1 = d , for the UCB algorithm. 17 / 25

  18. Bandits Regret bounds Bounding E [ T d t ] ◮ Let � α log( T ) � ˜ T d T = . ψ ∗ (∆ d / 2 ) ◮ Forcing the algorithm to pick d the first ˜ T d T periods can only increase T d T . ◮ We can collect our results to get T ] = ∑ T + ∑ E [ T d 1 ( D t = d ) ≤ ˜ T d E [ 1 ( D t = d )] 1 ≤ t ≤ T ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1) or (2) is true at t )] ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1)is true at t )]+ E [ 1 ( (2) is true at t )] ˜ T d T < t ≤ T α T + ∑ 2 t − α + 1 ≤ ˜ ≤ ˜ T d T d T + α − 2 . T d ˜ T < t ≤ T 18 / 25

  19. Bandits Regret bounds Upper bound on expected regret for UCB ◮ We thus get: T ] ≤ α log( T ) α E [ T d ψ ∗ (∆ d / 2 ) + α − 2 , � α log( T ) � α R T ≤ 1 · ∆ d . T ∑ ψ ∗ (∆ d / 2 ) + α − 2 d ◮ Expected regret (difference to optimal policy) goes to 0 at a rate of O (log( T ) / T ) – pretty fast! ◮ While the cost of “getting treatment wrong” is ∆ d , the difficulty of figuring out the right treatment is of order 1 / ψ ∗ (∆ d / 2 ) . Typically, this is of order ( 1 / ∆ d ) 2 . 19 / 25

  20. Bandits Regret bounds Related bounds - rate optimality ◮ Lower bound : Consider the Bandit problem with binary outcomes and any algorithm such that E [ T d t ] = o ( t a ) for all a > 0. Then ∆ d R T ≥ ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) , t → ∞ d where kl ( p , q ) = p · log( p / q )+( 1 − p ) · log(( 1 − p ) / ( 1 − q )) . ◮ Upper bound for Thompson sampling : In the binary outcome setting, Thompson sampling achieves this bound, i.e., ∆ d R T = ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) . t → ∞ d 20 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend