bandits and exploration how do we optimally gather
play

Bandits and Exploration: How do we (optimally) gather information? - PowerPoint PPT Presentation

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18 Announcements... HW 4 posted soon


  1. Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18

  2. Announcements... HW 4 posted soon (short) Poster session: June 1, 9-11:30a; ask TA/CSE students for help printing Projects: the term is approaching the end.... Today: Quick overview: Parallelization and Deep learning Bandits: Vanilla k-arm setting 1 Linear bandits and ad-placement 2 Game trees? 3 S. M. Kakade (UW) Optimization for Big data 2 / 18

  3. The problem In unsupervised learning, we just have data... In supervised learning, we have inputs X and labels Y (often we spend resources to get these labels). In reinforcement learning (very general), we act in the world, there is “state” and we observe rewards. Bandit Settings: We have K decisions each round and we do only received feedback for the chosen decision... S. M. Kakade (UW) Optimization for Big data 3 / 18

  4. Gambling in casino... S. M. Kakade (UW) Optimization for Big data 4 / 18

  5. Multi-Armed Bandit Game K Independent Arms: a ∈ { 1 , . . . K } Each arm a returns a random reward R a if pulled. (simpler case) assume R a is not time varying. Game: You chose arm a t at time t . You then observe: X t = R a t where R a t is sampled from the underlying distribution of that arm. The distribution of R a is not known. S. M. Kakade (UW) Optimization for Big data 5 / 18

  6. More real motivations... S. M. Kakade (UW) Optimization for Big data 5 / 18

  7. Ad placement... S. M. Kakade (UW) Optimization for Big data 5 / 18

  8. The Goal We would like to maximize our long term future reward. Our (possibly randomized) sequential strategy/algorithm A is: a t = A ( a 1 , X 1 , a 2 , X 2 , . . . a t − 1 , X t − 1 ) In T rounds, our reward is: T � E [ X t |A ] t = 1 where the expectation is with respect to the reward process and our algorithm. Objective: What is a strategy which maximizes our long term reward? S. M. Kakade (UW) Optimization for Big data 6 / 18

  9. Our Regret Suppose: µ a = E [ R a ] Assume 0 ≤ µ a ≤ 1. Let µ ∗ = max a µ a In expectation, the best we can do is obtain µ ∗ T reward in T steps. In T rounds, our regret is: � T � � µ ∗ T − E X t |A ≤ ?? t = 1 Objective: What is a strategy which makes our regret small? S. M. Kakade (UW) Optimization for Big data 7 / 18

  10. A Naive Strategy For the first τ rounds, sample each arm τ/ K times. For the remainder of the rounds, choose the arm with best observed empirical reward. How goes is this strategy? How do we set τ ? Let’s look at confidence intervals. S. M. Kakade (UW) Optimization for Big data 8 / 18

  11. Hoeffding’s bound If we pull arm N a times, our empirical estimate for arm a is: µ a = 1 � ˆ X t N a t : a t = a By Hoeffding’s bound, with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a − µ a | ≤ O N a By the union bound, with probability greater than 1 − δ , � log ( K /δ ) ∀ a , | ˆ µ a − µ a | ≤ O N a S. M. Kakade (UW) Optimization for Big data 9 / 18

  12. Our regret (Exploration rounds) What is our regret for the first τ rounds? (Exploitation rounds) What is our regret for the remainder τ rounds? Our total regret is: T � log ( K /δ ) � µ ∗ T − X t ≤ τ + O ( T − τ ) τ/ K t = 1 How do we choose τ ? S. M. Kakade (UW) Optimization for Big data 10 / 18

  13. The Naive Strategy’s Regret Choose τ = K 1 / 3 T 2 / 3 and δ = 1 / T . Theorem: Our total (expected) regret is: T � X t |A ] ≤ O ( K 1 / 3 T 2 / 3 ( log ( KT )) 1 / 3 ) µ ∗ T − E [ t = 1 S. M. Kakade (UW) Optimization for Big data 11 / 18

  14. Can we be more adaptive? Are we still pulling arms that we know are sub-optimal? How do we know this?? Let N a , t be the number of times we pulled arm a up to time t . Confidence interval at time t : with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a , t − µ a | ≤ O N a , t with δ → δ/ ( TK ) , the above bound will hold for all time arms a ∈ [ K ] and timesteps t ≤ T . S. M. Kakade (UW) Optimization for Big data 12 / 18

  15. Example S. M. Kakade (UW) Optimization for Big data 12 / 18

  16. Example S. M. Kakade (UW) Optimization for Big data 12 / 18

  17. Example S. M. Kakade (UW) Optimization for Big data 12 / 18

  18. Confidence Bounds... S. M. Kakade (UW) Optimization for Big data 13 / 18

  19. UCB: a reasonable state of our uncertainty... S. M. Kakade (UW) Optimization for Big data 14 / 18

  20. Upper Confidence Bound (UCB) Algorithm At each time t , Pull arm: � log ( KT /δ ) = argmax ˆ µ a , t + c a t N a , t := argmax ˆ µ a , t + ConfBound a , t (where c ≤ 10 is a constant). Observe reward X t . Update µ a , t , N a , t , and ConfBound a , t . How well does this do? S. M. Kakade (UW) Optimization for Big data 15 / 18

  21. Instantaneous Regret With probability greater than 1 − δ all the confidence bounds will hold. Question: If argmax ˆ µ a , t + ConfBound a , t ≤ µ ∗ could UCB pull arm a at time t ? Question: If pull arm a at time t , how much regret do we pay? i.e. µ ∗ − µ a t ≤ ?? S. M. Kakade (UW) Optimization for Big data 16 / 18

  22. Total Regret Theorem: The total (expected) regret of UCB is: T � � µ ∗ T − E [ X t |A ] ≤ KT log ( KT ) t = 1 This better than the Naive strategy. Up to log factors, it is optimal. Practical algorithm? S. M. Kakade (UW) Optimization for Big data 17 / 18

  23. Proof Idea: for K = 2 Suppose arm a = 2 is not optimal. Claim 1: All confidence intervals will be valid (with Pr ≥ 1 − δ ). Claim 2: If we pull arm a = 1, then no regret. Claim 3: If we pull a = 2, then we pay 2 C a , t regret. To see this: Why? µ a , t + C a , t ≥ ˆ ˆ µ 1 , t + C 1 , t ≥ µ ∗ Why? µ a ≥ ˆ µ a , t − C a , t The total regret is: 1 � � C a , t ≤ � N a , t t t Note that N a , t ≤ t (and increasing). S. M. Kakade (UW) Optimization for Big data 18 / 18

  24. Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ S. M. Kakade (UW) Optimization for Big data 18 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend