multi arm bandit sutton and barto
play

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 - PowerPoint PPT Presentation

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online decision-making involves a fundamental


  1. Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1

  2. Multi-Arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem

  3. The Exploration/Exploitation Dilemma Online decision-making involves a fundamental choice: • Exploitation Make the best decision given current information • Exploration Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions 295, class 2 3

  4. Examples Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move 295, class 2 4

  5. Multi‐Armed Bandit 295, class 2 6

  6. Regret 295, class 2 10

  7. Counting Regret 295, class 2 11

  8. Multi-Armed Bandits Linear or Sublinear Regret Regret greedy ϵ-greedy Totalregret decaying ϵ-greedy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 Time-steps If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret?

  9. Complexity of regret 295, class 2 13

  10. Overview • Action‐value methods – Epsilon‐greedy strategy – Incremental implementation – Stationary vs. non‐stationary environment – Optimistic initial values • UCB action selection • Gradient bandit algorithms • Associative search (contextual bandits) 295, class 2 14

  11. Basics • Maximize total reward collected – vs learn (optimal) policy (RL) • Episode is one step • Complex function of – True value – Uncertainty – Number of time steps – Stationary vs non‐stationary? 295, class 2 15

  12. Action‐Value Methods

  13. Greedy Algorithms 295, class 2 17

  14.  ‐Greedy Algorithm 295, class 2 18

  15. A sim ple bandit algorithm

  16. One Bandit T askfrom Figure 2.1: An example The 10‐armedTestbed bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean 4 zero and unit variance, and then the actual rewards were selected 3 according to a mean q(a) q unit variance normal ⇤ (3) distribution, as suggested q ⇤ (5) 2 by these gray distributions. q ⇤ (9) 1 q ⇤ (4) q ⇤ (1) Reward 0 q q ⇤ (7) ⇤ (10) distribution q q ⇤ (2) ⇤ (8) -1 q ⇤ (6) -2 Run for 1000 steps Repeat the whole -3 thing 2000 times with different bandit tasks -4 1 2 3 4 5 6 7 8 9 10 Action

  17.  ‐Greedy Methods on the 10‐ArmedTestbed

  18. Averaging ⟶ learning rule • T o simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: Q n = R 1 + R 2 + · · · + R n- 1 . n - 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently:

  19. Derivation of incremental update

  20. Tracking a Non‐stationary Problem

  21. Standard stochastic approximation convergence conditions

  22. Optimistic InitialValues

  23. Optimistic InitialValues • All methods so far depend on Q 1 ( a ) , i.e.,they are biased. Q 1 ( a ) = 0 So far we have used • Suppose we initialize the action values optimistically ( Q 1 ( a ) = 5 ), e.g., on the 10-armed testbed (with alpha = 0 . 1 ) 100% optimistic, greedy Q 1 = 5, ε = 0 80% 0 realistic, ε-greedy % 60% Q 1 = 0, ε = 0.1 Optimal 0 action 40% 20% 0% 0 200 400 600 800 1000 Plays Steps

  24. Decaying ε‐greedy 295, class 2 29

  25. Optimism in the face of uncertainty 295, class 2 30

  26. Optimism in the face of uncertainty 295, class 2 31

  27. Upper Confidence Bounds 295, class 2 32

  28. Hoeffding’s Inequality 295, class 2 33

  29. Calculating UCB 295, class 2 34

  30. Upper Confidence Bound (UCB) action selection • A clever way of reducing exploration over time • Focus on actions whose estimate has large degree of uncertainty • Estimate an upper bound on the true action values • Select the action with the largest (estimated) upper bound UCB c =2 E -greedy E = 0.1 Average reward Steps

  31. Complexity of UCB Algorithm Theorem The UCB algorithm achieves logarithmic asymptotic total regret lim L t ≤ 8 log t ∆ a t →∞ a | ∆ > 0 a

  32. UCB vs ε‐greedy on 10‐armed bandit

  33. UCB vs ε‐greedy on 10‐armed bandit

  34. Gradient‐Bandit Algorithms • Let H t ( a ) be a learned preference for taking action a 100% α =0.1 80% with baseline α =0.4 % 60% Optimal α =0.1 action 40% without baseline α =0.4 20% 0% 0 250 500 750 1000 Steps

  35. Derivation of gradient‐bandit algorithm

  36. Summary Comparison of BanditAlgorithms

  37. Conclusions • These are all simple methods • but they are complicated enough—we will build on them • we should understand them completely • there are still open questions • Our first algorithms that learn from evaluative feedback • and thus must balance exploration and exploitation • Our first algorithms that appear to have a goal —that learn to maximize reward by trial and error

  38. Our first dimensions! • Problems vs Solution Methods Bandits? • Evaluative vs Instructive Problem orSolution? • Associative vs Non-associative

  39. Problem space Single State Associative Instructive feedback Evaluative feedback

  40. Problem space Single State Associative Instructive feedback Evaluative Bandits feedback (Function optimization)

  41. Problem space Single State Associative Instructive Supervised feedback learning Evaluative Bandits feedback (Function optimization)

  42. Problem space Single State Associative Instructive Supervised Averaging feedback learning Evaluative Bandits feedback (Function optimization)

  43. Problem space Single State Associative Instructive Supervised Averaging feedback learning Associative Evaluative Bandits Search feedback (Function optimization) (Contextual bandits)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend