cs885 reinforcement learning lecture 8a may 25 2018
play

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Exploration/exploitation tradeoff Regret


  1. CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

  2. Outline • Exploration/exploitation tradeoff • Regret • Multi-armed bandits – ! -greedy strategies – Upper confidence bounds University of Waterloo CS885 Spring 2018 Pascal Poupart 2

  3. Exploration/Exploitation Tradeoff • Fundamental problem of RL due to the active nature of the learning process • Consider one-state RL problems known as bandits University of Waterloo CS885 Spring 2018 Pascal Poupart 3

  4. Stochastic Bandits • Formal definition: – Single state: S = {s} – A: set of actions (also known as arms) – Space of rewards (often re-scaled to be [0,1]) • No transition function to be learned since there is a single state • We simply need to learn the stochastic reward function University of Waterloo CS885 Spring 2018 Pascal Poupart 4

  5. Origin • The term bandit comes from gambling where slot machines can be thought as one-armed bandits. • Problem: which slot machine should we play at each turn when their payoffs are not necessarily the same and initially unknown? University of Waterloo CS885 Spring 2018 Pascal Poupart 5

  6. Examples • Design of experiments (Clinical Trials) • Online ad placement • Web page personalization • Games • Networks (packet routing) University of Waterloo CS885 Spring 2018 Pascal Poupart 6

  7. Online Ad Optimization University of Waterloo CS885 Spring 2018 Pascal Poupart 7

  8. Online Ad Optimization • Problem: which ad should be presented? • Answer: present ad with highest payoff !"#$%% = '()'*+ℎ-$./ℎ0"12×!"#4251 – Click through rate: probability that user clicks on ad – Payment: $$ paid by advertiser • Amount determined by an auction University of Waterloo CS885 Spring 2018 Pascal Poupart 8

  9. Simplified Problem • Assume payment is 1 unit for all ads • Need to estimate click through rate • Formulate as a bandit problem: – Arms: the set of possible ads – Rewards: 0 (no click) or 1 (click) • In what order should ads be presented to maximize revenue? – How should we balance exploitation and exploration? University of Waterloo CS885 Spring 2018 Pascal Poupart 9

  10. Simple yet difficult problem • Simple: description of the problem is short • Difficult: no known tractable optimal solution University of Waterloo CS885 Spring 2018 Pascal Poupart 10

  11. Simple heuristics • Greedy strategy: select the arm with the highest average so far – May get stuck due to lack of exploration • ! -greedy: select an arm at random with probability ! and otherwise do a greedy selection – Convergence rate depends on choice of ! University of Waterloo CS885 Spring 2018 Pascal Poupart 11

  12. Regret • Let !(#) be the unknown average reward of # • Let % ∗ = max !(#) and # ∗ = #%,-#. + !(#) + • Denote by /011(#) the expected regret of # /011 # = % ∗ − !(#) • Denote by 3011 4 the expected cumulative regret for 5 time steps 4 3011 4 = ∑ 789 /011(# 7 ) University of Waterloo CS885 Spring 2018 Pascal Poupart 12

  13. Theoretical Guarantees • When ! is constant, then – For large enough " : Pr % & ≠ % ∗ ≈ ! - – Expected cumulative regret: *+,, - ≈ ∑ &/0 ! = 2(4) • Linear regret • When ! 6 ∝ 1/" – For large enough " : Pr % & ≠ % ∗ ≈ ! & = 2 0 & 0 - – Expected cumulative regret: *+,, - ≈ ∑ &/0 & = 2(log 4) • Logarithmic regret University of Waterloo CS885 Spring 2018 Pascal Poupart 13

  14. Empirical mean • Problem: how far is the empirical mean ! "($) from the true mean "($) ? • If we knew that " $ − ! " $ ≤ ()*+, – Then we would know that " $ < ! " $ + ()*+, – And we could select the arm with best ! " $ + ()*+, • Overtime, additional data will allow us to refine ! " ($) and compute a tighter ()*+, . University of Waterloo CS885 Spring 2018 Pascal Poupart 14

  15. Positivism in the Face of Uncertainty • Suppose that we have an oracle that returns an upper bound !" # (%) on '(%) for each arm based on ( trials of arm % . • Suppose the upper bound returned by this oracle converges to '(%) in the limit: – i.e. lim #→- !" # % = '(%) • Optimistic algorithm – At each step, select %/01%2 3 !" # (%) University of Waterloo CS885 Spring 2018 Pascal Poupart 15

  16. Convergence • Theorem: An optimistic strategy that always selects argmax & '( ) (+) will converge to + ∗ • Proof by contradiction: – Suppose that we converge to suboptimal arm + after infinitely many trials. – Then . + = '( 0 + ≥ '( 0 + 2 = .(+ 2 ) ∀+′ – But . + ≥ . + 2 ∀+′ contradicts our assumption that + is suboptimal. University of Waterloo CS885 Spring 2018 Pascal Poupart 16

  17. Probabilistic Upper Bound • Problem: We can’t compute an upper bound with certainty since we are sampling • However we can obtain measures ! that are upper bounds most of the time – i.e., Pr $ % ≤ ! % ≥ 1 − * – Example: Hoeffding’s inequality -./ 0 $ % ≤ + 1 Pr $ % + ≥ 1 − * 23 4 where 5 6 is the number of trials for arm % University of Waterloo CS885 Spring 2018 Pascal Poupart 17

  18. Upper Confidence Bound (UCB) • Set ! " = 1/& ' in Hoeffding’s bound • Choose ( with highest Hoeffding bound UCB( ℎ ) * ← 0 , & ← 0, & . ← 0 ∀( Repeat until & = ℎ 9 :;< " Execute argmax 5 6 7 ( + " = Receive > * ← * + > " = 6 ? . @A 6 7 ( ← " = @B & ← & + 1, & . ← & . + 1 Return * University of Waterloo CS885 Spring 2018 Pascal Poupart 18

  19. UCB Convergence • Theorem: Although Hoeffding’s bound is probabilistic, UCB converges. " #$% & • Idea: As ! increases, the term increases, & ' ensuring that all arms are tried infinitely often • Expected cumulative regret: ()** & = ,(log !) – Logarithmic regret University of Waterloo CS885 Spring 2018 Pascal Poupart 19

  20. Summary • Stochastic bandits – Exploration/exploitation tradeoff • ! -greedy and UCB – Theory: logarithmic expected cumulative regret • In practice: – UCB often performs better than ! -greedy – Many variants of UCB improve performance University of Waterloo CS885 Spring 2018 Pascal Poupart 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend