CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 8a may 25 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Exploration/exploitation tradeoff Regret


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 8a: May 25, 2018

Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Outline

  • Exploration/exploitation tradeoff
  • Regret
  • Multi-armed bandits

– !-greedy strategies – Upper confidence bounds

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Exploration/Exploitation Tradeoff

  • Fundamental problem of RL due to the active

nature of the learning process

  • Consider one-state RL problems known as

bandits

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Stochastic Bandits

  • Formal definition:

– Single state: S = {s} – A: set of actions (also known as arms) – Space of rewards (often re-scaled to be [0,1])

  • No transition function to be learned since there is a single

state

  • We simply need to learn the stochastic reward function

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Origin

  • The term bandit comes from gambling where slot

machines can be thought as one-armed bandits.

  • Problem: which slot

machine should we play at each turn when their payoffs are not necessarily the same and initially unknown?

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Examples

  • Design of experiments (Clinical Trials)
  • Online ad placement
  • Web page personalization
  • Games
  • Networks (packet routing)

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Online Ad Optimization

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Online Ad Optimization

  • Problem: which ad should be presented?
  • Answer: present ad with highest payoff

!"#$%% = '()'*+ℎ-$./ℎ0"12×!"#4251 – Click through rate: probability that user clicks on ad – Payment: $$ paid by advertiser

  • Amount determined by an auction

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Simplified Problem

  • Assume payment is 1 unit for all ads
  • Need to estimate click through rate
  • Formulate as a bandit problem:

– Arms: the set of possible ads – Rewards: 0 (no click) or 1 (click)

  • In what order should ads be presented to

maximize revenue?

– How should we balance exploitation and exploration?

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Simple yet difficult problem

  • Simple: description of the problem is short
  • Difficult: no known tractable optimal solution

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Simple heuristics

  • Greedy strategy: select the arm with the highest

average so far

– May get stuck due to lack of exploration

  • !-greedy: select an arm at random with

probability ! and otherwise do a greedy selection

– Convergence rate depends on choice of !

University of Waterloo

slide-12
SLIDE 12

CS885 Spring 2018 Pascal Poupart 12

Regret

  • Let !(#) be the unknown average reward of #
  • Let %∗ = max

+

!(#) and #∗ = #%,-#.+ !(#)

  • Denote by /011(#) the expected regret of #

/011 # = %∗ − !(#)

  • Denote by 30114 the expected cumulative regret

for 5 time steps

30114 = ∑789

4

/011(#7)

University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

Theoretical Guarantees

  • When ! is constant, then

– For large enough ": Pr %& ≠ %∗ ≈ ! – Expected cumulative regret: *+,,- ≈ ∑&/0

  • ! = 2(4)
  • Linear regret
  • When !6 ∝ 1/"

– For large enough ": Pr %& ≠ %∗ ≈ !& = 2

&

– Expected cumulative regret: *+,,- ≈ ∑&/0

  • & = 2(log 4)
  • Logarithmic regret

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

Empirical mean

  • Problem: how far is the empirical mean !

"($) from the true mean "($)?

  • If we knew that " $ − !

" $ ≤ ()*+,

– Then we would know that " $ < ! " $ + ()*+, – And we could select the arm with best ! " $ + ()*+,

  • Overtime, additional data will allow us to refine

! "($) and compute a tighter ()*+,.

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Positivism in the Face of Uncertainty

  • Suppose that we have an oracle that returns an

upper bound !"#(%) on '(%) for each arm based

  • n ( trials of arm %.
  • Suppose the upper bound returned by this oracle

converges to '(%) in the limit:

– i.e. lim

#→- !"# % = '(%)

  • Optimistic algorithm

– At each step, select %/01%23 !"#(%)

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Convergence

  • Theorem: An optimistic strategy that always

selects argmax&'()(+) will converge to +∗

  • Proof by contradiction:

– Suppose that we converge to suboptimal arm + after infinitely many trials. – Then . + = '(0 + ≥ '(0 +2 = .(+2) ∀+′ – But . + ≥ . +2 ∀+′ contradicts our assumption that + is suboptimal.

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Probabilistic Upper Bound

  • Problem: We can’t compute an upper bound with

certainty since we are sampling

  • However we can obtain measures ! that are

upper bounds most of the time

– i.e., Pr $ % ≤ ! % ≥ 1 − * – Example: Hoeffding’s inequality Pr $ % ≤ + $ % +

  • ./ 0

1

234

≥ 1 − * where 56 is the number of trials for arm %

University of Waterloo

slide-18
SLIDE 18

CS885 Spring 2018 Pascal Poupart 18

Upper Confidence Bound (UCB)

  • Set !" = 1/&' in Hoeffding’s bound
  • Choose ( with highest Hoeffding bound

UCB(ℎ) * ← 0, & ← 0, &. ← 0 ∀( Repeat until & = ℎ Execute argmax5 6 7 ( +

9 :;< " "=

Receive > * ← * + > 6 7 ( ←

"= 6 ? . @A "=@B

& ← & + 1, &. ← &. + 1 Return *

University of Waterloo

slide-19
SLIDE 19

CS885 Spring 2018 Pascal Poupart 19

UCB Convergence

  • Theorem: Although Hoeffding’s bound is

probabilistic, UCB converges.

  • Idea: As ! increases, the term

" #$% & &'

increases, ensuring that all arms are tried infinitely often

  • Expected cumulative regret: ()**& = ,(log !)

– Logarithmic regret

University of Waterloo

slide-20
SLIDE 20

CS885 Spring 2018 Pascal Poupart 20

Summary

  • Stochastic bandits

– Exploration/exploitation tradeoff

  • !-greedy and UCB

– Theory: logarithmic expected cumulative regret

  • In practice:

– UCB often performs better than !-greedy – Many variants of UCB improve performance

University of Waterloo