Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 - - PowerPoint PPT Presentation

multi arm bandit sutton and barto
SMART_READER_LITE
LIVE PREVIEW

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 - - PowerPoint PPT Presentation

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online decision-making involves a fundamental


slide-1
SLIDE 1

Multi-Arm Bandit Sutton and Barto

1

Sutton slides and Silver

slide-2
SLIDE 2

Multi-Arm Bandits

Sutton and Barto, Chapter 2

The simplest reinforcement learning problem

slide-3
SLIDE 3

The Exploration/Exploitation Dilemma

295, class 2 3

Online decision-making involves a fundamental choice:

  • Exploitation Make the best decision given current information
  • Exploration Gather more information

The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions

slide-4
SLIDE 4

Examples

295, class 2 4

Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move

slide-5
SLIDE 5

Multi‐Armed Bandit

295, class 2 6

slide-6
SLIDE 6

Regret

295, class 2 10

slide-7
SLIDE 7

Counting Regret

295, class 2 11

slide-8
SLIDE 8

Multi-Armed Bandits Regret

11 12 13 14 15 16 17 1819

Totalregret ϵ-greedy greedy

1 2 3 4 5 6 7 8 9 10

Time-steps decaying ϵ-greedy

If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret?

Linear or Sublinear Regret

slide-9
SLIDE 9

Complexity of regret

295, class 2 13

slide-10
SLIDE 10

Overview

  • Action‐value methods

– Epsilon‐greedy strategy – Incremental implementation – Stationary vs. non‐stationary environment – Optimistic initial values

  • UCB action selection
  • Gradient bandit algorithms
  • Associative search (contextual bandits)

295, class 2 14

slide-11
SLIDE 11

Basics

  • Maximize total reward collected

– vs learn (optimal) policy (RL)

  • Episode is one step
  • Complex function of

– True value – Uncertainty – Number of time steps – Stationary vs non‐stationary?

295, class 2 15

slide-12
SLIDE 12

Action‐Value Methods

slide-13
SLIDE 13

Greedy Algorithms

295, class 2 17

slide-14
SLIDE 14

‐Greedy Algorithm

295, class 2 18

slide-15
SLIDE 15

A sim ple bandit algorithm

slide-16
SLIDE 16

1 2 3 4 7 8 9 10 1 2 3

  • 3
  • 2
  • 1

q⇤(1) q

⇤(2)

q

⇤(3)

q

⇤(4)

q

⇤(5)

q

⇤(6)

q

⇤(7)

q

⇤(8)

q

⇤(9)

q

⇤(10)

Reward distribution

5 6

Action

  • 4

4

One Bandit T askfrom

The 10‐armedTestbed

Run for 1000 steps Repeat the whole thing 2000 times with different bandit tasks

Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q(a) unit variance normal distribution, as suggested by these gray distributions.

slide-17
SLIDE 17

‐Greedy Methods on the 10‐ArmedTestbed

slide-18
SLIDE 18

Averaging ⟶ learning rule

  • T
  • simplify notation, let us focus on one action
  • We consider only its rewards, and its estimate after n+1 rewards:
  • How can we do this incrementally (without storing all the rewards)?
  • Could store a running sum and count (and divide), or equivalently:

. Qn = R1 + R2 + · · ·+ Rn-1 n - 1

slide-19
SLIDE 19

Derivation of incremental update

slide-20
SLIDE 20

Tracking a Non‐stationary Problem

slide-21
SLIDE 21

Standard stochastic approximation convergence conditions

slide-22
SLIDE 22

Optimistic InitialValues

slide-23
SLIDE 23

Optimistic InitialValues

So far we have used

  • All methods so far depend on Q1(a), i.e.,they are biased.

Q1(a) = 0

  • Suppose we initialize the action values optimistically (Q1(a) = 5 ), e.g., on

the 10-armed testbed (with alpha= 0.1 )

20% 0% 40% 60% 80% 100%

% Optimal action

200 400 600 800 1000

Plays

realistic, ε-greedy

Steps

Q1= 0, ε = 0.1

  • ptimistic, greedy

Q1= 5, ε = 0

slide-24
SLIDE 24

Decaying ε‐greedy

295, class 2 29

slide-25
SLIDE 25

Optimism in the face of uncertainty

295, class 2 30

slide-26
SLIDE 26

Optimism in the face of uncertainty

295, class 2 31

slide-27
SLIDE 27

Upper Confidence Bounds

295, class 2 32

slide-28
SLIDE 28

Hoeffding’s Inequality

295, class 2 33

slide-29
SLIDE 29

Calculating UCB

295, class 2 34

slide-30
SLIDE 30

Upper Confidence Bound (UCB) action selection

  • A clever way of reducing exploration over time
  • Focus on actions whose estimate has large degree of uncertainty
  • Estimate an upper bound on the true action values
  • Select the action with the largest (estimated) upper bound

UCB c =2

E-greedy E = 0.1

Average reward Steps

slide-31
SLIDE 31

Theorem

t →∞

lim Lt ≤ 8 logt The UCB algorithm achieves logarithmic asymptotic total regret

a

a|∆ > 0

∆ a

Complexity of UCB Algorithm

slide-32
SLIDE 32

UCB vs ε‐greedy on 10‐armed bandit

slide-33
SLIDE 33

UCB vs ε‐greedy on 10‐armed bandit

slide-34
SLIDE 34

Gradient‐Bandit Algorithms

  • Let Ht(a) be a learned preference for taking action a

% Optimal action α =0.1

100% 80% 60% 40% 20% 0%

α =0.4 α =0.1 α =0.4

without baseline with baseline

250 500

Steps

750 1000

slide-35
SLIDE 35

Derivation of gradient‐bandit algorithm

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39

Summary Comparison of BanditAlgorithms

slide-40
SLIDE 40

Conclusions

  • These are all simple methods
  • but they are complicated enough—we will build on them
  • we should understand them completely
  • there are still open questions
  • Our first algorithms that learn from evaluative feedback
  • and thus must balance exploration and exploitation
  • Our first algorithms that appear to have a goal

—that learn to maximize reward by trial and error

slide-41
SLIDE 41

Our first dimensions!

  • Problems vs Solution Methods
  • Evaluative vs Instructive
  • Associative vs Non-associative

Bandits? Problem orSolution?

slide-42
SLIDE 42

Problem space

Single State Associative Instructive feedback Evaluative feedback

slide-43
SLIDE 43

Problem space

Single State Associative Instructive feedback Evaluative feedback Bandits

(Function optimization)

slide-44
SLIDE 44

Problem space

Single State Associative Instructive feedback Supervised learning Evaluative feedback Bandits

(Function optimization)

slide-45
SLIDE 45

Problem space

Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits

(Function optimization)

slide-46
SLIDE 46

Problem space

Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits

(Function optimization)

Associative Search

(Contextual bandits)