Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides - - PowerPoint PPT Presentation

▶

Feb 06, 2023 123 likes •460 views

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1 Multi-Arm Bandits Sutton and Barto, Chapter2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online

SLIDE 1

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2

295, class 2 1

Sutton slides and Silver

SLIDE 2

Multi-Arm Bandits

Sutton and Barto, Chapter2

The simplest reinforcement learning problem

SLIDE 3

The Exploration/Exploitation Dilemma

295, class 2 3

Online decision-making involves a fundamental choice:

Exploitation Make the best decision given current information
Exploration Gather more information

The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions

SLIDE 4

Examples

295, class 2 4

Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move

SLIDE 5

You are the algorithm! (bandit1)

SLIDE 6

The k-armed Bandit Problem

On each of a sequence of time steps,t=1,2,3,…,

you choose an action At from k possibilities, and receive a real- valued reward Rt

These true values are unknown. The distribution is unknown
Nevertheless, you must maximize your total reward
Y
u must both try actions to learn their values (explore), and

prefer those that appear best (exploit)

true values

SLIDE 7

The Exploration/Exploitation Dilemma

SLIDE 8

Regret

295, class 2 8

The action-value is the mean reward for action a,

q*(a) = E [r|a]

The optimal value V ∗

V ∗

= Q(a∗) = max q*(a)

a∈A The regret is the opportunity loss for one step

lt = E [V ∗

− Q(at )]

The total regret is the total opportunity loss

SLIDE 9

Multi-Armed Bandits Regret

SLIDE 10

Multi-Armed Bandits Regret

11 12 13 1415 16 17 1819

Totalregret ϵ-greedy greedy

1 2 3 4 5 6 7 8 9 10

Time-steps decaying ϵ-greedy

If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret?

SLIDE 11

Complexity of regret

295, class 2 11

SLIDE 12

Overview

Action-value methods

– Epsilon-greedy strategy – Incremental implementation – Stationary vs. non-stationary environment – Optimistic initial values

UCB action selection
Gradient bandit algorithms
Associative search (contextual bandits)

295, class 2 12

SLIDE 13

Basics

Maximize total reward collected

– vs learn (optimal) policy (RL)

Episode is one step
Complex function of

– True value – Uncertainty – Number of time steps – Stationary vs non-stationary?

295, class 2 13

SLIDE 14

Action-Value Methods

SLIDE 15

-Greedy ActionSelection

In greedy action selection, you always exploit
In 𝜁-greedy, you are usually greedy, but with probability 𝜁 you

instead pick an action at random (possibly the greedy action again)

This is perhaps the simplest way to balance exploration and

exploitation

SLIDE 16

A simple bandit algorithm

SLIDE 17

1 2 3 4 7 8 9 10 1 2 3

q⇤(1) q

⇤(2)

⇤(3)

⇤(4)

⇤(5)

⇤(6)

⇤(7)

⇤(8)

⇤(9)

⇤(10)

Reward distribution

5 6

Action

One Bandit T askfrom

The 10-armedTestbed

Run for 1000 steps Repeat the whole thing 2000 times with different bandit tasks

Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q(a) unit variance normal distribution, as suggested by these gray distributions.

SLIDE 18

-Greedy Methods on the 10-ArmedTestbed

SLIDE 19

Averaging ⟶ learning rule

T
simplify notation, let us focus on one action
We consider only its rewards, and its estimate after n+1 rewards:
How can we do this incrementally (without storing all the rewards)?
Could store a running sum and count (and divide), or equivalently:

. Qn = R1 + R2 + · · ·+ Rn-1 n - 1

SLIDE 20

Derivation of incremental update

SLIDE 21

Tracking a Non-stationary Problem

SLIDE 22

Standard stochastic approximation convergence conditions

SLIDE 23

Optimistic InitialValues

So far we have used

All methods so far depend on Q1(a), i.e.,they are biased.

Q1(a) = 0

Suppose we initialize the action values optimistically (Q1(a) = 5 ), e.g., on

the 10-armed testbed (with alpha= 0.1 )

20% 0% 40% 60% 80% 100%

% Optimal action

200 400 600 800 1000

Plays

realistic, -greedy

Steps

Q1= 0, E = 0.1

ptimistic, greedy

Q1 = 5, E= 0

SLIDE 24

Upper Confidence Bound (UCB) action selection

A clever way of reducing exploration over time
Focus on actions whose estimate has large degree of uncertainty
Estimate an upper bound on the true action values
Select the action with the largest (estimated) upper bound

UCB c =2

E-greedy E = 0.1

Average reward Steps

SLIDE 25

Theorem

t →∞

lim Lt ≤ 8 logt The UCB algorithm achieves logarithmic asymptotic total regret

a|∆ >0

∆ a

Complexity of UCB Algorithm

SLIDE 26

Gradient-Bandit Algorithms

Let Ht(a) be a learned preference for taking action a

% Optimal action α =0.1

100% 80% 60% 40% 20% 0%

α =0.4 α =0.1 α =0.4

without baseline with baseline

250 500

Steps

750 1000

SLIDE 27

Derivation of gradient-bandit algorithm

SLIDE 28

SLIDE 29

SLIDE 30

SLIDE 31

Summary Comparison of BanditAlgorithms

SLIDE 32

Conclusions

These are all simple methods
but they are complicated enough—we will build on them
we should understand them completely
there are still open questions
Our first algorithms that learn from evaluative feedback
and thus must balance exploration and exploitation
Our first algorithms that appear to have a goal

—that learn to maximize reward by trial and error