Multi-Arm Bandit Sutton and Barto
1
Sutton slides and Silver
Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 - - PowerPoint PPT Presentation
Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online decision-making involves a fundamental
1
Sutton slides and Silver
Sutton and Barto, Chapter 2
295, class 2 3
Online decision-making involves a fundamental choice:
The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions
295, class 2 4
Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move
295, class 2 6
295, class 2 10
295, class 2 11
Multi-Armed Bandits Regret
11 12 13 14 15 16 17 1819
Totalregret ϵ-greedy greedy
1 2 3 4 5 6 7 8 9 10
Time-steps decaying ϵ-greedy
If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret?
295, class 2 13
295, class 2 14
295, class 2 15
295, class 2 17
295, class 2 18
A sim ple bandit algorithm
1 2 3 4 7 8 9 10 1 2 3
q⇤(1) q
⇤(2)
q
⇤(3)
q
⇤(4)
q
⇤(5)
q
⇤(6)
q
⇤(7)
q
⇤(8)
q
⇤(9)
q
⇤(10)
Reward distribution
5 6
Action
4
One Bandit T askfrom
Run for 1000 steps Repeat the whole thing 2000 times with different bandit tasks
Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q(a) unit variance normal distribution, as suggested by these gray distributions.
. Qn = R1 + R2 + · · ·+ Rn-1 n - 1
So far we have used
Q1(a) = 0
the 10-armed testbed (with alpha= 0.1 )
20% 0% 40% 60% 80% 100%
% Optimal action
200 400 600 800 1000
Plays
realistic, ε-greedy
Steps
Q1= 0, ε = 0.1
Q1= 5, ε = 0
295, class 2 29
295, class 2 30
295, class 2 31
295, class 2 32
295, class 2 33
295, class 2 34
UCB c =2
E-greedy E = 0.1
Average reward Steps
Theorem
t →∞
lim Lt ≤ 8 logt The UCB algorithm achieves logarithmic asymptotic total regret
a
a|∆ > 0
∆ a
% Optimal action α =0.1
100% 80% 60% 40% 20% 0%
α =0.4 α =0.1 α =0.4
without baseline with baseline
250 500
Steps
750 1000
—that learn to maximize reward by trial and error
Single State Associative Instructive feedback Evaluative feedback
Single State Associative Instructive feedback Evaluative feedback Bandits
(Function optimization)
Single State Associative Instructive feedback Supervised learning Evaluative feedback Bandits
(Function optimization)
Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits
(Function optimization)
Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits
(Function optimization)
Associative Search
(Contextual bandits)