Exploration/Exploitation in Multi-armed Bandits
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science Spring 2019, CMU 10-403
Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU - - PowerPoint PPT Presentation
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Some of the material and slides
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science Spring 2019, CMU 10-403
borrowed from Russ Salakhutdinov who in turn borrowed from Rich Sutton’s class and David Silver’s class on Reinforcement Learning.
suggests correct actions
provides signal whether actions the agent selects are good or bad, not even how far away they are from the optimal actions!
Agent and environment interact at discrete time steps: t = 0,1, 2,K Agent observes state at step t: St ∈ produces action at step t : At ∈ (St ) gets resulting reward: Rt+1 ∈ and resulting next state: St+1 ∈
At Rt+1 St At+1 Rt+2 St+1 At+2 Rt+3 St+2 At+3 St+3
. . . . . . S A(
R S+
= 0, 1, 2, 3, . . ..
∈ R ⊂ R,
Agent Environment
action
At
reward
Rt
state
St
Rt+1 St+1
A closer look to exploration-exploitation balancing in a simplified RL setup
Agent Environment
action
At
reward
Rt
state
St
Rt+1 St+1
Agent and environment interact at discrete time steps: t = 0,1, 2,K Agent observes state at step t: St ∈ produces action at step t : At ∈ (St ) gets resulting reward: Rt+1 ∈ and resulting next state: St+1 ∈
S
= 0, 1, 2, 3, . . ..
St
At, Rt+1, At+1, Rt+2, At+2, At+3, Rt+3, . . .
The state does not change.
source: infoslotmachine.com
One-armed bandit= Slot machine (English slang)
source: Microsoft Research
source: Pandey et al.’s slide
At each timestep t the agent chooses one of the K arms and plays it . The ith arm produces reward ri,t when played at timestep t . The rewards ri,t are drawn from a probability distribution 𝒬i with mean μi The agent does not know neither the arm reward distributions neither their means Agent’s Objective:
q∗(a) . = E[Rt|At = a] , ∀a ∈ {1, . . . , k} Alternative notation for mean arm rewards:
source: Pandey et al.’s slide
win 0.6
win 0.4
win 0.45
Recall: The Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p, that is, the probability distribution of any single experiment that asks a yes–no question
1 2 6 3 5 4 7 8 9 10 1 2 3
q∗(1) q∗(2) q∗(3) q∗(4) q∗(5) q∗(6)
q∗(7)
q∗(8) q∗(9) q∗(10)
Reward distribution Action
4
One Bandit Task from
Rt ∼ N(q∗(a), 1)
necessarily the same user)
Netflix Artwork For a particular movie, we want to decide what image to show (to all the NEFLIX users)
title and watch the movie
clickbait)
Qt(a) ≈ q∗(a), ∀a
action-value estimates
A∗
t
. = arg max
a
Qt(a)
Qt(a) ≈ q∗(a), ∀a
action-value estimates
A∗
t
. = arg max
a
Qt(a)
If then you are exploring
Qt(a) ≈ q∗(a), ∀a
action-value estimates
At = A∗
t
At 6= A∗
t
A∗
t
. = arg max
a
Qt(a)
If then you are exploring
Qt(a) ≈ q∗(a), ∀a
action-value estimates
At = A∗
t
At 6= A∗
t
A∗
t
. = arg max
a
Qt(a)
If then you are exploring
less with time.
Qt(a) ≈ q∗(a), ∀a
action-value estimates
At = A∗
t
At 6= A∗
t
(expected return) q∗(a) . = E[Rt|At = a] , ∀a ∈ {1, . . . , k}
v* = q(a*) = max
a∈ q*(a)
reward = − regret
It = 𝔽[v* − q*(at)]
Lt = 𝔽 [
T
∑
t=1
v* − q*(at)]
to time t
action a∗:
Δa = v* − q*(a)
Lt = 𝔽 [
T
∑
t=1
v* − q*(at)] = ∑
a∈
𝔽[Nt(a)](v* − q*(a)) = ∑
a∈
𝔽[Nt(a)]Δa
Qt(a) . = sum of rewards when a taken prior to t
number of times a taken prior to t
= Pt−1
i=1 Ri · 1Ai=a
Pt−1
i=1 1Ai=a
lim
Nt(a)→∞ Qt(a) = q∗(a)
The number of times action a has been taken by time t
If the action is taken an infinite number of times
Qn . = R1 + R2 + · · · + Rn−1 n − 1 .
Qn+1 = Qn + 1 n h Rn − Qn i Qn . = R1 + R2 + · · · + Rn−1 n − 1 .
Qn+1 = Qn + 1 n h Rn − Qn i
NewEstimate ← OldEstimate + StepSize h Target − OldEstimate i
Qn . = R1 + R2 + · · · + Rn−1 n − 1 .
Qn+1 = Qn + 1 n h Rn − Qn i
NewEstimate ← OldEstimate + StepSize h Target − OldEstimate i
Qn . = R1 + R2 + · · · + Rn−1 n − 1 .
error
Qn . = R1 + R2 + · · · + Rn−1 n − 1 .
Qn+1 = 1 n
n
X
i=1
Ri = 1 n Rn +
n−1
X
i=1
Ri ! = 1 n Rn + (n − 1) 1 n − 1
n−1
X
i=1
Ri ! = 1 n ⇣ Rn + (n − 1)Qn ⌘ = 1 n ⇣ Rn + nQn − Qn ⌘ = Qn + 1 n h Rn − Qn i ,
Qn+1 . = Qn + α h Rn − Qn i = (1 − α)nQ1 +
n
X
i=1
α(1 − α)n−iRi,
where α ∈ (0,1] and constant The smaller the i, the smaller the multiplier-> forgetting earlier rewards
We have seen how to form estimates for the bandit mean rewards. Next we will discuss our action selection strategy (policy)
1.Allocate a fixed time period to exploration when you try bandits uniformly at random
1.Allocate a fixed time period to exploration when you try bandits uniformly at random 2.Estimate mean rewards for all actions
Qt(a) = 1 Nt(a)
t−1
∑
i=1
ri1(Ai = a)
1.Allocate a fixed time period to exploration when you try bandits uniformly at random 2.Estimate mean rewards for all actions 3.Select the action that is optimal for the estimated mean rewards given all data thus far, breaking ties at random
at = argmaxa∈Qt(a) Qt(a) = 1 Nt(a)
t−1
∑
i=1
ri1(Ai = a)
1.Allocate a fixed time period to exploration when you try bandits uniformly at random 2.Estimate mean rewards for all actions 3.Select the action that is optimal for the estimated mean rewards given all data thus far, breaking ties at random 4.GOTO 3
at = argmaxa∈Qt(a) Qt(a) = 1 Nt(a)
t−1
∑
i=1
ri1(Ai = a)
Qt(a1) = 0.3 Qt(a2) = 0.5 Qt(a3) = 0.1 Q: Will the greedy method always pick the second action? After the fixed exploration period we have formed the following reward estimates
instead pick an action at random (possibly the greedy action again)
exploitation
A simple bandit algorithm
Initialize, for a = 1 to k: Q(a) ← 0 N(a) ← 0 Repeat forever: A ← ⇢ arg maxa Q(a) with probability 1 − ε (breaking ties randomly) a random action with probability ε R ← bandit(A) N(A) ← N(A) + 1 Q(A) ← Q(A) +
1 N(A)
⇥ R − Q(A) ⇤
at = argmaxa∈Qt(a)
휀 = 0 (greedy)
0.5 1 1.5
Average reward
250 500 750 1000
Steps
= 0.01
= 0.1 휀 휀
1
Q: In the limit (after infinite number of steps), which method will result in the largest average reward? We sample 10 arm bandits instantiations:
Rt ∼ N(q∗(a), 1) q∗(a) ∼ N(0, 1)
= 0 (greedy)
0% 20% 40% 60% 80% 100%
% Optimal action
250 500 750 1000
Steps 휀 = 0.1 휀 = 0.01 휀
1
We sample 10 arm bandits instantiations from here
Rt ∼ N(q∗(a), 1) q∗(a) ∼ N(0, 1)
Q: Which method will find the optimal action in the limit?
= 0 (greedy)
0% 20% 40% 60% 80% 100%
% Optimal action
250 500 750 1000
Steps 휀 = 0.1 휀 = 0.01 휀
1
We sample 10 arm bandits instantiations from here
Rt ∼ N(q∗(a), 1) q∗(a) ∼ N(0, 1)
Q: Does the performance of those methods depend on the initialization of the action value estimates?
a suboptimal action if rewards are stochastic
just an incremental estimate
including one ‘hallucinated’ initial optimistic value
Qt(at) = Qt−1(at) + 1 Nt(at) (rt − Qt−1(at))
Qt(a1) = 1 Qt(a2) = 1 Qt(a3) = 1 Q: When it is possible that greedy action selection will not try out all the actions? We initialize with the following reward estimates for Bernoulli bandits
e.g., on the 10-armed testbed
Q1(a) = 5
0% 20% 40% 60% 80% 100%
% Optimal action
200 400 600 800 1000
Plays
Q0 = 5, = 0 realistic, !-greedy Q0 = 0, = 0.1
1 1
Steps
휀 휀
1
Rt ∼ N(q∗(a), 1) q∗(a) ∼ N(0, 1)
To achieve that we need to reason about uncertainty of our action value estimates
been selected
Estimated mean Estimated Upper Confidence
conditioned on selecting action a
(note: c is a hyper-parameter that trades-off explore/exploit)
At . = argmax
a
" Qt(a) + c s log t Nt(a) #
1휀-greedy 휀 = 0.1
UCB c = 2
Average reward Steps
1000 pulls, 600 wins Q_t(a_1)=0.6 1000 pulls, 400 wins Q_t(a_2)=0.4 10 pulls, 4 wins Q_t(a_1)=0.4 The problem with using mean estimates is that we cannot reason about uncertainty
← Repeat forever: A ← ⇢ arg maxa Q(a) with probability 1 − ε (breaking ties randomly) a random action with probability ε R ← bandit(A) N(A) ← N(A) + 1 Q(A) ← Q(A) +
1 N(A)⇥ R − Q(A) ⇤
Epsilon-greedy
Bayes rule enables us to reverse probabilities: P(B|A)P(A) P(B) P(A|B) =
Slide from Nanto de Freitas
! The doctor has bad news and good news. ! The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). ! The good news is that this is a rare disease, striking only 1 in 10,000 people. ! What are the chances that you actually have the disease?
Slide from Nanto de Freitas
The test is 99% accurate: P(T=1|D=1) = 0.99 and P(T=0|D=0) = 0.99 Where T denotes test and D denotes disease. The disease affects 1 in 10000: P(D=1) = 0.0001
Slide from Nanto de Freitas
The test is 99% accurate: P(T=1|D=1) = 0.99 and P(T=0|D=0) = 0.99 Where T denotes test and D denotes disease. The disease affects 1 in 10000: P(D=1) = 0.0001
Slide from Nanto de Freitas
The test is 99% accurate: P(T=1|D=1) = 0.99 and P(T=0|D=0) = 0.99 Where T denotes test and D denotes disease. The disease affects 1 in 10000: P(D=1) = 0.0001
0.0098
Slide from Nanto de Freitas
Step 1: Given n data, D = x1:n = {x1, x2,…, xn }, write down the expression for the likelihood: p( D |θ θ θ θ ) Step 3: Compute the posterior: Step 2: Specify a prior: p(θ θ θ θ ) Step 3: Compute the posterior: p(θ θ θ θ | D ) p( D |θ θ θ θ ) p(θ θ θ θ ) p( D ) =
Slide from Nanto de Freitas
Let’s consider a Beta distribution prior over the mean rewards of the Bernoulli bandits:
Beta(α, β)
The mean is α α + β The larger the α + β the more concentrated the distribution
Let’s consider a Beta distribution prior over the mean rewards of the Bernoulli bandits:
p(θ θ θ θ | D ) p( D |θ θ θ θ ) p(θ θ θ θ ) p( D ) =
The posterior is also a Beta! Because beta is conjugate distribution for the Bernoulli distribution. A closed form solution for the bayesian update, possible only for conjugate distributions!
a: success b: failure
Represent a posterior distribution of mean rewards
p(θ1, θ2⋯θk)
a = arg max
a
𝔽θ[r(a)] θ1, θ2, ⋯, θk ∼ ̂ p(θ1, θ2⋯θk)
The equivalent of mean expected rewards for general MDPs are Q functions
states (or “contexts”)
distribution over rewards
Netflix Artwork For a particular title and a particular user, we will use the contextual multi-armed bandit formulation to decide what image to show per title per user
the movie
watched, time and day of the week, etc.