MVA-RL Course
Reinforcement Learning Algorithms
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation
Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to solve incrementally an RL problem Reinforcement Learning Algorithms A. LAZARIC Reinforcement
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Oct 15th, 2013 - 2/83
Oct 15th, 2013 - 2/83
Oct 15th, 2013 - 3/83
Oct 15th, 2013 - 3/83
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
Oct 15th, 2013 - 4/83
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
◮ This knowledge is often unavailable (i.e., wind intensity,
Oct 15th, 2013 - 4/83
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
◮ This knowledge is often unavailable (i.e., wind intensity,
◮ Can we relax this assumption?
Oct 15th, 2013 - 4/83
◮ Learning with generative model. A black-box simulator f of
Oct 15th, 2013 - 5/83
◮ Learning with generative model. A black-box simulator f of
◮ Episodic learning. Multiple trajectories can be repeatedly
0 = x, xi 1, . . . , xi Ti)n i=1.
Oct 15th, 2013 - 5/83
◮ Learning with generative model. A black-box simulator f of
◮ Episodic learning. Multiple trajectories can be repeatedly
0 = x, xi 1, . . . , xi Ti)n i=1. ◮ Online learning. At each time t the agent is at state xt, it
Oct 15th, 2013 - 5/83
Mathematical Tools
Oct 15th, 2013 - 6/83
Mathematical Tools
Oct 15th, 2013 - 6/83
Mathematical Tools
Let X be a random variable and {Xn}n∈N a sequence of r.v.
Oct 15th, 2013 - 7/83
Mathematical Tools
Let X be a random variable and {Xn}n∈N a sequence of r.v.
◮ {Xn} converges to X almost surely, Xn
a.s.
− → X, if P( lim
n→∞Xn = X) = 1,
Oct 15th, 2013 - 7/83
Mathematical Tools
Let X be a random variable and {Xn}n∈N a sequence of r.v.
◮ {Xn} converges to X almost surely, Xn
a.s.
− → X, if P( lim
n→∞Xn = X) = 1,
◮ {Xn} converges to X in probability, Xn
P
− → X, if for any ǫ > 0, lim
n→∞P[|Xn − X| > ǫ] = 0,
Oct 15th, 2013 - 7/83
Mathematical Tools
Let X be a random variable and {Xn}n∈N a sequence of r.v.
◮ {Xn} converges to X almost surely, Xn
a.s.
− → X, if P( lim
n→∞Xn = X) = 1,
◮ {Xn} converges to X in probability, Xn
P
− → X, if for any ǫ > 0, lim
n→∞P[|Xn − X| > ǫ] = 0,
◮ {Xn} converges to X in law (or in distribution), Xn
D
− → X, if for any bounded continuous function f lim
n→∞E[f (Xn)] = E[f (X)].
Oct 15th, 2013 - 7/83
Mathematical Tools
Let X be a random variable and {Xn}n∈N a sequence of r.v.
◮ {Xn} converges to X almost surely, Xn
a.s.
− → X, if P( lim
n→∞Xn = X) = 1,
◮ {Xn} converges to X in probability, Xn
P
− → X, if for any ǫ > 0, lim
n→∞P[|Xn − X| > ǫ] = 0,
◮ {Xn} converges to X in law (or in distribution), Xn
D
− → X, if for any bounded continuous function f lim
n→∞E[f (Xn)] = E[f (X)].
Remark: Xn
a.s.
− → X = ⇒ Xn
P
− → X = ⇒ Xn
D
− → X.
Oct 15th, 2013 - 7/83
Mathematical Tools
Proposition (Markov Inequality)
Oct 15th, 2013 - 8/83
Mathematical Tools
Proposition (Markov Inequality)
Proof. P(X ≥ a) = E[I{X ≥ a}] = E[I{X/a ≥ 1}] ≤ E[X/a]
Oct 15th, 2013 - 8/83
Mathematical Tools
Proposition (Hoeffding Inequality)
Oct 15th, 2013 - 9/83
Mathematical Tools
Proof. From convexity of the exponential function, for any a ≤ x ≤ b, esx ≤ x − a b − aesb + b − x b − a esa. Let p = −a/(b − a) then (recall that E[X] = 0) E[esx] ≤ b b − aesa − a b − aesb = (1 − p + pes(b−a))e−ps(b−a) = eφ(u) with u = s(b − a) and φ(u) = −pu + log(1 − p + peu) whose derivative is φ′(u) = −p + p p + (1 − p)e−u , and φ(0) = φ′(0) = 0 and φ′′(u) =
p(1−p)e−u (p+(1−p)e−u)2 ≤ 1/4.
Thus from Taylor’s theorem, the exists a θ ∈ [0, u] such that φ(θ) = φ(0) + θφ′(0) + u2 2 φ′′(θ) ≤ u2 8 = s2(b − a)2 8 .
Oct 15th, 2013 - 10/83
Mathematical Tools
Proposition (Chernoff-Hoeffding Inequality)
i=1(bi − ai)2
Oct 15th, 2013 - 11/83
Mathematical Tools
Proof. P
Xi − µi ≥ ǫ
P(es n
i=1 Xi−µi ≥ esǫ)
≤ e−sǫE[es n
i=1 Xi−µi],
Markov inequality = e−sǫ
n
E[es(Xi−µi)], independent random variables ≤ e−sǫ
n
es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n
i=1(bi−ai)2/8
If we choose s = 4ǫ/ n
i=1(bi − ai)2, the result follows.
Similar arguments hold for P n
i=1 Xi − µi ≤ −ǫ
Oct 15th, 2013 - 12/83
Mathematical Tools
Definition
n
Oct 15th, 2013 - 13/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n )
Oct 15th, 2013 - 14/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
Oct 15th, 2013 - 14/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
◮ Strong law of large numbers: µn a.s.
Oct 15th, 2013 - 14/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
◮ Strong law of large numbers: µn a.s.
◮ Central limit theorem (CLT): √n(µn − µ) D
Oct 15th, 2013 - 14/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
◮ Strong law of large numbers: µn a.s.
◮ Central limit theorem (CLT): √n(µn − µ) D
◮ Finite sample guarantee:
n
Oct 15th, 2013 - 14/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
◮ Strong law of large numbers: µn a.s.
◮ Central limit theorem (CLT): √n(µn − µ) D
◮ Finite sample guarantee:
n
Oct 15th, 2013 - 15/83
Mathematical Tools
◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P
◮ Strong law of large numbers: µn a.s.
◮ Central limit theorem (CLT): √n(µn − µ) D
◮ Finite sample guarantee:
n
2ǫ2
Oct 15th, 2013 - 16/83
Mathematical Tools
Oct 15th, 2013 - 17/83
Mathematical Tools
Definition
Oct 15th, 2013 - 18/83
Mathematical Tools
Definition
n this is the recursive definition of empirical
Oct 15th, 2013 - 18/83
Mathematical Tools
Proposition (Borel-Cantelli)
n≥1 P(En) < ∞,
n→∞ En
∞
Oct 15th, 2013 - 19/83
Mathematical Tools
Proposition
n < ∞,
a.s.
Oct 15th, 2013 - 20/83
Mathematical Tools
In order to satisfy the two conditions we need 1/2 < α ≤ 1. In fact, for instance α = 2 ⇒
1 n2 = π2 6 < ∞ (see the Basel problem) α = 1/2 ⇒
1 √n 2 =
1 n = ∞ (harmonic series).
Oct 15th, 2013 - 21/83
Mathematical Tools
Proof (cont’d). Case α = 1 Let (ǫk)k a sequence such that ǫk → 0, almost sure convergence corresponds to P
n→∞ µn = µ
From Chernoff-Hoeffding inequality for any fixed n P
(1) Let {En} be a sequence of events En = {
P(En) < ∞, and from Borel-Cantelli lemma we obtain that with probability 1 there exist only a finite number of n values such that
Oct 15th, 2013 - 22/83
Mathematical Tools
Proof (cont’d). Case α = 1 Then for any ǫk there exist only a finite number of instants were
P(∀n ≥ nk,
Repeating for all ǫk in the sequence leads to the statement.
Oct 15th, 2013 - 23/83
Mathematical Tools
Proof (cont’d). Case α = 1 Then for any ǫk there exist only a finite number of instants were
P(∀n ≥ nk,
Repeating for all ǫk in the sequence leads to the statement.
Remark: when α = 1, µn is the Monte-Carlo estimate and this corresponds to the strong law of large numbers. A more precise and accurate proof is here: http://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/
Oct 15th, 2013 - 23/83
Mathematical Tools
Proof (cont’d). Case 1/2 < α < 1. The stochastic approximation µn is µ1 = x1 µ2 = (1 − η2)µ1 + η2x2 = (1 − η2)x1 + η2x2 µ3 = (1 − η3)µ2 + η3x3 = (1 − η2)(1 − η3)x1 + η2(1 − η3)x2 + η3x3 . . . µn =
n
λixi, with λi = ηi n
j=i+1(1 − ηj) such that n i=1 λi = 1.
By C-H inequality P
λixi −
n
λiE[xi]
−
2ǫ2 n i=1 λ2 i .
Oct 15th, 2013 - 24/83
Mathematical Tools
Proof (cont’d). Case 1/2 < α < 1. From the definition of λi log λi = log ηi +
n
log(1 − ηj) ≤ log ηi −
n
ηj since log(1 − x) < −x. Thus λi ≤ ηie− n
j=i+1 ηj and for any 1 ≤ m ≤ n,
n
λ2
i
≤
n
η2
i e−2 n
j=i+1 ηj
(a)
≤
m
e−2 n
j=i+1 ηj +
n
η2
i (b)
≤ me−2(n−m)ηn + (n − m)η2
m (c)
= me−2(n−m)n−α + (n − m)m−2α.
Oct 15th, 2013 - 25/83
Mathematical Tools
Proof (cont’d). Case 1/2 < α < 1. Let m = nβ with β = (1 + α/2)/2 (i.e. 1 − 2αβ = 1/2 − α):
n
λ2
i ≤ ne−2(1−n−1/4)n1−α + n1/2−α ≤ 2n1/2−α
for n big enough, which leads to P
ǫ2 n1/2−α .
From this point we follow the same steps as for α = 1 (application of the Borel-Cantelli lemma) and obtain the convergence result for µn.
Oct 15th, 2013 - 26/83
Mathematical Tools
Definition
Oct 15th, 2013 - 27/83
Mathematical Tools
Proposition
n(x)|Fn] ≤ c(1 + ||Vn||2)
n < ∞,
Oct 15th, 2013 - 28/83
Mathematical Tools
Oct 15th, 2013 - 29/83
Mathematical Tools
a.s.
Oct 15th, 2013 - 29/83
Mathematical Tools
Oct 15th, 2013 - 30/83
Mathematical Tools
a.s.
Remark: this is often referred to as the stochastic gradient algorithm.
Oct 15th, 2013 - 30/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 31/83
The Monte-Carlo Algorithm
3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1
Oct 15th, 2013 - 32/83
The Monte-Carlo Algorithm
. . . x0 x(1)
1
x(i)
1
x(i)
2
. . . x(i)
T (i)
x(n)
2
x(n)
1
x(1)
2
x(1)
T (1)
. . . x(n)
T (n)
Oct 15th, 2013 - 33/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 34/83
The Monte-Carlo Algorithm
3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1
Oct 15th, 2013 - 34/83
The Monte-Carlo Algorithm
rπ(x(n)
1 )
x0 x(i)
1
x(i)
2
. . . x(i)
T (i)
x(n)
2
x(n)
1
x(1)
2
x(1)
T (1)
. . . x(n)
T (n)
. . . x(1)
1
rπ(x(1)
1 )
rπ(x(1)
T (i))
rπ(x(1)
2 )
rπ(x(2)
1 )
rπ(x(2)
T (2))
rπ(x(n)
T (n))
rπ(x(2)
2 )
rπ(x(n)
2 )
Oct 15th, 2013 - 35/83
The Monte-Carlo Algorithm
◮ Infinite time horizon with terminal state: the problem never
Oct 15th, 2013 - 36/83
The Monte-Carlo Algorithm
◮ Return of trajectory i
T (i)
t ) ◮ Estimated value function
n (x0) = 1
n
Oct 15th, 2013 - 37/83
The Monte-Carlo Algorithm
3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1
n (x0) using MC approximation
Oct 15th, 2013 - 38/83
The Monte-Carlo Algorithm
◮ All returns are unbiased estimators of V π(x)
0 ) + γr π(x(i) 1 ) + · · · + γT (i)r π(x(i) T (i))
◮ Thus
n (x0) a.s.
◮ Finite-sample guarantees are also possible
Oct 15th, 2013 - 39/83
The Monte-Carlo Algorithm
◮ Interrupt trajectories after H steps
H
t ) ◮ Loss in accuracy limited to γH rmax 1−γ
Oct 15th, 2013 - 40/83
The Monte-Carlo Algorithm
x(i)
T (i)
x0 x(n)
1
x(1)
T (1)
. . . x(n)
T (n)
. . . x(1)
1
rπ(x(1)
1 )
rπ(x(1)
T (i))
rπ(x(1)
2 )
rπ(x(2)
1 )
rπ(x(2)
T (2))
rπ(x(n)
T (n))
rπ(x(2)
2 )
rπ(x(n)
2 )
rπ(x(n)
1 )
x(i)
1 = x
x(n)
2
= x x(1)
2
x(i)
2
. . .
Oct 15th, 2013 - 41/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm
◮ First-visit MC. For each state x we only consider the
Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm
◮ First-visit MC. For each state x we only consider the
◮ Every-visit MC. Given a trajectory (x0 = x, x1, x2, . . . , xT), we
Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 43/83
The Monte-Carlo Algorithm
1−p p 1
1
The reward is 1 while in state 1 (while is 0 in the terminal state). All trajectories are (x0 = 1, x1 = 1, . . . , xT = 0). By Bellman equations V (1) = 1 + (1 − p)V (1) + 0 · p = 1 p , since V (0) = 0.
Oct 15th, 2013 - 44/83
The Monte-Carlo Algorithm
2
Oct 15th, 2013 - 45/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 46/83
The Monte-Carlo Algorithm
T−1
T
Oct 15th, 2013 - 47/83
The Monte-Carlo Algorithm
Let’s consider n independent trajectories, each of length Ti. Total number of samples n
i=1 Ti and the estimator
Vn is
n
i=1
Ti−1
t=0 (Ti − t)
n
i=1 Ti
= n
i=1 Ti(Ti + 1)
2 n
i=1 Ti
= 1/n n
i=1 Ti(Ti + 1)
2/n n
i=1 Ti a.s.
− → E[T 2] + E[T] 2E[T] = 1 p = V π(1) ⇒ consistent estimator. The MSE of the estimator E T + 1 2 − 1 p 2 = 1 2p2 − 3 4p + 1 4≤ 1 p2 − 1 p .
Oct 15th, 2013 - 48/83
The Monte-Carlo Algorithm
◮ Every-visit MC: biased but consistent estimator. ◮ First-visit MC: unbiased estimator with potentially bigger
Oct 15th, 2013 - 49/83
The Monte-Carlo Algorithm
◮ Every-visit MC: biased but consistent estimator. ◮ First-visit MC: unbiased estimator with potentially bigger
Remark: when the state space is large the probability of visiting multiple times the same state is low, then the performance of the two methods tends to be the same.
Oct 15th, 2013 - 49/83
The Monte-Carlo Algorithm
◮ Use subtrajectories ◮ Restart from random states over X
Oct 15th, 2013 - 50/83
The Monte-Carlo Algorithm
◮ The estimate
Oct 15th, 2013 - 51/83
The Monte-Carlo Algorithm
◮ Return of trajectory i
T (i)
t ) ◮ Estimated value function after trajectory i
i (x0) = (1 − αi)
i−1(x0) + αi
Oct 15th, 2013 - 52/83
The Monte-Carlo Algorithm
3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1
i (x0) using TD(1) approximation
n (x0) using MC approximation
Oct 15th, 2013 - 53/83
The Monte-Carlo Algorithm
◮ If αi = 1/i, then TD(1) is just the incremental version of the
i (x0) = n − 1
i−1(x0) + 1
◮ Using a generic step-size (learning rate) αi gives flexibility to
Oct 15th, 2013 - 54/83
The Monte-Carlo Algorithm
Proposition
∞
∞
i < ∞,
n (x0) a.s.
Oct 15th, 2013 - 55/83
The Monte-Carlo Algorithm
◮ Non-episodic problems: Truncated trajectories ◮ Multiple sub-trajectories
◮ Updates of all the states using sub-trajectories ◮ state-dependent learning rate αi(x) ◮ i is the index of the number of updates in that specific state
Oct 15th, 2013 - 56/83
The Monte-Carlo Algorithm
◮ The estimate
Oct 15th, 2013 - 57/83
The Monte-Carlo Algorithm
Proposition
Oct 15th, 2013 - 58/83
The Monte-Carlo Algorithm
Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm
◮ Bellman error of a function V in a state x
Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm
◮ Bellman error of a function V in a state x
◮ Temporal difference of a function
Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm
◮ Bellman error of a function V in a state x
◮ Temporal difference of a function
◮ Estimated value function after transition xt, rt, xt+1
Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm
3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1 3.4 Update V π(xt) using TD(0) approximation
i (x0) using TD(1) approximation
n (x0) using MC approximation
Oct 15th, 2013 - 60/83
The Monte-Carlo Algorithm
◮ The update rule
◮ The temporal difference is an unbiased sample of the Bellman
Oct 15th, 2013 - 61/83
The Monte-Carlo Algorithm
Proposition
∞
∞
i (x) < ∞,
Oct 15th, 2013 - 62/83
The Monte-Carlo Algorithm
For i = 1, . . . , n
V π(x) = 0, ∀x ∈ X
4.1 Take action at = π(xt) 4.2 Observe next state xt+1 and reward rt = r π(xt) 4.3 Set t = t + 1 4.4 Compute the TD δt = rt + γ V π(xt+1) − V π(xt) 4.5 Update the value function estimate in xt as
V π(xt) + αi(xt)δt 4.6 Update the learning rate, e.g., α(xt) = 1 # visits(xt) EndWhile EndFor
Oct 15th, 2013 - 63/83
The Monte-Carlo Algorithm
TD(1)
◮ Update rule
=
◮ No bias, large variance
TD(0)
◮ Update rule
V π(xt) + α(xt)δt.
◮ Potential bias, small variance
Oct 15th, 2013 - 64/83
The Monte-Carlo Algorithm
TD(1)
◮ Update rule
=
◮ No bias, large variance
TD(0)
◮ Update rule
V π(xt) + α(xt)δt.
◮ Potential bias, small variance
Oct 15th, 2013 - 64/83
The Monte-Carlo Algorithm
λ Bellman operator Definition
λ is
λ = (1 − λ)
Oct 15th, 2013 - 65/83
The Monte-Carlo Algorithm
λ Bellman operator Definition
λ is
λ = (1 − λ)
Remark: convex combination of the m-step Bellman operators (T π)m weighted by a sequences of coefficients defined as a function of a λ.
Oct 15th, 2013 - 65/83
The Monte-Carlo Algorithm
◮ Temporal difference of a function
◮ Estimated value function
T
Oct 15th, 2013 - 66/83
The Monte-Carlo Algorithm
◮ Temporal difference of a function
◮ Estimated value function
T
Oct 15th, 2013 - 66/83
The Monte-Carlo Algorithm
◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1
Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm
◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1
dt = r π(xt) + γ V π(xt+1) − V π(xt)
Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm
◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1
dt = r π(xt) + γ V π(xt+1) − V π(xt)
z(x) = λz(x) if x = xt 1 + λz(x) if x = xt if xt = x0 (reset the traces)
Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm
◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1
dt = r π(xt) + γ V π(xt+1) − V π(xt)
z(x) = λz(x) if x = xt 1 + λz(x) if x = xt if xt = x0 (reset the traces)
V π(x) + α(x)z(x)δt.
Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm
◮ λ < 1: smaller variance w.r.t. λ = 1 (MC/TD(1)). ◮ λ > 0: faster propagation of rewards w.r.t. λ = 0.
Oct 15th, 2013 - 68/83
The Monte-Carlo Algorithm
1 3 4
−1
2 5
1
0.4 0.6 0.8 1
λ
Oct 15th, 2013 - 69/83
The Q-learning Algorithm
Oct 15th, 2013 - 70/83
The Q-learning Algorithm
Oct 15th, 2013 - 71/83
The Q-learning Algorithm
Oct 15th, 2013 - 72/83
The Q-learning Algorithm
3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1
Oct 15th, 2013 - 72/83
The Q-learning Algorithm
◮ Policy evaluation given πk, compute Qπk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈AQπ
k (x)
Oct 15th, 2013 - 73/83
The Q-learning Algorithm
Idea: alternate policy evaluation and policy improvement
Oct 15th, 2013 - 74/83
The Q-learning Algorithm
Idea: alternate policy evaluation and policy improvement
◮ Define a greedy exploratory policy with temperature τ
πQ(a|x) = exp(Q(x, a)/τ)
The higher Q(x, a), the more probability to take action a in state x
Oct 15th, 2013 - 74/83
The Q-learning Algorithm
Idea: alternate policy evaluation and policy improvement
◮ Define a greedy exploratory policy with temperature τ
πQ(a|x) = exp(Q(x, a)/τ)
The higher Q(x, a), the more probability to take action a in state x
◮ Compute the temporal difference on the trajectory
xt, at, rt, xt+1, at+1 (with actions chosen according to πQ(a|x)) δt = rt + γ Q(xt+1, at+1) − Q(xt, at)
Oct 15th, 2013 - 74/83
The Q-learning Algorithm
Idea: alternate policy evaluation and policy improvement
◮ Define a greedy exploratory policy with temperature τ
πQ(a|x) = exp(Q(x, a)/τ)
The higher Q(x, a), the more probability to take action a in state x
◮ Compute the temporal difference on the trajectory
xt, at, rt, xt+1, at+1 (with actions chosen according to πQ(a|x)) δt = rt + γ Q(xt+1, at+1) − Q(xt, at)
◮ Update the estimate of Q as
Q(xt, at) + α(xt, at)δt
Oct 15th, 2013 - 74/83
The Q-learning Algorithm
◮ The TD updates make
◮ The update of πQ allows to improve the policy ◮ A decreasing temperature allows to become more and more
Oct 15th, 2013 - 75/83
The Q-learning Algorithm
Oct 15th, 2013 - 76/83
The Q-learning Algorithm
Proposition
Oct 15th, 2013 - 77/83
The Q-learning Algorithm
Idea: use TD for the optimal Bellman operator
◮ Compute the (optimal) temporal difference on the trajectory
xt, at, rt, xt+1 (with actions chosen arbitrarily!) δt = rt + γ max
a′
Q(xt, at)
◮ Update the estimate of Q as
Q(xt, at) + α(xt, at)δt
Oct 15th, 2013 - 78/83
The Q-learning Algorithm
Proposition
∞
∞
i (x) < ∞,
Oct 15th, 2013 - 79/83
The Q-learning Algorithm
For i = 1, . . . , n
3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt = rt + γ Q(xt+1, at+1) − Q(xt, at) (SARSA) δt = rt + γ max
a′
Q(xt, at) (Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Oct 15th, 2013 - 80/83
The Q-learning Algorithm
Oct 15th, 2013 - 81/83
The Q-learning Algorithm
Oct 15th, 2013 - 82/83
The Q-learning Algorithm
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr