Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

reinforcement learning algorithms
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to solve incrementally an RL problem Reinforcement Learning Algorithms A. LAZARIC Reinforcement


slide-1
SLIDE 1

MVA-RL Course

Reinforcement Learning Algorithms

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 2/83

slide-3
SLIDE 3

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

Tools Policy Evaluation Policy Learning

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 2/83

slide-4
SLIDE 4

Notice From now on we often work on the episodic discounted setting. Most results smoothly extend to other settings.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 3/83

slide-5
SLIDE 5

Notice From now on we often work on the episodic discounted setting. Most results smoothly extend to other settings. The value functions can be represented exactly (no approximation error).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 3/83

slide-6
SLIDE 6

In This Lecture

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 4/83

slide-7
SLIDE 7

In This Lecture

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

◮ This knowledge is often unavailable (i.e., wind intensity,

human-computer-interaction).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 4/83

slide-8
SLIDE 8

In This Lecture

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

◮ This knowledge is often unavailable (i.e., wind intensity,

human-computer-interaction).

◮ Can we relax this assumption?

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 4/83

slide-9
SLIDE 9

In This Lecture

◮ Learning with generative model. A black-box simulator f of

the environment is available. Given (x, a), f (x, a) = {y, r} with y ∼ p(·|x, a), r = r(x, a).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 5/83

slide-10
SLIDE 10

In This Lecture

◮ Learning with generative model. A black-box simulator f of

the environment is available. Given (x, a), f (x, a) = {y, r} with y ∼ p(·|x, a), r = r(x, a).

◮ Episodic learning. Multiple trajectories can be repeatedly

generated from the same state x and terminating when a reset condition is achieved: (xi

0 = x, xi 1, . . . , xi Ti)n i=1.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 5/83

slide-11
SLIDE 11

In This Lecture

◮ Learning with generative model. A black-box simulator f of

the environment is available. Given (x, a), f (x, a) = {y, r} with y ∼ p(·|x, a), r = r(x, a).

◮ Episodic learning. Multiple trajectories can be repeatedly

generated from the same state x and terminating when a reset condition is achieved: (xi

0 = x, xi 1, . . . , xi Ti)n i=1. ◮ Online learning. At each time t the agent is at state xt, it

takes action at, it observes a transition to state xt+1, and it receives a reward rt. We assume that xt+1 ∼ p(·|xt, at) and rt = r(xt, at) (i.e., MDP assumption).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 5/83

slide-12
SLIDE 12

Mathematical Tools

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 6/83

slide-13
SLIDE 13

Mathematical Tools

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

Tools Policy Evaluation Policy Learning

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 6/83

slide-14
SLIDE 14

Mathematical Tools

Concentration Inequalities

Let X be a random variable and {Xn}n∈N a sequence of r.v.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 7/83

slide-15
SLIDE 15

Mathematical Tools

Concentration Inequalities

Let X be a random variable and {Xn}n∈N a sequence of r.v.

◮ {Xn} converges to X almost surely, Xn

a.s.

− → X, if P( lim

n→∞Xn = X) = 1,

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 7/83

slide-16
SLIDE 16

Mathematical Tools

Concentration Inequalities

Let X be a random variable and {Xn}n∈N a sequence of r.v.

◮ {Xn} converges to X almost surely, Xn

a.s.

− → X, if P( lim

n→∞Xn = X) = 1,

◮ {Xn} converges to X in probability, Xn

P

− → X, if for any ǫ > 0, lim

n→∞P[|Xn − X| > ǫ] = 0,

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 7/83

slide-17
SLIDE 17

Mathematical Tools

Concentration Inequalities

Let X be a random variable and {Xn}n∈N a sequence of r.v.

◮ {Xn} converges to X almost surely, Xn

a.s.

− → X, if P( lim

n→∞Xn = X) = 1,

◮ {Xn} converges to X in probability, Xn

P

− → X, if for any ǫ > 0, lim

n→∞P[|Xn − X| > ǫ] = 0,

◮ {Xn} converges to X in law (or in distribution), Xn

D

− → X, if for any bounded continuous function f lim

n→∞E[f (Xn)] = E[f (X)].

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 7/83

slide-18
SLIDE 18

Mathematical Tools

Concentration Inequalities

Let X be a random variable and {Xn}n∈N a sequence of r.v.

◮ {Xn} converges to X almost surely, Xn

a.s.

− → X, if P( lim

n→∞Xn = X) = 1,

◮ {Xn} converges to X in probability, Xn

P

− → X, if for any ǫ > 0, lim

n→∞P[|Xn − X| > ǫ] = 0,

◮ {Xn} converges to X in law (or in distribution), Xn

D

− → X, if for any bounded continuous function f lim

n→∞E[f (Xn)] = E[f (X)].

Remark: Xn

a.s.

− → X = ⇒ Xn

P

− → X = ⇒ Xn

D

− → X.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 7/83

slide-19
SLIDE 19

Mathematical Tools

Concentration Inequalities

Proposition (Markov Inequality)

Let X be a positive random variable. Then for any a > 0, P(X ≥ a) ≤ EX a .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 8/83

slide-20
SLIDE 20

Mathematical Tools

Concentration Inequalities

Proposition (Markov Inequality)

Let X be a positive random variable. Then for any a > 0, P(X ≥ a) ≤ EX a .

Proof. P(X ≥ a) = E[I{X ≥ a}] = E[I{X/a ≥ 1}] ≤ E[X/a]

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 8/83

slide-21
SLIDE 21

Mathematical Tools

Concentration Inequalities

Proposition (Hoeffding Inequality)

Let X be a centered random variable bounded in [a, b]. Then for any s ∈ R, E[esX] ≤ es2(b−a)2/8.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 9/83

slide-22
SLIDE 22

Mathematical Tools

Concentration Inequalities

Proof. From convexity of the exponential function, for any a ≤ x ≤ b, esx ≤ x − a b − aesb + b − x b − a esa. Let p = −a/(b − a) then (recall that E[X] = 0) E[esx] ≤ b b − aesa − a b − aesb = (1 − p + pes(b−a))e−ps(b−a) = eφ(u) with u = s(b − a) and φ(u) = −pu + log(1 − p + peu) whose derivative is φ′(u) = −p + p p + (1 − p)e−u , and φ(0) = φ′(0) = 0 and φ′′(u) =

p(1−p)e−u (p+(1−p)e−u)2 ≤ 1/4.

Thus from Taylor’s theorem, the exists a θ ∈ [0, u] such that φ(θ) = φ(0) + θφ′(0) + u2 2 φ′′(θ) ≤ u2 8 = s2(b − a)2 8 .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 10/83

slide-23
SLIDE 23

Mathematical Tools

Concentration Inequalities

Proposition (Chernoff-Hoeffding Inequality)

Let Xi ∈ [ai, bi] be n independent r.v. with mean µi = EXi. Then P

  • n
  • i=1
  • Xi − µi
  • ≥ ǫ
  • ≤ 2 exp

2ǫ2 n

i=1(bi − ai)2

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 11/83

slide-24
SLIDE 24

Mathematical Tools

Concentration Inequalities

Proof. P

  • n
  • i=1

Xi − µi ≥ ǫ

  • =

P(es n

i=1 Xi−µi ≥ esǫ)

≤ e−sǫE[es n

i=1 Xi−µi],

Markov inequality = e−sǫ

n

  • i=1

E[es(Xi−µi)], independent random variables ≤ e−sǫ

n

  • i=1

es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n

i=1(bi−ai)2/8

If we choose s = 4ǫ/ n

i=1(bi − ai)2, the result follows.

Similar arguments hold for P n

i=1 Xi − µi ≤ −ǫ

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 12/83

slide-25
SLIDE 25

Mathematical Tools

Monte-Carlo Approximation of a Mean

Definition

Let X be a random variable with mean µ = E[X] and variance σ2 = V[X] and xn ∼ X be n i.i.d. realizations of X. The Monte-Carlo approximation of the mean (i.e., the empirical mean) built on n i.i.d. realizations is defined as µn = 1 n

n

  • i=1

xi.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 13/83

slide-26
SLIDE 26

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n )

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 14/83

slide-27
SLIDE 27

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 14/83

slide-28
SLIDE 28

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

◮ Strong law of large numbers: µn a.s.

− → µ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 14/83

slide-29
SLIDE 29

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

◮ Strong law of large numbers: µn a.s.

− → µ.

◮ Central limit theorem (CLT): √n(µn − µ) D

− → N(0, V[X]).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 14/83

slide-30
SLIDE 30

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

◮ Strong law of large numbers: µn a.s.

− → µ.

◮ Central limit theorem (CLT): √n(µn − µ) D

− → N(0, V[X]).

◮ Finite sample guarantee:

P

  • 1

n

n

  • t=1

Xt − E[X1]

  • deviation

> ǫ

  • accuracy
  • ≤ 2 exp

2nǫ2 (b − a)2

  • confidence
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 14/83

slide-31
SLIDE 31

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

◮ Strong law of large numbers: µn a.s.

− → µ.

◮ Central limit theorem (CLT): √n(µn − µ) D

− → N(0, V[X]).

◮ Finite sample guarantee:

P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 15/83

slide-32
SLIDE 32

Mathematical Tools

Monte-Carlo Approximation of a Mean

◮ Unbiased estimator: Then E[µn] = µ (and V[µn] = V[X] n ) ◮ Weak law of large numbers: µn P

− → µ.

◮ Strong law of large numbers: µn a.s.

− → µ.

◮ Central limit theorem (CLT): √n(µn − µ) D

− → N(0, V[X]).

◮ Finite sample guarantee:

P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > ǫ
  • ≤ δ

if n ≥ (b−a)2 log 2/δ

2ǫ2

.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 16/83

slide-33
SLIDE 33

Mathematical Tools

Exercise

Simulate n Bernoulli of probability p and verify the correctness and the accuracy of the C-H bounds.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 17/83

slide-34
SLIDE 34

Mathematical Tools

Stochastic Approximation of a Mean

Definition

Let X a random variable bounded in [0, 1] with mean µ = E[X] and xn ∼ X be n i.i.d. realizations of X. The stochastic approximation of the mean is, µn = (1 − ηn)µn−1 + ηnxn with µ1 = x1 and where (ηn) is a sequence of learning steps.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 18/83

slide-35
SLIDE 35

Mathematical Tools

Stochastic Approximation of a Mean

Definition

Let X a random variable bounded in [0, 1] with mean µ = E[X] and xn ∼ X be n i.i.d. realizations of X. The stochastic approximation of the mean is, µn = (1 − ηn)µn−1 + ηnxn with µ1 = x1 and where (ηn) is a sequence of learning steps. Remark: When ηn = 1

n this is the recursive definition of empirical

mean.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 18/83

slide-36
SLIDE 36

Mathematical Tools

Stochastic Approximation of a Mean

Proposition (Borel-Cantelli)

Let (En)n≥1 be a sequence of events such that

n≥1 P(En) < ∞,

then the probability of the intersection of an infinite subset is 0. More formally, P

  • lim sup

n→∞ En

  • = P

  • n=1

  • k=n

Ek

  • = 0.
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 19/83

slide-37
SLIDE 37

Mathematical Tools

Stochastic Approximation of a Mean

Proposition

If for any n, ηn ≥ 0 and are such that

  • n≥0

ηn = ∞;

  • n≥0

η2

n < ∞,

then µn

a.s.

− → µ, and we say that µn is a consistent estimator.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 20/83

slide-38
SLIDE 38

Mathematical Tools

Stochastic Approximation of a Mean

  • Proof. We focus on the case ηn = n−α.

In order to satisfy the two conditions we need 1/2 < α ≤ 1. In fact, for instance α = 2 ⇒

  • n≥0

1 n2 = π2 6 < ∞ (see the Basel problem) α = 1/2 ⇒

  • n≥0

1 √n 2 =

  • n≥0

1 n = ∞ (harmonic series).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 21/83

slide-39
SLIDE 39

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case α = 1 Let (ǫk)k a sequence such that ǫk → 0, almost sure convergence corresponds to P

  • lim

n→∞ µn = µ

  • = P(∀k, ∃nk, ∀n ≥ nk,
  • µn − µ
  • ≤ ǫk) = 1.

From Chernoff-Hoeffding inequality for any fixed n P

  • µn − µ
  • ≥ ǫ
  • ≤ 2e−2nǫ2.

(1) Let {En} be a sequence of events En = {

  • µn − µ
  • ≥ ǫ}. From C-H
  • n≥1

P(En) < ∞, and from Borel-Cantelli lemma we obtain that with probability 1 there exist only a finite number of n values such that

  • µn − µ
  • ≥ ǫ.
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 22/83

slide-40
SLIDE 40

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case α = 1 Then for any ǫk there exist only a finite number of instants were

  • µn − µ
  • ≥ ǫk, which corresponds to have ∃nk such that

P(∀n ≥ nk,

  • µn − µ
  • ≤ ǫk) = 1

Repeating for all ǫk in the sequence leads to the statement.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 23/83

slide-41
SLIDE 41

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case α = 1 Then for any ǫk there exist only a finite number of instants were

  • µn − µ
  • ≥ ǫk, which corresponds to have ∃nk such that

P(∀n ≥ nk,

  • µn − µ
  • ≤ ǫk) = 1

Repeating for all ǫk in the sequence leads to the statement.

Remark: when α = 1, µn is the Monte-Carlo estimate and this corresponds to the strong law of large numbers. A more precise and accurate proof is here: http://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 23/83

slide-42
SLIDE 42

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case 1/2 < α < 1. The stochastic approximation µn is µ1 = x1 µ2 = (1 − η2)µ1 + η2x2 = (1 − η2)x1 + η2x2 µ3 = (1 − η3)µ2 + η3x3 = (1 − η2)(1 − η3)x1 + η2(1 − η3)x2 + η3x3 . . . µn =

n

  • i=1

λixi, with λi = ηi n

j=i+1(1 − ηj) such that n i=1 λi = 1.

By C-H inequality P

  • n
  • i=1

λixi −

n

  • i=1

λiE[xi]

  • ≥ ǫ
  • = P
  • µn − µ
  • ≥ ǫ
  • ≤ e

2ǫ2 n i=1 λ2 i .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 24/83

slide-43
SLIDE 43

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case 1/2 < α < 1. From the definition of λi log λi = log ηi +

n

  • j=i+1

log(1 − ηj) ≤ log ηi −

n

  • j=i+1

ηj since log(1 − x) < −x. Thus λi ≤ ηie− n

j=i+1 ηj and for any 1 ≤ m ≤ n,

n

  • i=1

λ2

i

n

  • i=1

η2

i e−2 n

j=i+1 ηj

(a)

m

  • i=1

e−2 n

j=i+1 ηj +

n

  • i=m+1

η2

i (b)

≤ me−2(n−m)ηn + (n − m)η2

m (c)

= me−2(n−m)n−α + (n − m)m−2α.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 25/83

slide-44
SLIDE 44

Mathematical Tools

Stochastic Approximation of a Mean

Proof (cont’d). Case 1/2 < α < 1. Let m = nβ with β = (1 + α/2)/2 (i.e. 1 − 2αβ = 1/2 − α):

n

  • i=1

λ2

i ≤ ne−2(1−n−1/4)n1−α + n1/2−α ≤ 2n1/2−α

for n big enough, which leads to P

  • µn − µ
  • ≥ ǫ
  • ≤ e−

ǫ2 n1/2−α .

From this point we follow the same steps as for α = 1 (application of the Borel-Cantelli lemma) and obtain the convergence result for µn.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 26/83

slide-45
SLIDE 45

Mathematical Tools

Stochastic Approximation of a Fixed Point

Definition

Let T : RN → RN be a contraction in some norm || · || with fixed point V . For any function W and state x, a noisy observation

  • T W (x) = T W (x) + b(x) is available.

For any x ∈ X = {1, . . . , N}, we defined the stochastic approximation Vn+1(x) = (1 − ηn(x))Vn(x) + ηn(x)( ˆ T Vn(x)) = (1 − ηn(x))Vn(x) + ηn(x)(T Vn(x) + bn), where ηn is a sequence of learning steps.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 27/83

slide-46
SLIDE 46

Mathematical Tools

Stochastic Approximation of a Fixed Point

Proposition

Let Fn = {V0, . . . , Vn, b0, . . . , bn−1, η0, . . . , ηn} the filtration of the algorithm and assume that E[bn(x)|Fn] = 0 and E[b2

n(x)|Fn] ≤ c(1 + ||Vn||2)

for a constant c. If the learning rates ηn(x) are positive and satisfy the stochastic approximation conditions

  • n≥0

ηn = ∞,

  • n≥0

η2

n < ∞,

then for any x ∈ X Vn(x) a.s. − → V (x).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 28/83

slide-47
SLIDE 47

Mathematical Tools

Stochastic Approximation of a Zero

Robbins-Monro (1951) algorithm. Given a noisy function f , find x∗ such that f (x∗) = 0. In each xn, observe yn = f (xn) + bn (with bn a zero-mean independent noise) and compute xn+1 = xn − ηnyn.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 29/83

slide-48
SLIDE 48

Mathematical Tools

Stochastic Approximation of a Zero

Robbins-Monro (1951) algorithm. Given a noisy function f , find x∗ such that f (x∗) = 0. In each xn, observe yn = f (xn) + bn (with bn a zero-mean independent noise) and compute xn+1 = xn − ηnyn. If f is an increasing function, then under the same assumptions on the learning step xn

a.s.

− → x∗

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 29/83

slide-49
SLIDE 49

Mathematical Tools

Stochastic Approximation of a Minimum

Kiefer-Wolfowitz (1952) algorithm. Given a function f and noisy

  • bservations of its gradient, find x∗ = arg min f (x).

In each xn, observe gn = ∇f (xn) + bn (with bn a zero-mean independent noise) and compute xn+1 = xn − ηngn.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 30/83

slide-50
SLIDE 50

Mathematical Tools

Stochastic Approximation of a Minimum

Kiefer-Wolfowitz (1952) algorithm. Given a function f and noisy

  • bservations of its gradient, find x∗ = arg min f (x).

In each xn, observe gn = ∇f (xn) + bn (with bn a zero-mean independent noise) and compute xn+1 = xn − ηngn. If the Hessian ∇2f is positive, then under the same assumptions

  • n the learning step

xn

a.s.

− → x∗

Remark: this is often referred to as the stochastic gradient algorithm.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 30/83

slide-51
SLIDE 51

The Monte-Carlo Algorithm

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

Tools Policy Evaluation Policy Learning

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 31/83

slide-52
SLIDE 52

The Monte-Carlo Algorithm

The RL Interaction Protocol

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0 [possibly random]

[execute one trajectory]

  • 3. While (xt not terminal)

3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 32/83

slide-53
SLIDE 53

The Monte-Carlo Algorithm

The RL Interaction Protocol

. . . x0 x(1)

1

x(i)

1

x(i)

2

. . . x(i)

T (i)

x(n)

2

x(n)

1

x(1)

2

x(1)

T (1)

. . . x(n)

T (n)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 33/83

slide-54
SLIDE 54

The Monte-Carlo Algorithm

Policy Evaluation

Objective: given a policy π evaluate its quality at the (fixed) initial state x0

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 34/83

slide-55
SLIDE 55

The Monte-Carlo Algorithm

Policy Evaluation

Objective: given a policy π evaluate its quality at the (fixed) initial state x0 For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0 [possibly random]

[execute one trajectory]

  • 3. While (xt not terminal)

3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 34/83

slide-56
SLIDE 56

The Monte-Carlo Algorithm

The RL Interaction Protocol

rπ(x(n)

1 )

x0 x(i)

1

x(i)

2

. . . x(i)

T (i)

x(n)

2

x(n)

1

x(1)

2

x(1)

T (1)

. . . x(n)

T (n)

. . . x(1)

1

rπ(x(1)

1 )

rπ(x(1)

T (i))

rπ(x(1)

2 )

rπ(x(2)

1 )

rπ(x(2)

T (2))

rπ(x(n)

T (n))

rπ(x(2)

2 )

rπ(x(n)

2 )

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 35/83

slide-57
SLIDE 57

The Monte-Carlo Algorithm

State Value Function

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state. V π(x) = E T

  • t=0

γtr(xt, π(xt))|x0 = x; π

  • ,

where T is the first (random) time when the termination state is achieved.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 36/83

slide-58
SLIDE 58

The Monte-Carlo Algorithm

Monte-Carlo Approximation

Idea: we can approximate an expectation by an average!

◮ Return of trajectory i

  • Ri(x0) =

T (i)

  • t=0

γtr π(x(i)

t ) ◮ Estimated value function

  • V π

n (x0) = 1

n

n

  • i=1
  • Ri(x0)
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 37/83

slide-59
SLIDE 59

The Monte-Carlo Algorithm

Monte-Carlo Approximation

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0 [possibly random]

[execute one trajectory]

  • 3. While (xt not terminal)

3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1

EndWhile EndFor Collect trajectories and compute V π

n (x0) using MC approximation

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 38/83

slide-60
SLIDE 60

The Monte-Carlo Algorithm

Monte-Carlo Approximation: Properties

◮ All returns are unbiased estimators of V π(x)

E[ R(i)(x0)] = E

  • r π(x(i)

0 ) + γr π(x(i) 1 ) + · · · + γT (i)r π(x(i) T (i))

  • = V π(x)

◮ Thus

  • V π

n (x0) a.s.

− → V π(x0).

◮ Finite-sample guarantees are also possible

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 39/83

slide-61
SLIDE 61

The Monte-Carlo Algorithm

Monte-Carlo Approximation: Extensions

Non-episodic problems

◮ Interrupt trajectories after H steps

  • Ri(x0) =

H

  • t=0

γtr π(x(i)

t ) ◮ Loss in accuracy limited to γH rmax 1−γ

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 40/83

slide-62
SLIDE 62

The Monte-Carlo Algorithm

Monte-Carlo Approximation: Extensions

Multiple subtrajectories

x(i)

T (i)

x0 x(n)

1

x(1)

T (1)

. . . x(n)

T (n)

. . . x(1)

1

rπ(x(1)

1 )

rπ(x(1)

T (i))

rπ(x(1)

2 )

rπ(x(2)

1 )

rπ(x(2)

T (2))

rπ(x(n)

T (n))

rπ(x(2)

2 )

rπ(x(n)

2 )

rπ(x(n)

1 )

x(i)

1 = x

x(n)

2

= x x(1)

2

x(i)

2

. . .

All subtrajectories starting with x can be used to estimate V π(x)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 41/83

slide-63
SLIDE 63

The Monte-Carlo Algorithm

First-visit and Every-Visit Monte-Carlo

Remark: any trajectory (x0, x1, x2, . . . , xT) contains also the sub-trajectory (xt, xt+1, . . . , xT) whose return

  • R(xt) = r π(xt) + · · · + r π(xT−1) could be used to build an

estimator of V π(xt).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 42/83

slide-64
SLIDE 64

The Monte-Carlo Algorithm

First-visit and Every-Visit Monte-Carlo

Remark: any trajectory (x0, x1, x2, . . . , xT) contains also the sub-trajectory (xt, xt+1, . . . , xT) whose return

  • R(xt) = r π(xt) + · · · + r π(xT−1) could be used to build an

estimator of V π(xt).

◮ First-visit MC. For each state x we only consider the

sub-trajectory when x is first achieved. Unbiased estimator,

  • nly one sample per trajectory.
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 42/83

slide-65
SLIDE 65

The Monte-Carlo Algorithm

First-visit and Every-Visit Monte-Carlo

Remark: any trajectory (x0, x1, x2, . . . , xT) contains also the sub-trajectory (xt, xt+1, . . . , xT) whose return

  • R(xt) = r π(xt) + · · · + r π(xT−1) could be used to build an

estimator of V π(xt).

◮ First-visit MC. For each state x we only consider the

sub-trajectory when x is first achieved. Unbiased estimator,

  • nly one sample per trajectory.

◮ Every-visit MC. Given a trajectory (x0 = x, x1, x2, . . . , xT), we

list all the m sub-trajectories starting from x up to xT and we average them all to obtain an estimate. More than one sample per trajectory, biased estimator.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 42/83

slide-66
SLIDE 66

The Monte-Carlo Algorithm

Question

More samples or no bias? ⇒ Sometimes a biased estimator is preferable if consistent!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 43/83

slide-67
SLIDE 67

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

Example: 2-state Markov Chain

1−p p 1

1

The reward is 1 while in state 1 (while is 0 in the terminal state). All trajectories are (x0 = 1, x1 = 1, . . . , xT = 0). By Bellman equations V (1) = 1 + (1 − p)V (1) + 0 · p = 1 p , since V (0) = 0.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 44/83

slide-68
SLIDE 68

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

We measure the mean squared error (MSE) of V w.r.t. V E

  • (

V − V )2 =

  • E[

V ] − V 2

  • Bias

2

+ E

  • V − E[

V ] 2

  • Variance
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 45/83

slide-69
SLIDE 69

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

First-visit Monte-Carlo. All the trajectories start from state 1, then the return over one single trajectory is exactly T, i.e., V = T. The time-to-end T is a geometric r.v. with expectation E[ V ] = E[T] = 1 p = V π(1) ⇒ unbiased estimator. Thus the MSE of V coincides with the variance of T, which is E

  • T − 1

p 2 = 1 p2 − 1 p .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 46/83

slide-70
SLIDE 70

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

Every-visit Monte-Carlo. Given one trajectory, we can construct T − 1 sub-trajectories (number of times state 1 is visited), where the t-th trajectory has a return T − t.

  • V = 1

T

T−1

  • t=0

(T − t) = 1 T

T

  • t′=1

t′ = T + 1 2 . The corresponding expectation is E T + 1 2

  • = 1 + p

2p =V π(1) ⇒ biased estimator.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 47/83

slide-71
SLIDE 71

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

Let’s consider n independent trajectories, each of length Ti. Total number of samples n

i=1 Ti and the estimator

Vn is

  • Vn =

n

i=1

Ti−1

t=0 (Ti − t)

n

i=1 Ti

= n

i=1 Ti(Ti + 1)

2 n

i=1 Ti

= 1/n n

i=1 Ti(Ti + 1)

2/n n

i=1 Ti a.s.

− → E[T 2] + E[T] 2E[T] = 1 p = V π(1) ⇒ consistent estimator. The MSE of the estimator E T + 1 2 − 1 p 2 = 1 2p2 − 3 4p + 1 4≤ 1 p2 − 1 p .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 48/83

slide-72
SLIDE 72

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

In general

◮ Every-visit MC: biased but consistent estimator. ◮ First-visit MC: unbiased estimator with potentially bigger

MSE.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 49/83

slide-73
SLIDE 73

The Monte-Carlo Algorithm

First-visit vs Every-Visit Monte-Carlo

In general

◮ Every-visit MC: biased but consistent estimator. ◮ First-visit MC: unbiased estimator with potentially bigger

MSE.

Remark: when the state space is large the probability of visiting multiple times the same state is low, then the performance of the two methods tends to be the same.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 49/83

slide-74
SLIDE 74

The Monte-Carlo Algorithm

Monte-Carlo Approximation: Extensions

Full estimate of V π over any x ∈ X

◮ Use subtrajectories ◮ Restart from random states over X

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 50/83

slide-75
SLIDE 75

The Monte-Carlo Algorithm

Monte-Carlo Approximation: Limitations

◮ The estimate

V π(x0) is computed when all trajectories are terminated

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 51/83

slide-76
SLIDE 76

The Monte-Carlo Algorithm

Temporal Difference TD(1)

Idea: we can approximate an expectation by an incremental average!

◮ Return of trajectory i

  • Ri(x0) =

T (i)

  • t=0

γtr π(x(i)

t ) ◮ Estimated value function after trajectory i

  • V π

i (x0) = (1 − αi)

V π

i−1(x0) + αi

Ri(x0)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 52/83

slide-77
SLIDE 77

The Monte-Carlo Algorithm

Temporal Difference TD(1)

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0 [possibly random]

[execute one trajectory]

  • 3. While (xt not terminal)

3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1

EndWhile

  • 4. Update

V π

i (x0) using TD(1) approximation

EndFor Collect trajectories and compute V π

n (x0) using MC approximation

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 53/83

slide-78
SLIDE 78

The Monte-Carlo Algorithm

Temporal Difference TD(1): Properties

◮ If αi = 1/i, then TD(1) is just the incremental version of the

empirical mean

  • V π

i (x0) = n − 1

n

  • V π

i−1(x0) + 1

n

  • Ri(x0)

◮ Using a generic step-size (learning rate) αi gives flexibility to

the algorithm

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 54/83

slide-79
SLIDE 79

The Monte-Carlo Algorithm

Temporal Difference TD(1): Properties

Proposition

If the learning rate satisfies the Robbins-Monro conditions

  • i=0

αi = ∞,

  • i=0

α2

i < ∞,

then

  • V π

n (x0) a.s.

− → V π(x0)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 55/83

slide-80
SLIDE 80

The Monte-Carlo Algorithm

Temporal Difference TD(1): Extensions

◮ Non-episodic problems: Truncated trajectories ◮ Multiple sub-trajectories

◮ Updates of all the states using sub-trajectories ◮ state-dependent learning rate αi(x) ◮ i is the index of the number of updates in that specific state

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 56/83

slide-81
SLIDE 81

The Monte-Carlo Algorithm

Temporal Difference TD(1): Limitations

◮ The estimate

V π(x0) is updated when the trajectory is completely terminated

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 57/83

slide-82
SLIDE 82

The Monte-Carlo Algorithm

The Bellman Equation

Proposition

For any stationary policy π = (π, π, . . . ), the state value function at a state x ∈ X satisfies the Bellman equation: V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 58/83

slide-83
SLIDE 83

The Monte-Carlo Algorithm

Temporal Difference TD(0)

Idea: we can approximate V π by estimating the Bellman error

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 59/83

slide-84
SLIDE 84

The Monte-Carlo Algorithm

Temporal Difference TD(0)

Idea: we can approximate V π by estimating the Bellman error

◮ Bellman error of a function V in a state x

Bπ(V ; x) = r π(x) + γ

  • y

p(y|x, π(x))V (y) − V (x).

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 59/83

slide-85
SLIDE 85

The Monte-Carlo Algorithm

Temporal Difference TD(0)

Idea: we can approximate V π by estimating the Bellman error

◮ Bellman error of a function V in a state x

Bπ(V ; x) = r π(x) + γ

  • y

p(y|x, π(x))V (y) − V (x).

◮ Temporal difference of a function

V π for a transition xt, rt, xt+1 δt = rt + γ V π(xt+1) − V π(xt)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 59/83

slide-86
SLIDE 86

The Monte-Carlo Algorithm

Temporal Difference TD(0)

Idea: we can approximate V π by estimating the Bellman error

◮ Bellman error of a function V in a state x

Bπ(V ; x) = r π(x) + γ

  • y

p(y|x, π(x))V (y) − V (x).

◮ Temporal difference of a function

V π for a transition xt, rt, xt+1 δt = rt + γ V π(xt+1) − V π(xt)

◮ Estimated value function after transition xt, rt, xt+1

  • V π(xt) =
  • 1 − αi(xt)
  • V π(xt) + αi(xt)
  • rt + γ

V π(xt+1)

  • =

V π(xt) + αi(xt)δt

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 59/83

slide-87
SLIDE 87

The Monte-Carlo Algorithm

Temporal Difference TD(0)

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0 [possibly random]

[execute one trajectory]

  • 3. While (xt not terminal)

3.1 Take action at = π(xt) 3.2 Observe next state xt+1 and reward rt = r π(xt) 3.3 Set t = t + 1 3.4 Update V π(xt) using TD(0) approximation

EndWhile

  • 4. Update

V π

i (x0) using TD(1) approximation

EndFor Collect trajectories and compute V π

n (x0) using MC approximation

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 60/83

slide-88
SLIDE 88

The Monte-Carlo Algorithm

Temporal Difference TD(0): Properties

◮ The update rule

  • V π(xt) =
  • 1 − αi(xt)
  • V π(xt) + αi(xt)
  • rt + γ

V π(xt+1)

  • is bootstrapping the current estimate of

V π in other state.

◮ The temporal difference is an unbiased sample of the Bellman

error E[δt] = E[rt + γ V π(xt+1) − V π(xt)] = T π V π(xt) − V π(xt)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 61/83

slide-89
SLIDE 89

The Monte-Carlo Algorithm

Temporal Difference TD(0): Properties

Proposition

If the learning rate satisfies the Robbins-Monro conditions in all states x ∈ X

  • i=0

αi(x) = ∞,

  • i=0

α2

i (x) < ∞,

and all states are visited infinitely often, then for all x ∈ X

  • V π(x) a.s.

− → V π(x)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 62/83

slide-90
SLIDE 90

The Monte-Carlo Algorithm

Temporal Difference TD(0)

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set

V π(x) = 0, ∀x ∈ X

  • 3. Set initial state x0
  • 4. While (xt not terminal)

4.1 Take action at = π(xt) 4.2 Observe next state xt+1 and reward rt = r π(xt) 4.3 Set t = t + 1 4.4 Compute the TD δt = rt + γ V π(xt+1) − V π(xt) 4.5 Update the value function estimate in xt as

  • V π(xt) =

V π(xt) + αi(xt)δt 4.6 Update the learning rate, e.g., α(xt) = 1 # visits(xt) EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 63/83

slide-91
SLIDE 91

The Monte-Carlo Algorithm

Comparison between TD(1) and TD(0)

TD(1)

◮ Update rule

  • V π(xt)

=

  • V π(xt) + α(xt)[δt + γδt+1 + · · · + γT−1δT].

◮ No bias, large variance

TD(0)

◮ Update rule

  • V π(xt) =

V π(xt) + α(xt)δt.

◮ Potential bias, small variance

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 64/83

slide-92
SLIDE 92

The Monte-Carlo Algorithm

Comparison between TD(1) and TD(0)

TD(1)

◮ Update rule

  • V π(xt)

=

  • V π(xt) + α(xt)[δt + γδt+1 + · · · + γT−1δT].

◮ No bias, large variance

TD(0)

◮ Update rule

  • V π(xt) =

V π(xt) + α(xt)δt.

◮ Potential bias, small variance

⇒ TD(λ) perform intermediate updates!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 64/83

slide-93
SLIDE 93

The Monte-Carlo Algorithm

The T π

λ Bellman operator Definition

Given λ < 1, then the Bellman operator T π

λ is

T π

λ = (1 − λ)

  • m≥0

λm(T π)m+1.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 65/83

slide-94
SLIDE 94

The Monte-Carlo Algorithm

The T π

λ Bellman operator Definition

Given λ < 1, then the Bellman operator T π

λ is

T π

λ = (1 − λ)

  • m≥0

λm(T π)m+1.

Remark: convex combination of the m-step Bellman operators (T π)m weighted by a sequences of coefficients defined as a function of a λ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 65/83

slide-95
SLIDE 95

The Monte-Carlo Algorithm

Temporal Difference TD(λ)

Idea: use the whole series of temporal differences to update V π

◮ Temporal difference of a function

V π for a transition xt, rt, xt+1 δt = rt + γ V π(xt+1) − V π(xt)

◮ Estimated value function

  • V π(xt) =

V π(xt) + αi(xt)

T

  • s=t

(γλ)s−tδs

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 66/83

slide-96
SLIDE 96

The Monte-Carlo Algorithm

Temporal Difference TD(λ)

Idea: use the whole series of temporal differences to update V π

◮ Temporal difference of a function

V π for a transition xt, rt, xt+1 δt = rt + γ V π(xt+1) − V π(xt)

◮ Estimated value function

  • V π(xt) =

V π(xt) + αi(xt)

T

  • s=t

(γλ)s−tδs ⇒ Still requires the whole trajectory before updating...

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 66/83

slide-97
SLIDE 97

The Monte-Carlo Algorithm

Temporal Difference TD(λ): Eligibility Traces

◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 67/83

slide-98
SLIDE 98

The Monte-Carlo Algorithm

Temporal Difference TD(λ): Eligibility Traces

◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1

  • 1. Compute the temporal difference

dt = r π(xt) + γ V π(xt+1) − V π(xt)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 67/83

slide-99
SLIDE 99

The Monte-Carlo Algorithm

Temporal Difference TD(λ): Eligibility Traces

◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1

  • 1. Compute the temporal difference

dt = r π(xt) + γ V π(xt+1) − V π(xt)

  • 2. Update the eligibility traces

z(x) =    λz(x) if x = xt 1 + λz(x) if x = xt if xt = x0 (reset the traces)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 67/83

slide-100
SLIDE 100

The Monte-Carlo Algorithm

Temporal Difference TD(λ): Eligibility Traces

◮ Eligibility traces z ∈ RN ◮ For every transition xt → xt+1

  • 1. Compute the temporal difference

dt = r π(xt) + γ V π(xt+1) − V π(xt)

  • 2. Update the eligibility traces

z(x) =    λz(x) if x = xt 1 + λz(x) if x = xt if xt = x0 (reset the traces)

  • 3. For all state x ∈ X
  • V π(x) ←

V π(x) + α(x)z(x)δt.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 67/83

slide-101
SLIDE 101

The Monte-Carlo Algorithm

Sensitivity to λ

◮ λ < 1: smaller variance w.r.t. λ = 1 (MC/TD(1)). ◮ λ > 0: faster propagation of rewards w.r.t. λ = 0.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 68/83

slide-102
SLIDE 102

The Monte-Carlo Algorithm

Example: Sensitivity to λ

Linear chain example

1 3 4

−1

2 5

1

The MSE of Vn w.r.t. V π after n = 100 trajectories:

  • 0.2

0.4 0.6 0.8 1

λ

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 69/83

slide-103
SLIDE 103

The Q-learning Algorithm

How to solve incrementally an RL problem

Reinforcement Learning Algorithms

Tools Policy Evaluation Policy Learning

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 70/83

slide-104
SLIDE 104

The Q-learning Algorithm

Question

How do we compute the optimal policy online? ⇒ Q-learning!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 71/83

slide-105
SLIDE 105

The Q-learning Algorithm

Learning the Optimal Policy

Objective: learn the optimal policy π∗ with direct interaction with the environment

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 72/83

slide-106
SLIDE 106

The Q-learning Algorithm

Learning the Optimal Policy

Objective: learn the optimal policy π∗ with direct interaction with the environment For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 72/83

slide-107
SLIDE 107

The Q-learning Algorithm

Policy Iteration

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute Qπk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈AQπ

k (x)

  • 3. Return the last policy πK
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 73/83

slide-108
SLIDE 108

The Q-learning Algorithm

SARSA

Idea: alternate policy evaluation and policy improvement

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 74/83

slide-109
SLIDE 109

The Q-learning Algorithm

SARSA

Idea: alternate policy evaluation and policy improvement

◮ Define a greedy exploratory policy with temperature τ

πQ(a|x) = exp(Q(x, a)/τ)

  • a′ exp(Q(x, a′)/τ)

The higher Q(x, a), the more probability to take action a in state x

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 74/83

slide-110
SLIDE 110

The Q-learning Algorithm

SARSA

Idea: alternate policy evaluation and policy improvement

◮ Define a greedy exploratory policy with temperature τ

πQ(a|x) = exp(Q(x, a)/τ)

  • a′ exp(Q(x, a′)/τ)

The higher Q(x, a), the more probability to take action a in state x

◮ Compute the temporal difference on the trajectory

xt, at, rt, xt+1, at+1 (with actions chosen according to πQ(a|x)) δt = rt + γ Q(xt+1, at+1) − Q(xt, at)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 74/83

slide-111
SLIDE 111

The Q-learning Algorithm

SARSA

Idea: alternate policy evaluation and policy improvement

◮ Define a greedy exploratory policy with temperature τ

πQ(a|x) = exp(Q(x, a)/τ)

  • a′ exp(Q(x, a′)/τ)

The higher Q(x, a), the more probability to take action a in state x

◮ Compute the temporal difference on the trajectory

xt, at, rt, xt+1, at+1 (with actions chosen according to πQ(a|x)) δt = rt + γ Q(xt+1, at+1) − Q(xt, at)

◮ Update the estimate of Q as

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 74/83

slide-112
SLIDE 112

The Q-learning Algorithm

SARSA: Properties

◮ The TD updates make

Q converge to Qπ

◮ The update of πQ allows to improve the policy ◮ A decreasing temperature allows to become more and more

greedy ⇒ If τ → 0 with a proper rate, then Q → Q∗ and πQ → π∗

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 75/83

slide-113
SLIDE 113

The Q-learning Algorithm

SARSA: Limitations

The actions at need to be selected according to the current Q ⇒ On-policy learning

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 76/83

slide-114
SLIDE 114

The Q-learning Algorithm

The Optimal Bellman Equation

Proposition

The optimal value function V ∗ (i.e., V ∗ = maxπ V π) is the solution to the optimal Bellman equation: V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 77/83

slide-115
SLIDE 115

The Q-learning Algorithm

Q-Learning

Idea: use TD for the optimal Bellman operator

◮ Compute the (optimal) temporal difference on the trajectory

xt, at, rt, xt+1 (with actions chosen arbitrarily!) δt = rt + γ max

a′

  • Q(xt+1, a′) −

Q(xt, at)

◮ Update the estimate of Q as

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 78/83

slide-116
SLIDE 116

The Q-learning Algorithm

Q-Learning: Properties

Proposition

If the learning rate satisfies the Robbins-Monro conditions in all states x ∈ X

  • i=0

αi(x) = ∞,

  • i=0

α2

i (x) < ∞,

and all states are visited infinitely often, then for all x ∈ X

  • Q(x) a.s.

− → Q∗(x) Remark: “infinitely often” requires a steady exploration policy

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 79/83

slide-117
SLIDE 117

The Q-learning Algorithm

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt = rt + γ Q(xt+1, at+1) − Q(xt, at) (SARSA) δt = rt + γ max

a′

  • Q(xt+1, a′) −

Q(xt, at) (Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 80/83

slide-118
SLIDE 118

The Q-learning Algorithm

The Grid-World Problem

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 81/83

slide-119
SLIDE 119

The Q-learning Algorithm

Bibliography I

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 15th, 2013 - 82/83

slide-120
SLIDE 120

The Q-learning Algorithm

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr