Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

sample complexity of adp algorithms
SMART_READER_LITE
LIVE PREVIEW

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be


slide-1
SLIDE 1

MVA-RL Course

Sample Complexity of ADP Algorithms

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

Sources of Error

◮ Approximation error. If X is large or continuous, value

functions V cannot be represented correctly ⇒ use an approximation space F

◮ Estimation error. If the reward r and dynamics p are

unknown, the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 2/82

slide-3
SLIDE 3

In This Lecture

◮ Infinite horizon setting with discount γ ◮ Study the impact of estimation error

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 3/82

slide-4
SLIDE 4

In This Lecture: Warning!!

Problem: are these performance bounds accurate/useful? Answer: of course not! :) Reason: upper bounds, non-tight analysis, worst case.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 4/82

slide-5
SLIDE 5

In This Lecture: Warning!!

Chernoff-Hoeffding inequality P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ

⇒ worst-case w.r.t. to all the distributions bounded in [a, b], loose for other distributions.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 5/82

slide-6
SLIDE 6

In This Lecture: Warning!!

Question: so why should we derive/study these bounds? Answer:

◮ General guarantees ◮ Rates of convergence (not always available in asymptotic

analysis)

◮ Explicit dependency on the design parameters ◮ Explicit dependency on the problem parameters ◮ First guess on how to tune parameters ◮ Better understanding of the algorithms

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 6/82

slide-7
SLIDE 7

Sample Complexity of LSTD

Outline

Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 7/82

slide-8
SLIDE 8

Sample Complexity of LSTD The Algorithm

Outline

Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 8/82

slide-9
SLIDE 9

Sample Complexity of LSTD The Algorithm

Least-Squares Temporal-Difference Learning (LSTD)

◮ Linear function space F =

  • f : f (·) = d

j=1 αjϕj(·)

  • ◮ V π is the fixed-point of T π

V π = T πV π

◮ V π may not belong to F

V π / ∈ F

◮ Best approximation of V π in F is

ΠV π = arg min

f ∈F ||V π − f ||

(Π is the projection onto F)

F V π

T π

ΠV π

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 9/82

slide-10
SLIDE 10

Sample Complexity of LSTD The Algorithm

Least-Squares Temporal-Difference Learning (LSTD)

◮ LSTD searches for the fixed-point of Π?T π instead (Π? is a

projection into F w.r.t. L?-norm)

◮ Π∞T π is a contraction in L∞-norm

◮ L∞-projection is numerically expensive when the number of

states is large or infinite

◮ LSTD searches for the fixed-point of Π2,ρT π

Π2,ρ g = arg min

f ∈F ||g − f ||2,ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 10/82

slide-11
SLIDE 11

Sample Complexity of LSTD The Algorithm

Least-Squares Temporal-Difference Learning (LSTD)

When the fixed-point of ΠρT π exists, we call it the LSTD solution VTD = ΠρT πVTD

F V π

T πVTD T π T π

ΠρV π

VTD = ΠρT πVTD T πVTD − VTD, ϕiρ = 0, i = 1, . . . , d r π + γPπVTD − VTD, ϕiρ = 0 r π, ϕiρ

  • bi

d

  • i=1

ϕj − γPπϕj, ϕiρ

  • Aij

· α(j)

TD = 0

− → A αTD = b

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 11/82

slide-12
SLIDE 12

Sample Complexity of LSTD The Algorithm

LSTD Algorithm

◮ In general, ΠρT π is not a contraction and does not have a

fixed-point.

◮ If ρ = ρπ, the stationary dist. of π, then ΠρπT π has a unique

fixed-point.

Proposition (LSTD Performance)

||V π − VTD||ρπ ≤ 1

  • 1 − γ2 inf

V ∈F ||V π − V ||ρπ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 12/82

slide-13
SLIDE 13

Sample Complexity of LSTD The Algorithm

LSTD Algorithm

Empirical LSTD

◮ We observe a trajectory (X0, R0, X1, R1, . . . , XN) where

Xt+1 ∼ P

  • · |Xt, π(Xt)
  • and Rt = r
  • Xt, π(Xt)
  • ◮ We build estimators of the matrix A and vector b
  • Aij = 1

N

N−1

  • t=0

ϕi(Xt)

  • ϕj(Xt)−γϕj(Xt+1)
  • ,
  • bi = 1

N

N−1

  • t=0

ϕi(Xt)Rt

A αTD = b ,

  • VTD(·) = φ(·)⊤

αTD when n → ∞ then A → A and b → b, and thus, αTD → αTD and

  • VTD → VTD.
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 13/82

slide-14
SLIDE 14

Sample Complexity of LSTD LSTD and LSPI Error Bounds

Outline

Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 14/82

slide-15
SLIDE 15

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSTD Error Bound

When the Markov chain induced by the policy under evaluation π has a stationary distribution ρπ (Markov chain is ergodic - e.g. β-mixing), then Theorem (LSTD Error Bound) Let V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ||V π − V ||ρπ ≤ c

  • 1 − γ2 inf

f ∈F ||V π − f ||ρπ + O

  • d log(d/δ)

n ν

  • ◮ n = # of samples

, d = dimension of the linear function space F

◮ ν = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dρπ)i,j

(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) ◮ β-mixing coefficients are hidden in the O(·) notation

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 15/82

slide-16
SLIDE 16

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSTD Error Bound

LSTD Error Bound ||V π − V ||ρπ ≤ c

  • 1 − γ2

inf

f ∈F ||V π − f ||ρπ

  • approximation error

+ O

  • d log(d/δ)

n ν

  • estimation error

◮ Approximation error: it depends on how well the function space F

can approximate the value function V π

◮ Estimation error: it depends on the number of samples n, the dim of

the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 16/82

slide-17
SLIDE 17

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 17/82

slide-18
SLIDE 18

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • ◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 18/82

slide-19
SLIDE 19

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • ◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

◮ Estimation error: depends on n, d, νρ, K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 19/82

slide-20
SLIDE 20

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • ◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

◮ Estimation error: depends on n, d, νρ, K ◮ Initialization error: error due to the choice of the initial value function or

initial policy |V ∗ − V π0|

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 20/82

slide-21
SLIDE 21

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

LSPI Error Bound

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • Lower-Bounding Distribution

There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 21/82

slide-22
SLIDE 22

Sample Complexity of LSTD LSTD and LSPI Error Bounds

LSPI Error Bound

LSPI Error Bound

||V ∗−V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γ

K−1 2

Rmax

  • Lower-Bounding Distribution

There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.

◮ νρ = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dρ)i,j
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 22/82

slide-23
SLIDE 23

Sample Complexity of Fitted Q-iteration

Outline

Sample Complexity of LSTD Sample Complexity of Fitted Q-iteration Error at Each Iteration Error Propagation The Final Bound

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 23/82

slide-24
SLIDE 24

Sample Complexity of Fitted Q-iteration

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

◮ Draw n samples (xi, ai)

i.i.d

∼ ρ

◮ Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

◮ Compute yi = ri + γ maxa

Qk−1(x′

i , a)

◮ Build training set

  • (xi, ai), yi

n

i=1

◮ Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

◮ Return

Qk = Trunc(fˆ

αk)

Return πK(·) = arg maxa QK(·, a) (greedy policy)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 24/82

slide-25
SLIDE 25

Sample Complexity of Fitted Q-iteration

Theoretical Objectives

Objective 1: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 25/82

slide-26
SLIDE 26

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Outline

Sample Complexity of LSTD Sample Complexity of Fitted Q-iteration Error at Each Iteration Error Propagation The Final Bound

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 26/82

slide-27
SLIDE 27

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ Initial function Q0 ∈ F For k = 1, . . . , K

◮ Draw n samples (xi, ai)

i.i.d

∼ ρ

◮ Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

◮ Compute yi = ri + γ maxa

Qk−1(x′

i , a)

◮ Build training set

  • (xi, ai), yi

n

i=1

◮ Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

◮ Return

Qk = Trunc(fˆ

αk)

Return πK(·) = arg maxa QK(·, a) (greedy policy)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 27/82

slide-28
SLIDE 28

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Linear Fitted Q-iteration

◮ Draw n samples (xi, ai)

i.i.d

∼ ρ

◮ Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

◮ Compute yi = ri + γ maxa

Qk−1(x′

i , a)

◮ Build training set

  • (xi, ai), yi

n

i=1

◮ Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

◮ Return

Qk = Trunc(fˆ

αk)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 28/82

slide-29
SLIDE 29

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Objectives

Target: at each iteration we want to approximate Qk = T Qk−1 Objective 2: derive an intermediate bound on the prediction error [random design] ||Qk − Qk||ρ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 29/82

slide-30
SLIDE 30

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Objectives

Target: at each iteration we have samples {(xi, ai)}n

i=1 (from ρ)

Objective 3: derive an intermediate bound on the prediction error

  • n the samples [deterministic design]

1 n

n

  • i=1
  • Qk(xi, ai) −

Qk(xi, ai) 2 = ||Qk − Qk||2

ˆ ρ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 30/82

slide-31
SLIDE 31

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Objectives

Obj 3 ||Qk − Qk||ˆ

ρ ≤ ???

⇒ Obj 2 ||Qk − Qk||ρ ≤ ??? ⇒ Obj 1 ||Q∗ − QπK ||µ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 31/82

slide-32
SLIDE 32

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Objectives

Returned solution fˆ

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2 Best solution fα∗

k = arg inf

fα∈F ||fα − Qk||ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 32/82

slide-33
SLIDE 33

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Additional Notation

Given the set of inputs {(xi, ai)}n

i=1 drawn from ρ.

Vector space Fn = {z ∈ Rn, zi = fα(xi, ai); fα ∈ F} ⊂ Rn Empirical L2-norm ||fα||2

ˆ ρ = 1

n

n

  • i=1

fα(xi, ai)2 = 1 n

n

  • i=1

z2

i = ||z||2 n

Empirical orthogonal projection

  • Πy = arg min

z∈Fn ||y − z||n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 33/82

slide-34
SLIDE 34

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Additional Notation

◮ Target vector:

qi = Qk(xi, ai) = T Qk−1(xi, ai) = r(xi, ai) + γ max

a

  • X
  • Qk−1(dx′, a)p(dx′|xi, ai)

◮ Observed target vector:

yi = ri + γ max

a

  • Qk−1(x′

i , a)

◮ Noise vector (zero–mean and bounded):

ξi = qi − yi |ξi| ≤ Vmax E[ξi|xi] = 0

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 34/82

slide-35
SLIDE 35

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Additional Notation

y Fn ξ q

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 35/82

slide-36
SLIDE 36

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Additional Notation

◮ Optimal solution in Fn

  • Πq = arg min

z∈Fn ||q − z||n ◮ Returned vector

  • qi = fˆ

αk(xi, ai)

  • q =

Πy = arg min

z∈Fn ||y − z||n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 36/82

slide-37
SLIDE 37

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Additional Notation

  • Πq

Fn ξ q y

  • q
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 37/82

slide-38
SLIDE 38

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||Qk − fˆ

αk||2 ˆ ρ = ||q −

q||2

n

  • Πq

Fn ξ q y

  • q
  • ξ

||q − q||n ≤ ||q − Πq||n + || Πq − q||n = ||q − Πq||n + || ξ||n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 38/82

slide-39
SLIDE 39

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||q − q||n

  • prediction err

≤ ||q − Πq||n

  • approx. err

+ || ξ||n

  • estim. err

◮ Prediction error: distance between learned function and

target function

◮ Approximation error: distance between the best function in

F and the target function ⇒ depends on F

◮ Estimation error: distance between the best function in F

and the learned function ⇒ depends on the samples

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 39/82

slide-40
SLIDE 40

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

The noise ξ = Πξ ⇒ || ξ||n = ξ, ξ = ξ, ξ The projected noise belongs to Fn ⇒ ∃fβ ∈ F : fβ(xi, ai) = ξi, ∀(xi, ai) By definition of inner product ⇒ || ξ||n = 1 n

n

  • i=1

fβ(xi, ai)ξi

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 40/82

slide-41
SLIDE 41

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

The noise ξ has zero mean and it is bounded in [−Vmax, Vmax] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi, ai)) ⇒ Eξ 1 n

n

  • i=1

fβ(xi, ai)ξi

  • = 1

n

n

  • i=1

  • fβ(xi, ai)ξi
  • = 0

⇒ 1 n

n

  • i=1
  • fβ(xi, ai)ξi

2 ≤ 4Vmax2 1 n

n

  • i=1

fβ(xi, ai)2 = 4Vmax||fβ||2

ˆ ρ

⇒ we can use concentration inequalities

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 41/82

slide-42
SLIDE 42

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Problem: fβ is a random function Solution: we need functional concentration inequalities

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 42/82

slide-43
SLIDE 43

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Define the space of normalized functions G =

  • g(·) = fα(·)

||fα||ˆ

ρ

, fα ∈ F

  • [by definition] ⇒ ∀g ∈ G, ||g||ˆ

ρ ≤ 1

[F is a linear space] ⇒ V(G) = d + 1

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 43/82

slide-44
SLIDE 44

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Application of Pollard’s inequality for space G For any g ∈ G

  • 1

n

n

  • i=1

g(xi, ai)ξi

  • ≤ 4Vmax
  • 2

n log 3(9ne2)d+1 δ

  • with probability 1 − δ (w.r.t., the realization of the noise ξ).
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 44/82

slide-45
SLIDE 45

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

By definition of g ⇒

  • 1

n

n

  • i=1

fα(xi, ai)ξi

  • ≤ 4Vmax||fα||ˆ

ρ

  • 2

n log 3(9ne2)d+1 δ

  • For the specific fβ equivalent to

ξ ⇒ ξ, ξ ≤ 4Vmax|| ξ||n

  • 2

n log 3(9ne2)d+1 δ

  • Recalling the objective

⇒ || ξ||2

n ≤ 4Vmax||

ξ||n

  • 2

n log 3(9ne2)d+1 δ

  • ⇒ ||

Πq − q||n ≤ 4Vmax

  • 2

n log 3(9ne2)d+1 δ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 45/82

slide-46
SLIDE 46

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

  • Πq

Fn ξ q y

  • q
  • ξ

Theorem (see e.g. Lazaric et al.,’11) At each iteration k and given a set of state–action pairs {(xi, ai)}, LinearFQI returns an approximation q such that ||q − q||n ≤ ||q − Πq||n + || Πq − q||n ≤ ||q − Πq||n + O

  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 46/82

slide-47
SLIDE 47

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Moving back from vectors to functions ||q − q||n = ||Qk − fˆ

αk||ˆ ρ

||q − Πq||n ≤ ||Qk − fα∗

k ||ˆ

ρ

⇒ ||Qk − fˆ

αk||ˆ ρ ≤ ||Qk − fα∗

k ||ˆ

ρ + O

  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 47/82

slide-48
SLIDE 48

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

By definition of truncation ( Qk = Trunc(fˆ

αk))

Theorem

At each iteration k and given a set of state–action pairs {(xi, ai)}, LinearFQI returns an approximation Qk such that (Objective 3) ||Qk − Qk||ˆ

ρ ≤ ||Qk − fˆ αk||ˆ ρ

≤ ||Qk − fα∗

k ||ˆ

ρ + O

  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 48/82

slide-49
SLIDE 49

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2-norms Since Qk is truncated, it is bounded in [−Vmax, Vmax] 2||Qk − Qk||ˆ

ρ ≥ ||Qk −

Qk||ρ − O

  • Vmax
  • d log n/δ

n

  • The best solution fα∗

k is a fixed function in F

||Qk − fα∗

k ||ˆ

ρ ≤ 2||Qk − fα∗

k ||ρ + O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 49/82

slide-50
SLIDE 50

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

Theorem

At each iteration k, LinearFQI returns an approximation Qk such that (Objective 2) ||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • ,

with probability 1 − δ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 50/82

slide-51
SLIDE 51

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 51/82

slide-52
SLIDE 52

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 52/82

slide-53
SLIDE 53

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ Vanishing to zero as O(n−1/2) ◮ Depends on the features (L) and on the best solution (||α∗ k||)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 53/82

slide-54
SLIDE 54

Sample Complexity of Fitted Q-iteration Error at Each Iteration

Theoretical Analysis

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ Vanishing to zero as O(n−1/2) ◮ Depends on the dimensionality of the space (d) and the

number of samples (n)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 54/82

slide-55
SLIDE 55

Sample Complexity of Fitted Q-iteration Error Propagation

Outline

Sample Complexity of LSTD Sample Complexity of Fitted Q-iteration Error at Each Iteration Error Propagation The Final Bound

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 55/82

slide-56
SLIDE 56

Sample Complexity of Fitted Q-iteration Error Propagation

Theoretical Analysis

Objective 1 ||Q∗ − QπK ||µ

◮ Problem 1: the test norm µ is different from the sampling

norm ρ

◮ Problem 2: we have bounds for

Qk not for the performance

  • f the corresponding πk

◮ Problem 3: we have bounds for one single iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 56/82

slide-57
SLIDE 57

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

◮ Bellman operators

T Q(x, a) = r(x, a) + γ

  • X

max

a′ Q(dx′, a′)p(dx′|x, a)

T πQ(x, a) = r(x, a) + γ

  • X

Q(dx′, π(dx′))p(dx′|x, a)

◮ Optimal action–value function

Q∗ = T Q∗

◮ Greedy policy

π(x) = arg max

a

Q(x, a) π∗(x) = arg max

a

Q∗(x, a)

◮ Prediction error

ǫk = Qk − Qk

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 57/82

slide-58
SLIDE 58

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) By definition T Qk ≥ T π∗Qk Q∗ − Qk+1 = T π∗Q∗

fixed point

−T π∗ Qk + T π∗ Qk

  • −T

Qk + ǫk

  • Qk+1

Q∗ − Qk+1 = T π∗Q∗ − T π∗ Qk

  • recursion

+ T π∗ Qk − T Qk

  • ≤0

+ ǫk

  • error

Q∗ − Qk+1 = T π∗Q∗ − T π∗ Qk + T π∗ Qk − T Qk + ǫk ≤ γPπ∗(Q∗ − Qk) + ǫk Q∗ − QK ≤

K−1

  • k=0

γK−k−1(Pπ∗)K−k−1ǫk + γK(Pπ∗)K(Q∗ − Q0)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 58/82

slide-59
SLIDE 59

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πkQ∗ Q∗ − Qk+1 = T Q∗

  • fixed point

−T πkQ∗ + T πkQ∗

  • −T

Qk + ǫk

  • Qk+1

Q∗ − Qk+1 = T Q∗ − T πkQ∗

  • ≥0

+ T πkQ∗ − T Qk

  • greedy pol.

+ ǫk

  • error

Q∗ − Qk+1 ≥ T πkQ∗ − T πk Qk

  • recursion

+ ǫk

  • error

Q∗ − Qk+1 ≥ γPπk(Q∗ − Qk) + ǫk Q∗ − QK ≥

K−1

  • γK−k−1(PπK−1PπK−2 . . . Pπk+1)ǫk
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 59/82

slide-60
SLIDE 60

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 3: from QK to πK (problem 2) By definition T πK QK = T QK ≥ T π∗QK

Q∗ − QπK = T π∗Q∗

fixed point

−T π∗ QK + T π∗ QK

  • −T πK

QK + T πK QK

  • −T πK

QK

  • fixed point

Q∗ − QπK = T π∗Q∗ − T π∗ QK

  • error

+ T π∗ QK − T πK QK

  • ≤0

+ T πK QK − T πK QK

  • function vs policy

Q∗ − QπK ≤ γPπ∗(Q∗ − QK) + γPπK ( QK −Q∗ + Q∗

  • −QπK )

Q∗ − QπK ≤ γPπ∗(Q∗ − QK

  • error

) + γPπK ( QK − Q∗

  • error

+ Q∗ − QπK

  • policy performance

) (I − γPπK )(Q∗ − QπK ) ≤ γ(Pπ∗ − PπK )(Q∗ − QK)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 60/82

slide-61
SLIDE 61

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 3: plugging the error propagation (problem 2)

Q∗ − QπK ≤(I − γPπK )−1 K−1

  • k=0

γK−k (Pπ∗)K−k − PπK PπK−1 . . . Pπk+1

  • ǫk

+

  • (Pπ∗)K+1 − (PπK PπK−1 . . . Pπ0)
  • (Q∗ −

Q0)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 61/82

slide-62
SLIDE 62

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 4: rewrite in compact form

Q∗ − QπK ≤2γ(1 − γK+1) (1 − γ)2 K−1

  • k=0

αkAk|ǫk| + αKAK|Q∗ − Q0|

  • ◮ αk: weights (

k αk = 1)

◮ Ak: summarize the Pπi terms

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 62/82

slide-63
SLIDE 63

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ

||Q∗ − QπK ||2

µ =

  • µ(dx, da)(Q∗(x, a) − QπK (x, a))2

≤ 2γ(1 − γK+1 (1 − γ)2 2 µ(dx, da) K−1

  • k=0

αkAk|ǫk| + αKAK|Q∗ − Q0| 2 (x, a) ≤ 2γ(1 − γK+1 (1 − γ)2 2 µ(dx, da) K−1

  • k=0

αkAkǫ2

k + αKAK(Q∗ −

Q0)2

  • (x, a)
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 63/82

slide-64
SLIDE 64

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Focusing on one single term

µAk = 1 − γ 2 µ(I − γPπK )−1 (Pπ∗)K−k + PπK PπK−1 . . . Pπk+1 = 1 − γ 2

  • m≥0

γmµ(PπK )m (Pπ∗)K−k + PπK PπK−1 . . . Pπk+1 = 1 − γ 2

m≥0

γmµ(PπK )m(Pπ∗)K−k +

  • m≥0

γmµ(PπK )mPπK PπK−1 . . . Pπk+1

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 64/82

slide-65
SLIDE 65

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Assumption: concentrability terms

c(m) = sup

π1...πm

  • d(µPπ1 . . . Pπm)

Cµ,ρ = (1 − γ)2

m≥1

mγm−1c(m) < +∞ Remark: related to top-Lyapunov exponent ⇒ Cµ,ρ < ∞ is a weak stability condition

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 65/82

slide-66
SLIDE 66

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ

||Q∗ − QπK ||2

µ

≤ 2γ(1 − γK+1 (1 − γ)2 2 K−1

  • k=0

αk(1 − γ)

  • m≥0

γmc(m + K − k)||ǫk||2

ρ + αK(2Vmax)2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 66/82

slide-67
SLIDE 67

Sample Complexity of Fitted Q-iteration Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ (problem 1)

||Q∗ − QπK ||2

µ ≤

(1 − γ)2 2 Cµ,ρ max

k

||ǫk||2

ρ + O

  • γK

(1 − γ)3 Vmax

2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 67/82

slide-68
SLIDE 68

Sample Complexity of Fitted Q-iteration The Final Bound

Outline

Sample Complexity of LSTD Sample Complexity of Fitted Q-iteration Error at Each Iteration Error Propagation The Final Bound

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 68/82

slide-69
SLIDE 69

Sample Complexity of Fitted Q-iteration The Final Bound

Plugging the Per–Iteration Error

||Q∗ − QπK ||2

µ ≤

(1 − γ)2 2 Cµ,ρ max

k

||ǫk||2

ρ + O

  • γK

(1 − γ)3 Vmax

2

  • ||ǫk||ρ = ||Qk −

Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 69/82

slide-70
SLIDE 70

Sample Complexity of Fitted Q-iteration The Final Bound

Plugging the Per–Iteration Error

The inherent Bellman error

||Qk − fα∗

k ||ρ = inf

f ∈F ||Qk − f ||ρ

= inf

f ∈F ||T

Qk−1 − f ||ρ ≤ inf

f ∈F ||T fαk−1 − f ||ρ

≤ sup

g∈F

inf

f ∈F ||T g − f ||ρ = d(F, T F)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 70/82

slide-71
SLIDE 71

Sample Complexity of Fitted Q-iteration The Final Bound

Plugging the Per–Iteration Error

fα∗

k is the orthogonal projection of Qk onto F w.r.t. ρ

⇒ ||fα∗

k ||ρ ≤ ||Qk||ρ = ||T

Qk−1||ρ ≤ || Qk−1||∞ ≤ Vmax

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 71/82

slide-72
SLIDE 72

Sample Complexity of Fitted Q-iteration The Final Bound

Plugging the Per–Iteration Error

Gram matrix Gi,j = E(x,a)∼ρ[ϕi(x, a)ϕj(x, a)] Smallest eigenvalue of G is ω

||fα||2

ρ = ||φ⊤α||2 ρ = α⊤Gα ≥ ωα⊤α = ω||α||2

max

k

||α∗

k|| ≤ max k

||fα∗

k ||ρ

√ω ≤ Vmax √ω

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 72/82

slide-73
SLIDE 73

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that

||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 73/82

slide-74
SLIDE 74

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The propagation (and different norms) makes the problem more complex

⇒ how do we choose the sampling distribution?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 74/82

slide-75
SLIDE 75

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The approximation error is worse than in regression ⇒ how do adapt to

the Bellman operator?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 75/82

slide-76
SLIDE 76

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The dependency on γ is worse than at each iteration

⇒ is it possible to avoid it?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 76/82

slide-77
SLIDE 77

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The error decreases exponentially in K

⇒ K ≈ ǫ/(1 − γ)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 77/82

slide-78
SLIDE 78

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The smallest eigenvalue of the Gram matrix

⇒ design the features so as to be orthogonal w.r.t. ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 78/82

slide-79
SLIDE 79

Sample Complexity of Fitted Q-iteration The Final Bound

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The asymptotic rate O(d/n) is the same as for regression
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 79/82

slide-80
SLIDE 80

Sample Complexity of Fitted Q-iteration The Final Bound

Summary

◮ At each iteration FQI solves a regression problem

⇒ least–squares prediction error bound

◮ The error is propagated through iterations

⇒ propagation of any error

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 80/82

slide-81
SLIDE 81

Sample Complexity of Fitted Q-iteration The Final Bound

Bibliography I

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 3rd, 2013 - 81/82

slide-82
SLIDE 82

Sample Complexity of Fitted Q-iteration The Final Bound

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr