Finite-Sample Analysis in Reinforcement Learning Mohammad - - PowerPoint PPT Presentation

finite sample analysis in reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Finite-Sample Analysis in Reinforcement Learning Mohammad - - PowerPoint PPT Presentation

Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline Introduction to RL and DP 1 Approximate Dynamic Programming (AVI & API) 2 How does Statistical Learning Theory come to


slide-1
SLIDE 1

Finite-Sample Analysis in Reinforcement Learning

Mohammad Ghavamzadeh

INRIA Lille – Nord Europe, Team SequeL

slide-2
SLIDE 2

Outline

1

Introduction to RL and DP

2

Approximate Dynamic Programming (AVI & API)

3

How does Statistical Learning Theory come to the picture?

4

Error Propagation (AVI & API Error Propagation)

5

An AVI Algorithm (Fitted Q-Iteration)

FQI: error at each iteration Final performance bound of FQI

6

An API Algorithm (Least-Squares Policy Iteration)

Error at each iteration (LSTD error) Final performance bound of LSPI

7

Discussion

slide-3
SLIDE 3

Sequential Decision-Making under Uncertainty

!

!"#$%&'$($)))$*

Move around in the physical world (e.g. driving, navigation) Play and win a game Retrieve information over the web Medical diagnosis and treatment Maximize the throughput of a factory Optimize the performance of a rescue team

slide-4
SLIDE 4

Reinforcement Learning (RL)

RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment Goal: Learn an action-selection strategy, or policy, to

  • ptimize some measure of its long-term performance

Interaction: Modeled as a MDP or a POMDP

slide-5
SLIDE 5

Markov Decision Process

MDP An MDP M is a tuple X, A, r, p, γ. The state space X is a bounded closed subset of Rd. The set of actions A is finite (|A| < ∞). The reward function r : X × A → R is bounded by Rmax. The transition model p(·|x, a) is a distribution over X. γ ∈ (0, 1) is a discount factor. Policy: a mapping from states to actions π(x) ∈ A

slide-6
SLIDE 6

Value Function

For a policy π Value function V π : X → R V π(x) = E ∞

  • t=0

γtr

  • Xt, π(Xt)
  • |X0 = x
  • Action-value function

Qπ : X × A → R Qπ(x, a) = E ∞

  • t=0

γtr(Xt, At)|X0 = x, A0 = a

slide-7
SLIDE 7

Notation

Bellman Operator Bellman operator for policy π T π : BV(X; Vmax) → BV(X; Vmax) V π is the unique fixed-point of the Bellman operator (T πV)(x) = r

  • x, π(x)
  • + γ
  • X

p

  • dy|x, π(x)
  • V(y)

The action-value function Qπ is defined as Qπ(x, a) = r(x, a) + γ

  • X

p(dy|x, a)V π(y)

slide-8
SLIDE 8

Optimal Value Function and Optimal Policy

Optimal value function V ∗(x) = sup

π

V π(x) ∀x ∈ X Optimal action-value function Q∗(x, a) = sup

π

Qπ(x, a) ∀x ∈ X, ∀a ∈ A A policy π is optimal if V π(x) = V ∗(x) ∀x ∈ X

slide-9
SLIDE 9

Notation

Bellman Optimality Operator Bellman optimality operator T : BV(X; Vmax) → BV(X; Vmax) V ∗ is the unique fixed-point of the Bellman optimality

  • perator

(T V)(x) = max

a∈A

  • r(x, a) + γ
  • X

p(dy|x, a)V(y)

  • Optimal action-value function Q∗ is defined as

Q∗(x, a) = r(x, a) + γ

  • X

p(dy|x, a)V ∗(y)

slide-10
SLIDE 10

Properties of Bellman Operators

Monotonicity: if V1 ≤ V2 component-wise T πV1 ≤ T πV2 and T V1 ≤ T V2 Max-Norm Contraction: ∀V1, V2 ∈ BV(X; Vmax) ||T πV1 − T πV2||∞ ≤ γ||V1 − V2||∞ ||T V1 − T V2||∞ ≤ γ||V1 − V2||∞

slide-11
SLIDE 11

Dynamic Programming Algorithms

Value Iteration start with an arbitrary action-value function Q0 at each iteration k Qk+1 = T Qk Convergence limk→∞ Vk = V ∗.

||V ∗ −Vk+1||∞ = ||T V ∗ −T Vk||∞ ≤ γ||V ∗ −Vk||∞ ≤ γk+1||V ∗ −V0||∞

k→∞

− → 0

slide-12
SLIDE 12

Dynamic Programming Algorithms

Policy Iteration start with an arbitrary policy π0 at each iteration k

Policy Evaluation: Compute Qπk Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1(x) = (Gπk)(x) = arg max

a∈A

Qπk(x, a)

Convergence

PI generates a sequence of policies with increasing performance (V πk+1 ≥ V πk ) and stops after a finite number of iterations with the

  • ptimal policy π∗.

V πk = T πk V πk ≤ T V πk = T πk+1V πk ≤ lim

n→∞(T πk+1)nV πk = V πk+1

slide-13
SLIDE 13

Approximate Dynamic Programming

slide-14
SLIDE 14

Approximate Dynamic Programming Algorithms

Value Iteration start with an arbitrary action-value function Q0 at each iteration k Qk+1 = T Qk What if Qk+1 ≈ T Qk? ||Q∗ − Qk+1||

?

≤ γ||Q∗ − Qk||

slide-15
SLIDE 15

Approximate Dynamic Programming Algorithms

Policy Iteration start with an arbitrary policy π0 at each iteration k

Policy Evaluation: Compute Qπk Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1(x) = (Gπk)(x) = arg max

a∈A

Qπk(x, a) What if we cannot compute Qπk exactly? (Compute

Qπk ≈ Qπk instead)

πk+1(x) = arg max

a∈A

  • Qπk(x, a) = (Gπk)(x) −

→ V πk+1

?

≥ V πk

slide-16
SLIDE 16

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-17
SLIDE 17

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-18
SLIDE 18

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-19
SLIDE 19

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-20
SLIDE 20

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-21
SLIDE 21

Statistical Learning Theory in RL & ADP

Approximate Value Iteration (AVI) Qk+1 ≈ T Qk

finding a function that best approximates T Qk Q = minf ||f − T Qk||µ

  • nly noisy observations of T Qk are available
  • T Qk

Target Function = T Qk Noisy Observation = T Qk we minimize the empirical error Qk+1 = Q = minf ||f − T Qk||

µ

with the target of minimizing the true error Q = minf ||f − T Qk||µ Objective: || Q − T Qk||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − T Qk||µ

  • approximation error

to be small

regression

slide-22
SLIDE 22

Statistical Learning Theory in RL & ADP

Approximate Policy Iteration (API) - policy evaluation

finding a function that best approximates Qπk Q = minf ||f − Qπk ||µ

  • nly noisy observations of Qπk are available
  • Qπk

Target Function = Qπk Noisy Observation = Qπk we minimize the empirical error

  • Q = minf ||f −

Qπk ||

µ

with the target of minimizing the true error Q = minf ||f − Qπk ||µ Objective: || Q − Qπk ||µ ≤ || Q − Q||µ

  • estimation error

+ ||Q − Qπk ||µ

  • approximation error

to be small

regression

slide-23
SLIDE 23

Statistical Learning Theory in RL & ADP

Approximate Policy Iteration (API) πk+1 ≈ Gπk

finding a policy that best approximates Gπk π = minf L(f, πk; µ) we minimize the empirical error πk+1 = π = minf L(f, πk; µ) with the target of minimizing the true error π = minf L(f, πk; µ) Objective: L( π, πk; µ) ≤ L( π, π; µ)

  • estimation error

+ L(π, πk; µ)

  • approximation error

to be small

classification (we do not discuss it in this talk)

slide-24
SLIDE 24

Statistical Learning Theory in RL & ADP

Approximate Policy Iteration (API) - policy evaluation

finding the fixed-point of T πk

  • nly noisy observations of T πk are available
  • T πk

a fixed-point problem

slide-25
SLIDE 25

SLT in RL & ADP

supervised learning methods (regression, classification) appear in the inner-loop of ADP algorithms (performance at each iteration) tools from SLT that are used to analyze supervised learning methods can be used in RL and ADP (e.g., how many samples are required to achieve

a certain performance)

What makes RL more challenging?

the objective is not always to recover a target function from its noisy

  • bservations (fixed-point vs. regression)

the target sometimes has to be approximated given sample trajectories

(non i.i.d. samples)

propagation of error (control problem) is there any hope?

slide-26
SLIDE 26

Approximate Value Iteration (AVI)

Vk+1 = T Vk + ǫk

  • r

||Vk+1 − T Vk||∞ = ǫk Proposition (AVI Error Propagation)

We run AVI for K iterations and πK = GVK ||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2 max

0≤k<K ǫk + 2γK+1

1 − γ ||V ∗ − V0||∞.

Proof

||V ∗ − Vk+1||∞ ≤ ||T V ∗ − T Vk ||∞ + ||T Vk − Vk+1||∞ = γ||V ∗ − Vk ||∞ + ǫk so ||V ∗ − VK ||∞ ≤

K−1

  • k=0

γK−1−k ǫk + γK ||V ∗ − V0||∞ ≤ 1 1 − γ max

0≤k<K ǫk + γK ||V ∗ − V0||∞

the result follows by the fact that ||V ∗ − V πK ||∞ ≤ 2γ 1 − γ ||V ∗ − VK ||∞.

slide-27
SLIDE 27

Approximate Policy Iteration (API)

1

Vk = V πk + ǫk

  • r

||Vk − V πk||∞ = ǫk (Policy Evaluation Error)

2

Vk = T πkVk + ǫk

  • r

||Vk − T πk Vk||∞ = ǫk (Bellman Residual)

Proposition (API Asymptotic Performance)

(1) lim sup

k→∞

||V ∗ − V πk ||∞ ≤ 2γ (1 − γ)2 lim sup

k→∞

||Vk − V πk ||∞

  • ǫk

(2) lim sup

k→∞

||V ∗ − V πk ||∞ ≤ 2γ (1 − γ)2 lim sup

k→∞

||Vk − T πkVk||∞

  • ǫk
slide-28
SLIDE 28

Approximate Dynamic Programming (ADP)

Proposition (AVI Asymptotic Performance)

lim sup

k→∞

||V ∗ − V πk ||∞ ≤ 2γ (1 − γ)2 lim sup

k→∞

||Vk+1 − T Vk||∞

  • ǫk

Proposition (API Asymptotic Performance)

(1) lim sup

k→∞

||V ∗ − V πk ||∞ ≤ 2γ (1 − γ)2 lim sup

k→∞

||Vk − V πk ||∞

  • ǫk

(2) lim sup

k→∞

||V ∗ − V πk ||∞ ≤ 2γ (1 − γ)2 lim sup

k→∞

||Vk − T πkVk||∞

  • ǫk
slide-29
SLIDE 29

Error Propagation

slide-30
SLIDE 30

AVI Error Propagation

Error at each iteration k: ǫk = T Vk − Vk+1 πK is a greedy policy w.r.t. VK−1 πK = G(VK−1) Proposition (AVI Pointwise Error Bound)

V ∗ − V πK ≤ (I − γPπK )−1 K−1

  • k=0

γK−k (Pπ∗)K−k + PπK PπK −1 . . . Pπk+1 |ǫk| + γK+1 (Pπ∗)K+1 + (PπK PπK −1 . . . Pπ0)

  • |V ∗ − V0|
slide-31
SLIDE 31

AVI Error Propagation

Proposition (AVI Lp Error Bound) ǫk = T Vk − Vk+1

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)
slide-32
SLIDE 32

AVI Error Propagation

Proposition (AVI Lp Error Bound) ǫk = T Vk − Vk+1

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)

||ǫk||p,µ: error at each iteration k, note that ǫk = T Vk − Vk+1

slide-33
SLIDE 33

AVI Error Propagation

Proposition (AVI Lp Error Bound) ǫk = T Vk − Vk+1

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)

||ǫk||p,µ: error at each iteration k, note that ǫk = T Vk − Vk+1 2γK/pVmax: initialization error |V ∗ − V0|

slide-34
SLIDE 34

AVI Error Propagation

Proposition (AVI Lp Error Bound) ǫk = T Vk − Vk+1

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)

||ǫk||p,µ: error at each iteration k, note that ǫk = T Vk − Vk+1 2γK/pVmax: initialization error |V ∗ − V0| Cρ,µ, Cµ: final performance is evaluated w.r.t. a measure ρ = µ

slide-35
SLIDE 35

AVI Error Propagation (Concentrability Coefficients)

Final performance is evaluated w.r.t. a measure ρ = µ, ||V ∗ − V πK ||p,ρ

Assumption 1. (Uniformly Stochastic Transitions)

For all x ∈ X and a ∈ A, there exists a constant Cµ < ∞ such that P(·|x, a) ≤ Cµµ(·).

Assumption 2. (Discounted-Average Concentrability of Future-State Distribution)

For any sequence of policies {πm}m≥1, there exists a constant cρ,µ(m) < ∞ such that ρPπ1Pπ2 . . . Pπm ≤ cρ,µ(m)µ. We define Cρ,µ = (1 − γ)2

m≥1

mγm−1cρ,µ(m) Note that Cρ,µ ≤ Cµ.

slide-36
SLIDE 36

API Error Propagation

Error at each iteration k: ǫk = Vk − T πkVk πK is a greedy policy w.r.t. VK−1 πK = G(VK−1) Proposition (API Pointwise Error Bound)

V ∗ − V πK ≤ γ

K−1

  • k=0

(γPπ∗)K−k−1Ek|ǫk| + (γPπ∗)K |V ∗ − V π0| where Ek = Pπk+1(I − γPπk+1)−1 − Pπ∗(I − γPπk )−1

slide-37
SLIDE 37

API Error Propagation

Proposition (API Lp Error Bound) ǫk = Vk − T πkVk

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)
slide-38
SLIDE 38

API Error Propagation

Proposition (API Lp Error Bound) ǫk = Vk − T πkVk

||V ∗ − V πK ||p,ρ ≤ 2γ (1 − γ)2

  • C1/p

ρ,µ max 0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A1)

||V ∗ − V πK ||∞ ≤ 2γ (1 − γ)2

  • C1/p

µ

max

0≤k<K ||ǫk||p,µ + 2γK/pVmax

  • (A2)

||ǫk||p,µ: error at each iteration k, note that ǫk = Vk − T πk Vk 2γK/pVmax: initialization error |V ∗ − V π0| Cρ,µ, Cµ: final performance is evaluated w.r.t. a measure ρ = µ

slide-39
SLIDE 39

API Error Propagation (Concentrability Coefficients)

Final performance is evaluated w.r.t. a measure ρ = µ, ||V ∗ − V πK ||p,ρ

Assumption 1. (Uniformly Stochastic Transitions)

For all x ∈ X and a ∈ A, there exists a constant Cµ < ∞ such that P(·|x, a) ≤ Cµµ(·).

Assumption 2. (Discounted-Average Concentrability of Future-State Distribution)

For any policy π and any non-negative integers s and t, there exists a constant cρ,µ(s, t) < ∞ such that ρ(P∗)s(Pπ)t ≤ cρ,µ(s, t)µ. We define Cρ,µ = (1 − γ)2

  • s=0

  • t=0

γs+tcρ,µ(s, t) Note that Cρ,µ ≤ Cµ.

slide-40
SLIDE 40

Finite-Sample Performance Bound of an AVI Algorithm

slide-41
SLIDE 41

Approximate Value Iteration (AVI)

if F is a function space, then Vk+1 can be defined as Vk+1 = inf

V∈F ||V − T Vk||? = Π?T Vk

(projection of T Vk into F according to the norm L?)

if Vk+1 = Π∞T Vk then AVI converges to the unique fixed-point of Π∞T , i.e. V ∈ F : V = Π∞T V

(T is a contraction in L∞-norm and Π∞ is non-expansive)

if we consider another norm, e.g. L2(µ), then AVI does not necessarily converge

(Π2,µT is not necessarily a contraction)

slide-42
SLIDE 42

An Approximate Value Iteration Algorithm

Linear function space F =

  • f : f(·) = d

j=1 αjϕj(·)

  • {ϕj}d

j=1 ∈ B

  • (X , A); L
  • ,

φ : (X , A) → Rd, φ(·) =

  • ϕ1(·), . . . , ϕd(·)

Fitted Q-Iteration (FQI)

At each iteration k: Generate N samples of the form (Xi, Ai, X ′

i , Ri), where

(Xi, Ai) ∼ µ, X ′

i ∼ p(·|Xi, Ai),

Ri ∼ r(Xi, Ai) Build the training set Dk = (Xi, Ai), T Qk(Xi, Ai)N

i=1, where

  • T Qk(Xi, Ai) = Ri + γ maxa∈A Qk(X ′

i , a)

Qk+1 = arg minf∈F ||f − T Qk||2

N = arg minf∈F 1 N

N

i=1

f(Xi, Ai)− T Qk(Xi, Ai)2 (regression)

slide-43
SLIDE 43

FQI - Error at Each Iteration

Theorem (FQI - Error at Iteration k)

Let F be a d-dim linear space, Dk =

  • (Xi, Ai, X ′

i , Ri)

N

i=1, (Xi, Ai) iid

∼ µ, X ′

i ∼ p(·|Xi, Ai), Ri = r(Xi, Ai), and

Q be the training set and the truncated solution at the k’th iteration of FQI. Then with probability 1 − δ, we have || Q−T Qk||µ ≤ 4 inf

f∈F ||f−T Qk||µ+O

  • ||α∗

k ||

  • log(1/δ)

N

  • +O
  • d log(N/δ)

N

  • .

Note that Qk+1 = Q .

slide-44
SLIDE 44

FQI - Error at Each Iteration

Theorem (FQI - Error at Iteration k)

Let F be a d-dim linear space, Dk =

  • (Xi, Ai, X ′

i , Ri)

N

i=1, (Xi, Ai) iid

∼ µ, X ′

i ∼ p(·|Xi, Ai), Ri = r(Xi, Ai), and

Q be the training set and the truncated solution at the k’th iteration of FQI. Then with probability 1 − δ, we have || Q−T Qk||µ ≤ 4 inf

f∈F ||f−T Qk||µ+O

  • ||α∗

k ||

  • log(1/δ)

N

  • +O
  • d log(N/δ)

N

  • .

Note that Qk+1 = Q . N = # of samples , d = dimension of the linear function space F

slide-45
SLIDE 45

FQI - Error at Each Iteration

Theorem (FQI - Error at Iteration k)

Let F be a d-dim linear space, Dk =

  • (Xi, Ai, X ′

i , Ri)

N

i=1, (Xi, Ai) iid

∼ µ, X ′

i ∼ p(·|Xi, Ai), Ri = r(Xi, Ai), and

Q be the training set and the truncated solution at the k’th iteration of FQI. Then with probability 1 − δ, we have || Q−T Qk||µ ≤ 4 inf

f∈F ||f−T Qk||µ+O

  • ||α∗

k ||

  • log(1/δ)

N

  • +O
  • d log(N/δ)

N

  • .

Note that Qk+1 = Q . N = # of samples , d = dimension of the linear function space F α∗

k −

→ fα∗

k = Π2,µT Qk: the best approximation of T Qk in F w.r.t. µ

slide-46
SLIDE 46

FQI - Error at Each Iteration

FQI - Error at Iteration k

|| Q−T Qk||µ ≤ 4 inf

f∈F ||f − T Qk||µ

  • approximation error

+ O

  • ||α∗

k||

  • log(1/δ)

N

  • + O
  • d log(N/δ)

N

  • estimation error

Approximation error: it depends on how well the function space F

can approximate T Qk

Estimation error: it depends on the number of samples N, the dim of

the function space d, and ||α∗

k ||

slide-47
SLIDE 47

FQI Error Bound

Theorem (FQI Error Bound)

Let Q−1 ∈ F be an arbitrary initial value function, Q0, . . . , QK−1 be the sequence of truncated action-value functions generated by FQI after K iterations, and πK be the greedy policy w.r.t. QK−1. Then with probability 1 − δ, we have

||V ∗ − V πK ||ρ ≤ 2γ (1 − γ)2

  • Cρ,µ
  • dµ(T F, F) + O
  • Qmax
  • log(K/δ)

N νµ

  • + O
  • d log(NK/δ)

N + 2γK/2Qmax

slide-48
SLIDE 48

FQI Error Bound

Theorem (FQI Error Bound)

Let Q−1 ∈ F be an arbitrary initial value function, Q0, . . . , QK−1 be the sequence of truncated action-value functions generated by FQI after K iterations, and πK be the greedy policy w.r.t. QK−1. Then with probability 1 − δ, we have

||V ∗ − V πK ||ρ ≤ 2γ (1 − γ)2

  • Cρ,µ
  • dµ(T F, F) + O
  • Qmax
  • log(K/δ)

N νµ

  • + O
  • d log(NK/δ)

N + 2γK/2Qmax

  • Approximation error: dµ(T F, F) = supf∈

F infg∈F ||g − T f||µ

slide-49
SLIDE 49

FQI Error Bound

Theorem (FQI Error Bound)

Let Q−1 ∈ F be an arbitrary initial value function, Q0, . . . , QK−1 be the sequence of truncated action-value functions generated by FQI after K iterations, and πK be the greedy policy w.r.t. QK−1. Then with probability 1 − δ, we have

||V ∗ − V πK ||ρ ≤ 2γ (1 − γ)2

  • Cρ,µ
  • dµ(T F, F) + O
  • Qmax
  • log(K/δ)

N νµ

  • + O
  • d log(NK/δ)

N + 2γK/2Qmax

  • Approximation error: dµ(T F, F) = supf∈

F infg∈F ||g − T f||µ

Estimation error: depends on N, d, νµ, K. Note that ||α∗

k || ≤ Qmax νµ

νµ = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dµ)i,j

slide-50
SLIDE 50

FQI Error Bound

Theorem (FQI Error Bound)

Let Q−1 ∈ F be an arbitrary initial value function, Q0, . . . , QK−1 be the sequence of truncated action-value functions generated by FQI after K iterations, and πK be the greedy policy w.r.t. QK−1. Then with probability 1 − δ, we have

||V ∗ − V πK ||ρ ≤ 2γ (1 − γ)2

  • Cρ,µ
  • dµ(T F, F) + O
  • Qmax
  • log(K/δ)

N νµ

  • + O
  • d log(NK/δ)

N + 2γK/2Qmax

  • Approximation error: dµ(T F, F) = supf∈

F infg∈F ||g − T f||µ

Estimation error: depends on N, d, νµ, K. Note that ||α∗

k || ≤ Qmax νµ

νµ = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dµ)i,j

Initialization error: error due to the choice of the initial action-value function |Q∗ − Q0|

slide-51
SLIDE 51

Finite-Sample Performance Bound of an API Algorithm

slide-52
SLIDE 52

Least-Squares Temporal-Difference Learning (LSTD)

Linear function space F =

  • f : f(·) = d

j=1 αjϕj(·)

  • {ϕj}d

j=1 ∈ B(X ; L

  • ,

φ : X → Rd, φ(·) =

  • ϕ1(·), . . . , ϕd(·)

V π is the fixed-point of T π T πV π = V π V π may not belong to F V π / ∈ F LSTD searches for the fixed-point of Π?T π instead (Π? is a

projection into F w.r.t. L?-norm)

Π∞T π is a contraction in L∞-norm

L∞-projection is numerically expensive when the number of states is large or infinite

LSTD searches for the fixed-point of Π2,µT π Π2,µg = arg minf∈F ||f − g||2,µ

slide-53
SLIDE 53

Least-Squares Temporal-Difference Learning (LSTD)

When the fixed-point of ΠµT π exists, we call it the LSTD solution VTD = ΠµT πVTD

F ΠµV π VTD = ΠµT πVTD V π

T πVTD T π T π

T πVTD − VTD, ϕiµ = 0, i = 1, . . . , d r π + γPπVTD − VTD, ϕiµ = 0 r π, ϕiµ

  • bi

d

  • i=1

ϕj − γPπϕj, ϕiµ

  • Aij

· α(j)

TD = 0

− → A αTD = b In general, ΠµT π is not a contraction and does not have a fixed-point. If µ = µπ, the stationary dist. of π, then ΠµπT π has a unique fixed-point.

slide-54
SLIDE 54

LSTD Algorithm

Proposition (LSTD Performance) ||V π − VTD||µπ ≤ 1

  • 1 − γ2 inf

V∈F ||V π − V||µπ

LSTD Algorithm

We observe a trajectory generated by following the policy π (X0, R0, X1, R1, . . . , XN) where Xt+1 ∼ P · |Xt, π(Xt) and Rt = rXt, π(Xt) We build estimators of the matrix A and vector b

  • Aij = 1

N

N−1

  • t=0

ϕi(Xt )

  • ϕj(Xt) − γϕj(Xt+1)
  • ,
  • bi = 1

N

N−1

  • t=0

ϕi(Xt )Rt

  • A

αTD = b ,

  • VTD(·) = φ(·)⊤

αTD

when n → ∞ then A → A and b → b, and thus, αTD → αTD and VTD → VTD.

slide-55
SLIDE 55

LSTD Error Bound

When the Markov chain induced by the policy under evaluation π has a stationary distribution µπ (Markov chain is ergodic - e.g. β-mixing), then

Theorem (LSTD Error Bound)

Let V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ||V π − V||µπ ≤ c

  • 1 − γ2 inf

f∈F ||V π − f||µπ + O

  • d log(d/δ)

n ν

  • n = # of samples

, d = dimension of the linear function space F ν = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dµπ)i,j

(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution)

β-mixing coefficients are hidden in O notation

slide-56
SLIDE 56

LSTD Error Bound

LSTD Error Bound

||V π − V||µπ ≤ c

  • 1 − γ2

inf

f∈F ||V π − f||µπ

  • approximation error

+ O

  • d log(d/δ)

n ν

  • estimation error

Approximation error: it depends on how well the function space F

can approximate the value function V π

Estimation error: it depends on the number of samples n, the dim of

the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)

slide-57
SLIDE 57

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

slide-58
SLIDE 58

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

  • Approximation error: E0(F) = supπ∈G(

F) inff∈F ||V π − f||µπ

slide-59
SLIDE 59

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

  • Approximation error: E0(F) = supπ∈G(

F) inff∈F ||V π − f||µπ

Estimation error: depends on n, d, νµ, K

slide-60
SLIDE 60

LSPI Error Bound

Theorem (LSPI Error Bound)

Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

  • Approximation error: E0(F) = supπ∈G(

F) inff∈F ||V π − f||µπ

Estimation error: depends on n, d, νµ, K Initialization error: error due to the choice of the initial value function

  • r initial policy |V ∗ − V π0|
slide-61
SLIDE 61

LSPI Error Bound

LSPI Error Bound

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

  • Lower-Bounding Distribution

There exists a distribution µ such that for any policy π ∈ G( F), we have µ ≤ Cµπ, where C < ∞ is a constant and µπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cρ,µ as before.

slide-62
SLIDE 62

LSPI Error Bound

LSPI Error Bound

||V ∗−V πK ||ρ ≤ 4γ (1 − γ)2

  • CCρ,µ
  • cE0(F) + O
  • d log(dK/δ)

n νµ

  • + γ

K −1 2 Rmax

  • Lower-Bounding Distribution

There exists a distribution µ such that for any policy π ∈ G( F), we have µ ≤ Cµπ, where C < ∞ is a constant and µπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cρ,µ as before. νµ = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dµ)i,j
slide-63
SLIDE 63

Discussion

we obtain the optimal rate of regression and classification for RL (ADP) algorithms What makes RL more challenging then? the propagation of error (control problem) the approximation error is more complex the sampling problem (how to choose µ - exploration problem)

slide-64
SLIDE 64

Other Finite-Sample Analysis Results in RL

Approximate Value Iteration [MS08] Approximate Policy Iteration

LSTD and LSPI [LGM10, LGM11] Bellman Residual Minimization [MMLG10] Modified Bellman Residual Minimization [ASM08] Classification-based Policy Iteration [FYG06, LGM10, GLGS11]

Regularized Approximate Dynamic Programming

L2-Regularization L2-Regularized Policy Iteration [FGSM08] L2-Regularized Fitted Q-Iteration [FGSM09] L1-Regularization and High-Dimensional RL Lasso-TD [GLMH11] LSTD (LSPI) with Random Projections [GLMM10]

slide-65
SLIDE 65

Bibliography I

Antos, A., Szepesvári, Cs., and Munos, R. Learning Near-Optimal Policies with Bellman Residual Minimization-based Fitted Policy Iteration and a Single Sample Path. Machine Learning Journal, 71:89–129, 2008. Farahmand, A., Ghavamzadeh, M., Szepesvári Cs., and Mannor, S. Regularized Policy Iteration. Proceedings of Advances in Neural Information Processing Systems 21, pp. 441–448, 2008. Farahmand, A., Ghavamzadeh, M., Szepesvári Cs., and Mannor, S. Regularized Fitted Q-iteration for Planning in Continuous-Space Markovian Decision Problems. Proceedings of the American Control Conference, pp. 725–730, 2009. Fern, A., Yoon, S., and Givan, R. Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes. Journal of Artificial Intelligence Research, 25:85–118, 2006. Gabillon, V., Lazaric, A., Ghavamzadeh, M., and Scherrer, B. Classification-based Policy Iteration with a Critic. Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp. 1049–1056, 2011. Ghavamzadeh, M., Lazaric A., Munos, R., and Hoffman, M. Finite-Sample Analysis of Lasso-TD. Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp. 1177–1184, 2011. Ghavamzadeh, M., Lazaric, A., Maillard, O., and Munos, R. LSTD with Random Projections. Proceedings of Advances in Neural Information Processing Systems 23, pp. 721–729, 2010.

slide-66
SLIDE 66

Bibliography II

Lazaric A., Ghavamzadeh, M., and Munos, R. Analysis of a Classification-based Policy Iteration Algorithm. Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp. 607–614, 2010. Lazaric A., Ghavamzadeh, M., and Munos, R. Finite-Sample Analysis of LSTD. Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp. 615–622, 2010. Lazaric A., Ghavamzadeh, M., and Munos, R. Finite-Sample Analysis of Least-Squares Policy Iteration. Accepted at the Journal of Machine Learning Research, 2011. Maillard, O., Munos, R., Lazaric A., and Ghavamzadeh, M. Finite-Sample Analysis of Bellman Residual Minimization. Proceedings of the Second Asian Conference on Machine Learning, pp. 299–314, 2010. Munos, R. and Szepesvári, Cs. Finite-Time Bounds for Fitted Value Iteration. Journal of Machine Learning Research, 9:815–857, 2008. Munos, R. Performance Bounds in Lp-norm for Approximate Value Iteration. SIAM Journal of Control and Optimization, 2007. Munos, R. Error Bounds for Approximate Policy Iteration. Proceedings of the Nineteenth International Conference on Machine Learning, pp. 560–567, 2003.