Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation

approximate dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning


slide-1
SLIDE 1

MVA-RL Course

Approximate Dynamic Programming

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

Approximate Dynamic Programming

(a.k.a. Batch Reinforcement Learning)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 2/82

slide-3
SLIDE 3

Approximate Dynamic Programming

(a.k.a. Batch Reinforcement Learning)

Approximate Value Iteration Approximate Policy Iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 2/82

slide-4
SLIDE 4

From DP to ADP

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 3/82

slide-5
SLIDE 5

From DP to ADP

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

◮ This knowledge is often unavailable (i.e., wind intensity,

human-computer-interaction).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 3/82

slide-6
SLIDE 6

From DP to ADP

◮ Dynamic programming algorithms require an explicit

definition of

◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)

◮ This knowledge is often unavailable (i.e., wind intensity,

human-computer-interaction).

◮ Can we rely on samples?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 3/82

slide-7
SLIDE 7

From DP to ADP

◮ Dynamic programming algorithms require an exact

representation of value functions and policies

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 4/82

slide-8
SLIDE 8

From DP to ADP

◮ Dynamic programming algorithms require an exact

representation of value functions and policies

◮ This is often impossible since their shape is too “complicated”

(e.g., large or continuous state space).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 4/82

slide-9
SLIDE 9

From DP to ADP

◮ Dynamic programming algorithms require an exact

representation of value functions and policies

◮ This is often impossible since their shape is too “complicated”

(e.g., large or continuous state space).

◮ Can we use approximations?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 4/82

slide-10
SLIDE 10

The Objective

Find a policy π such that the performance loss ||V ∗ − V π|| is as small as possible

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 5/82

slide-11
SLIDE 11

From Approximation Error to Performance Loss

Question: if V is an approximation of the optimal value function V ∗ with an error error = V − V ∗

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 6/82

slide-12
SLIDE 12

From Approximation Error to Performance Loss

Question: if V is an approximation of the optimal value function V ∗ with an error error = V − V ∗ how does it translate to the (loss of) performance of the greedy policy π(x) ∈ arg max

a∈A

  • y

p(y|x, a)

  • r(x, a, y) + γV (y)
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 6/82

slide-13
SLIDE 13

From Approximation Error to Performance Loss

Question: if V is an approximation of the optimal value function V ∗ with an error error = V − V ∗ how does it translate to the (loss of) performance of the greedy policy π(x) ∈ arg max

a∈A

  • y

p(y|x, a)

  • r(x, a, y) + γV (y)
  • i.e.

performance loss = V ∗ − V π

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 6/82

slide-14
SLIDE 14

From Approximation Error to Performance Loss

Proposition

Let V ∈ RN be an approximation of V ∗ and π its corresponding greedy policy, then V ∗ − V π∞

  • performance loss

≤ 2γ 1 − γ V ∗ − V ∞

  • approx. error

. Furthermore, there exists ǫ > 0 such that if V − V ∗∞ ≤ ǫ, then π is optimal.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 7/82

slide-15
SLIDE 15

From Approximation Error to Performance Loss

Proof. V ∗ − V π∞ ≤ T V ∗ − T πV ∞ + T πV − T πV π∞ ≤ T V ∗ − T V ∞ + γV − V π∞ ≤ γV ∗ − V ∞ + γ(V − V ∗∞ + V ∗ − V π∞) ≤ 2γ 1 − γ V ∗ − V ∞.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 8/82

slide-16
SLIDE 16

Approximate Dynamic Programming

(a.k.a. Batch Reinforcement Learning)

Approximate Value Iteration Approximate Policy Iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 9/82

slide-17
SLIDE 17

From Approximation Error to Performance Loss

Question: how do we compute a good V ?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 10/82

slide-18
SLIDE 18

From Approximation Error to Performance Loss

Question: how do we compute a good V ? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V ∗.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 10/82

slide-19
SLIDE 19

From Approximation Error to Performance Loss

Question: how do we compute a good V ? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V ∗. Solution: value iteration tends to learn functions which are close to the optimal value function V ∗.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 10/82

slide-20
SLIDE 20

Value Iteration: the Idea

  • 1. Let Q0 be any action-value function
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute

Qk+1(x, a) = T Qk(x, a) = r(x, a)+

  • y

p(y|x, a)γ max

b

Qk(y, b)

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A QK(x, a).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 11/82

slide-21
SLIDE 21

Value Iteration: the Idea

  • 1. Let Q0 be any action-value function
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute

Qk+1(x, a) = T Qk(x, a) = r(x, a)+

  • y

p(y|x, a)γ max

b

Qk(y, b)

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A QK(x, a).

◮ Problem: how can we approximate T Qk? ◮ Problem: if Qk+1 = T Qk, does (approx.) value iteration still work?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 11/82

slide-22
SLIDE 22

Linear Fitted Q-iteration: the Approximation Space

Linear space (used to approximate action–value functions) F =

  • f (x, a) =

d

  • j=1

αjϕj(x, a), α ∈ Rd

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 12/82

slide-23
SLIDE 23

Linear Fitted Q-iteration: the Approximation Space

Linear space (used to approximate action–value functions) F =

  • f (x, a) =

d

  • j=1

αjϕj(x, a), α ∈ Rd with features ϕj : X × A → [0, L] φ(x, a) = [ϕ1(x, a) . . . ϕd(x, a)]⊤

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 12/82

slide-24
SLIDE 24

Linear Fitted Q-iteration: the Samples

Assumption: access to a generative model, that is a black-box simulator sim() of the environment is available. Given (x, a), sim(x, a) = {y, r}, with y ∼ p(·|x, a), r = r(x, a)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 13/82

slide-25
SLIDE 25

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-26
SLIDE 26

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-27
SLIDE 27

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-28
SLIDE 28

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-29
SLIDE 29

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-30
SLIDE 30

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • 3. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-31
SLIDE 31

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • 3. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 4. Build training set
  • (xi, ai), yi

n

i=1

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-32
SLIDE 32

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • 3. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 4. Build training set
  • (xi, ai), yi

n

i=1

  • 5. Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-33
SLIDE 33

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • 3. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 4. Build training set
  • (xi, ai), yi

n

i=1

  • 5. Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • 6. Return

Qk = fˆ

αk (truncation may be needed)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-34
SLIDE 34

Linear Fitted Q-iteration

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • 3. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 4. Build training set
  • (xi, ai), yi

n

i=1

  • 5. Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • 6. Return

Qk = fˆ

αk (truncation may be needed)

Return πK(·) = arg maxa QK(·, a) (greedy policy)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 14/82

slide-35
SLIDE 35

Linear Fitted Q-iteration: Sampling

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 15/82

slide-36
SLIDE 36

Linear Fitted Q-iteration: Sampling

  • 1. Draw n samples (xi, ai)

i.i.d

∼ ρ

  • 2. Sample x′

i ∼ p(·|xi, ai) and ri = r(xi, ai)

◮ In practice it can be done once before running the algorithm ◮ The sampling distribution ρ should cover the state-action space in

all relevant regions

◮ If not possible to choose ρ, a database of samples can be used

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 15/82

slide-37
SLIDE 37

Linear Fitted Q-iteration: The Training Set

  • 4. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 5. Build training set
  • (xi, ai), yi

n

i=1

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 16/82

slide-38
SLIDE 38

Linear Fitted Q-iteration: The Training Set

  • 4. Compute yi = ri + γ maxa

Qk−1(x′

i , a)

  • 5. Build training set
  • (xi, ai), yi

n

i=1

◮ Each sample yi is an unbiased sample, since

E[yi|xi, ai] = E[ri + γ max

a

  • Qk−1(x′

i , a)] = r(xi, ai) + γE[max a

  • Qk−1(x′

i , a)]

= r(xi, ai) + γ

  • X

max

a

  • Qk−1(x′, a)p(dy|x, a) = T

Qk−1(xi, ai)

◮ The problem “reduces” to standard regression ◮ It should be recomputed at each iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 16/82

slide-39
SLIDE 39

Linear Fitted Q-iteration: The Regression Problem

  • 6. Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • 7. Return

Qk = fˆ

αk (truncation may be needed)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 17/82

slide-40
SLIDE 40

Linear Fitted Q-iteration: The Regression Problem

  • 6. Solve the least squares problem

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • 7. Return

Qk = fˆ

αk (truncation may be needed)

◮ Thanks to the linear space we can solve it as

◮ Build matrix Φ =

  • φ(x1, a1)⊤ . . . φ(xn, an)⊤

◮ Compute ˆ

αk = (Φ⊤Φ)−1Φ⊤y (least–squares solution)

◮ Truncation to [−Vmax; Vmax] (with Vmax = Rmax/(1 − γ))

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 17/82

slide-41
SLIDE 41

Sketch of the Analysis

Q3 greedy πK · · · Q2 Q0 Q1

T T

T Q2

  • Q2

ǫ2

  • Q3

ǫ3 T Q3 ǫ1

  • Q1

T Q1

T T

Q4 · · · final error Q∗

T

  • QK

QπK Skip Theory

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 18/82

slide-42
SLIDE 42

Theoretical Objectives

Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 19/82

slide-43
SLIDE 43

Theoretical Objectives

Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T Qk−1 − Qk||ρ ≤ ???

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 19/82

slide-44
SLIDE 44

Theoretical Objectives

Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T Qk−1 − Qk||ρ ≤ ??? Sub-Objective 2: analyze how the error at each iteration is propagated through iterations ||Q∗ − QπK ||µ ≤ propagation(||T Qk−1 − Qk||ρ)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 19/82

slide-45
SLIDE 45

The Sources of Error

◮ Desired solution

Qk = T Qk−1

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 20/82

slide-46
SLIDE 46

The Sources of Error

◮ Desired solution

Qk = T Qk−1

◮ Best solution (wrt sampling distribution ρ)

fα∗

k = arg inf

fα∈F ||fα − Qk||ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 20/82

slide-47
SLIDE 47

The Sources of Error

◮ Desired solution

Qk = T Qk−1

◮ Best solution (wrt sampling distribution ρ)

fα∗

k = arg inf

fα∈F ||fα − Qk||ρ

⇒ Error from the approximation space F

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 20/82

slide-48
SLIDE 48

The Sources of Error

◮ Desired solution

Qk = T Qk−1

◮ Best solution (wrt sampling distribution ρ)

fα∗

k = arg inf

fα∈F ||fα − Qk||ρ

⇒ Error from the approximation space F

◮ Returned solution

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 20/82

slide-49
SLIDE 49

The Sources of Error

◮ Desired solution

Qk = T Qk−1

◮ Best solution (wrt sampling distribution ρ)

fα∗

k = arg inf

fα∈F ||fα − Qk||ρ

⇒ Error from the approximation space F

◮ Returned solution

αk = arg min fα∈F

1 n

n

  • i=1
  • fα(xi, ai) − yi

2 ⇒ Error from the (random) samples

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 20/82

slide-50
SLIDE 50

Per-Iteration Error

Theorem

At each iteration k, Linear-FQI returns an approximation Qk such that (Sub-Objective 1) ||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • ,

with probability 1 − δ.

Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces, ...

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 21/82

slide-51
SLIDE 51

Per-Iteration Error

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 22/82

slide-52
SLIDE 52

Per-Iteration Error

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 23/82

slide-53
SLIDE 53

Per-Iteration Error

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ Vanishing to zero as O(n−1/2) ◮ Depends on the features (L) and on the best solution (||α∗ k||)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 24/82

slide-54
SLIDE 54

Per-Iteration Error

||Qk − Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • Remarks

◮ Vanishing to zero as O(n−1/2) ◮ Depends on the dimensionality of the space (d) and the

number of samples (n)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 25/82

slide-55
SLIDE 55

Error Propagation

Objective ||Q∗ − QπK ||µ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 26/82

slide-56
SLIDE 56

Error Propagation

Objective ||Q∗ − QπK ||µ

◮ Problem 1: the test norm µ is different from the sampling

norm ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 26/82

slide-57
SLIDE 57

Error Propagation

Objective ||Q∗ − QπK ||µ

◮ Problem 1: the test norm µ is different from the sampling

norm ρ

◮ Problem 2: we have bounds for

Qk not for the performance

  • f the corresponding πk
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 26/82

slide-58
SLIDE 58

Error Propagation

Objective ||Q∗ − QπK ||µ

◮ Problem 1: the test norm µ is different from the sampling

norm ρ

◮ Problem 2: we have bounds for

Qk not for the performance

  • f the corresponding πk

◮ Problem 3: we have bounds for one single iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 26/82

slide-59
SLIDE 59

Error Propagation

Transition kernel for a fixed policy Pπ. ◮ m-step (worst-case) concentration of future state distribution

c(m) = sup

π1...πm

  • d(µPπ1 . . . Pπm)

< ∞

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 27/82

slide-60
SLIDE 60

Error Propagation

Transition kernel for a fixed policy Pπ. ◮ m-step (worst-case) concentration of future state distribution

c(m) = sup

π1...πm

  • d(µPπ1 . . . Pπm)

< ∞

◮ Average (discounted) concentration

Cµ,ρ = (1 − γ)2

m≥1

mγm−1c(m) < +∞

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 27/82

slide-61
SLIDE 61

Error Propagation

Remark: relationship to top-Lyapunov exponent L+ = sup

π lim sup m→∞

1 m log+ ||ρPπ1Pπ2 · · · Pπm||

  • If L+ ≤ 0 (stable system), then c(m) has a growth rate which is

polynomial and Cµ,ρ < ∞ is finite

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 28/82

slide-62
SLIDE 62

Error Propagation

Proposition

Let ǫk = Qk − Qk be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy πK is

||Q∗ − QπK ||2

µ ≤

(1 − γ)2 2 Cµ,ρ max

k

||ǫk||2

ρ + O

  • γK

(1 − γ)3 Vmax

2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 29/82

slide-63
SLIDE 63

The Final Bound

Bringing everything together...

||Q∗ − QπK ||2

µ ≤

(1 − γ)2 2 Cµ,ρ max

k

||ǫk||2

ρ + O

  • γK

(1 − γ)3 Vmax

2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 30/82

slide-64
SLIDE 64

The Final Bound

Bringing everything together...

||Q∗ − QπK ||2

µ ≤

(1 − γ)2 2 Cµ,ρ max

k

||ǫk||2

ρ + O

  • γK

(1 − γ)3 Vmax

2

  • ||ǫk||ρ = ||Qk −

Qk||ρ ≤ 4||Qk − fα∗

k ||ρ

+ O

  • Vmax + L||α∗

k||

  • log 1/δ

n

  • + O
  • Vmax
  • d log n/δ

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 30/82

slide-65
SLIDE 65

The Final Bound

Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that

||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 31/82

slide-66
SLIDE 66

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The propagation (and different norms) makes the problem more complex

⇒ how do we choose the sampling distribution?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 32/82

slide-67
SLIDE 67

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The approximation error is worse than in regression
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 33/82

slide-68
SLIDE 68

The Final Bound

The inherent Bellman error

||Qk − fα∗

k ||ρ = inf

f ∈F ||Qk − f ||ρ

= inf

f ∈F ||T

Qk−1 − f ||ρ ≤ inf

f ∈F ||T fαk−1 − f ||ρ

≤ sup

g∈F

inf

f ∈F ||T g − f ||ρ = d(F, T F)

Question: how to design F to make it “compatible” with the Bellman

  • perator?
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 34/82

slide-69
SLIDE 69

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The dependency on γ is worse than at each iteration

⇒ is it possible to avoid it?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 35/82

slide-70
SLIDE 70

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The error decreases exponentially in K

⇒ K ≈ ǫ/(1 − γ)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 36/82

slide-71
SLIDE 71

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The smallest eigenvalue of the Gram matrix

⇒ design the features so as to be orthogonal w.r.t. ρ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 37/82

slide-72
SLIDE 72

The Final Bound

Theorem

LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2

  • Cµ,ρ
  • 4d(F, T F) + O
  • Vmax
  • 1 +

L √ω

  • d log n/δ

n

  • + O
  • γK

(1 − γ)3 Vmax2

  • The asymptotic rate O(d/n) is the same as for regression
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 38/82

slide-73
SLIDE 73

Summary

Approximation space Samples algorithm process Performance Markov decision Dynamic programming Approximation algorithm

(sampling strategy, number)

Range Vmax Concentrability Cµ,ρ d(F, T F) size d, features ω number n, sampling dist. ρ Qk − Qk Propagation

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 39/82

slide-74
SLIDE 74

Other implementations

Replace the regression step with

◮ K-nearest neighbour ◮ Regularized linear regression with L1 or L2 regularisation ◮ Neural network ◮ Support vector regression ◮ ...

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 40/82

slide-75
SLIDE 75

Example: the Optimal Replacement Problem

State: level of wear of an object (e.g., a car).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 41/82

slide-76
SLIDE 76

Example: the Optimal Replacement Problem

State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 41/82

slide-77
SLIDE 77

Example: the Optimal Replacement Problem

State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost:

◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 41/82

slide-78
SLIDE 78

Example: the Optimal Replacement Problem

State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost:

◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.

Dynamics:

◮ p(·|x, R) = exp(β) with density d(y) = β exp−βy I{y ≥ 0}, ◮ p(·|x, K) = x + exp(β) with density d(y − x).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 41/82

slide-79
SLIDE 79

Example: the Optimal Replacement Problem

State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost:

◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.

Dynamics:

◮ p(·|x, R) = exp(β) with density d(y) = β exp−βy I{y ≥ 0}, ◮ p(·|x, K) = x + exp(β) with density d(y − x).

Problem: Minimize the discounted expected cost over an infinite horizon.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 41/82

slide-80
SLIDE 80

Example: the Optimal Replacement Problem

Optimal value function V ∗(x) = min

  • c(x)+γ

∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 42/82

slide-81
SLIDE 81

Example: the Optimal Replacement Problem

Optimal value function V ∗(x) = min

  • c(x)+γ

∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy

  • Optimal policy: action that attains the minimum
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 42/82

slide-82
SLIDE 82

Example: the Optimal Replacement Problem

Optimal value function V ∗(x) = min

  • c(x)+γ

∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy

  • Optimal policy: action that attains the minimum

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Management cost wear

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Value function

R R R K K K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 42/82

slide-83
SLIDE 83

Example: the Optimal Replacement Problem

Optimal value function V ∗(x) = min

  • c(x)+γ

∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy

  • Optimal policy: action that attains the minimum

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Management cost wear

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Value function

R R R K K K

Linear approximation space F :=

  • Vn(x) = 20

k=1 αk cos(kπ x xmax )

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 42/82

slide-84
SLIDE 84

Example: the Optimal Replacement Problem

Collect N sample on a uniform grid.

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 43/82

slide-85
SLIDE 85

Example: the Optimal Replacement Problem

Collect N sample on a uniform grid.

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Figure: Left: the target values computed as {T V0(xn)}1≤n≤N. Right: the approximation V1 ∈ F of the target function T V0.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 43/82

slide-86
SLIDE 86

Example: the Optimal Replacement Problem

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ +++++++++++++++++++++++++ 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Figure: Left: the target values computed as {T V1(xn)}1≤n≤N. Center: the approximation V2 ∈ F of T V1. Right: the approximation Vn ∈ F after n iterations.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 44/82

slide-87
SLIDE 87

Example: the Optimal Replacement Problem

Simulation

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 45/82

slide-88
SLIDE 88

Approximate Dynamic Programming

(a.k.a. Batch Reinforcement Learning)

Approximate Value Iteration Approximate Policy Iteration

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 46/82

slide-89
SLIDE 89

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute Vk = V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 47/82

slide-90
SLIDE 90

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute Vk = V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK

◮ Problem: how can we approximate V πk? ◮ Problem: if Vk = V πk, does (approx.) policy iteration still work?

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 47/82

slide-91
SLIDE 91

Approximate Policy Iteration: performance loss

Problem: the algorithm is no longer guaranteed to converge.

V *−V πk k

Asymptotic Error

Proposition The asymptotic performance of the policies πk generated by the API algorithm is related to the approximation error as: lim sup

k→∞

V ∗ − V πk∞

  • performance loss

≤ 2γ (1 − γ)2 lim sup

k→∞

Vk − V πk∞

  • approximation error
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 48/82

slide-92
SLIDE 92

Least-Squares Policy Iteration (LSPI)

LSPI uses

◮ Linear space to approximate value functions*

F =

  • f (x) =

d

  • j=1

αjϕj(x), α ∈ Rd

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 49/82

slide-93
SLIDE 93

Least-Squares Policy Iteration (LSPI)

LSPI uses

◮ Linear space to approximate value functions*

F =

  • f (x) =

d

  • j=1

αjϕj(x), α ∈ Rd

◮ Least-Squares Temporal Difference (LSTD) algorithm for

policy evaluation.

*In practice we use approximations of action-value functions.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 49/82

slide-94
SLIDE 94

Least-Squares Temporal-Difference Learning (LSTD)

◮ V π may not belong to F

V π / ∈ F

◮ Best approximation of V π in F is

ΠV π = arg min

f ∈F ||V π − f ||

(Π is the projection onto F)

F V π

T π

ΠV π

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 50/82

slide-95
SLIDE 95

Least-Squares Temporal-Difference Learning (LSTD)

◮ V π is the fixed-point of T π

V π = T πV π = r π + γPπV π

◮ LSTD searches for the fixed-point of Π2,ρT π

Π2,ρ g = arg min

f ∈F ||g − f ||2,ρ

◮ When the fixed-point of ΠρT π exists, we call it the LSTD solution

VTD = ΠρT πVTD

F V π

T πVTD T π T π

ΠρV π

VTD = ΠρT πVTD

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 51/82

slide-96
SLIDE 96

Least-Squares Temporal-Difference Learning (LSTD)

VTD = ΠρT πVTD

◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ

  • (T πVTD(x) − VTD(x))ϕi(x)
  • = 0, ∀i ∈ [1, d]

T πVTD − VTD, ϕiρ = 0

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 52/82

slide-97
SLIDE 97

Least-Squares Temporal-Difference Learning (LSTD)

VTD = ΠρT πVTD

◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ

  • (T πVTD(x) − VTD(x))ϕi(x)
  • = 0, ∀i ∈ [1, d]

T πVTD − VTD, ϕiρ = 0 ◮ By definition of Bellman operator rπ + γPπVTD − VTD, ϕiρ = 0 rπ, ϕiρ − (I − γPπ)VTD, ϕiρ = 0

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 52/82

slide-98
SLIDE 98

Least-Squares Temporal-Difference Learning (LSTD)

VTD = ΠρT πVTD

◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ

  • (T πVTD(x) − VTD(x))ϕi(x)
  • = 0, ∀i ∈ [1, d]

T πVTD − VTD, ϕiρ = 0 ◮ By definition of Bellman operator rπ + γPπVTD − VTD, ϕiρ = 0 rπ, ϕiρ − (I − γPπ)VTD, ϕiρ = 0 ◮ Since VTD ∈ F, there exists αTD such that VTD(x) = φ(x)⊤αTD rπ, ϕiρ −

d

  • j=1

(I − γPπ)ϕjαTD,j, ϕiρ = 0 rπ, ϕiρ −

d

  • j=1

(I − γPπ)ϕj, ϕiραTD,j = 0

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 52/82

slide-99
SLIDE 99

Least-Squares Temporal-Difference Learning (LSTD)

VTD = ΠρT πVTD ⇓ r π, ϕiρ

  • bi

d

  • j=1

(I − γPπ)ϕj, ϕiρ

  • Ai,j

αTD,j = 0 ⇓ AαTD = b

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 53/82

slide-100
SLIDE 100

Least-Squares Temporal-Difference Learning (LSTD)

◮ Problem: In general, ΠρT π is not a contraction and does not

have a fixed-point.

◮ Solution: If ρ = ρπ (stationary dist. of π) then ΠρπT π has a

unique fixed-point.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 54/82

slide-101
SLIDE 101

Least-Squares Temporal-Difference Learning (LSTD)

◮ Problem: In general, ΠρT π is not a contraction and does not

have a fixed-point.

◮ Solution: If ρ = ρπ (stationary dist. of π) then ΠρπT π has a

unique fixed-point.

◮ Problem: In general, ΠρT π cannot be computed (because

unknown)

◮ Solution: Use samples coming from a “trajectory” of π.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 54/82

slide-102
SLIDE 102

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-103
SLIDE 103

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-104
SLIDE 104

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-105
SLIDE 105

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-106
SLIDE 106

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)

  • 2. Compute the empirical matrix

Ak and the vector bk [ Ak]i,j = 1 n

n

  • t=1

(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n

n

  • t=1

ϕi(xt)rt ≈ r π, ϕiρπk

  • 3. Solve the linear system αk =

A−1

k

bk

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-107
SLIDE 107

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)

  • 2. Compute the empirical matrix

Ak and the vector bk [ Ak]i,j = 1 n

n

  • t=1

(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n

n

  • t=1

ϕi(xt)rt ≈ r π, ϕiρπk

  • 3. Solve the linear system αk =

A−1

k

bk

  • 4. Compute the greedy policy πk+1 w.r.t.

Vk = fαk

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-108
SLIDE 108

Least-Squares Policy Iteration (LSPI)

Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)

  • 2. Compute the empirical matrix

Ak and the vector bk [ Ak]i,j = 1 n

n

  • t=1

(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n

n

  • t=1

ϕi(xt)rt ≈ r π, ϕiρπk

  • 3. Solve the linear system αk =

A−1

k

bk

  • 4. Compute the greedy policy πk+1 w.r.t.

Vk = fαk Return the last policy πK

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 55/82

slide-109
SLIDE 109

Least-Squares Policy Iteration (LSPI)

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)

◮ The first few samples may be discarded because not actually drawn

from the stationary distribution ρπk

◮ Off-policy samples could be used with importance weighting ◮ In practice i.i.d. states drawn from an arbitrary distribution (but

with actions πk) may be used

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 56/82

slide-110
SLIDE 110

Least-Squares Policy Iteration (LSPI)

  • 4. Compute the greedy policy πk+1 w.r.t.

Vk = fαk

◮ Computing the greedy policy from

Vk is difficult, so move to LSTD-Q and compute πk+1(x) = arg max

a

  • Qk(x, a)
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 57/82

slide-111
SLIDE 111

Least-Squares Policy Iteration (LSPI)

For k = 1, . . . , K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 58/82

slide-112
SLIDE 112

Least-Squares Policy Iteration (LSPI)

For k = 1, . . . , K

  • 1. Generate a trajectory of length n from the stationary dist. ρπk

(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn) ...

  • 4. Compute the greedy policy πk+1 w.r.t.

Vk = fαk Problem: This process may be unstable because πk does not cover the state space properly

Skip Theory

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 58/82

slide-113
SLIDE 113

LSTD Algorithm

When n → ∞ then A → A and b → b, and thus,

  • αTD → αTD and

VTD → VTD

Proposition (LSTD Performance)

If LSTD is used to estimate the value of π with an infinite number

  • f samples drawn from the stationary distribution ρπ then

||V π − VTD||ρπ ≤ 1

  • 1 − γ2 inf

V ∈F ||V π − V ||ρπ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 59/82

slide-114
SLIDE 114

LSTD Algorithm

When n → ∞ then A → A and b → b, and thus,

  • αTD → αTD and

VTD → VTD

Proposition (LSTD Performance)

If LSTD is used to estimate the value of π with an infinite number

  • f samples drawn from the stationary distribution ρπ then

||V π − VTD||ρπ ≤ 1

  • 1 − γ2 inf

V ∈F ||V π − V ||ρπ

Problem: we don’t have an infinite number of samples...

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 59/82

slide-115
SLIDE 115

LSTD Algorithm

When n → ∞ then A → A and b → b, and thus,

  • αTD → αTD and

VTD → VTD

Proposition (LSTD Performance)

If LSTD is used to estimate the value of π with an infinite number

  • f samples drawn from the stationary distribution ρπ then

||V π − VTD||ρπ ≤ 1

  • 1 − γ2 inf

V ∈F ||V π − V ||ρπ

Problem: we don’t have an infinite number of samples... Problem 2: VTD is a fixed point solution and not a standard machine learning problem...

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 59/82

slide-116
SLIDE 116

LSTD Error Bound

Assumption: The Markov chain induced by the policy πk has a stationary distribution ρπk and it is ergodic and β-mixing.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 60/82

slide-117
SLIDE 117

LSTD Error Bound

Assumption: The Markov chain induced by the policy πk has a stationary distribution ρπk and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 − δ ||V πk − Vk||ρπk ≤ c

  • 1 − γ2 inf

f ∈F ||V πk − f ||ρπk + O

  • d log(d/δ)

n ν

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 60/82

slide-118
SLIDE 118

LSTD Error Bound

||V π − V ||ρπ ≤ c

  • 1 − γ2

inf

f ∈F ||V π − f ||ρπ

  • approximation error

+ O

  • d log(d/δ)

n ν

  • estimation error

◮ Approximation error: it depends on how well the function space F

can approximate the value function V π

◮ Estimation error: it depends on the number of samples n, the dim of

the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 61/82

slide-119
SLIDE 119

LSTD Error Bound

||V πk − Vk||ρπk ≤ c

  • 1 − γ2

inf

f ∈F ||V πk − f ||ρπk

  • approximation error

+ O  

  • d log(d/δ)

n νk  

  • estimation error

◮ n number of samples and d dimensionality

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 62/82

slide-120
SLIDE 120

LSTD Error Bound

||V πk − Vk||ρπk ≤ c

  • 1 − γ2

inf

f ∈F ||V πk − f ||ρπk

  • approximation error

+ O  

  • d log(d/δ)

n νk  

  • estimation error

◮ νk = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dρπk )i,j

(Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) ◮ β-mixing coefficients are hidden in the O(·) notation

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 63/82

slide-121
SLIDE 121

LSPI Error Bound

Theorem (LSPI Error Bound)

If LSPI is run over K iterations, then the performance loss policy πK is

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • E0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • with probability 1 − δ.
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 64/82

slide-122
SLIDE 122

LSPI Error Bound

Theorem (LSPI Error Bound)

If LSPI is run over K iterations, then the performance loss policy πK is

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • with probability 1 − δ.

◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 65/82

slide-123
SLIDE 123

LSPI Error Bound

Theorem (LSPI Error Bound)

If LSPI is run over K iterations, then the performance loss policy πK is

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • with probability 1 − δ.

◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

◮ Estimation error: depends on n, d, νρ, K

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 66/82

slide-124
SLIDE 124

LSPI Error Bound

Theorem (LSPI Error Bound)

If LSPI is run over K iterations, then the performance loss policy πK is

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • with probability 1 − δ.

◮ Approximation error: E0(F) = supπ∈G(

F) inff ∈F ||V π − f ||ρπ

◮ Estimation error: depends on n, d, νρ, K ◮ Initialization error: error due to the choice of the initial value function or

initial policy |V ∗ − V π0|

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 67/82

slide-125
SLIDE 125

LSPI Error Bound

LSPI Error Bound

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • Lower-Bounding Distribution

There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 68/82

slide-126
SLIDE 126

LSPI Error Bound

LSPI Error Bound

||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2

  • CCµ,ρ
  • cE0(F) + O
  • d log(dK/δ)

n νρ

  • + γKRmax
  • Lower-Bounding Distribution

There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.

◮ νρ = the smallest eigenvalue of the Gram matrix (

  • ϕi ϕj dρ)i,j
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 69/82

slide-127
SLIDE 127

Bellman Residual Minimization (BRM): the idea

V π T π F T π T πVBR arg min

V ∈FV π − V

VBR = arg min

V ∈FT πV − V

Let µ be a distribution over X, VBR is the minimum Bellman residual w.r.t. T π VBR = arg min

V ∈F T πV − V 2,µ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 70/82

slide-128
SLIDE 128

Bellman Residual Minimization (BRM): the idea

The mapping α → T πVα − Vα is affine The function α → T πVα − Vα2

µ is quadratic

⇒ The minimum is obtained by computing the gradient and setting it to zero r π + (γPπ − I)

d

  • j=1

φjαj, (γPπ − I)φiµ = 0, which can be rewritten as Aα = b, with Ai,j = φi − γPπφi, φj − γPπφjµ, bi = φi − γPπφi, r πµ,

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 71/82

slide-129
SLIDE 129

Bellman Residual Minimization (BRM): the idea

Remark: the system admits a solution whenever the features φi are linearly independent w.r.t. µ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 72/82

slide-130
SLIDE 130

Bellman Residual Minimization (BRM): the idea

Remark: the system admits a solution whenever the features φi are linearly independent w.r.t. µ Remark: let {ψi = φi − γPπφi}i=1...d, then the previous system can be interpreted as a linear regression problem α · ψ − r πµ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 72/82

slide-131
SLIDE 131

BRM: the approximation error

Proposition

We have V π − VBR ≤ (I − γPπ)−1(1 + γPπ) inf

V ∈F V π − V .

If µπ is the stationary policy of π, then Pπµπ = 1 and (I − γPπ)−1µπ =

1 1−γ , thus

V π − VBRµπ ≤ 1 + γ 1 − γ inf

V ∈F V π − V µπ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 73/82

slide-132
SLIDE 132

BRM: the implementation

  • Assumption. A generative model is available.

◮ Drawn n states Xt ∼ µ ◮ Call generative model on (Xt, At) (with At = π(Xt)) and

  • btain Rt = r(Xt, At), Yt ∼ p(·|Xt, At)

◮ Compute

ˆ B(V ) = 1 n

n

  • t=1
  • V (Xt) −
  • Rt + γV (Yt)
  • ˆ

T V (Xt)

2.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 74/82

slide-133
SLIDE 133

BRM: the implementation

Problem: this estimator is biased and not consistent! In fact, E[ ˆ B(V )] = E

  • V (Xt) − T πV (Xt) + T πV (Xt) − ˆ

T V (Xt) 2 = T πV − V 2

µ + E

  • T πV (Xt) − ˆ

T V (Xt) 2 ⇒ minimizing ˆ B(V ) does not correspond to minimizing B(V ) (even when n → ∞).

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 75/82

slide-134
SLIDE 134

BRM: the implementation

  • Solution. In each state Xt, generate two independent samples Yt

et Y ′

t ∼ p(·|Xt, At)

Define ˆ B(V ) = 1 n

n

  • t=1
  • V (Xt)−
  • Rt +γV (Yt)
  • V (Xt)−
  • Rt +γV (Y ′

t)

  • .

⇒ ˆ B → B for n → ∞.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 76/82

slide-135
SLIDE 135

BRM: the implementation

The function α → ˆ B(Vα) is quadratic and we obtain the linear system

  • Ai,j

= 1 n

n

  • t=1
  • φi(Xt) − γφi(Yt)
  • φj(Xt) − γφj(Y ′

t)

  • ,
  • bi

= 1 n

n

  • t=1
  • φi(Xt) − γ φi(Yt) + φi(Y ′

t)

2

  • Rt.
  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 77/82

slide-136
SLIDE 136

BRM: the approximation error

  • Proof. We relate the Bellman residual to the approximation error as

V π − V = V π − T πV + T πV − V = γPπ(V π − V ) + T πV − V (I − γPπ)(V π − V ) = T πV − V , taking the norm both sides we obtain V π − VBR ≤ (I − γPπ)−1T πVBR − VBR and T πVBR − VBR = inf

V ∈F T πV − V ≤ (1 + γPπ) inf V ∈F V π − V .

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 78/82

slide-137
SLIDE 137

BRM: the approximation error

  • Proof. If we consider the stationary distribution µπ, then Pπµπ = 1.

The matrix (I − γPπ) can be written as the power series

t γ(Pπ)t.

Applying the norm we obtain (I − γPπ)−1µπ ≤

  • t≥0

γtPπt

µπ ≤

1 1 − γ

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 79/82

slide-138
SLIDE 138

LSTD vs BRM

◮ Different assumptions: BRM requires a generative model,

LSTD requires a single trajectory.

◮ The performance is evaluated differently: BRM any

distribution, LSTD stationary distribution µπ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 80/82

slide-139
SLIDE 139

Bibliography I

  • A. LAZARIC – Reinforcement Learning Algorithms

Dec 2nd, 2014 - 81/82

slide-140
SLIDE 140

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr