[PPT] - On Conservative Policy Iteration Bruno Scherrer INRIA Lorraine, PowerPoint Presentation

SLIDE 1

On Conservative Policy Iteration

Bruno Scherrer

INRIA Lorraine, LORIA

ICML 2014

1 / 13

SLIDE 2

Motivation / Context

Large Markov Decision Process
A policy space Π
A reference policy π ∈ Π
On-Policy data from π

Can we compute a provably better policy ?

Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003)
When local (gradient) optimization induces a (good) global

performance guarantee

2 / 13

SLIDE 3

Motivation / Context

Large Markov Decision Process
A policy space Π
A reference policy π ∈ Π
On-Policy data from π

Can we compute a provably better policy ?

Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003)
When local (gradient) optimization induces a (good) global

performance guarantee

2 / 13

SLIDE 4

Motivation / Context

Large Markov Decision Process
A policy space Π
A reference policy π ∈ Π
On-Policy data from π

Can we compute a provably better policy ?

Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003)
When local (gradient) optimization induces a (good) global

performance guarantee

2 / 13

SLIDE 5

Outline

1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement

3 / 13

SLIDE 6

Outline

1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement

4 / 13

SLIDE 7

Infinite-Horizon Markov Decision Process

(Puterman, 1994; Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998)

Markov Decision Process (MDP):

X is the state space,
A is the action space,
r : X → R is the reward function,

(rt = r(xt))

p : X × A → ∆X is the transition function.

(xt+1 ∼ p(·|xt, at)) Problem: Find a policy π : X → A that maximizes the value vπ(x) for all x: vπ(x) =E ∞

t=0

γtrt

x0 = x, {∀t, at= π(xt)}
.

(γ ∈ (0, 1))

5 / 13

SLIDE 8

Infinite-Horizon Markov Decision Process

(Puterman, 1994; Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998)

Markov Decision Process (MDP):

X is the state space,
A is the action space,
r : X → R is the reward function,

(rt = r(xt))

p : X × A → ∆X is the transition function.

(xt+1 ∼ p(·|xt, at)) Problem: Find a policy π : X → A that maximizes the value vπ(x) for all x: vπ(x) =E ∞

t=0

γtrt

x0 = x, {∀t, at= π(xt)}
.

(γ ∈ (0, 1))

5 / 13

SLIDE 9

Notations

For any policy π, vπ is the unique solution of the Bellman

equation: ∀x, vπ(x) = r(x) + γ

y∈X

p(y|x, π(x))vπ(y) ⇔ vπ = Tπvπ ⇔ vπ = r + γPπvπ ⇔ vπ = (I − γPπ)−1r.

The optimal value v∗ is the unique solution of the Bellman
ptimality equation:

∀x, v∗(x) = max

a∈A

r(x) + γ
y∈X

p(y|x, a)v∗(y)

⇔

v∗ = Tv∗ ⇔ v∗ = max

π

Tπv∗.

π is a greedy policy w.r.t. v , written π = Gv, iff

∀x, π(x) ∈ arg max

a∈A

r(x) + γ
y∈X

p(y|x, a)v(y)

⇔

Tπv = Tv.

6 / 13

SLIDE 10

Outline

1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement

7 / 13

SLIDE 11

Approximate Policy Iteration

(Exact) Policy Iteration πk+1 ← Gvπk (where vπk = Tπkvπk)

Guaranteed improvement in all states
π is (ǫ, ν)-approximately greedy with respect to v, written

π = Gǫ(ν, v), iff νT(Tv − Tπv) = Ex∼ν {[Tv](x) − [Tπv](x)} ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) πk+1 ← Gǫ(ν, vπk)

Performance may decrease in all states!

8 / 13

SLIDE 12

Approximate Policy Iteration

(Exact) Policy Iteration πk+1 ← Gvπk (where vπk = Tπkvπk)

Guaranteed improvement in all states
π is (ǫ, ν)-approximately greedy with respect to v, written

π = Gǫ(ν, v), iff νT(Tv − Tπv) = Ex∼ν {[Tv](x) − [Tπv](x)} ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) πk+1 ← Gǫ(ν, vπk)

Performance may decrease in all states!

8 / 13

SLIDE 13

Approximate Policy Iteration

(Exact) Policy Iteration πk+1 ← Gvπk (where vπk = Tπkvπk)

Guaranteed improvement in all states
π is (ǫ, ν)-approximately greedy with respect to v, written

π = Gǫ(ν, v), iff νT(Tv − Tπv) = Ex∼ν {[Tv](x) − [Tπv](x)} ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) πk+1 ← Gǫ(ν, vπk)

Performance may decrease in all states!

8 / 13

SLIDE 14

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 15

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 16

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 17

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 18

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 19

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 20

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 21

Conservative Policy Iteration as a Projected Gradient Ascent Algorithm

π: current policy
π′: alternative policy
πα = (1 − α)π + απ′: α-mixture of π and π′

Taylor expansion of α → νTvπα = Ex∼ν[vπα(x)] around α = 0: νT(vπα − vπ) = νT[(I − γPπα)−1r − vπ] = νT(I − γPπα)−1(r − vπ + γPπαvπ) = νT[(I − γPπ)−1 + o(α)](Tπαvπ − Tπvπ) = νT[(I − γPπ)−1 + o(α)]α(Tπ′vπ − Tπvπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

with dT

ν,π = (1 − γ)νT(I − γPπ)−1.

The steepest direction is π′ ∈ Gvπ.
Choosing π′ ∈ Gǫ(dν,π, vπ) amounts to find an approximately

steepest direction.

9 / 13

SLIDE 22

Conservative Policy Iteration

CPI (Kakade & Langford, 2002; Kakade, 2003) π′

k+1 ← Gǫ(dν,πk, vπk)

πk+1 ← (1 − α)πk + απ′

k+1

Convergence to a local maximum... Either dν,πk(Tπ′

k+1vπk − Tπkvπk)

is big: the slope is big and we make a lot of progress
is small (< ǫ): πk satisfies a relaxed optimality equation:

πk ∈ G2ǫ(dν,πk, vπk), which implies a global performance guarantee: µT(vπ∗ − vπk) ≤ Cπ∗ (1 − γ2)(2ǫ) where dµ,π∗ ≤ Cπ∗ν. This performance guarantee can be arbitraritly better than that known for Approximate PI (see also PSDP∞, that can be exponentially faster than CPI) (Scherrer, 2014)

10 / 13

SLIDE 23

Conservative Policy Iteration

CPI (Kakade & Langford, 2002; Kakade, 2003) π′

k+1 ← Gǫ(dν,πk, vπk)

πk+1 ← (1 − α)πk + απ′

k+1

Convergence to a local maximum... Either dν,πk(Tπ′

k+1vπk − Tπkvπk)

is big: the slope is big and we make a lot of progress
is small (< ǫ): πk satisfies a relaxed optimality equation:

πk ∈ G2ǫ(dν,πk, vπk), which implies a global performance guarantee: µT(vπ∗ − vπk) ≤ Cπ∗ (1 − γ2)(2ǫ) where dµ,π∗ ≤ Cπ∗ν. This performance guarantee can be arbitraritly better than that known for Approximate PI (see also PSDP∞, that can be exponentially faster than CPI) (Scherrer, 2014)

10 / 13

SLIDE 24

Conservative Policy Iteration

CPI (Kakade & Langford, 2002; Kakade, 2003) π′

k+1 ← Gǫ(dν,πk, vπk)

πk+1 ← (1 − α)πk + απ′

k+1

Convergence to a local maximum... Either dν,πk(Tπ′

k+1vπk − Tπkvπk)

is big: the slope is big and we make a lot of progress
is small (< ǫ): πk satisfies a relaxed optimality equation:

πk ∈ G2ǫ(dν,πk, vπk), which implies a global performance guarantee: µT(vπ∗ − vπk) ≤ Cπ∗ (1 − γ2)(2ǫ) where dµ,π∗ ≤ Cπ∗ν. This performance guarantee can be arbitraritly better than that known for Approximate PI (see also PSDP∞, that can be exponentially faster than CPI) (Scherrer, 2014)

10 / 13

SLIDE 25

Conservative Policy Iteration

CPI (Kakade & Langford, 2002; Kakade, 2003) π′

k+1 ← Gǫ(dν,πk, vπk)

πk+1 ← (1 − α)πk + απ′

k+1

Convergence to a local maximum... Either dν,πk(Tπ′

k+1vπk − Tπkvπk)

is big: the slope is big and we make a lot of progress
is small (< ǫ): πk satisfies a relaxed optimality equation:

πk ∈ G2ǫ(dν,πk, vπk), which implies a global performance guarantee: µT(vπ∗ − vπk) ≤ Cπ∗ (1 − γ2)(2ǫ) where dµ,π∗ ≤ Cπ∗ν. This performance guarantee can be arbitraritly better than that known for Approximate PI (see also PSDP∞, that can be exponentially faster than CPI) (Scherrer, 2014)

10 / 13

SLIDE 26

Outline

1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement

11 / 13

SLIDE 27

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) = α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) + o(α2)

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 28

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) −

2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 29

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ dT

ν,π(Tπ′vπ − Tπvπ) −

2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 30

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ (ˆ A − ρ) − 2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 31

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ (ˆ A − ρ) − 2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 32

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ (ˆ A − ρ) − 2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 33

Guaranteed Improvement ?

Recall the expansion of α → νTvπα: νT(vπα − vπ) ≥ α 1 − γ (ˆ A − ρ) − 2γα2 (1 − γ)2 Vmax

We need to estimate the direction π′ ∈ Gǫ(dν,π, vπ) and the

amplitude of the gradient: |ˆ A − dT

ν,π(Tπ′vπ − Tπvπ)| ≤ ρ

Samples from dν,π ⇒ resets to ν
Samples from Tπ′vπ − Tπvπ ⇒ exploration
Optimize α.

Can we estimate ˆ A without a simulator/generative model ?

12 / 13

SLIDE 34

Summary

Can we improve a policy in a large MDP ?

Yes, do a gradient ascent (CPI)

If we repeat the process, can we get stuck in local optima ?

Yes, but any (approximately) local optimum satisfies a relaxed

ptimality equation that implies a nice global guarantee
Can we do this without a simulator ?

13 / 13

SLIDE 35

Summary

Can we improve a policy in a large MDP ?

Yes, do a gradient ascent (CPI)

If we repeat the process, can we get stuck in local optima ?

Yes, but any (approximately) local optimum satisfies a relaxed

ptimality equation that implies a nice global guarantee
Can we do this without a simulator ?

13 / 13

SLIDE 36

Summary

Can we improve a policy in a large MDP ?

Yes, do a gradient ascent (CPI)

If we repeat the process, can we get stuck in local optima ?

Yes, but any (approximately) local optimum satisfies a relaxed

ptimality equation that implies a nice global guarantee
Can we do this without a simulator ?

13 / 13

SLIDE 37

Summary

Can we improve a policy in a large MDP ?

Yes, do a gradient ascent (CPI)

If we repeat the process, can we get stuck in local optima ?

Yes, but any (approximately) local optimum satisfies a relaxed

ptimality equation that implies a nice global guarantee
Can we do this without a simulator ?

13 / 13

SLIDE 38

Summary

Can we improve a policy in a large MDP ?

Yes, do a gradient ascent (CPI)

If we repeat the process, can we get stuck in local optima ?

Yes, but any (approximately) local optimum satisfies a relaxed

ptimality equation that implies a nice global guarantee
Can we do this without a simulator ?

13 / 13

SLIDE 39

References I

Archibald, T., McKinnon, K., & Thomas, L. 1995. On the Generation of Markov Decision Processes. Journal of the operational research society, 46, 354–361. Bertsekas, D.P., & Tsitsiklis, J.N. 1996. Neuro-dynamic programming. Athena Scientific. Kakade, Sham, & Langford, John. 2002. Approximately optimal approximate reinforcement learning. In: Icml. Kakade, S.M. 2003. On the sample complexity of reinforcement learning. Ph.D. thesis, University College London. Puterman, M. 1994. Markov Decision Processes. Wiley, New York. Scherrer, B. 2014. Approximate Policy Iteration Schemes: A Comparison. In: Icml.

14 / 13

SLIDE 40

References II