Reinforcement Learning: Approximate Dynamic Programming Decision - - PowerPoint PPT Presentation

reinforcement learning approximate dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning: Approximate Dynamic Programming Decision - - PowerPoint PPT Presentation

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter 10 Christos Dimitrakakis Chalmers November 21, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming


slide-1
SLIDE 1

Reinforcement Learning: Approximate Dynamic Programming

Decision Making Under Uncertainty, Chapter 10 Christos Dimitrakakis

Chalmers

November 21, 2013

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 1 / 19

slide-2
SLIDE 2

1 Introduction

Error bounds Features

2 Approximate policy iteration

Estimation building blocks The value estimation step Policy estimation Rollout-based policy iteration methods Least Squares Methods

3 Approximate Value Iteration

Approximate backwards induction State aggregation Representative states

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 2 / 19

slide-3
SLIDE 3

Introduction

Definition 1 (u-greedy policy and value function)

π∗

u ∈ arg max π

Lπu, v∗

u = L u,

(1.1) where π : S → D (A) maps from states to action distributions.

Parameteric value function estimation

VΘ = {vθ | θ ∈ Θ} , θ∗ ∈ arg min

θ∈Θ

vθ − uφ (1.2) where · φ

  • S | · | dφ.

Parameteric policy estimation

ΠΘ = {πθ | θ ∈ Θ} , θ∗ ∈ arg min

θ∈Θ

πθ − π∗

(1.3) where π∗

u = arg maxπ∈Π Lπu

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 3 / 19

slide-4
SLIDE 4

Introduction Error bounds

Theorem 2

Consider a finite MDP µ with discount factor γ < 1 and a vector u ∈ V such that

  • u − V ∗

µ

  • ∞ = ǫ. If π is the u-greedy policy then
  • V π

µ − V ∗ µ

  • ∞ ≤

2γǫ 1 − γ . In addition, ∃ǫ0 > 0 s.t. if ǫ < ǫ0, then π is optimal.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 4 / 19

slide-5
SLIDE 5

Introduction Features

Feature mapping f : S × A → X.

For X ⊂ Rn, the feature mapping can be written in vector form: f (s, a) =   f1(s, a) . . . fn(s, a)   (1.4)

Example 3 (Radial Basis Functions)

Let d be a metric on S × A and {(si, ai) | i = 1, . . . , n}. Then we define each element of f as: fi(s, a) exp {−d[(s, a), (si, ai)]} . (1.5) These function are sometimes called kernels.

Example 4 (Tilings)

Let G = {X1, . . . , Xn} be a partition of S × A of size n. Then: fi(s, a) I {(s, a) ∈ Xi} . (1.6)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 5 / 19

slide-6
SLIDE 6

Approximate policy iteration

Approximate policy ieration

Algorithm 1 Generic approximate policy iteration algorithm input Initial value function v0, approximate Bellman operator ˆ L , approximate value estimator ˆ V . for k = 1, . . . do πk = arg minπ∈ ˆ

Π

  • ˆ

Lπvk−1 − L vk−1

  • // policy improvement

vk = arg minv∈ ˆ

V v − V πk µ

// policy evaluation end for

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 6 / 19

slide-7
SLIDE 7

Approximate policy iteration

Theoretical gurantees

Assumption 1

Consider a discounted problem with discount factor γ and iterates vk, πk such that: vk − V πk ∞ ≤ ǫ, ∀k (2.1)

  • Lπk+1vk − L vk
  • ∞ ≤ δ,

∀k (2.2)

Theorem 5 ([6], proposition 6.2)

Under Assumption 1 lim sup

k→∞

V πk − V ∗∞ ≤ δ + 2γǫ (1 − γ)2 . (2.3)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 7 / 19

slide-8
SLIDE 8

Approximate policy iteration Estimation building blocks

Lookahead policies

Single-step lookahead

πq(a | i) > 0 iff a ∈ arg max

a′∈A

q(i, a′) (2.4) q(i, a) rµ(i, a) + γ

  • j∈S

Pµ(j | i, a)u(j). (2.5)

T-step lookahead

π(i; qT) = arg max

a∈A

qT(i, a), (2.6) where uk is recursively defined as: qk(i, a) = rµ(i, a) + γ

  • j∈S

Pµ(j | i, a)uk−1(j) (2.7) uk(i) = max {qk(i, a) | a ∈ A} (2.8) and u0 = u.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 8 / 19

slide-9
SLIDE 9

Approximate policy iteration Estimation building blocks

Rollout policies

Rollout estimate of the q-factor

q(i, a) = 1 Ki

Ki

  • k=1

Tk −1

  • t=0

r(st,k, at,k), where st,k, at,k ∼ Pπ

µ(· | s0 = i, a0 = a), and Tk ∼ Geom(1 − γ).

Rollout policy estimation.

Given a set of samples q(i, a) for i ∈ ˆ S, we estimate min

θ

  • πθ − π∗

q

  • φ ,

for some φ on ˆ S.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 9 / 19

slide-10
SLIDE 10

Approximate policy iteration The value estimation step

Generalised linear model using features (or kernel)

Feature mapping f : S → Rn, parameters θ ∈ Rn. vθ(s) =

n

  • i=1

θifi(s) (2.9)

Fitting a value function.

c(θ) =

  • s∈ˆ

S

cs(θ), cs(θ) = φ(s) vθ(s) − v(s)κ

p .

(2.10)

Example 6

The case p = 2, κ = 2 θ′

j = θj − 2αφ(s)[vθ(s) − v(s)]fj(s).

(2.11)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 10 / 19

slide-11
SLIDE 11

Approximate policy iteration Policy estimation

Generalised linear model using features (or kernel).

Feature mapping f : S → Rn, parameters θ ∈ Rn. πθ(a | s) = g(s, a) h(s) , g(s, a) =

n

  • i=1

θifi(s, a), h(s) =

  • b∈A

g(s, b) (2.12)

Fitting a policy through a cost function.

c(θ) =

  • s∈ˆ

S

cs(θ), cs(θ) = φ(s) πθ(· | s) − π(· | s)κ

p .

(2.13)

The case p = 1, κ = 1.

θ′

j = θj − αφ(s)

  • πθ(a | s)
  • b∈A

fj(s, b) − fj(s, a)

  • .

(2.14)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 11 / 19

slide-12
SLIDE 12

Approximate policy iteration Rollout-based policy iteration methods

Algorithm 2 Rollout Sampling Approximate Policy Iteration. for k = 1, . . . do Select a set of representative states ˆ Sk for n = 1, . . . do Select a state sn ∈ ˆ Sk maximising Un(s) and perform a rollout. If ˆ a∗(sn) is optimal w.p. 1 − δ, put sn in ˆ Sk(δ) and remove it from ˆ Sk. end for Calculate qk ≈ Qπk from the rollouts. Train a classifier πθk+1 on the set of states ˆ Sk(δ) with actions ˆ a∗(s). end for

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 12 / 19

slide-13
SLIDE 13

Approximate policy iteration Least Squares Methods

Least square value estimation

Projection.

Setting v = Φθ where Φ is a feature matrix and θ is a parameter vector we have Φθ = r + γPµ,πΦθ (2.15) θ = [(I − γPµ,π)Φ]−1 r (2.16) Replacing the inverse with the pseudo-inverse, with A = (I − γPµ,π)Φ ˜ A−1 A⊤ AA⊤−1 ,

Empirical constructions.

Given a set of data points {(si, ai, ri, s′

i ) | i = 1, . . . , n}, which may not be consecutive,

we define:

1 r = (ri)i. 2 Φi = f (si, ai), Φ = (Φi)i. 3 Pµ,π = PµPπ, Pµ,π(i, j) = I {j = i + 1}

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 13 / 19

slide-14
SLIDE 14

Approximate policy iteration Least Squares Methods

Algorithm 3 LSTDQ - Least Squares Temporal Differences on q-factors input data D = {(si, ai, ri, s′

i ) | i = 1, . . . , n}, feature mapping f , policy π

θ =

  • (Φ(I − γPµ,π))

−1

r Algorithm 4 LSPI - Least Squares Policy Iteration input data D = {(si, ai, ri, s′

i ) | i = 1, . . . , n}, feature mapping f

Set π0 arbitrarily. for k = 1, . . . do θk = LSTDQ(D, f , πk−1). πk = π∗

Φθk .

end for

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 14 / 19

slide-15
SLIDE 15

Approximate Value Iteration Approximate backwards induction

V ∗

t (s) = max a∈A {r(s, a) + γ Eµ (V ∗ t+1 | st = s, at = a)}

(3.1)

Iterative approximation

ˆ Vt(s) = max

a∈A

  • r(s, a) + γ
  • s′

Pµ(s′ | s, a)vt+1(s′)

  • (3.2)

vt = arg min

  • v − ˆ

Vt

  • v ∈ V
  • (3.3)

Online gradient estimation

θt+1 = θt − αt∇θ

  • vt − ˆ

Vt

  • (3.4)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 15 / 19

slide-16
SLIDE 16

Approximate Value Iteration State aggregation

Aggregated estimate.

Let G = {S0, S1, . . . , Sn} be a partition of S, with S0 = ∅ and θ ∈ Rn and let fk(st) = I {st ∈ Sk}. Then the approximate value function is v(s) = θ(k), if s ∈ Sk, k = 0. (3.5)

Online gradient estimate.

Consider the case · = ·2

  • 2. For st ∈ Sk:

θt+1(k) = (1 − α)θt(k) + α max

a∈A r(st, a) + γ

  • j

P(j | st, a)vt(s) (3.6) For st / ∈ Sk: θt+1(k) = θ(k). (3.7)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 16 / 19

slide-17
SLIDE 17

Approximate Value Iteration Representative states

Representative states approximation.

Let ˆ S be a set of n representative states and θ ∈ Rn and a feature mapping f :

n

  • i=1

fi(s) = 1, ∀s ∈ S.

Representative state update.

For i ∈ ˆ S: θt+1(i) = max

a∈A

  • r(i, a) + γ
  • vt(s) dP(s | i, a)
  • (3.8)

with vt(s) =

n

  • i=1

fi(s)θt(i). (3.9)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 17 / 19

slide-18
SLIDE 18

Approximate Value Iteration Representative states

Bellman error methods

min

θ vθ − L vθ

(3.10)

Gradient update.

When the norm is vθ − L vθ =

  • s∈ˆ

S

Dθ(s)2, Dθ(s) = vθ(s) − max

a∈A

  • S

vθ(j) dP(j | s, a). (3.11) then the gradient update becomes θt+1 = θt − αDθt (st)∇θDθt (st) (3.12) ∇θDθt (st) = ∇θvθt (st) −

  • S

∇θvθt (j) dP(j | st, a∗

t )

(3.13) a∗

t = arg max a∈A

  • r(st, a) + γ
  • S

vθ(j) dP(j | st, a)

  • (3.14)

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 18 / 19

slide-19
SLIDE 19

Approximate Value Iteration Representative states

A litany of approximation algorithms

Fitted Q-iteration [2]. Fitted value iteration [16]. Rollout sampling policy iteration [13] State aggregation [19, 4] Bellman error minimisation [1, 12, 14] Least-squares methods [8, 7, 15].

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 19 / 19

slide-20
SLIDE 20

References

[1] A. Antos, C. Szepesv´ ari, and R. Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008. [2] Andr` e Antos, R´ emi Munos, and Csaba Szepesvari. Fitted q-iteration in continuous action-space mdps. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [3] Peter Auer, Nicol`

  • Cesa-Bianchi, and Paul Fischer. Finite time analysis of the

multiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002. [4] A. Bernstein. Adaptive state aggregation for reinforcement learning. Master’s thesis, Technion – Israel Institute of Technolog01y, 2007. [5] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2001. [6] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [7] J.A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233–246, 2002. [8] S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference

  • learning. Machine Learning, 22(1):33–57, 1996.

[9] Herman Chernoff. Sequential design of experiments. Annals of Mathematical Statistics, 30(3):755–770, 1959.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 19 / 19

slide-21
SLIDE 21

References

[10] Herman Chernoff. Sequential Models for Clinical Trials. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol.4, pages 805–812. Univ. of Calif Press, 1966. [11] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970. [12] Christos Dimitrakakis. Monte-carlo utility estimates for bayesian reinforcement

  • learning. In IEEE 52nd Annual Conference on Decision and Control (CDC 2013),
  • 2013. arXiv:1303.2506.

[13] Christos Dimitrakakis and Michail G. Lagoudakis. Rollout sampling approximate policy iteration. Machine Learning, 72(3):157–171, September 2008. Presented at ECML’08. [14] Mohammad Ghavamzadeh and Yaakov Engel. Bayesian policy gradient algorithms. In NIPS 2006, 2006. [15] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107–1149, 2003. [16] R. Munos and C. Szepesv´

  • ari. Finite-time bounds for fitted value iteration. The

Journal of Machine Learning Research, 9:815–857, 2008. [17] Marting L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic

  • Programming. John Wiley & Sons, New Jersey, US, 1994.

[18] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951. [19] S. Singh, T. Jaakkola, and M.I. Jordan. Reinforcement learning with soft state

  • aggregation. Advances in neural information processing systems, pages 361–368,

1995.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 19 / 19

slide-22
SLIDE 22

Approximate Value Iteration Representative states

[20] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML 2010, 2010. [21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 19 / 19