Incremental Basis Construction from Temporal Difference Error Yi - - PowerPoint PPT Presentation

incremental basis construction from temporal difference
SMART_READER_LITE
LIVE PREVIEW

Incremental Basis Construction from Temporal Difference Error Yi - - PowerPoint PPT Presentation

Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 /


slide-1
SLIDE 1

Incremental Basis Construction from Temporal Difference Error

Yi Sun, Faustino Gomez, Mark Ring, J¨ urgen Schmidhuber

IDSIA, USI & SUPSI, Switzerland

June 2011

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 / 17

slide-2
SLIDE 2

Preliminary

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-3
SLIDE 3

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-4
SLIDE 4

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-5
SLIDE 5

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i]

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-6
SLIDE 6

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-7
SLIDE 7

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-8
SLIDE 8

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-9
SLIDE 9

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv. Let L = I − γP, then v = L−r

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

slide-10
SLIDE 10

Preliminary

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-11
SLIDE 11

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-12
SLIDE 12

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-13
SLIDE 13

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-14
SLIDE 14

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-15
SLIDE 15

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-16
SLIDE 16

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ ε is the expectation of the TD error

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

slide-17
SLIDE 17

Preliminary

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-18
SLIDE 18

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-19
SLIDE 19

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-20
SLIDE 20

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-21
SLIDE 21

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-22
SLIDE 22

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-23
SLIDE 23

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-24
SLIDE 24

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-25
SLIDE 25

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

slide-26
SLIDE 26

Bellman Error Basis Functions

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-27
SLIDE 27

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-28
SLIDE 28

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction:

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-29
SLIDE 29

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-30
SLIDE 30

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-31
SLIDE 31

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-32
SLIDE 32

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-33
SLIDE 33

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-34
SLIDE 34

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sequences of BEBFs form orthogonal basis (Parr et al. 2007)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-35
SLIDE 35

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sequences of BEBFs form orthogonal basis (Parr et al. 2007) In sufficient number, any value function can be represented

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

slide-36
SLIDE 36

Bellman Error Basis Functions

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 6 / 17

slide-37
SLIDE 37

Bellman Error Basis Functions

Problem with BEBF

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 6 / 17

slide-38
SLIDE 38

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 6 / 17

slide-39
SLIDE 39

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1. Reason: failed to take into acount the transition structure

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 6 / 17

slide-40
SLIDE 40

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1. Reason: failed to take into acount the transition structure

Theorem

Let ˆ J(k) and ˆ J(k+1) be the squared value error corresponding to the BEBF basis functions Φ(k) and Φ(k+1). Then ρ(k) = ˆ J(k+1) ˆ J(k) ≤ γ2.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 6 / 17

slide-41
SLIDE 41

A Simple Example

  • P = [ 0

1 1 0 ].

  • r ∈ R2 moves along the

unit square

  • Start from empty basis

set The first BEBF is the reward.

  • Distance between the

curve and the origin denotes (ρ(1))(1/2)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 7 / 17

slide-42
SLIDE 42

V-BEBF: Main Idea

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-43
SLIDE 43

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-44
SLIDE 44

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-45
SLIDE 45

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-46
SLIDE 46

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-47
SLIDE 47

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ φ is the value function of the Bellman error (V-BEBF)

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-48
SLIDE 48

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ φ is the value function of the Bellman error (V-BEBF) φ can be estimated by any RL algorithm, with TD error as the reward

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 8 / 17

slide-49
SLIDE 49

V-BEBF: Comparison to BEBF

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 9 / 17

slide-50
SLIDE 50

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 9 / 17

slide-51
SLIDE 51

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error When computed exactly, representing a value function may require a long sequence of BEBFs, but a single V-BEBF is enough.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 9 / 17

slide-52
SLIDE 52

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error When computed exactly, representing a value function may require a long sequence of BEBFs, but a single V-BEBF is enough. When approximated, the sequence of V-BEBFs converges much faster than BEBFs, when γ → 1.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 9 / 17

slide-53
SLIDE 53

V-BEBF: Framework

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 10 / 17

slide-54
SLIDE 54

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 10 / 17

slide-55
SLIDE 55

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy A primary learner build the estimation upon a set of basis functions, and propagates the TD-error to a secondary learner

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 10 / 17

slide-56
SLIDE 56

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy A primary learner build the estimation upon a set of basis functions, and propagates the TD-error to a secondary learner The secondary learner estimates the value function of the TD-error, which then becomes the new basis function used by the primary learner

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 10 / 17

slide-57
SLIDE 57

Incremental Basis Projection

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-58
SLIDE 58

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM]

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-59
SLIDE 59

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN].

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-60
SLIDE 60

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-61
SLIDE 61

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Φ Ψ W v = = θ θ

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-62
SLIDE 62

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Φ Ψ W v = = θ θ

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 11 / 17

slide-63
SLIDE 63

Incremental Basis Projection

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-64
SLIDE 64

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why?

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-65
SLIDE 65

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-66
SLIDE 66

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-67
SLIDE 67

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Only the learner of the refined basis functions works on raw basis functions

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-68
SLIDE 68

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Only the learner of the refined basis functions works on raw basis functions Therefore it only affect the estimation indirectly

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 12 / 17

slide-69
SLIDE 69

IBP with V-BEBF

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-70
SLIDE 70

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-71
SLIDE 71

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-72
SLIDE 72

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-73
SLIDE 73

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions In this work we simply choose B ≪ M entries in wn at random.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-74
SLIDE 74

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions In this work we simply choose B ≪ M entries in wn at random. Combine with LSTD to attain batch version (O(M3/2) in time, O(M) in storage), with TD to attain online version (O(MB)).

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 13 / 17

slide-75
SLIDE 75

Experiments

Randomly generated MRP, 500 states, branching factor 5 Randomly generated binary raw basis functions (30% non-zero) Error measured in mean-square value error w.r.t. LSTD solution. In batch case, B = N = √ M, the training trajectory length is 5000.

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 14 / 17

slide-76
SLIDE 76

Batch

1 0.3 0.1

  • no. refined basis functions

!=0.99

1 0.9 0.8 0.7 0.6 0.2 0.5 0.4 0.3 0.1 0.2 1 0.9

M=200 M=1000

0.8 0.7 0.6 0.5 0.4

!=0.999 !=0.999 !=0.99

0.2 0.1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IBP!V IBP!B RFP

70 60 50 40 30 20 10 30 25 20 15 10 5

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 15 / 17

slide-77
SLIDE 77

Online

!0.2 !0.6 !1.2 !1.4 !1.6 !0.8 !1.0 !0.2 !0.4 !0.6 !0.8 !1.2 !1.0 !0.2 !0.1 !0.4 !0.6 !1.2 !1.4 !1.6 !0.8 !1.0

30 20 10

M=1000 M=200

time!steps x 10

50 40 30 20 10 50 40

!0.2 !0.3 !0.5 !0.4 !0.6 !0.4 IBP!V!sps IBP!V IBP!B!sps IBP!B TD IBP!V!cor

!=0.99 !=0.999 !=0.999 !=0.99

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 16 / 17

slide-78
SLIDE 78

Conclusion

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-79
SLIDE 79

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-80
SLIDE 80

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-81
SLIDE 81

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-82
SLIDE 82

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-83
SLIDE 83

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy Multiple secondary learners

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17

slide-84
SLIDE 84

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy Multiple secondary learners Incorporating memory for the secondary learner

Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 17 / 17