incremental basis construction from temporal difference
play

Incremental Basis Construction from Temporal Difference Error Yi - PowerPoint PPT Presentation

Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 /


  1. Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 / 17

  2. Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  3. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  4. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  5. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  6. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  7. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  8. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  9. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

  10. Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  11. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  12. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  13. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  14. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  15. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  16. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

  17. Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  18. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  19. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  20. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  21. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  22. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  23. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  24. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  25. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

  26. Bellman Error Basis Functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  27. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  28. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  29. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  30. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  31. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  32. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

  33. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Expand: Φ ( k + 1 ) = [ Φ ( k ) ⋮ ε ( k ) ] . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend