planning to be surprised optimal bayesian exploration in
play

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18


  1. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

  2. Motivation Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  3. Motivation An intelligent agent is sent to explore an unknown environment Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  4. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  5. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  6. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  7. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using only 100 < s , a , s ′ > triples Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  8. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  9. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  10. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  11. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  12. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  13. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  14. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  15. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  16. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  17. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  18. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  19. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  20. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  21. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  22. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  23. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  24. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  25. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  26. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  27. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  28. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  29. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  30. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  31. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  32. Bellman Error Basis Functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  33. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  34. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  35. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  36. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend