bayesian rl tutorial 1 25 gaussian process temporal
play

Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference - PowerPoint PPT Presentation

Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference Learning Yaakov Engel Collaborators: Shie Mannor, Ron Meir Why use GPs in RL? A Bayesian approach to value estimation Forces us to to make our assumptions explicit


  1. Bayesian RL Tutorial 1/25

  2. Gaussian Process Temporal Difference Learning Yaakov Engel Collaborators: Shie Mannor, Ron Meir

  3. Why use GPs in RL? • A Bayesian approach to value estimation • Forces us to to make our assumptions explicit • Non-parametric – priors are placed and inference is performed directly in function space (kernels). • But, can also be defined parametrically • Domain knowledge intuitively coded in priors • Provides full posterior over values, not just point estimates • Efficient, on-line implementations, suitable for large problems Bayesian RL Tutorial 3/25

  4. Gaussian Processes Definition: “An indexed set of jointly Gaussian random variables” Note: The index set X may be just about any set. Example: F ( x ), index is x ∈ [0 , 1] n F ’s distribution is specified by its mean and covariance: F ( x ) , F ( x ′ ) = k ( x , x ′ ) � � � � E F ( x ) = m ( x ) , Cov Conditions on k : Symmetric, positive definite ⇒ k is a Mercer kernel Bayesian RL Tutorial 4/25

  5. Example: Parametric GP A linear combination of basis functions: F ( x ) = φ ( x ) ⊤ W Σ W1 Wn W 2 . . . . φ ( ) x φ ( ) x φ ( ) n x 1 2 If W ∼ N { m w , C w } , E [ F ( x )] = φ ( x ) ⊤ m w , then F is a GP: Cov [ F ( x ) , F ( x ′ )] = φ ( x ) ⊤ C w φ ( x ′ ) Bayesian RL Tutorial 5/25

  6. Conditioning – Gauss-Markov Thm. Theorem Let Z and Y be random vectors jointly distributed according to the multivariate normal distribution          m z  C zz C z y  Z    ∼ N  ,  .  m y C y z C yy Y  � � ˆ Then Z | Y ∼ N Z, P , where ˆ Z = m z + C z y C − 1 yy ( Y − m y ) P = C zz − C z y C − 1 yy C y z . Bayesian RL Tutorial 6/25

  7. GP Regression Sample: (( x 1 , y 1 ) , . . . , ( x t , y t )) Model equation: Y ( x i ) = F ( x i ) + N ( x i ) GP Prior on F : F ∼ N { 0 , k ( · , · ) } N(x ) N(x ) N(x ) 1 2 t . . . . Y(x ) Y(x ) Y(x ) 1 2 t F(x ) F(x ) F(x ) 1 2 t IID zero-mean Gaussian noise, with variance σ 2 N : Bayesian RL Tutorial 7/25

  8. GP Regression (ctd.) Denote: Y t = ( Y ( x 1 ) , . . . , Y ( x t )) ⊤ , k t ( x ) = ( k ( x 1 , x ) , . . . , k ( x t , x )) ⊤ , K t = [ k t ( x 1 ) , . . . , k t ( x t )] . Then:         k t ( x ) ⊤  F ( x )  0  k ( x , x )    ∼ N  ,  K t + σ 2 I 0 k t ( x ) Y t   Now apply conditioning formula to compute the poste- rior moments of F ( x ), given Y t = y t = ( y 1 , . . . , y t ) ⊤ . Bayesian RL Tutorial 8/25

  9. Example 1.5 Training Set SINC SGPR σ confidence Test err=0.131 1 0.5 0 −0.5 −1 −10 −8 −6 −4 −2 0 2 4 6 8 10 Bayesian RL Tutorial 9/25

  10. Markov Decision Processes z −1 x t x t+1 MDP r t a t x t Controller X , state x ∈ X State space: A , action a ∈ A Action space: Z = X × A , z = ( x , a ) Joint state-action space: x t +1 ∼ p ( ·| x t , a t ) Transition prob. density: Reward prob. density: R ( x t , a t ) ∼ q ( ·| x t , a t ) Bayesian RL Tutorial 10/25

  11. Control and Returns Stationary policy: a t ∼ µ ( ·| x t ) ξ µ = ( z 0 , z 1 , . . . ) Path: D ( ξ µ ) = � ∞ i =0 γ i R ( z i ) Discounted Return: V µ ( x ) = E µ [ D ( ξ µ ) | x 0 = x ] Value function: Q µ ( z ) = E µ [ D ( ξ µ ) | z 0 = z ] State-action value func.: Goal: Find a policy µ ∗ maximizing V µ ( x ) ∀ x ∈ X If Q ∗ ( x , a ) = Q µ ∗ ( x , a ) is available, then an optimal action Note: for state x is given by any a ∗ ∈ argmax a Q ∗ ( x , a ). Bayesian RL Tutorial 11/25

  12. Value-Based RL ^ ^ µ µ Value Estimator: V (x) or Q (x,a) learning data MRP learning data z −1 x t x t+1 MDP r t a t a t Policy: (a|x) x t µ Bayesian RL Tutorial 12/25

  13. Bellman’s Equation For a fixed policy µ : � � ¯ R ( x , a ) + γV µ ( x ′ ) V µ ( x ) = E x ′ , a | x Optimal value and policy: µ ∗ = argmax V ∗ ( x ) = max V µ ( x ) , V µ ( x ) µ µ How to solve it? - Methods based on Value Iteration (e.g. Q-learning) - Methods based on Policy Iteration (e.g. SARSA, OPI, Actor-Critic) Bayesian RL Tutorial 13/25

  14. Solution Method Taxonomy RL Algorithms Purely Policy based Value−Function based (Policy Gradient) Value Iteration type Policy Iteration type (Actor−Critic, OPI, SARSA) (Q−Learning) PI methods need a “subroutine” for policy evaluation Bayesian RL Tutorial 14/25

  15. What’s Missing? Shortcomings of current policy evaluation methods: • Some methods can only be applied to small problems • No probabilistic interpretation - how good is the estimate? • Only parametric methods are capable of operating on-line • Non-parametric methods are more flexible but only work off-line • Small-step-size (stoch. approx.) methods use data inefficiently • Finite-time solutions lack interpretability, all statements are asymptotic • Convergence issues Bayesian RL Tutorial 15/25

  16. GP Temporal Difference Learning Model Equations: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i , x i +1 ) Or, in compact form: R t = H t +1 V t +1 + N t   1 − γ 0 . . . 0   0 1 − γ . . . 0   H t = .   . .  . .  . .     0 0 . . . 1 − γ Our (Bayesian) goal: Find the posterior distribution of V , given a sequence of observed states and rewards. Bayesian RL Tutorial 16/25

  17. Deterministic Dynamics Bellman’s Equation: V ( x i ) = ¯ R ( x i ) + γV ( x i +1 ) Define: N ( x ) = R ( x ) − ¯ R ( x ) Assumption: N ( x i ) are Normal, IID, with variance σ 2 . Model Equations: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i ) In compact form: � � 0 , σ 2 I R t = H t +1 V t +1 + N t , with N t ∼ N Bayesian RL Tutorial 17/25

  18. Stochastic Dynamics The discounted return: D ( x i ) = E µ D ( x i ) + ( D ( x i ) − E µ D ( x i )) = V ( x i ) + ∆ V ( x i ) For a stationary MDP: D ( x i ) = R ( x i ) + γD ( x i +1 ) (where x i +1 ∼ p ( ·| x i , a i ) , a i ∼ µ ( ·| x i )) Substitute and rearrange: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i , x i +1 ) def = ∆ V ( x i ) − γ ∆ V ( x i +1 ) N ( x i , x i +1 ) Assumption: ∆ V ( x i ) are Normal, i.i.d., with variance σ 2 . In compact form: 0 , σ 2 H t +1 H ⊤ � � R t = H t +1 V t +1 + N t , with N t ∼ N t +1 Bayesian RL Tutorial 18/25

  19. The Posterior General noise covariance: Cov [ N t ] = Σ t Joint distribution:          H t K t H ⊤  0 t + Σ t H t k t ( x )  R t − 1    ∼ N  ,  k t ( x ) ⊤ H ⊤ V ( x ) 0 k ( x , x )   t Condition on R t − 1 : E [ V ( x ) | R t − 1 = r t − 1 ] = k t ( x ) ⊤ α t Cov [ V ( x ) , V ( x ′ ) | R t − 1 = r t − 1 ] = k ( x , x ′ ) − k t ( x ) ⊤ C t k t ( x ′ ) ” − 1 ” − 1 “ “ α t = H ⊤ H t K t H ⊤ C t = H ⊤ H t K t H ⊤ t + Σ t r t − 1 , t + Σ t H t . t t Bayesian RL Tutorial 19/25

  20. Learning State-Action Values Under a fixed stationary policy µ , state-action pairs z t form a Markov chain, just like the states x t . Consequently Q µ ( z ) behaves similarly to V µ ( x ): R ( z i ) = Q ( z i ) − γQ ( z i +1 ) + N ( z i , z i +1 ) Posterior moments: E [ Q ( z ) | R t − 1 = r t − 1 ] = k t ( z ) ⊤ α t Cov [ Q ( z ) , Q ( z ′ ) | R t − 1 = r t − 1 ] = k ( z , z ′ ) − k t ( z ) ⊤ C t k t ( z ′ ) Bayesian RL Tutorial 20/25

  21. Policy Improvement Optimistic Policy Iteration algorithms work by maintaining a policy evaluator ˆ Q t and selecting the action at time t semi-greedily w.r.t. to the current state-action value estimates ˆ Q t ( x t , · ). Policy evaluator Parameters OPI algorithm Online TD( λ ) (Sutton) w t SARSA (Rummery & Niranjan) Online GPTD (Engel et Al.) α t , C t GPSARSA (Engel et Al.) Bayesian RL Tutorial 21/25

  22. GPSARSA Algorithm Initialize α 0 = 0 , C 0 = 0, D 0 = { z 0 } , c 0 = 0 , d 0 = 0, 1 /s 0 = 0 for t = 1 , 2 , . . . observe x t − 1 , a t − 1 , r t − 1 , x t a t = SemiGreedyAction ( x t , D t − 1 , α t − 1 , C t − 1 ) γσ 2 t − 1 d t = s t − 1 d t − 1 + temporal difference c t = . . . , s t = . . . 0 1 @ α t − 1 A + c t α t = s t d t 0 2 3 4 C t − 1 0 5 + s t c t c ⊤ 1 C t = t 0 ⊤ 0 D t = D t − 1 ∪ { z t } end for return α t , C t , D t Bayesian RL Tutorial 22/25

  23. A 2D Navigation Task 0 −50 −30 4 − −60 −30 −50 −50 −60 −40 −50 −60 −30 −20 − 5 −20 0 −40 −30 −50 −30 − 2 0 −20 −10 − 3 0 −50 0 4 −20 − −30 −10 −10 0 −20 −50 −40 − 6 0 Bayesian RL Tutorial 23/25

  24. Challenges • How to use value uncertainty? • What’s a disciplined way to select actions? • What’s the best noise covariance? • Bias, variance, learning curves • POMDPs • More complicated tasks Questions? Bayesian RL Tutorial 24/25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend