Gaussian Process Temporal Difference Learning - Theory and Practice - PowerPoint PPT Presentation

Gaussian Process Temporal Difference Learning - Theory and Practice Yaakov Engel Collaborators: Shie Mannor, Ron Meir, Peter Szabo, Dmitry Volkinshtein, Nadav Aharony, Tzachi Zehavi

Kernel-RL workshop – ICML’06 Timeline • ICML’03: Bayes meets Bellman paper – GPTD model for MDPs with deterministic transitions • ICML’05: RL with GPs paper – GPTD model for general MDPs + GPSARSA for learning control • NIPS’05: Learning to control an Octopus Arm – GPTD applied to a high dimensional control problem • OPNET’05: Network association-control with GPSARSA 2/30

Kernel-RL workshop – ICML’06 Why use GPs in RL? • A Bayesian approach to value estimation • Forces us to to make our assumptions explicit • Non-parametric – priors are placed and inference is performed directly in function space (kernels). • But, can also be defined parametrically • Domain knowledge intuitively coded in priors • Provides full posterior, not just point estimates • Efficient, on-line implementations, suitable for large problems 3/30

Kernel-RL workshop – ICML’06 The Bayesian Approach Z Y • Z – hidden process, Y – observable • We want to infer Z from measurements of Y • Statistical dependence between Z and Y known: P ( Y | Z ) • Place prior over Z , reflecting our uncertainty: P ( Z ) • Observe Y = y P ( y | Z ) P ( Z ) • Compute posterior: P ( Z | Y = y ) = R dZ ′ P ( y | Z ′ ) P ( Z ′ ) 4/30

Kernel-RL workshop – ICML’06 Gaussian Processes Definition: “An indexed set of jointly Gaussian random variables” Note: The index set X may be just about any set. Example: F ( x ), index is x ∈ [0 , 1] n F ’s distribution is specified by its mean and covariance: F ( x ) , F ( x ′ ) = k ( x , x ′ ) � � � � E F ( x ) = m ( x ) , Cov m is a function X → R , k is a function X × X → R . Conditions on k : Symmetric, positive definite ⇒ k is a Mercer kernel 5/30

Kernel-RL workshop – ICML’06 GP Regression Model equation: Y ( x ) = F ( x ) + N ( x ) Prior: F ∼ N { 0 , k ( · , · ) } Noise: 0 , σ 2 δ ( · − · ) � � N ∼ N Goal: Find the posterior distribution of F , given a sample for Y (via Bayes’ rule) 6/30

Kernel-RL workshop – ICML’06 Example 1.5 Training Set SINC SGPR σ confidence Test err=0.131 1 0.5 0 −0.5 −1 −10 −8 −6 −4 −2 0 2 4 6 8 10 7/30

Kernel-RL workshop – ICML’06 Markov Decision Processes X : state space U : action space p : X × X × U → [0 , 1], x t +1 ∼ p ( ·| x t , u t ) R × X × U → [0 , 1], R ( x t , u t ) ∼ q ( ·| x t , u t ) q : I A Stationary policy: U × X → [0 , 1], u t ∼ µ ( ·| x t ) µ : Discounted Return: D µ ( x ) = � ∞ i =0 γ i R ( x i , u i ) | ( x 0 = x ) Value function: V µ ( x ) = E µ [ D µ ( x )] Goal: Find a policy µ ∗ maximizing V µ ( x ) ∀ x ∈ X 8/30

Kernel-RL workshop – ICML’06 Bellman’s Equation For a fixed policy µ : � � V µ ( x ) = E x ′ , u | x R ( x , u ) + γV µ ( x ′ ) Optimal value and policy: µ ∗ = argmax V ∗ ( x ) = max V µ ( x ) , V µ ( x ) µ µ How to solve it? - Methods based on Value Iteration (e.g. Q-learning) - Methods based on Policy Iteration (e.g. SARSA, OPI, Actor-Critic) 9/30

Kernel-RL workshop – ICML’06 Solution Method Taxonomy RL Algorithms Purely Policy based Value−Function based (Policy Gradient) Value Iteration type Policy Iteration type (Actor−Critic, OPI, SARSA) (Q−Learning) PI methods need a “subroutine” for policy evaluation 10/30

Kernel-RL workshop – ICML’06 What’s Missing? Shortcomings of current policy evaluation methods: • Some methods can only be applied to small problems • No probabilistic interpretation - how good is the estimate? • Only parametric methods are capable of operating on-line • Non-parametric methods are more flexible but only work off-line • Small-step-size (stoch. approx.) methods use data inefficiently • Finite-time solutions lack interpretability, all statements are asymptotic • Convergence issues 11/30

Kernel-RL workshop – ICML’06 Gaussian Process Temporal Difference Learning Model Equations: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i , x i +1 ) Or, in compact form: R t = H t +1 V t +1 + N t   1 0 0 − γ . . .   0 1 0 − γ . . .   H t = .   . . . .   . .     0 0 1 . . . − γ Our (Bayesian) goal: Find the posterior distribution of V , given a sequence of observed states and rewards. 12/30

Kernel-RL workshop – ICML’06 Deterministic Dynamics Bellman’s Equation: V ( x i ) = ¯ R ( x i ) + γV ( x i +1 ) Define: N ( x ) = R ( x ) − ¯ R ( x ) Assumption: N ( x i ) are Normal, i.i.d., with variance σ 2 . Model Equations: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i ) In compact form: 0 , σ 2 I � � R t = H t +1 V t +1 + N t , with N t ∼ N 13/30

Kernel-RL workshop – ICML’06 Stochastic Dynamics The discounted return: D ( x i ) = E µ D ( x i ) + ( D ( x i ) − E µ D ( x i )) = V ( x i ) + ∆ V ( x i ) For a stationary MDP: D ( x i ) = R ( x i ) + γD ( x i +1 ) (where x i +1 ∼ p ( ·| x i , u i ) , u i ∼ µ ( ·| x i )) Substitute and rearrange: R ( x i ) = V ( x i ) − γV ( x i +1 ) + N ( x i , x i +1 ) def = ∆ V ( x i ) − γ ∆ V ( x i +1 ) N ( x i , x i +1 ) Assumption: ∆ V ( x i ) are Normal, i.i.d., with variance σ 2 . In compact form: 0 , σ 2 H t +1 H ⊤ � � R t = H t +1 V t +1 + N t , with N t ∼ N t +1 14/30

Kernel-RL workshop – ICML’06 The Posterior General noise covariance: Cov [ N t ] = Σ t Joint distribution:          H t K t H ⊤  0 t + Σ t H t k t ( x )  R t − 1    ∼ N  ,  k t ( x ) ⊤ H ⊤ V ( x ) 0 k ( x , x )   t Invoke Bayes’ Rule: E [ V ( x ) | R t − 1 = r t − 1 ] = k t ( x ) ⊤ α t Cov [ V ( x ) , V ( x ′ ) | R t − 1 = r t − 1 ] = k ( x , x ′ ) − k t ( x ) ⊤ C t k t ( x ′ ) k t ( x ) = ( k ( x 0 , x ) , . . . , k ( x t , x )) ⊤ , K t = [ k t ( x 0 ) , . . . , k t ( x t )] ” − 1 ” − 1 “ “ α t = H ⊤ H t K t H ⊤ C t = H ⊤ H t K t H ⊤ t + Σ t t + Σ t r t − 1 , H t . t t 15/30

Kernel-RL workshop – ICML’06 A Parametric Gaussian Process Model A linear combination of features: V ( x ) = φ ( x ) ⊤ W Σ W1 Wn W 2 . . . . x x n x φ ( ) φ ( ) φ ( ) 1 2 Prior on W : Gaussian, with E [ W ] = 0 , Cov [ W, W ] = I Prior on V : Gaussian, with Cov [ V ( x ) , V ( x ′ )] = φ ( x ) ⊤ φ ( x ′ ) E [ V ( x )] = 0 , 16/30

Kernel-RL workshop – ICML’06 Comparison of Models Parametric Nonparametric V ( x ) = φ ( x ) ⊤ W Parametrization None, V is V W ∼ N { 0 , I } V ∼ N { 0 , k ( · , · ) } Prior E [ V ( x )] 0 0 Cov [ V ( x ) , V ( x ′ )] φ ( x ) ⊤ φ ( x ′ ) k ( x , x ′ ) We seek W | R t − 1 V ( x ) | R t − 1 If we can find a set of basis functions satisfying φ ( x ) ⊤ φ ( x ′ ) = k ( x , x ′ ), the two models become equiva- lent. In fact, such a set always exists [Mercer] . However, it may be infinite 17/30

Kernel-RL workshop – ICML’06 Relation to Monte-Carlo Estimation Σ t = σ 2 H t +1 H ⊤ In the stochastic model: t +1 Also, let: ( Y t ) i = � t j = i γ j − i R ( x i , u i ) Then: � − 1 � Φ t Φ ⊤ t + σ 2 I E [ W | R t ] = Φ t Y t � − 1 Cov [ W | R t ] = σ 2 � Φ t Φ ⊤ t + σ 2 I That’s the solution to GP regression on Monte-Carlo samples of the discounted return. 18/30

Kernel-RL workshop – ICML’06 MAP / ML Solutions Since the posterior is Gaussian: � − 1 � Φ t Φ ⊤ w MAP t + σ 2 I ˆ = E [ W | R t ] = Φ t Y t t +1 Performing ML inference using the same model we get: � − 1 � Φ t Φ ⊤ w ML ˆ t +1 = Φ t Y t t That’s the LSTD(1) (Least-Squares Monte-Carlo) solution. 19/30

Kernel-RL workshop – ICML’06 Policy Improvement How can we perform policy improvement? State values? Not without a transition model (even then tricky). State-action (Q-) values? Yes! Idea: Use a state-action value GP How? ( x , u ) , ( x ′ , u ′ ) � � • Define a state-action kernel: k • Run GPTD on state-action pairs • Use some semi-greedy action selection rule We call this GPSARSA. 20/30

Kernel-RL workshop – ICML’06 A Simple Experiment 0 −50 −30 4 − −60 −30 −50 −50 −60 −40 −50 −60 −30 −20 − 5 −20 0 −40 −30 −50 −30 − 2 0 −20 −10 − 3 0 −50 0 4 −20 − −30 −10 0 −10 −20 −50 −40 − 6 0 21/30

Kernel-RL workshop – ICML’06 The Octopus Arm Can bend and twist at any point Can do this in any direction Can be elongated and shortened Can change cross section Can grab using any part of the arm Virtually infinitely many DOF 22/30

Gaussian Process Temporal Difference Learning - Theory and Practice - PowerPoint PPT Presentation

Gaussian Process Temporal Difference Learning - Theory and Practice Yaakov Engel Collaborators: Shie Mannor, Ron Meir, Peter Szabo, Dmitry Volkinshtein, Nadav Aharony, Tzachi Zehavi Kernel-RL workshop ICML06 Timeline ICML03:

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov

Bayesian RL Tutorial 1/25 Gaussian Process Temporal Difference Learning Yaakov Engel

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Modelling of the interplay between hard and soft processes in pp P e t e r S k a n d s ( C E R

Understanding the Higgs Boson: Where We Are, Where Were Going, and How To Get There Imperial

COMPASS - a facility to study QCD Eva-Maria Kabu, Institut f ur Kernphysik, Mainz

Measurement of Underlying Event Observables with the ATLAS detector Rbert Astalo (Comenius

H i i g ggs s ph physi ysi cs cs at t th the e LH C Mar arc c Riembau bau

Measuring the CP state of tau pairs from Higgs decay at ILC in ILD still preliminary, preparing

7 4 12 2 6 9 19 3 5 8 11 15 20 2 Traversal of the Nodes in a BST By traversal

AFOSR T HEORETICAL , N UMERICAL , AND E XPERIMENTAL I NVESTIGATIONS OF THE F UNDAMENTAL P