Multi-task Gaussian Process Prediction Chris Williams Joint Work with Edwin Bonilla, Kian Ming A. Chai, Stefan Klanke and Sethu Vijayakumar Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK September 2008 Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 1 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Assuming task relatedness can be detrimental (Caruana, 1997; Baxter, 2000) Task descriptors unavailable or difficult to define ◮ e.g. Compiler performance prediction: code features, responses Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Assuming task relatedness can be detrimental (Caruana, 1997; Baxter, 2000) Task descriptors unavailable or difficult to define ◮ e.g. Compiler performance prediction: code features, responses Learning inter-task dependencies based on task identities Correlations between tasks directly induced GP framework Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Outline The Model Making Predictions and Learning Hyperparameters Cancellation of Transfer Related Work Experiments and Results MTL in Robot Inverse Dynamics Conclusions and Discussion Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 3 / 24

Multi-task Setting Given a set X of N distinct inputs x 1 , . . . , x N : Complete set of responses: y = ( y 11 , . . . , y N 1 , . . . , y 12 , . . . , y N 2 , . . . , y 1 M , . . . , y NM ) T y i ℓ : response for the ℓ th task on the i th input x i Y : N × M matrix such y = vec Y Goal: Given observations y o ⊂ y : ◮ make predictions of unobserved values y u Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 4 / 24

Multi-task GP We place a (zero mean) GP prior over the latent functions { f ℓ } : The Model � f ℓ ( x ) f m ( x ′ ) � = K f ℓ m k x ( x , x ′ ) y i ℓ ∼ N ( f ℓ ( x i ) , σ 2 ℓ ) , K f : PSD matrix that specifies the inter-task similarities k x : Covariance function over inputs ℓ : Noise variance for the ℓ th task. σ 2 Additionally, k x : stationary, correlation function e.g. squared exponential Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 5 / 24

Multi-task GP (2) θ f 1 f 2 f 3 y 1 y 2 y 3 Other approaches Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others ℓ m = k f ( t ℓ , t m ) Bonilla et. al (2007), Yu et. al (2007): K f Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others ℓ m = k f ( t ℓ , t m ) Bonilla et. al (2007), Yu et. al (2007): K f Multi-task clustering easily modelled Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

x f Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 7 / 24

Making Predictions The mean prediction on a new data-point x ∗ for task ℓ is given by: ¯ ∗ ) T Σ − 1 y , with ( k f ℓ ⊗ k x f ℓ ( x ∗ ) = K f ⊗ K x + D ⊗ I Σ = where: ℓ selects the ℓ th column of K f k f k x ∗ : vector of covariances between x ∗ and the training points K x : matrix of covariances between all pairs of training points D : diagonal matrix in which the ( ℓ, ℓ ) th element is σ 2 ℓ Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 8 / 24

Learning Hyperparameters Given y o : Learn θ x of k x , K f , σ 2 ℓ to maximize p ( y o | X ). We note that: y | X ∼ N ( 0 , Σ) (a) Gradient-based method: ◮ K f = LL T (Recall K f must be PSD) ◮ Kronecker structure (b) EM : ◮ learning of θ x and K f in the M -step is decoupled ◮ closed-form updates for K f and D ◮ K f guaranteed PSD � � F T � � − 1 K f = N − 1 � K x ( � θ x ) F Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 9 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 We can show that if there is a grid design and no observation noise then: ∗ ) T ( K x ) − 1 y · ℓ f ( x ∗ , ℓ ) = ( k x Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 We can show that if there is a grid design and no observation noise then: ∗ ) T ( K x ) − 1 y · ℓ f ( x ∗ , ℓ ) = ( k x The predictions for task ℓ depend only on the targets y · ℓ Similar result for the covariances This is know as autokrigeability in geostatistics Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Related Work Early work on MTL: Thrun (1996), Caruana (1997) Minka (1997) and some other later GP work assumes that multiple tasks share the same hyperparameters but are otherwise uncorrelated Co-kriging in geostatistics Evgeniou et al (2005) induce correlations between tasks based on a correlated prior over linear regression parameters Conti & O’Hagan (2007): emulating multi-output simulators ℓ m = k f ( t ℓ , t m ), e.g. Yu et al (2007), Use of task descriptors so that K f Bonilla et al (2007). Semiparametric latent factor model (SLFM) of Teh et al (2005) has P latent processes each with its own covariance function. Noiseless outputs are obtained by linear mixing of these latent functions. Our model is similar, but simpler, in that all of the P latent processes share the same covariance function; this reduces the number of free parameters to be fitted and should help to minimize overfitting Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 11 / 24

Experiments Compiler performance prediction y : Speed-up of a program (task) when applying a transformation sequence x 11 C programs, 13 transformations, 5-length sequences “bag-of-characters” representation for x Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 12 / 24

Recommend

More recommend