Introduction to Machine Learning 12. Gaussian Processes Alex Smola - PowerPoint PPT Presentation

Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

The Normal Distribution

Gaussians in Space

Gaussians in Space samples in R 2

The Normal Distribution • Density for scalar variables 1 1 2 σ 2 ( x − µ ) p ( x ) = 2 πσ 2 e − √ • Density in d dimensions 2 ( x − µ ) > Σ � 1 ( x − µ ) p ( x ) = (2 π ) − d 2 | Σ | − 1 e − 1 • Principal components • Eigenvalue decomposition Σ = U > Λ U • Product representation 2 ( U ( x − µ )) > Λ � 1 U ( x − µ ) p ( x ) = (2 π ) − d 2 e − 1

The Normal Distribution principal principal components components Σ = U > Λ U d 2 ( U ( x − µ )) > Λ � 1 U ( x − µ ) p ( x ) = (2 π ) − d ii e − 1 Y Λ − 1 2 i =1

Why do we care? • Central limit theorem shows that in the limit all averages behave like Gaussians • Easy to estimate parameters (MLE) m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 • Distribution with largest uncertainty (entropy) for a given mean and covariance. • Works well even if the assumptions are wrong

Why do we care? • Central limit theorem shows that in the limit all averages behave like Gaussians • Easy to estimate parameters (MLE) m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 X: data m: sample size mu = (1/m)*sum(X,2) sigma = (1/m)*X*X’- mu*mu’

Sampling from a Gaussian • Case 1 - We have a normal distribution (randn) • We want x ∼ N ( µ, Σ ) • Recipe: where and Σ = LL > z ∼ N (0 , 1 ) x = µ + Lz • Proof: ( x − µ )( x − µ ) > ⇤ Lzz > L > ⇤ ⇥ ⇥ = E E L > = LL > = Σ zz > ⇤ ⇥ = L E • Case 2 - Box-Müller transform for U[0,1] 2 k x k 2 = p ( x ) = 1 ⇒ p ( φ , r ) = 1 2 r 2 2 π e � 1 2 π e � 1 F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π ·

Sampling from a Gaussian Φ r 2 k x k 2 = p ( x ) = 1 ⇒ p ( φ , r ) = 1 2 r 2 2 π e � 1 2 π e � 1 F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π ·

Sampling from a Gaussian • Cumulative distribution function F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π · Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))

Sampling from a Gaussian • Cumulative distribution function F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π · Draw radial and angle component separately Why can we use tmp1 tmp1 = rand() instead of 1-tmp1? tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))

Example: correlating weight and height

Example: correlating weight and height assume Gaussian correlation

p ( weight | height ) = p ( height , weight ) ∝ p ( height , weight ) p ( height )

� >  Σ 11  x 1 − µ 1 � � 1  x 1 − µ 1 " �# − 1 Σ 12 p ( x 2 | x 1 ) ∝ exp Σ 12 Σ 22 2 x 2 − µ 2 x 2 − µ 2 keep linear and quadratic terms of exponent

The gory math Correlated Observations Assume that the random variables t 2 R n , t 0 2 R n 0 are jointly normal with mean ( µ, µ 0 ) and covariance matrix K � >  K tt K tt 0 �!  � � 1  � 1 t � µ t � µ p ( t, t 0 ) / exp . t 0 � µ 0 t 0 � µ 0 K > tt 0 K t 0 t 0 2 Inference Given t , estimate t 0 via p ( t 0 | t ) . Translation into machine learning language: we learn t 0 from t . Practical Solution µ, ˜ Since t 0 | t ⇠ N (˜ K ) , we only need to collect all terms in p ( t, t 0 ) depending on t 0 by matrix inversion, hence µ = µ 0 + K > ⇥ ⇤ ˜ tt 0 K � 1 K � 1 K = K t 0 t 0 � K > tt K tt 0 and ˜ tt ( t � µ ) tt 0 | {z } independent of t 0 Handbook of Matrices, Lütkepohl 1997 (big timesaver)

Mini Summary • Normal distribution 2 ( x − µ ) > Σ � 1 ( x − µ ) p ( x ) = (2 π ) − d 2 | Σ | − 1 e − 1 • Sampling from x ∼ N ( µ, Σ ) Use where and Σ = LL > z ∼ N (0 , 1 ) x = µ + Lz • Estimating mean and variance m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 • Conditional distribution is Gaussian, too! � >  Σ 11  x 1 − µ 1 � � 1  x 1 − µ 1 " �# − 1 Σ 12 p ( x 2 | x 1 ) ∝ exp Σ 12 Σ 22 2 x 2 − µ 2 x 2 − µ 2

Gaussian Processes

Gaussian Process Key Idea Instead of a fixed set of random variables t, t 0 we assume a stochastic process t : X ! R , e.g. X = R n . Previously we had X = { age , height , weight , . . . } . Definition of a Gaussian Process A stochastic process R , where all : X t ! ( t ( x 1 ) , . . . , t ( x m )) are normally distributed. Parameters of a GP Mean µ ( x ) := E [ t ( x )] k ( x, x 0 ) := Cov( t ( x ) , t ( x 0 )) Covariance Function Simplifying Assumption We assume knowledge of k ( x, x 0 ) and set µ = 0 .

Gaussian Process • Sampling from a Gaussian Process • Points x where we want to sample • Compute covariance matrix X • Can only obtain values at those points! • In general entire function f(x) is NOT available

Gaussian Process • Sampling from a Gaussian Process • Points x where we want to sample • Compute covariance matrix X • Can only obtain values at those points! • In general entire function f(x) is NOT available only looks smooth (evaluated at many points)

Gaussian Process • Sampling from a Gaussian Process • Points x where we want to sample • Compute covariance matrix X • Can only obtain values at those points! • In general entire function f(x) is NOT available ✓ − 1 ◆ 2 | K | � 1 exp p ( t | X ) = (2 π ) � m 2( t − µ ) > K � 1 ( t − µ ) where K ij = k ( x i , x j ) and µ i = µ ( x i )

Kernels ... Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . . yes!

Mini Summary • Gaussian Process • Think distribution over function values (not functions) • Defined by mean and covariance function ✓ − 1 ◆ 2 | K | � 1 exp p ( t | X ) = (2 π ) � m 2( t − µ ) > K � 1 ( t − µ ) • Generates vectors of arbitrary dimensionality (via X) • Covariance function via kernels

Gaussian Process Regression

Gaussian Processes for Inference X = { height , weight } p ( weight | height ) = p ( height , weight ) ∝ p ( height , weight ) p ( height )

Joint Gaussian Model • Random variables (t,t’) are drawn from GP � >  K tt K tt 0 �!  � � 1  � 1 t � µ t � µ p ( t, t 0 ) / exp t 0 � µ 0 t 0 � µ 0 K > tt 0 K t 0 t 0 2 • Observe subset t Inference • Predict t’ using µ = µ 0 + K > ˜ tt 0 K � 1 K � 1 K = K t 0 t 0 − K > ⇥ ⇤ tt K tt 0 and ˜ tt ( t − µ ) tt 0 • Linear expansion (precompute things) • Predictive uncertainty is data independent Good for experimental design • Predictive uncertainty is data independent

Linear Gaussian Process Regression Linear kernel: k ( x, x 0 ) = h x, x 0 i Kernel matrix X > X Mean and covariance K = X 0> X 0 � X 0> X ( X > X ) � 1 X > X 0 = X 0> ( 1 � P X ) X 0 . ˜ µ = X 0> ⇥ X ( X > X ) � 1 t ⇤ ˜ µ is a linear function of X 0 . ˜ Problem The covariance matrix X > X has at most rank n . After n observations ( x 2 R n ) the variance vanishes . This is not realistic . “Flat pancake” or “cigar” distribution.

Degenerate Covariance

Degenerate Covariance x t

Degenerate Covariance x t y ‘fatten up’ covariance

Degenerate Covariance x t y ‘fatten up’ covariance t ∼ N ( µ, K ) y i ∼ N ( t i , σ 2 )

Additive Noise Indirect Model Instead of observing t ( x ) we observe y = t ( x ) + ξ , where ξ is a nuisance term. This yields m Z Y p ( Y | X ) = p ( y i | t i ) p ( t | X ) dt i =1 where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N (0 , σ 2 ) then y is the sum of two Gaussian random variables. Means and variances add up . y ∼ N ( µ, K + σ 2 1 ) .

Predictive mean k ( x, X ) > ( K ( X, X ) + σ 2 1 ) � 1 y

Variance

Putting it all together

Ugly details Covariance Matrices Additive noise K = K kernel + σ 2 1 Predictive mean and variance ˜ tt 0 K � 1 tt 0 K � 1 K = K t 0 t 0 � K > µ = K > tt K tt 0 and ˜ tt t Pointwise prediction With Noise � � 1 K tt 0 ˜ K = K t 0 t 0 + σ 2 1 − K > K tt + σ 2 1 � tt 0 � � 1 ( y − µ ) µ = µ 0 + K > h� i K tt + σ 2 1 and ˜ tt 0

Introduction to Machine Learning 12. Gaussian Processes Alex Smola - PowerPoint PPT Presentation

Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/ The Normal Distribution

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

CSE 158 Lecture 1.5 Web Mining and Recommender Systems Supervised learning Regression

Linear regression and t-tests Steve Bagley somgen223.stanford.edu 1 Linear regression

COMP 204 Control flow - Conditionals Mathieu Blanchette, based on material from Yue Li, Carlos

Advanced R (with Tidyverse) Simon Andrews V2020-11 Course Content Expanding knowledge

Knowledge Compilation Guy Van den Broeck Beyond NP Workshop Feb 12, 2016 Overview 1. Why

Union-Find Problem Given a set {1, 2, , n} of n elements. Initially each element is in

Advanced Loops STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

Introduction to Machine Learning Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences