Introduction to Machine Learning 12. Gaussian Processes Alex Smola - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning 12. Gaussian Processes Alex Smola - - PowerPoint PPT Presentation

Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/ The Normal Distribution


slide-1
SLIDE 1

Introduction to Machine Learning

  • 12. Gaussian Processes

Alex Smola Carnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-701 10-701

slide-2
SLIDE 2

The Normal Distribution

http://www.gaussianprocess.org/gpml/chapters/

slide-3
SLIDE 3

The Normal Distribution

slide-4
SLIDE 4

Gaussians in Space

slide-5
SLIDE 5

Gaussians in Space

samples in R2

slide-6
SLIDE 6

The Normal Distribution

  • Density for scalar variables
  • Density in d dimensions
  • Principal components
  • Eigenvalue decomposition
  • Product representation

p(x) = 1 √ 2πσ2 e−

1 2σ2 (x−µ)

Σ = U >ΛU p(x) = (2π)− d

2 e− 1 2 (U(x−µ))>Λ1U(x−µ)

p(x) = (2π)− d

2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)

slide-7
SLIDE 7

The Normal Distribution

Σ = U >ΛU

principal components principal components

p(x) = (2π)− d

2

d

Y

i=1

Λ−1

ii e− 1

2 (U(x−µ))>Λ1U(x−µ)

slide-8
SLIDE 8

Why do we care?

  • Central limit theorem shows that in the limit all

averages behave like Gaussians

  • Easy to estimate parameters (MLE)
  • Distribution with largest uncertainty (entropy)

for a given mean and covariance.

  • Works well even if the assumptions are wrong

µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

slide-9
SLIDE 9

Why do we care?

  • Central limit theorem shows that in the limit all

averages behave like Gaussians

  • Easy to estimate parameters (MLE)

X: data m: sample size mu = (1/m)*sum(X,2) sigma = (1/m)*X*X’- mu*mu’

µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

slide-10
SLIDE 10

Sampling from a Gaussian

  • Case 1 - We have a normal distribution (randn)
  • We want
  • Recipe: where and
  • Proof:
  • Case 2 - Box-Müller transform for U[0,1]

x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> E ⇥ (x − µ)(x − µ)>⇤ = E ⇥ Lzz>L>⇤ = LE ⇥ zz>⇤ L> = LL> = Σ p(x) = 1 2π e 1

2 kxk2 =

⇒ p(φ, r) = 1 2π e 1

2 r2

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

slide-11
SLIDE 11

Sampling from a Gaussian

p(x) = 1 2π e 1

2 kxk2 =

⇒ p(φ, r) = 1 2π e 1

2 r2

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

r Φ

slide-12
SLIDE 12

Sampling from a Gaussian

  • Cumulative distribution function

Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

slide-13
SLIDE 13

Sampling from a Gaussian

  • Cumulative distribution function

Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

Why can we use tmp1 instead of 1-tmp1?

slide-14
SLIDE 14

Example: correlating weight and height

slide-15
SLIDE 15

Example: correlating weight and height

assume Gaussian correlation

slide-16
SLIDE 16

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

slide-17
SLIDE 17

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

keep linear and quadratic terms of exponent

slide-18
SLIDE 18

The gory math

Correlated Observations Assume that the random variables t 2 Rn, t0 2 Rn0 are jointly normal with mean (µ, µ0) and covariance matrix K p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! . Inference Given t, estimate t0 via p(t0|t). Translation into machine learning language: we learn t0 from t. Practical Solution Since t0|t ⇠ N(˜ µ, ˜ K), we only need to collect all terms in p(t, t0) depending on t0 by matrix inversion, hence ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t µ)

⇤ | {z }

independent of t0

Handbook of Matrices, Lütkepohl 1997 (big timesaver)

slide-19
SLIDE 19

Mini Summary

  • Normal distribution
  • Sampling from

Use where and

  • Estimating mean and variance
  • Conditional distribution is Gaussian, too!

p(x) = (2π)− d

2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)

x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

slide-20
SLIDE 20

Gaussian Processes

slide-21
SLIDE 21

Gaussian Process

Key Idea Instead of a fixed set of random variables t, t0 we assume a stochastic process t : X ! R, e.g. X = Rn. Previously we had X = {age, height, weight, . . .}. Definition of a Gaussian Process A stochastic process t : X ! R, where all (t(x1), . . . , t(xm)) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x0) := Cov(t(x), t(x0)) Simplifying Assumption We assume knowledge of k(x, x0) and set µ = 0.

slide-22
SLIDE 22

Gaussian Process

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available
slide-23
SLIDE 23

Gaussian Process

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available
  • nly looks smooth

(evaluated at many points)

slide-24
SLIDE 24

Gaussian Process

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available

p(t|X) = (2π) m

2 |K|1 exp

✓ −1 2(t − µ)>K1(t − µ) ◆ where Kij = k(xi, xj) and µi = µ(xi)

slide-25
SLIDE 25

Kernels ...

Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .

yes!

slide-26
SLIDE 26

Mini Summary

  • Gaussian Process
  • Think distribution over function values (not functions)
  • Defined by mean and covariance function
  • Generates vectors of arbitrary dimensionality (via X)
  • Covariance function via kernels

p(t|X) = (2π) m

2 |K|1 exp

✓ −1 2(t − µ)>K1(t − µ) ◆

slide-27
SLIDE 27

Gaussian Process Regression

slide-28
SLIDE 28

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

Gaussian Processes for Inference

X = {height, weight}

slide-29
SLIDE 29

Joint Gaussian Model

  • Random variables (t,t’) are drawn from GP
  • Observe subset t
  • Predict t’ using
  • Linear expansion (precompute things)
  • Predictive uncertainty is data independent

Good for experimental design

  • Predictive uncertainty is data independent

˜ K = Kt0t0 − K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t − µ)

⇤ p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! Inference

slide-30
SLIDE 30

Linear Gaussian Process Regression

Linear kernel: k(x, x0) = hx, x0i Kernel matrix X>X Mean and covariance ˜ K = X0>X0 X0>X(X>X)1X>X0 = X0>(1 PX)X0. ˜ µ = X0>⇥ X(X>X)1t ⇤ ˜ µ is a linear function of X0. Problem The covariance matrix X>X has at most rank n. After n observations (x 2 Rn) the variance vanishes. This is not realistic. “Flat pancake” or “cigar” distribution.

slide-31
SLIDE 31

Degenerate Covariance

slide-32
SLIDE 32

x t

Degenerate Covariance

slide-33
SLIDE 33

x t y

‘fatten up’ covariance

Degenerate Covariance

slide-34
SLIDE 34

x t y

‘fatten up’ covariance

Degenerate Covariance

slide-35
SLIDE 35

x t y

‘fatten up’ covariance

t ∼ N(µ, K) yi ∼ N(ti, σ2)

Degenerate Covariance

slide-36
SLIDE 36

Additive Noise

Indirect Model Instead of observing t(x) we observe y = t(x) + ξ, where ξ is a nuisance term. This yields p(Y |X) = Z

m

Y

i=1

p(yi|ti)p(t|X)dt where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N(0, σ2) then y is the sum of two Gaussian ran- dom variables. Means and variances add up. y ∼ N(µ, K + σ21).

slide-37
SLIDE 37

Data

slide-38
SLIDE 38

Predictive mean k(x, X)>(K(X, X) + σ21)1y

slide-39
SLIDE 39

Variance

slide-40
SLIDE 40

Putting it all together

slide-41
SLIDE 41

Putting it all together

slide-42
SLIDE 42

Ugly details

Covariance Matrices Additive noise K = Kkernel + σ21 Predictive mean and variance ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = K>

tt0K1 tt t

Pointwise prediction

With Noise

˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-43
SLIDE 43

Pseudocode

ktrtr = k(xtrain,xtrain) + sigma2 * eye(mtr) ktetr = k(xtest,xtrain) ktete = k(xtest,xtest) alpha = ytr/ktrtr %better if you use cholesky yte = ktetr * alpha sigmate = ktete + sigma2 * eye(mte) + ... ktetr * (ktetr/ktrtr)’ ˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-44
SLIDE 44

The connection between SVM and GP

Gaussian Process on Parameters t ⇠ N(µ, K) where Kij = k(xi, xj) Linear Model in Feature Space t(x) = hΦ(x), wi + µ(x) where w ⇠ N(0, 1) The covariance between t(x) and t(x0) is then given by Ew [hΦ(x), wihw, Φ(x0)i] = hΦ(x), Φ(x0)i = k(x, x0) Conclusion

Linear model in feature space induces a Gaussian Process

slide-45
SLIDE 45

Mini Summary

  • Latent variables t drawn from a Gaussian Process
  • Observations y are t corrupted with noise
  • Observations y are drawn from Gaussian Process
  • Estimate y’|y,x,x’ (matrix inversion)
  • SVM kernel is GP kernel

µ → µ and K → K + σ21

˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-46
SLIDE 46

Gaussian Process Classification

slide-47
SLIDE 47
  • Regression
  • Data y is scalar
  • Connection to t is by additive noise
  • (Binary) Classification
  • Data y in {-1, 1}
  • Connection to t is by logistic model

Gaussian Process Classification

x t y

t ∼ N(µ, K) and yi ∼ N(ti, σ2) i.e. p(yi|ti) =

  • 2πσ2− 1

2 e− 1 2σ2 (yi−ti)2

t ∼ N(µ, K) and p(yi|ti) = 1 1 + e−yiti

slide-48
SLIDE 48

Logistic function

p(yi|ti) = 1 1 + e−yiti

slide-49
SLIDE 49

Gaussian Process Classification

  • Regression

We can integrate out the latent variable t.

  • Classification

Closed form solution is not possible (we cannot solve the integral in t).

t ∼ N(µ, K) and yi ∼ N(ti, σ2) hence y ∼ N(µ, K + σ21) t ∼ N(µ, K) and yi ∼ Logistic(ti)

p(t|y, x) ∝ p(t|x)

m

Y

i=1

p(yi|ti) ∝ exp ✓ −1 2t>K1t ◆ m Y

i=1

1 1 + eyiti

slide-50
SLIDE 50

Gaussian Process Classification

  • What we should do: integrate out t,t’

But this is very very expensive (e.g. MCMC)

  • Maximum a Posteriori approximation
  • Find
  • Ignore correlation in test data (horrible)
  • Find
  • Estimate

p(y0|y, x, x0) = Z d(t, t0)p(y0|t0)p(y|t)p(t, t0|x, x0) ˆ t := argmax

t

p(y|t)p(t|x) ˆ t0(x0) := argmax

t0

p(ˆ t, t0|x, x0) y0|y, x, x0 ∼ Logistic(ˆ t0(x0))

slide-51
SLIDE 51

Maximum a Posteriori Approximation

  • Step 1 - maximize p(t|y,x)
  • Step 2 - find t’|t for MAP estimate of t
  • Step 3 - estimate p(y’|t’)

minimize

t

1 2t>K1t +

m

X

i=1

log

  • 1 + eyiti

t0 = K>

tt0K1 tt t

precompute

p(y0|t0) = 1 1 + ey0t0

slide-52
SLIDE 52

Clean Data

slide-53
SLIDE 53

Noisy Data

slide-54
SLIDE 54

Connection to SVMs revisited

  • SVM objective
  • Logistic regression objective (MAP estimation)
  • Reparametrize

minimize

t

1 2t>K1t +

m

X

i=1

log

  • 1 + eyiti

minimize

α

1 2α>Kα +

m

X

i=1

max (0, 1 − yi[Kα]i) α = K−1t minimize

α

1 2α>Kα +

m

X

i=1

log (1 + exp yi[Kα]i)

slide-55
SLIDE 55

More loss functions

  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

slide-56
SLIDE 56

Mini Summary

  • Latent variables drawn from Gaussian Process
  • Observation drawn from logistic model
  • Impossible to integrate out latent variables
  • Maximum a posteriori inference

(with many hacks to make it scale)

  • Optimization problem is similar to SVM

(different loss and parametrization )

  • Advanced topic - adjusting K via prior on k

α = K−1t