6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola - - PowerPoint PPT Presentation

6 1 gaussians
SMART_READER_LITE
LIVE PREVIEW

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola - - PowerPoint PPT Presentation

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 Normal Distribution http://www.gaussianprocess.org/gpml/chapters/ The Normal Distribution Gaussians in


slide-1
SLIDE 1

6.1 Gaussians

6 Bayesian Kernel Methods

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-2
SLIDE 2

Normal Distribution

http://www.gaussianprocess.org/gpml/chapters/

slide-3
SLIDE 3

The Normal Distribution

slide-4
SLIDE 4

Gaussians in Space

slide-5
SLIDE 5

Gaussians in Space

samples in R2

slide-6
SLIDE 6

The Normal Distribution

  • Density for scalar variables


  • Density in d dimensions
  • Principal components
  • Eigenvalue decomposition
  • Product representation

p(x) = 1 √ 2πσ2 e−

1 2σ2 (x−µ)

Σ = U >ΛU p(x) = (2π)− d

2 e− 1 2 (U(x−µ))>Λ1U(x−µ)

p(x) = (2π)− d

2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)

slide-7
SLIDE 7

Recall - Gaussian is in the Exponential Family

  • Binomial Distribution
  • Discrete Distribution


(ex is unit vector for x)

  • Gaussian
  • Poisson (counting measure 1/x!)
  • Dirichlet, Beta, Gamma,

Wishart, ...

φ(x) = x φ(x) = ex φ(x) = ✓ x, 1 2xx> ◆ φ(x) = x

slide-8
SLIDE 8

Recall - Gaussian is in the Exponential Family

  • Binomial Distribution
  • Discrete Distribution


(ex is unit vector for x)

  • Gaussian
  • Poisson (counting measure 1/x!)
  • Dirichlet, Beta, Gamma,

Wishart, ...

φ(x) = x φ(x) = ex φ(x) = ✓ x, 1 2xx> ◆ φ(x) = x −∂θ log p(X; θ) = m " E[φ(x)] − 1 m

n

X

i=1

φ(xi) #

slide-9
SLIDE 9

The Normal Distribution

Σ = U >ΛU

principal components principal components

p(x) = (2π)− d

2

d

Y

i=1

Λ−1

ii e− 1

2 (U(x−µ))>Λ1U(x−µ)

slide-10
SLIDE 10

Why do we care?

  • Central limit theorem shows that in the limit all

averages behave like Gaussians

  • Easy to estimate parameters (MLE)



 


  • Distribution with largest uncertainty (entropy) for

a given mean and covariance.

  • Works well even if the assumptions are wrong

µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

slide-11
SLIDE 11

Why do we care?

  • Central limit theorem shows that in the limit all

averages behave like Gaussians

  • Easy to estimate parameters (MLE)



 
 
 X: data 
 m: sample size
 
 mu = (1/m)*sum(X,2)
 sigma = (1/m)*X*X’- mu*mu’

µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

slide-12
SLIDE 12

Sampling from a Gaussian

  • Case 1 - We have a normal distribution (randn)
  • We want
  • Recipe: where and
  • Proof:
  • Case 2 - Box-Müller transform for U[0,1]

x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> E ⇥ (x − µ)(x − µ)>⇤ = E ⇥ Lzz>L>⇤ = LE ⇥ zz>⇤ L> = LL> = Σ p(x) = 1 2π e 1

2 kxk2 =

⇒ p(φ, r) = 1 2π e 1

2 r2

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

slide-13
SLIDE 13

Sampling from a Gaussian

p(x) = 1 2π e 1

2 kxk2 =

⇒ p(φ, r) = 1 2π e 1

2 r2

F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

r Φ

slide-14
SLIDE 14

Sampling from a Gaussian

  • Cumulative distribution function



 
 Draw radial and angle component separately
 
 tmp1 = rand()
 tmp2 = rand()
 r = sqrt(-2*log(tmp1))
 x1 = r*sin(tmp2/(2*pi))
 x2 = r*cos(tmp2/(2*pi))


F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

slide-15
SLIDE 15

Sampling from a Gaussian

  • Cumulative distribution function



 
 Draw radial and angle component separately
 
 tmp1 = rand()
 tmp2 = rand()
 r = sqrt(-2*log(tmp1))
 x1 = r*sin(tmp2/(2*pi))
 x2 = r*cos(tmp2/(2*pi))


F(φ, r) = φ 2π · h 1 − e− 1

2 r2i

Why can we use tmp1 instead of 1-tmp1?

slide-16
SLIDE 16

Principal Component Analysis

slide-17
SLIDE 17

Data Visualization

  • 53 Blood and urine samples from 65 people
  • Difficult to see the correlations between features

H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000

Instances Features

slide-18
SLIDE 18

10 20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 measurement Value

Measurement

Data Visualization

  • Spectral format (65 curves, one for each person)
  • Difficult to compare different patients
slide-19
SLIDE 19

Data Visualization

0 10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Person H-Bands

One plot per person …

slide-20
SLIDE 20

Data Visualization

0 50 150 250 350 450 50 100 150 200 250 300 350 400 450 500 550

C-Triglycerides C-LDH

100200300400500 200 400 600 1 2 3 4

C-Triglycerides C-LDH M-EPI

Bi-variate Tri-variate

Even 3 dimensions are already difficult. How to extend this?

slide-21
SLIDE 21

Compact Summaries via PCA

minimize

rankP =k

1 m

m

X

i=1

kxi Pxik2 where 1 m

m

X

i=1

xi = µ

assume centering

tr " 1 m

m

X

i=1

xix>

i − Pxix> i P >

# tr Σ − tr PΣP >

maximize this

slide-22
SLIDE 22

Compact Summaries via PCA

  • Is there a representation better than the coordinate axes?
  • Is it really necessary to show all the 53 dimensions?
  • What if there are strong correlations between features?
  • What if there’s some additive noise?

minimize

rankP =k

1 m

m

X

i=1

kxi Pxik2 where 1 m

m

X

i=1

xi = µ

assume centering

tr " 1 m

m

X

i=1

xix>

i − Pxix> i P >

# tr Σ − tr PΣP >

maximize this

slide-23
SLIDE 23

Compact Summaries via PCA

  • Is there a representation better than the coordinate axes?
  • Is it really necessary to show all the 53 dimensions?
  • What if there are strong correlations between features?
  • What if there’s some additive noise?
  • Find the smallest subspace that keeps most information

minimize

rankP =k

1 m

m

X

i=1

kxi Pxik2 where 1 m

m

X

i=1

xi = µ

assume centering

tr " 1 m

m

X

i=1

xix>

i − Pxix> i P >

# tr Σ − tr PΣP >

maximize this

slide-24
SLIDE 24

Compact Summaries via PCA

maximize this

Residual = tr Σ − tr PΣP > =

d

X

i=1

σ2

i − k

X

i=1

σ2

i = d

X

i=k+1

σ2

i

x = z + ✏ where z ∼ N(µ, Σ) and ✏ ∼ N(0, 21) Σ + σ21 σ2

i + σ2

slide-25
SLIDE 25

Compact Summaries via PCA

  • Subspace optimization



 
 
 


  • Signal to Noise ratio optimization
  • Assume data x is generated with additive noise
  • Joint covariance matrix is
  • Joint eigenvalues are , so we can ignore

everything below the noise threshold

maximize this

Residual = tr Σ − tr PΣP > =

d

X

i=1

σ2

i − k

X

i=1

σ2

i = d

X

i=k+1

σ2

i

x = z + ✏ where z ∼ N(µ, Σ) and ✏ ∼ N(0, 21) Σ + σ21 σ2

i + σ2

slide-26
SLIDE 26

2d dataset

slide-27
SLIDE 27

First principal axis

slide-28
SLIDE 28

Second principal axis

slide-29
SLIDE 29

Eigenfaces (PCA on images)

slide-30
SLIDE 30

Eigenfaces (PCA on images)

slide-31
SLIDE 31

When projecting strange data

  • Original images
  • Reconstruction doesn’t look like the original
slide-32
SLIDE 32

Inference

height weight

slide-33
SLIDE 33

Correlating weight and height

slide-34
SLIDE 34

Correlating weight and height

assume Gaussian correlation

slide-35
SLIDE 35

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

slide-36
SLIDE 36

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

keep linear and quadratic terms of exponent

slide-37
SLIDE 37

The gory math

Correlated Observations Assume that the random variables t 2 Rn, t0 2 Rn0 are jointly normal with mean (µ, µ0) and covariance matrix K p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! . Inference Given t, estimate t0 via p(t0|t). Translation into machine learning language: we learn t0 from t. Practical Solution Since t0|t ⇠ N(˜ µ, ˜ K), we only need to collect all terms in p(t, t0) depending on t0 by matrix inversion, hence ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t µ)

⇤ | {z }

independent of t0

Handbook of Matrices, Lütkepohl 1997 (big timesaver)

slide-38
SLIDE 38

Mini Summary

  • Normal distribution
  • Sampling from


Use where and

  • Estimating mean and variance


  • Conditional distribution is Gaussian, too!

p(x) = (2π)− d

2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)

x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> µ = 1 m

m

X

i=1

xi and Σ = 1 m

m

X

i=1

xix>

i − µµ>

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

slide-39
SLIDE 39

6.2 Gaussian Processes

6 Bayesian Kernel Methods

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-40
SLIDE 40

Gaussian Process

slide-41
SLIDE 41

Gaussian Process

Key Idea Instead of a fixed set of random variables t, t0 we assume a stochastic process t : X ! R, e.g. X = Rn. Previously we had X = {age, height, weight, . . .}. Definition of a Gaussian Process A stochastic process t : X ! R, where all (t(x1), . . . , t(xm)) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x0) := Cov(t(x), t(x0)) Simplifying Assumption We assume knowledge of k(x, x0) and set µ = 0.

slide-42
SLIDE 42

Gaussian Process

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available
slide-43
SLIDE 43

Gaussian Process

  • nly looks smooth

(evaluated at many points)

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available
slide-44
SLIDE 44

Gaussian Process

p(t|X) = (2π) m

2 |K|1 exp

✓ −1 2(t − µ)>K1(t − µ) ◆ where Kij = k(xi, xj) and µi = µ(xi)

  • Sampling from a Gaussian Process
  • Points x where we want to sample
  • Compute covariance matrix X
  • Can only obtain values at those points!
  • In general entire function f(x) is NOT available
slide-45
SLIDE 45

Kernels ...

Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .

yes!

slide-46
SLIDE 46

Mini Summary

  • Gaussian Process
  • Think distribution over function values (not functions)
  • Defined by mean and covariance function
  • Generates vectors of arbitrary dimensionality (via X)
  • Covariance function via kernels

p(t|X) = (2π) m

2 |K|1 exp

✓ −1 2(t − µ)>K1(t − µ) ◆

slide-47
SLIDE 47

Gaussian Process Regression

slide-48
SLIDE 48

Inference

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

Gaussian Processes for Inference

X = {height, weight}

slide-49
SLIDE 49

Joint Gaussian Model

  • Random variables (t,t’) are drawn from GP


  • Observe subset t
  • Predict t’ using
  • Linear expansion (precompute things)
  • Predictive uncertainty is data independent


Good for experimental design

  • Predictive uncertainty is data independent

˜ K = Kt0t0 − K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t − µ)

⇤ p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! Inference

slide-50
SLIDE 50

Linear Gaussian Process Regression

Linear kernel: k(x, x0) = hx, x0i Kernel matrix X>X Mean and covariance ˜ K = X0>X0 X0>X(X>X)1X>X0 = X0>(1 PX)X0. ˜ µ = X0>⇥ X(X>X)1t ⇤ ˜ µ is a linear function of X0. Problem The covariance matrix X>X has at most rank n. After n observations (x 2 Rn) the variance vanishes. This is not realistic. “Flat pancake” or “cigar” distribution.

slide-51
SLIDE 51

Degenerate Covariance

slide-52
SLIDE 52

x t

Degenerate Covariance

slide-53
SLIDE 53

x t y

‘fatten up’ covariance

Degenerate Covariance

slide-54
SLIDE 54

x t y

‘fatten up’ covariance

Degenerate Covariance

slide-55
SLIDE 55

x t y

‘fatten up’ covariance

t ∼ N(µ, K) yi ∼ N(ti, σ2)

Degenerate Covariance

slide-56
SLIDE 56

Additive Noise

Indirect Model Instead of observing t(x) we observe y = t(x) + ξ, where ξ is a nuisance term. This yields p(Y |X) = Z

m

Y

i=1

p(yi|ti)p(t|X)dt where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N(0, σ2) then y is the sum of two Gaussian ran- dom variables. Means and variances add up. y ∼ N(µ, K + σ21).

slide-57
SLIDE 57

Data

slide-58
SLIDE 58

Predictive mean

k(x, X)>(K(X, X) + σ21)1y

slide-59
SLIDE 59

Variance

slide-60
SLIDE 60

Putting it all together

slide-61
SLIDE 61

Putting it all together

slide-62
SLIDE 62

Ugly details

Covariance Matrices Additive noise K = Kkernel + σ21 Predictive mean and variance ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = K>

tt0K1 tt t

Pointwise prediction

With Noise

˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-63
SLIDE 63

Pseudocode

ktrtr = k(xtrain,xtrain) + sigma2 * eye(mtr) ktetr = k(xtest,xtrain) ktete = k(xtest,xtest) alpha = ytr/ktrtr %better if you use cholesky yte = ktetr * alpha sigmate = ktete + sigma2 * eye(mte) + ... ktetr * (ktetr/ktrtr)’ ˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-64
SLIDE 64

The connection between SVM and GP

Gaussian Process on Parameters t ⇠ N(µ, K) where Kij = k(xi, xj) Linear Model in Feature Space t(x) = hΦ(x), wi + µ(x) where w ⇠ N(0, 1) The covariance between t(x) and t(x0) is then given by Ew [hΦ(x), wihw, Φ(x0)i] = hΦ(x), Φ(x0)i = k(x, x0) Conclusion

Linear model in feature space induces a Gaussian Process

slide-65
SLIDE 65

Mini Summary

  • Latent variables t drawn from a Gaussian Process
  • Observations y are t corrupted with noise
  • Observations y are drawn from Gaussian Process

  • Estimate y’|y,x,x’ (matrix inversion)
  • SVM kernel is GP kernel

µ → µ and K → K + σ21

˜ K = Kt0t0 + σ21 − K>

tt0

  • Ktt + σ21

1 Ktt0 and ˜ µ = µ0 + K>

tt0

h Ktt + σ21 1 (y − µ) i

slide-66
SLIDE 66

Gaussian Process Classification

slide-67
SLIDE 67

Gaussian Process Classification

  • Regression
  • Data y is scalar
  • Connection to t is by additive noise



 


  • (Binary) Classification
  • Data y in {-1, 1}
  • Connection to t is by logistic model

x t y

t ∼ N(µ, K) and yi ∼ N(ti, σ2) i.e. p(yi|ti) =

  • 2πσ2− 1

2 e− 1 2σ2 (yi−ti)2

t ∼ N(µ, K) and p(yi|ti) = 1 1 + e−yiti

slide-68
SLIDE 68

Logistic function

p(yi|ti) = 1 1 + e−yiti

slide-69
SLIDE 69

Recall - Binomial Distribution

  • Features
  • Domain is {-1, 1}
  • Normalization
  • Probability

φ(x) = x g(θ) = log ⇥ e−1·θ + e1·θ⇤ = log 2 cosh θ p(x|θ) = exp(x · θ − g(θ)) = exθ e−θ + eθ = 1 1 + e−2xθ

Logistic function

slide-70
SLIDE 70

Gaussian Process Classification

  • Regression



 We can integrate out the latent variable t.

  • Classification


Closed form solution is not possible
 
 (we cannot solve the integral in t).


t ∼ N(µ, K) and yi ∼ N(ti, σ2) hence y ∼ N(µ, K + σ21) t ∼ N(µ, K) and yi ∼ Logistic(ti)

p(t|y, x) ∝ p(t|x)

m

Y

i=1

p(yi|ti) ∝ exp ✓ −1 2t>K1t ◆ m Y

i=1

1 1 + eyiti

slide-71
SLIDE 71

Gaussian Process Classification

  • Integrating out t,t’



 
 is very very expensive (e.g. MCMC)

  • Maximum a Posteriori approximation
  • Find
  • Ignore correlation in test data (horrible)
  • Find
  • Estimate

p(y0|y, x, x0) = Z d(t, t0)p(y0|t0)p(y|t)p(t, t0|x, x0) ˆ t := argmax

t

p(y|t)p(t|x) ˆ t0(x0) := argmax

t0

p(ˆ t, t0|x, x0) y0|y, x, x0 ∼ Logistic(ˆ t0(x0))

slide-72
SLIDE 72

Maximum a Posteriori Approximation

  • Step 1 - maximize p(t|y,x)
  • Step 2 - find t’|t for MAP estimate of t


  • Step 3 - estimate p(y’|t’)

minimize

t

1 2t>K1t +

m

X

i=1

log

  • 1 + eyiti

t0 = K>

tt0K1 tt t

precompute

p(y0|t0) = 1 1 + ey0t0

slide-73
SLIDE 73

Clean Data

slide-74
SLIDE 74

Noisy Data

slide-75
SLIDE 75

Connection to SVMs

  • SVM objective


  • Logistic regression objective (MAP estimation)
  • Reparametrize

minimize

t

1 2t>K1t +

m

X

i=1

log

  • 1 + eyiti

minimize

α

1 2α>Kα +

m

X

i=1

max (0, 1 − yi[Kα]i) α = K−1t minimize

α

1 2α>Kα +

m

X

i=1

log (1 + exp yi[Kα]i)

slide-76
SLIDE 76

More loss functions

  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

slide-77
SLIDE 77

Mini Summary

  • Latent variables drawn from Gaussian Process
  • Observation drawn from logistic model
  • Impossible to integrate out latent variables
  • Maximum a posteriori inference


(with many hacks to make it scale)

  • Optimization problem is similar to SVM


(different loss and parametrization )

  • Advanced topic - adjusting K via prior on k

α = K−1t

slide-78
SLIDE 78

Further reading

  • Girosi, 1998; An equivalence between sparse approximation and Support

Vector Machines
 ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1606.pdf

  • Smola, Schoelkopf and Mueller, 1998; The connection between

regularization operators and Support Vector Kernels
 http://alex.smola.org/teaching/berkeley2012/slides/ Smola1998connection.pdf

  • Schoelkopf, Smola, Mueller, 1998; Nonlinear Component Analysis as

Kernel Eigenvalue Problem
 http://www.mitpressjournals.org/doi/abs/10.1162/089976698300017467

  • Scholkopf, Smola, Herbrich, 2001; A Generalized Representer Theorem


alex.smola.org/papers/2001/SchHerSmo01.pdf

  • Teo, Globerson, Roweis, Smola, 2008; Convex Learning with Invariances


http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2007_1047.pdf

  • Rasmussen, 2006; Gaussian Processes for Machine Learning


http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3414