[PPT] - Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola PowerPoint Presentation

SLIDE 1

Scalable Machine Learning

5. (Generalized) Linear Models

Alex Smola Yahoo! Research and ANU

http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

SLIDE 2

Administrative stuff

Solutions will be posted by tomorrow
New problem set will be available by tomorrow
Midterm project presentations are on March 13
Describe what you will do
Why it’s important
What you’ve achieved so far
Show why you think you’re going to succeed
10 minutes per team (6 slides maximum)
Up to 10 pages supporting documentation

SLIDE 3

5. (Generalized) Linear Models

SLIDE 4

Kernel trick
Simple kernels
Kernel PCA
Mean Classifier
Support Vectors
Support Vector Machine classification
Regression
Logistic regression
Novelty detection
Gaussian Process Estimation
Regression
Classification
Heteroscedastic Regression

(Generalized) Linear Models

SLIDE 5

Kernels - a Preview

SLIDE 6

Solving XOR

XOR not linearly separable
Mapping into 3 dimensions makes it easily solvable

(x1, x2) (x1, x2, x1x2)

SLIDE 7

Feature Space Mapping

Naive Nonlinearization Strategy
Express data x in terms of features ɸ(x)
Solve problem in feature space
Requires explicit feature computation
Kernel trick
Write algorithm in terms of inner products
Replace by
Works well for dimension-insensitive methods
Kernel matrix K is positive semidefinite

hx, x0i k(x, x0) := hφ(x), φ(x0)i

SLIDE 8

SLIDE 9

Polynomial Kernels

Linear
Quadratic
Homogeneous polynomial
Inhomogeneous polynomial

k(x, x0) := hx, x0i k(x, x0) := D (x2

1, x2 2,

p 2x1x2), (x0

1 2, x0 2 2,

p 2x0

1x0 2)

E = hx, x0i2 k(x, x0) := hx, x0ip = X

|α|=p

Y

i

αi!(xix0

i)αi with α 2 Nd

k(x, x0) := (hx, x0i + c)p =

p

X

i=0

✓p i ◆ hx, x0ii

inner product

SLIDE 10

More Kernels

Gaussian Kernel

can check that this is convolution of Gaussians

Brownian Bridge
Set intersection
Strings, more fancy set kernels, graphs, etc.

k(x, x0) := exp ⇣ γ kx x0k2⌘ k(x, x0) := min(x, x0) for x, x0 ≥ 0 k(A, B) := |A ∩ B|

SLIDE 11

Support Vector Machines

SLIDE 12

Classification

http://maktoons.blogspot.com/2009/03/support-vector-machine.html

SLIDE 13

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

margin

hw, x1i + b = 1 hw, x2i + b = 1 hence hw, x1 x2i = 2 hence ⌧ w kwk, x1 x2

=

2 kwk

SLIDE 14

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

dual problem

Kij = yiyj hxi, xji w = X

i

αiyixi minimize

α

1 2α>Kα − 1>α subject to X

i

αiyi = 0 αi ≥ 0

SLIDE 15

Karush Kuhn Tucker conditions

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

αi [yi(hxi, wi + b) 1] = 0 yi(hxi, wi + b) > 1 implies αi = 0 αi > 0 implies yi(hxi, wi + b) = 1

KKT optimality condition

SLIDE 16

Properties

Weight vector w as weighted linear

combination of instances

Only points on margin matter

(we can ignore the rest and get same solution)

Only inner products matter
Quadratic program
We can replace the inner product by a kernel
Keeps instances away from the margin

Java demo: http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

SLIDE 17

Example

SLIDE 18

Example

SLIDE 19

Why large margins?

Maximum

robustness relative to uncertainty

Symmetry breaking
Independent of

correctly classified instances

Easy to find for

easy problems

+

+ +

+

r ρ

SLIDE 20

Inseparable data

Quadratic program has no feasible solution

SLIDE 21

Adding slack variables

Hard margin problem
With slack variables

problem is always feasible. Proof: (also yields upper bound)

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1 minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1

SLIDE 22

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

dual problem

Kij = yiyj hxi, xji w = X

i

αiyixi minimize

α

1 2α>Kα − 1>α subject to X

i

αiyi = 0 αi ∈ [0, C] minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

SLIDE 23

Classification with errors

SLIDE 24

Nonlinear separation

Increasing C allows for more nonlinearities
Decreases number of errors
SV boundary need not be contiguous

SLIDE 25

Loss function point of view

Constrained quadratic program
Risk minimization setting

Follows from finding minimal slack variable for given (w,b) pair.

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize

w,b

1 2 kwk2 + C X

i

max [0, 1 yi [hw, xii + b]]

empirical risk

SLIDE 26

Soft margin as proxy for binary

Soft margin loss
Binary loss

max(0, 1 − yf(x)) {yf(x) < 0}

convex upper bound binary loss function margin

SLIDE 27

More loss functions

Logistic
Huberized loss
Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

SLIDE 28

Risk minimization view

Find function f minimizing classification error
Compute empirical average
Minimization is nonconvex
Overfitting as we minimize empirical error
Compute convex upper bound on the loss
Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m

m

X

i=1

{yif(xi) > 0} Rreg[f] := 1 m

m

X

i=1

max(0, 1 − yif(xi)) + λΩ[f]

regularization

how to control ƛ

SLIDE 29

Regression

SLIDE 30

Regression Estimation

Find function f minimizing regression error
Compute empirical average

Overfitting as we minimize empirical error

Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [l(y, f(x))] Remp[f] := 1 m

m

X

i=1

l(yi, f(xi)) Rreg[f] := 1 m

m

X

i=1

l(yi, f(xi)) + λΩ[f]

SLIDE 31

Squared loss

l(y, f(x)) = 1 2(y − f(x))2

SLIDE 32

l1 loss

l(y, f(x)) = |y − f(x)|

SLIDE 33

ε-insensitive Loss

l(y, f(x)) = max(0, |y − f(x)| − ✏)

SLIDE 34

Penalized least mean squares

Optimization problem
Solution

minimize

w

1 m

m

X

i=1

(yi hxi, wi)2 + λ 2 kwk2 ∂w [. . .] = 1 m

m

X

i=1

⇥ xix>

i w − xiyi

⇤ + λw =  1 mXX> + λ1

w − 1

mXy = 0 hence w = ⇥ XX> + λm1 ⇤1 Xy

matrix inverse use CG or SMW

nly inner product

between X matters

SLIDE 35

SVM Regression (ϵ-insensitive loss)

x x x x x x x x x x x x x x

+ε −ε

x

ξ +ε −ε ξ

y x y − f(x) loss

don’t care about deviations within the tube

SLIDE 36

SVM Regression (ϵ-insensitive loss)

Optimization Problem (as constrained QP)
Lagrange Function

minimize

w,b

1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ]

subject to hw, xii + b  yi + ✏ + ⇠i and ⇠i 0 hw, xii + b yi ✏ ⇠∗

i and ⇠∗ i 0

L =1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ] m

X

i=1

[⌘i⇠i + ⌘∗

i ⇠∗ i ] + m

X

i=1

↵i [hw, xii + b yi ✏ ⇠i] +

m

X

i=1

↵∗

i [yi ✏ ⇠∗ i hw, xii b]

SLIDE 37

SVM Regression (ϵ-insensitive loss)

First order conditions
Dual problem

∂wL = 0 = w + X

i

[αi − α∗

i ] xi

∂bL = 0 = X

i

[αi − α∗

i ]

∂ξiL = 0 = C − ηi − αi ∂ξ∗

i L = 0 = C − η∗

i − α∗ i

minimize

α,α∗

1 2(↵ − ↵⇤)>K(↵ − ↵⇤) + ✏1>(↵ + ↵⇤) + y>(↵ − ↵⇤) subject to 1>(↵ − ↵⇤) = 0 and ↵i, ↵⇤

i ∈ [0, C]

SLIDE 38

Properties

Ignores ‘typical’ instances with small error
Only upper or lower bound active at any time

(we cannot violate both bounds simultaneously)

Quadratic Program in 2n variables can be

solved as cheaply as standard SVM problem

Robustness with respect to outliers
l1 loss yields same problem without epsilon
Huber’s robust loss yields similar problem but

with added quadratic penalty on coefficients

SLIDE 39

Regression example

sinc x + 0.1 sinc x - 0.1 approximation

SLIDE 40

Regression example

sinc x + 0.2 sinc x - 0.2 approximation

SLIDE 41

Regression example

sinc x + 0.5 sinc x - 0.5 approximation

SLIDE 42

Regression example

Support Vectors Support Vectors Support Vectors

SLIDE 43

Huber’s robust loss

quadratic linear

l(y, f(x)) = (

1 2(y − f(x))2

if |y − f(x)| < 1 |y − f(x)| − 1

2

therwise

trimmed mean estimatior

SLIDE 44

Novelty Detection

SLIDE 45

Basic Idea

Data Observations (xi) generated from some P(x), e.g., network usage patterns handwritten digits alarm sensors factory status Task Find unusual events, clean database, dis- tinguish typical examples.

SLIDE 46

Applications

Network Intrusion Detection Detect whether someone is trying to hack the network, downloading tons of MP3s, or doing anything else unusual on the network. Jet Engine Failure Detection You can’t destroy jet engines just to see how they fail. Database Cleaning We want to find out whether someone stored bogus in- formation in a database (typos, etc.), mislabelled digits, ugly digits, bad photographs in an electronic album. Fraud Detection Credit Cards, Telephone Bills, Medical Records Self calibrating alarm devices Car alarms (adjusts itself to where the car is parked), home alarm (furniture, temperature, windows, etc.)

SLIDE 47

Novelty Detection via Density Estimation

Key Idea Novel data is one that we don’t see frequently. It must lie in low density regions. Step 1: Estimate density Observations x1, . . . , xm Density estimate via Parzen windows Step 2: Thresholding the density Sort data according to density and use it for rejection Practical implementation: compute p(xi) = 1 m X

j

k(xi, xj) for all i and sort according to magnitude. Pick smallest p(xi) as novel points.

SLIDE 48

Order Statistics of Densities

SLIDE 49

Typical Data

SLIDE 50

Outliers

SLIDE 51

A better way

Problems We do not care about estimating the density properly in regions of high density (waste of capacity). We only care about the relative density for thresholding purposes. We want to eliminate a certain fraction of observations and tune our estimator specifically for this fraction. Solution Areas of low density can be approximated as the level set of an auxiliary function. No need to estimate p(x) directly — use proxy of p(x). Specifically: find f(x) such that x is novel if f(x) ≤ c where c is some constant, i.e. f(x) describes the amount of novelty.

SLIDE 52

Problems with density estimation

Maximum a Posteriori minimize

θ m

X

i=1

g(θ) hφ(xi), θi + 1 2σ2kθk2 Advantages Convex optimization problem Concentration of measure Problems Normalization g(θ) may be painful to compute For density estimation we need no normalized p(x|θ) No need to perform particularly well in high density regions

SLIDE 53

Thresholding

SLIDE 54

Optimization Problem

Optimization Problem MAP

m

X

i=1

log p(xi|θ) + 1 2σ2kθk2 Novelty

m

X

i=1

max ✓ log p(xi|θ) exp(ρ g(θ)), 0 ◆ + 1 2kθk2

m

X

i=1

max(ρ hφ(xi), θi, 0) + 1 2kθk2 Advantages No normalization g(θ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program

SLIDE 55

Maximum Distance Hyperplane

Idea Find hyperplane, given by f(x) = hw, xi + b = 0 that has maximum distance from origin yet is still closer to the origin than the observations. Hard Margin minimize 1 2kwk2 subject to hw, xii 1 Soft Margin minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 ξi ξi 0

SLIDE 56

Optimization Problem

Primal Problem minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 + ξi 0 and ξi 0 Lagrange Function L Subtract constraints, multiplied by Lagrange multipli- ers (αi and ηi), from Primal Objective Function. Lagrange function L has saddlepoint at optimum. L = 1 2kwk2 + C

m

X

i=1

ξi

m

X

i=1

αi (hw, xii 1 + ξi)

m

X

i=1

ηiξi subject to αi, ηi 0.

SLIDE 57

Dual Problem

Optimality Conditions ∂wL = w

m

X

i=1

αixi = 0 = ) w =

m

X

i=1

αixi ∂ξiL = C αi ηi = 0 = ) αi 2 [0, C] Now substitute the optimality conditions back into L. Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji

m

X

i=1

αi subject to αi 2 [0, C] All this is only possible due to the convexity of the primal problem.

SLIDE 58

Minimum enclosing ball

Observations on

surface of ball

Find minimum

enclosing ball

Equivalent to

single class SVM

||w|| ρ/

.

x

R

x x x x x

SLIDE 59

Adaptive thresholds

Problem Depending on C, the number of novel points will vary. We would like to specify the fraction ν beforehand. Solution Use hyperplane separating data from the origin H := {x|hw, xi = ρ} where the threshold ρ is adaptive. Intuition Let the hyperplane shift by shifting ρ Adjust it such that the ’right’ number of observations is considered novel. Do this automatically

SLIDE 60

Optimization Problem

Primal Problem minimize 1 2kwk2 +

m

X

i=1

ξi mνρ where hw, xii ρ + ξi 0 ξi 0 Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji where αi 2 [0, 1] and

m

X

i=1

αi = νm. Similar to SV classification problem, use standard

SLIDE 61

The ν-property theorem

Optimization problem
Solution satisfies
At most a fraction of ν points are novel
At most a fraction of (1-ν) points aren’t novel
Fraction of points on boundary vanishes for

large m (for non-pathological kernels)

minimize

w

1 2 kwk2 +

m

X

i=1

ξi mνρ subject to hw, xii ρ ξi and ξi 0

SLIDE 62

Proof

Move boundary at optimality
For smaller threshold m- points on wrong side
f margin contribute
For larger threshold m+ points not on ‘good’

side of margin yield

Combining inequalities
Margin set of measure 0

δ(m− − νm) ≤ 0 δ(m+ − νm) ≥ 0 m− m ≤ ν ≤ m+ m

SLIDE 63

Toy example

ν, width c 0.5, 0.5 0.5, 0.5 0.1, 0.5 0.5, 0.1

frac. SVs/OLs

0.54, 0.43 0.59, 0.47 0.24, 0.03 0.65, 0.38 margin ρ/w 0.84 0.70 0.62 0.48

threshold and smoothness requirements

SLIDE 64

Novelty detection for OCR

Better estimates since we only optimize in low density regions. Specifically tuned for small number of outliers. Only estimates of a level-set. For ν = 1 we get the Parzen-windows estimator back.

SLIDE 65

Classification with the ν-trick

changing kernel width and threshold

SLIDE 66

Structured Estimation (preview)

SLIDE 67

Large Margin Condition

Binary classifier

Correct class chosen with large margin y f(x)

Multiple classes
Score function per class f(x,y)
Want that correct class has much larger score

than incorrect class

Structured loss function (e.g. coal & diamonds)

f(x, y) f(x, y0) 1 for all y0 6= y ∆(y, y0)

SLIDE 68

Large Margin Classifiers

Large Margin without rescaling (convex)

(Guestrin, Taskar, Koller)

Large Margin with rescaling (convex)

(Tsochantaridis, Hofmann, Joachims, Altun)

Both losses majorize misclassification loss
Proof by plugging argmax into the definition

l(x, y, f) = sup

y02Y

[f(x, y0) − f(x, y) + ∆(y, y0)] l(x, y, f) = sup

y02Y

[f(x, y0) − f(x, y) + 1] ∆(y, y0) ∆ ✓ y, argmax

y0

f(x, y0) ◆

SLIDE 69

Many applications

Ranking (DCG, NDCG)
Graph matching (linear assignment)
ROC and Fβ scores
Sequence annotation (named entities, activity)
Segmentation
Natural Language Translation
Image annotation / scene understanding
Caution - this loss is generally not consistent!

SLIDE 70

Extensions

Invariances
Add prior knowledge (e.g. in OCR)
Make estimates robust against malicious

abuse (e.g. spam filtering)

Tighter upper bounds
Convex bound can be very loose
Overweights noisy data
Structured version of ramp loss
Can be shown to be consistent

SLIDE 71

More Kernel Algorithms

SLIDE 72

Kernel PCA

SLIDE 73

Principal Component Analysis

Gaussian density model
Estimate variance by empirical average
Good approximation by low-rank model
Extract leading eigenvalues of covariance
Data might lie in a subspace

p(x; µ, Σ) = (2π)

d 2 |Σ|− 1 2 exp

✓ −1 2(x − µ)Σ−1(x − µ) ◆ ˆ Σ = 1 m

m

X

i=1

xix>

i − ˆ

µˆ µ> where ˆ µ = 1 m

m

X

i=1

xi

SLIDE 74

Principal Component Analysis

Generative approximation of data
Heuristic

Good explanation of data implies that we have meaningful dimensions of the data.

Linear feature extraction
PCA is reconstruction with smallest l2 error

x = X

i

σiviαi where αi ∼ N(0, 1) gi(x) = hvi, xi

SLIDE 75

http://www.plantsciences.ucdavis.edu/gepts/pb143/LEC17/pq0921251003.gif

good for exploratory data analysis

SLIDE 76

Kernel PCA

R2 linear PCA R2 H Φ kernel PCA k k(x,x’) = < x,x’> e.g. k(x,x’) = < x,x’>

d

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

k k k

SLIDE 77

PCA via inner products

Eigenvector condition
Kernel PCA

Σv = λv 1 m X

i

¯ xi¯ x>

i v = λv for ¯

xi = xi − 1 m X

i

xi hence v = X

j

αj ¯ xj using ¯ x>

l

1 m X

i

¯ xi¯ x>

i v = λ¯

x>

l v

yields 1 m ¯ K ¯ Kα = λ ¯ Kα 1 m ¯ Kα = λα where ¯ Kij = h¯ xi, ¯ xji

SLIDE 78

Two dimensional feature extraction

noisy parabola polynomials

f increasing
rder

(1 is PCA)

−1 1 −0.5 0.5 1 Eigenvalue=0.000 −1 1 −0.5 0.5 1 Eigenvalue=0.291 −1 1 −0.5 0.5 1 Eigenvalue=0.709 −1 1 −0.5 0.5 1 Eigenvalue=0.034 −1 1 −0.5 0.5 1 Eigenvalue=0.345 −1 1 −0.5 0.5 1 Eigenvalue=0.621 −1 1 −0.5 0.5 1 Eigenvalue=0.026 −1 1 −0.5 0.5 1 Eigenvalue=0.395 −1 1 −0.5 0.5 1 Eigenvalue=0.570 −1 1 −0.5 0.5 1 Eigenvalue=0.021 −1 1 −0.5 0.5 1 Eigenvalue=0.418 −1 1 −0.5 0.5 1 Eigenvalue=0.552

SLIDE 79

Feature extraction

Eigenvalue=0.251 Eigenvalue=0.233 Eigenvalue=0.052 Eigenvalue=0.044 Eigenvalue=0.037 Eigenvalue=0.033 Eigenvalue=0.031 Eigenvalue=0.025 Eigenvalue=0.014 Eigenvalue=0.008 Eigenvalue=0.007 Eigenvalue=0.006 Eigenvalue=0.005 Eigenvalue=0.004 Eigenvalue=0.003 Eigenvalue=0.002

SLIDE 80

Mean Classifier

SLIDE 81

‘Trivial’ classifier

Represent each class by mean in feature space
Classify along direction of maximum

discrepancy between classes

Trivial to ‘train’
+

+ + +

c+

c- x-c w x c

.

SLIDE 82

‘Trivial’ classifier

Class mean
Classifier
+

+ + +

c+

c- x-c w x c

.

µ+ = 1 m+ X

i:yi=1

φ(xi) and µ− = 1 m− X

i:yi=−1

φ(xi)

like Watson Nadaraya

f(x) = hµ+ µ−, φ(x)i = X

i

yi myi k(xi, x)

SLIDE 83

More kernel methods

Canonical Correlation analysis
Two sample test
Mean in feature space is sufficient to fully

represent a distribution

Compare them by computing distance
Independence test
Compare joint and product of marginals
Structured feature extraction
Find directions of high significance and low

function complexity

SLIDE 84

Conditional Models

SLIDE 85

Gaussian Processes

SLIDE 86

Weight & height

SLIDE 87

Weight & height

assume Gaussian correlation

SLIDE 88

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

SLIDE 89

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

keep linear and quadratic terms of exponent

SLIDE 90

The gory math

Correlated Observations Assume that the random variables t 2 Rn, t0 2 Rn0 are jointly normal with mean (µ, µ0) and covariance matrix K p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! . Inference Given t, estimate t0 via p(t0|t). Translation into machine learning language: we learn t0 from t. Practical Solution Since t0|t ⇠ N(˜ µ, ˜ K), we only need to collect all terms in p(t, t0) depending on t0 by matrix inversion, hence ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t µ)

⇤ | {z }

independent of t0

SLIDE 91

Gaussian Process

Key Idea Instead of a fixed set of random variables t, t0 we assume a stochastic process t : X ! R, e.g. X = Rn. Previously we had X = {age, height, weight, . . .}. Definition of a Gaussian Process A stochastic process t : X ! R, where all (t(x1), . . . , t(xm)) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x0) := Cov(t(x), t(x0)) Simplifying Assumption We assume knowledge of k(x, x0) and set µ = 0.

SLIDE 92

Kernels ...

Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .

SLIDE 93

The connection

Gaussian Process on Parameters t ⇠ N(µ, K) where Kij = k(xi, xj) Linear Model in Feature Space t(x) = hΦ(x), wi + µ(x) where w ⇠ N(0, 1) The covariance between t(x) and t(x0) is then given by Ew [hΦ(x), wihw, Φ(x0)i] = hΦ(x), Φ(x0)i = k(x, x0) Conclusion A small weight vector in “feature space”, as commonly used in SVM amounts to observing t with high p(t). Log prior log p(t) ( ) Margin kwk2 Will get back to this later again.

SLIDE 94

Regression

SLIDE 95

Joint Gaussian Model

Random variables (t,t’) are drawn from GP
Observe a subset t of them
Predict the rest using
Linear expansion (precompute things)
Predictive uncertainty is data independent

Good for experimental design

Predictive uncertainty is data independent
Predictive variance vanishes if K is rank deficient

˜ K = Kt0t0 − K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t − µ)

⇤

SLIDE 96

Some kernels

Observation Any function k leading to a symmetric matrix with nonnegative eigenvalues is a valid covariance function. Necessary and sufficient condition (Mercer’s Theorem) k needs to be a nonnegative integral kernel. Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)

Cond. Expectation

Ec[p(x|c)p(x0|c)]

SLIDE 97

Linear ‘GP regression’

Linear kernel: k(x, x0) = hx, x0i Kernel matrix X>X Mean and covariance ˜ K = X0>X0 X0>X(X>X)1X>X0 = X0>(1 PX)X0. ˜ µ = X0>⇥ X(X>X)1t ⇤ ˜ µ is a linear function of X0. Problem The covariance matrix X>X has at most rank n. After n observations (x 2 Rn) the variance vanishes. This is not realistic. “Flat pancake” or “cigar” distribution.

SLIDE 98

Degenerate Covariance

SLIDE 99

Additive Noise

Indirect Model Instead of observing t(x) we observe y = t(x) + ξ, where ξ is a nuisance term. This yields p(Y |X) = Z

m

Y

i=1

p(yi|ti)p(t|X)dt where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N(0, σ2) then y is the sum of two Gaussian random variables. Means and variances add up. y ∼ N(µ, K + σ21).

SLIDE 100

Data

SLIDE 101

Predictive mean k(x, X)>(K(X, X) + σ21)1y

SLIDE 102

Variance

SLIDE 103

Putting it all together

SLIDE 104

Putting it all together

SLIDE 105

Ugly details

Covariance Matrices Additive noise K = Kkernel + σ21 Predictive mean and variance ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = K>

tt0K1 tt t

Pointwise prediction Ktt = K + σ21 Kt0t0 = k(x, x) + σ2 Ktt0 = (k(x1, x), . . . , k(xm, x)) Plug this into the mean and covariance equations.

SLIDE 106

Gaussian Process Conditional Models

SLIDE 107

Exponential Families

SLIDE 108

Exponential Families

Density function

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi)

SLIDE 109

Exponential Families

Density function
Log partition function generates cumulants

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2

θg(θ) = Var [φ(x)]

SLIDE 110

Exponential Families

Density function
Log partition function generates cumulants
g is convex (second derivative is p.s.d.)

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2

θg(θ) = Var [φ(x)]

SLIDE 111

Conditional Exponential Families

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

SLIDE 112

Conditional Exponential Families

Density function

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

SLIDE 113

Conditional Exponential Families

Density function
Log partition function generates cumulants

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

SLIDE 114

Conditional Exponential Families

Density function
Log partition function generates cumulants
g is convex (second derivative is p.s.d.)

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

SLIDE 115

Key Idea

Gaussian Process indexed by (x,y)
Binary y yields classification
Set for y yields multiclass
Integer y yields Poisson regression
Scalar y yields heteroscedastic regression
Sequence for y yields CRF
... and lots more ...
The GP is in the latent variables

(Regression is special case where we can integrate)

SLIDE 116

Conditional GP Model

Data likelihood
Prior
Posterior distribution
Maximize with respect to t for MAP estimate

p(y|x, t(x)) := et(x,y)−g(t(x)) where g(t(x)) = X

y

et(x,y) t ∼ N(µ, K) p(t|X, Y ) ∝ exp X

i

t(xi, yi) − g(t(xi)) − 1 2t>K1t !

SLIDE 117

Logistic Regression

SLIDE 118

Binomial Model

Binary label space {-1, 1}
We can center t(x,y) as y t(x)

(constant offset doesn’t change model)

Log-likelihood
After rescaling by 2 this is the logistic loss
MAP estimation problem

− log p(y|t) = log ⇥ et + e−t⇤ − yt = log ⇥ 1 + e−2yt⇤

minimize

t

1 2t>K1t +

m

X

i=1

log ⇥ 1 + eyiti⇤

SLIDE 119

More loss functions

Logistic
Huberized loss
Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

SLIDE 120

Clean Data

SLIDE 121

Noisy Data

SLIDE 122

Heteroscedastic Estimation

SLIDE 123

Motivation

GP Regression has variance estimate

independent of observed data

Assumes that we know variance globally

beforehand

This is nonsense!
Estimate mean and variance jointly
Easily possible in an exponential family model

Le, Canu, Smola, 2005

SLIDE 124

Recall - Normal distributions

Engineer’s favorite p(x) = 1 p 2πσ2 exp ✓ 1 2σ2(x µ)2 ◆ where x 2 R =: X Massaging the math p(x) = exp ⇣ h(x, 0.5x2) | {z }

φ(x)

, θi ⇣ µ2 2σ2 + 1 2 log(2πσ2) ⌘ | {z }

g(θ)

⌘ Using the substitution θ2 := σ2 and θ1 := µσ2 yields g(θ) = 1 2 ⇥ θ2

1θ1 2

+ log 2π log θ2 ⇤

SLIDE 125

Basic Idea

Sufficient Statistic We pick φ(x, y) = (yφ1(x), y2φ2(x)), that is k((x, y), (x0, y0)) = k1(x, x0)yy0+k2(x, x0)y2y02 where y, y0 2 R Hence estimate mean and variance simultaneously. Optimization Problem

minimize

m

X

i=1

2 41 4 " m X

j=1

α1jk1(xi, xj) #> " m X

j=1

α2jk2(xi, xj) #1 " m X

j=1

α1jk1(xi, xj) # 1 2 log det 2 " m X

j=1

α2jk2(xi, xj) #

m

X

j=1

h y>

i α1jk1(xi, xj) + (y> j α2jyj)k2(xi, xj)

i# + 1 2σ2 X

i,j

α>

1iα1jk1(xi, xj) + tr

h α2iα>

2j

i k2(xi, xj). subject to 0

m

X

i=1

α2ik(xi, xj)

The problem is convex The log-determinant from the normalization of the Gaussian acts as a barrrier function, i.e. a nice SDP .

SLIDE 126

SLIDE 127

SLIDE 128

Newton Method with CG Solver Use Newton method to compute update direction, CG solver instead of inverting Hessian. Lazy Evaluation Never build explicit Hessian. Reduced Rank Use incomplete Cholesky factorization for low-rank approximation. Result m 100 200 500 1k 2k 5k 10k 20k Direct Hessian 8 18 90 607 3551

Hessian vector

9 15 38 115 752

Reduced rank

7 7 12 30 54 179 368 727 This yields scaling of O(m2.1), O(m1.4), and O(m0.95).

Computational Issues

SLIDE 129

Standard GP

SLIDE 130

Heteroscedastic GP mean

SLIDE 131

Heteroscedastic GP variance

SLIDE 132

Kernel trick
Simple kernels
Kernel PCA
Mean Classifier
Support Vectors
Support Vector Machine classification
Regression
Logistic regression
Novelty detection
Gaussian Process Estimation
Regression
Classification
Heteroscedastic Regression

(Generalized) Linear Models

SLIDE 133