Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola - - PowerPoint PPT Presentation

scalable machine learning
SMART_READER_LITE
LIVE PREVIEW

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola - - PowerPoint PPT Presentation

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Administrative stuff Solutions will be posted by tomorrow New problem set will be


slide-1
SLIDE 1

Scalable Machine Learning

  • 5. (Generalized) Linear Models

Alex Smola Yahoo! Research and ANU

http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

slide-2
SLIDE 2

Administrative stuff

  • Solutions will be posted by tomorrow
  • New problem set will be available by tomorrow
  • Midterm project presentations are on March 13
  • Describe what you will do
  • Why it’s important
  • What you’ve achieved so far
  • Show why you think you’re going to succeed
  • 10 minutes per team (6 slides maximum)
  • Up to 10 pages supporting documentation
slide-3
SLIDE 3
  • 5. (Generalized) Linear Models
slide-4
SLIDE 4
  • Kernel trick
  • Simple kernels
  • Kernel PCA
  • Mean Classifier
  • Support Vectors
  • Support Vector Machine classification
  • Regression
  • Logistic regression
  • Novelty detection
  • Gaussian Process Estimation
  • Regression
  • Classification
  • Heteroscedastic Regression

(Generalized) Linear Models

slide-5
SLIDE 5

Kernels - a Preview

slide-6
SLIDE 6

Solving XOR

  • XOR not linearly separable
  • Mapping into 3 dimensions makes it easily solvable

(x1, x2) (x1, x2, x1x2)

slide-7
SLIDE 7

Feature Space Mapping

  • Naive Nonlinearization Strategy
  • Express data x in terms of features ɸ(x)
  • Solve problem in feature space
  • Requires explicit feature computation
  • Kernel trick
  • Write algorithm in terms of inner products
  • Replace by
  • Works well for dimension-insensitive methods
  • Kernel matrix K is positive semidefinite

hx, x0i k(x, x0) := hφ(x), φ(x0)i

slide-8
SLIDE 8
slide-9
SLIDE 9

Polynomial Kernels

  • Linear
  • Quadratic
  • Homogeneous polynomial
  • Inhomogeneous polynomial

k(x, x0) := hx, x0i k(x, x0) := D (x2

1, x2 2,

p 2x1x2), (x0

1 2, x0 2 2,

p 2x0

1x0 2)

E = hx, x0i2 k(x, x0) := hx, x0ip = X

|α|=p

Y

i

αi!(xix0

i)αi with α 2 Nd

k(x, x0) := (hx, x0i + c)p =

p

X

i=0

✓p i ◆ hx, x0ii

inner product

slide-10
SLIDE 10

More Kernels

  • Gaussian Kernel

can check that this is convolution of Gaussians

  • Brownian Bridge
  • Set intersection
  • Strings, more fancy set kernels, graphs, etc.

k(x, x0) := exp ⇣ γ kx x0k2⌘ k(x, x0) := min(x, x0) for x, x0 ≥ 0 k(A, B) := |A ∩ B|

slide-11
SLIDE 11

Support Vector Machines

slide-12
SLIDE 12

Classification

http://maktoons.blogspot.com/2009/03/support-vector-machine.html

slide-13
SLIDE 13

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

margin

hw, x1i + b = 1 hw, x2i + b = 1 hence hw, x1 x2i = 2 hence ⌧ w kwk, x1 x2

  • =

2 kwk

slide-14
SLIDE 14

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

dual problem

Kij = yiyj hxi, xji w = X

i

αiyixi minimize

α

1 2α>Kα − 1>α subject to X

i

αiyi = 0 αi ≥ 0

slide-15
SLIDE 15

Karush Kuhn Tucker conditions

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

αi [yi(hxi, wi + b) 1] = 0 yi(hxi, wi + b) > 1 implies αi = 0 αi > 0 implies yi(hxi, wi + b) = 1

KKT optimality condition

slide-16
SLIDE 16

Properties

  • Weight vector w as weighted linear

combination of instances

  • Only points on margin matter

(we can ignore the rest and get same solution)

  • Only inner products matter
  • Quadratic program
  • We can replace the inner product by a kernel
  • Keeps instances away from the margin

Java demo: http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

slide-17
SLIDE 17

Example

slide-18
SLIDE 18

Example

slide-19
SLIDE 19

Why large margins?

  • Maximum

robustness relative to uncertainty

  • Symmetry breaking
  • Independent of

correctly classified instances

  • Easy to find for

easy problems

  • +

+ +

  • +

r ρ

slide-20
SLIDE 20

Inseparable data

Quadratic program has no feasible solution

slide-21
SLIDE 21

Adding slack variables

  • Hard margin problem
  • With slack variables

problem is always feasible. Proof: (also yields upper bound)

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1 minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 w = 0 and b = 0 and ξi = 1

slide-22
SLIDE 22

Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

dual problem

Kij = yiyj hxi, xji w = X

i

αiyixi minimize

α

1 2α>Kα − 1>α subject to X

i

αiyi = 0 αi ∈ [0, C] minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0

slide-23
SLIDE 23

Classification with errors

slide-24
SLIDE 24

Nonlinear separation

  • Increasing C allows for more nonlinearities
  • Decreases number of errors
  • SV boundary need not be contiguous
slide-25
SLIDE 25

Loss function point of view

  • Constrained quadratic program
  • Risk minimization setting

Follows from finding minimal slack variable for given (w,b) pair.

minimize

w,b

1 2 kwk2 + C X

i

ξi subject to yi [hw, xii + b] 1 ξi and ξi 0 minimize

w,b

1 2 kwk2 + C X

i

max [0, 1 yi [hw, xii + b]]

empirical risk

slide-26
SLIDE 26

Soft margin as proxy for binary

  • Soft margin loss
  • Binary loss

max(0, 1 − yf(x)) {yf(x) < 0}

convex upper bound binary loss function margin

slide-27
SLIDE 27

More loss functions

  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

slide-28
SLIDE 28

Risk minimization view

  • Find function f minimizing classification error
  • Compute empirical average
  • Minimization is nonconvex
  • Overfitting as we minimize empirical error
  • Compute convex upper bound on the loss
  • Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [{yf(x) > 0}] Remp[f] := 1 m

m

X

i=1

{yif(xi) > 0} Rreg[f] := 1 m

m

X

i=1

max(0, 1 − yif(xi)) + λΩ[f]

regularization

how to control ƛ

slide-29
SLIDE 29

Regression

slide-30
SLIDE 30

Regression Estimation

  • Find function f minimizing regression error
  • Compute empirical average

Overfitting as we minimize empirical error

  • Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [l(y, f(x))] Remp[f] := 1 m

m

X

i=1

l(yi, f(xi)) Rreg[f] := 1 m

m

X

i=1

l(yi, f(xi)) + λΩ[f]

slide-31
SLIDE 31

Squared loss

l(y, f(x)) = 1 2(y − f(x))2

slide-32
SLIDE 32

l1 loss

l(y, f(x)) = |y − f(x)|

slide-33
SLIDE 33

ε-insensitive Loss

l(y, f(x)) = max(0, |y − f(x)| − ✏)

slide-34
SLIDE 34

Penalized least mean squares

  • Optimization problem
  • Solution

minimize

w

1 m

m

X

i=1

(yi hxi, wi)2 + λ 2 kwk2 ∂w [. . .] = 1 m

m

X

i=1

⇥ xix>

i w − xiyi

⇤ + λw =  1 mXX> + λ1

  • w − 1

mXy = 0 hence w = ⇥ XX> + λm1 ⇤1 Xy

matrix inverse use CG or SMW

  • nly inner product

between X matters

slide-35
SLIDE 35

SVM Regression (ϵ-insensitive loss)

x x x x x x x x x x x x x x

+ε −ε

x

ξ +ε −ε ξ

y x y − f(x) loss

don’t care about deviations within the tube

slide-36
SLIDE 36

SVM Regression (ϵ-insensitive loss)

  • Optimization Problem (as constrained QP)
  • Lagrange Function

minimize

w,b

1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ]

subject to hw, xii + b  yi + ✏ + ⇠i and ⇠i 0 hw, xii + b yi ✏ ⇠∗

i and ⇠∗ i 0

L =1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ] m

X

i=1

[⌘i⇠i + ⌘∗

i ⇠∗ i ] + m

X

i=1

↵i [hw, xii + b yi ✏ ⇠i] +

m

X

i=1

↵∗

i [yi ✏ ⇠∗ i hw, xii b]

slide-37
SLIDE 37

SVM Regression (ϵ-insensitive loss)

  • First order conditions
  • Dual problem

∂wL = 0 = w + X

i

[αi − α∗

i ] xi

∂bL = 0 = X

i

[αi − α∗

i ]

∂ξiL = 0 = C − ηi − αi ∂ξ∗

i L = 0 = C − η∗

i − α∗ i

minimize

α,α∗

1 2(↵ − ↵⇤)>K(↵ − ↵⇤) + ✏1>(↵ + ↵⇤) + y>(↵ − ↵⇤) subject to 1>(↵ − ↵⇤) = 0 and ↵i, ↵⇤

i ∈ [0, C]

slide-38
SLIDE 38

Properties

  • Ignores ‘typical’ instances with small error
  • Only upper or lower bound active at any time

(we cannot violate both bounds simultaneously)

  • Quadratic Program in 2n variables can be

solved as cheaply as standard SVM problem

  • Robustness with respect to outliers
  • l1 loss yields same problem without epsilon
  • Huber’s robust loss yields similar problem but

with added quadratic penalty on coefficients

slide-39
SLIDE 39

Regression example

sinc x + 0.1 sinc x - 0.1 approximation

slide-40
SLIDE 40

Regression example

sinc x + 0.2 sinc x - 0.2 approximation

slide-41
SLIDE 41

Regression example

sinc x + 0.5 sinc x - 0.5 approximation

slide-42
SLIDE 42

Regression example

Support Vectors Support Vectors Support Vectors

slide-43
SLIDE 43

Huber’s robust loss

quadratic linear

l(y, f(x)) = (

1 2(y − f(x))2

if |y − f(x)| < 1 |y − f(x)| − 1

2

  • therwise

trimmed mean estimatior

slide-44
SLIDE 44

Novelty Detection

slide-45
SLIDE 45

Basic Idea

Data Observations (xi) generated from some P(x), e.g., network usage patterns handwritten digits alarm sensors factory status Task Find unusual events, clean database, dis- tinguish typical ex- amples.

slide-46
SLIDE 46

Applications

Network Intrusion Detection Detect whether someone is trying to hack the network, downloading tons of MP3s, or doing anything else un- usual on the network. Jet Engine Failure Detection You can’t destroy jet engines just to see how they fail. Database Cleaning We want to find out whether someone stored bogus in- formation in a database (typos, etc.), mislabelled digits, ugly digits, bad photographs in an electronic album. Fraud Detection Credit Cards, Telephone Bills, Medical Records Self calibrating alarm devices Car alarms (adjusts itself to where the car is parked), home alarm (furniture, temperature, windows, etc.)

slide-47
SLIDE 47

Novelty Detection via Density Estimation

Key Idea Novel data is one that we don’t see frequently. It must lie in low density regions. Step 1: Estimate density Observations x1, . . . , xm Density estimate via Parzen windows Step 2: Thresholding the density Sort data according to density and use it for rejection Practical implementation: compute p(xi) = 1 m X

j

k(xi, xj) for all i and sort according to magnitude. Pick smallest p(xi) as novel points.

slide-48
SLIDE 48

Order Statistics of Densities

slide-49
SLIDE 49

Typical Data

slide-50
SLIDE 50

Outliers

slide-51
SLIDE 51

A better way

Problems We do not care about estimating the density properly in regions of high density (waste of capacity). We only care about the relative density for threshold- ing purposes. We want to eliminate a certain fraction of observations and tune our estimator specifically for this fraction. Solution Areas of low density can be approximated as the level set of an auxiliary function. No need to estimate p(x) directly — use proxy of p(x). Specifically: find f(x) such that x is novel if f(x) ≤ c where c is some constant, i.e. f(x) describes the amount of novelty.

slide-52
SLIDE 52

Problems with density estimation

Maximum a Posteriori minimize

θ m

X

i=1

g(θ) hφ(xi), θi + 1 2σ2kθk2 Advantages Convex optimization problem Concentration of measure Problems Normalization g(θ) may be painful to compute For density estimation we need no normalized p(x|θ) No need to perform particularly well in high density regions

slide-53
SLIDE 53

Thresholding

slide-54
SLIDE 54

Optimization Problem

Optimization Problem MAP

m

X

i=1

log p(xi|θ) + 1 2σ2kθk2 Novelty

m

X

i=1

max ✓ log p(xi|θ) exp(ρ g(θ)), 0 ◆ + 1 2kθk2

m

X

i=1

max(ρ hφ(xi), θi, 0) + 1 2kθk2 Advantages No normalization g(θ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program

slide-55
SLIDE 55

Maximum Distance Hyperplane

Idea Find hyperplane, given by f(x) = hw, xi + b = 0 that has maximum distance from origin yet is still closer to the origin than the observations. Hard Margin minimize 1 2kwk2 subject to hw, xii 1 Soft Margin minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 ξi ξi 0

slide-56
SLIDE 56

Optimization Problem

Primal Problem minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 + ξi 0 and ξi 0 Lagrange Function L Subtract constraints, multiplied by Lagrange multipli- ers (αi and ηi), from Primal Objective Function. Lagrange function L has saddlepoint at optimum. L = 1 2kwk2 + C

m

X

i=1

ξi

m

X

i=1

αi (hw, xii 1 + ξi)

m

X

i=1

ηiξi subject to αi, ηi 0.

slide-57
SLIDE 57

Dual Problem

Optimality Conditions ∂wL = w

m

X

i=1

αixi = 0 = ) w =

m

X

i=1

αixi ∂ξiL = C αi ηi = 0 = ) αi 2 [0, C] Now substitute the optimality conditions back into L. Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji

m

X

i=1

αi subject to αi 2 [0, C] All this is only possible due to the convexity of the primal problem.

slide-58
SLIDE 58

Minimum enclosing ball

  • Observations on

surface of ball

  • Find minimum

enclosing ball

  • Equivalent to

single class SVM

||w|| ρ/

.

x

R

x x x x x

slide-59
SLIDE 59

Adaptive thresholds

Problem Depending on C, the number of novel points will vary. We would like to specify the fraction ν beforehand. Solution Use hyperplane separating data from the origin H := {x|hw, xi = ρ} where the threshold ρ is adaptive. Intuition Let the hyperplane shift by shifting ρ Adjust it such that the ’right’ number of observations is considered novel. Do this automatically

slide-60
SLIDE 60

Optimization Problem

Primal Problem minimize 1 2kwk2 +

m

X

i=1

ξi mνρ where hw, xii ρ + ξi 0 ξi 0 Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji where αi 2 [0, 1] and

m

X

i=1

αi = νm. Similar to SV classification problem, use standard

slide-61
SLIDE 61

The ν-property theorem

  • Optimization problem
  • Solution satisfies
  • At most a fraction of ν points are novel
  • At most a fraction of (1-ν) points aren’t novel
  • Fraction of points on boundary vanishes for

large m (for non-pathological kernels)

minimize

w

1 2 kwk2 +

m

X

i=1

ξi mνρ subject to hw, xii ρ ξi and ξi 0

slide-62
SLIDE 62

Proof

  • Move boundary at optimality
  • For smaller threshold m- points on wrong side
  • f margin contribute
  • For larger threshold m+ points not on ‘good’

side of margin yield

  • Combining inequalities
  • Margin set of measure 0

δ(m− − νm) ≤ 0 δ(m+ − νm) ≥ 0 m− m ≤ ν ≤ m+ m

slide-63
SLIDE 63

Toy example

ν, width c 0.5, 0.5 0.5, 0.5 0.1, 0.5 0.5, 0.1

  • frac. SVs/OLs

0.54, 0.43 0.59, 0.47 0.24, 0.03 0.65, 0.38 margin ρ/w 0.84 0.70 0.62 0.48

threshold and smoothness requirements

slide-64
SLIDE 64

Novelty detection for OCR

Better estimates since we only optimize in low density regions. Specifically tuned for small number of outliers. Only estimates of a level-set. For ν = 1 we get the Parzen-windows estimator back.

slide-65
SLIDE 65

Classification with the ν-trick

changing kernel width and threshold

slide-66
SLIDE 66

Structured Estimation (preview)

slide-67
SLIDE 67

Large Margin Condition

  • Binary classifier

Correct class chosen with large margin y f(x)

  • Multiple classes
  • Score function per class f(x,y)
  • Want that correct class has much larger score

than incorrect class

  • Structured loss function (e.g. coal & diamonds)

f(x, y) f(x, y0) 1 for all y0 6= y ∆(y, y0)

slide-68
SLIDE 68

Large Margin Classifiers

  • Large Margin without rescaling (convex)

(Guestrin, Taskar, Koller)

  • Large Margin with rescaling (convex)

(Tsochantaridis, Hofmann, Joachims, Altun)

  • Both losses majorize misclassification loss
  • Proof by plugging argmax into the definition

l(x, y, f) = sup

y02Y

[f(x, y0) − f(x, y) + ∆(y, y0)] l(x, y, f) = sup

y02Y

[f(x, y0) − f(x, y) + 1] ∆(y, y0) ∆ ✓ y, argmax

y0

f(x, y0) ◆

slide-69
SLIDE 69

Many applications

  • Ranking (DCG, NDCG)
  • Graph matching (linear assignment)
  • ROC and Fβ scores
  • Sequence annotation (named entities, activity)
  • Segmentation
  • Natural Language Translation
  • Image annotation / scene understanding
  • Caution - this loss is generally not consistent!
slide-70
SLIDE 70

Extensions

  • Invariances
  • Add prior knowledge (e.g. in OCR)
  • Make estimates robust against malicious

abuse (e.g. spam filtering)

  • Tighter upper bounds
  • Convex bound can be very loose
  • Overweights noisy data
  • Structured version of ramp loss
  • Can be shown to be consistent
slide-71
SLIDE 71

More Kernel Algorithms

slide-72
SLIDE 72

Kernel PCA

slide-73
SLIDE 73

Principal Component Analysis

  • Gaussian density model
  • Estimate variance by empirical average
  • Good approximation by low-rank model
  • Extract leading eigenvalues of covariance
  • Data might lie in a subspace

p(x; µ, Σ) = (2π)

d 2 |Σ|− 1 2 exp

✓ −1 2(x − µ)Σ−1(x − µ) ◆ ˆ Σ = 1 m

m

X

i=1

xix>

i − ˆ

µˆ µ> where ˆ µ = 1 m

m

X

i=1

xi

slide-74
SLIDE 74

Principal Component Analysis

  • Generative approximation of data
  • Heuristic

Good explanation of data implies that we have meaningful dimensions of the data.

  • Linear feature extraction
  • PCA is reconstruction with smallest l2 error

x = X

i

σiviαi where αi ∼ N(0, 1) gi(x) = hvi, xi

slide-75
SLIDE 75

http://www.plantsciences.ucdavis.edu/gepts/pb143/LEC17/pq0921251003.gif

good for exploratory data analysis

slide-76
SLIDE 76

Kernel PCA

R2 linear PCA R2 H Φ kernel PCA k k(x,x’) = < x,x’> e.g. k(x,x’) = < x,x’>

d

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

k k k

slide-77
SLIDE 77

PCA via inner products

  • Eigenvector condition
  • Kernel PCA

Σv = λv 1 m X

i

¯ xi¯ x>

i v = λv for ¯

xi = xi − 1 m X

i

xi hence v = X

j

αj ¯ xj using ¯ x>

l

1 m X

i

¯ xi¯ x>

i v = λ¯

x>

l v

yields 1 m ¯ K ¯ Kα = λ ¯ Kα 1 m ¯ Kα = λα where ¯ Kij = h¯ xi, ¯ xji

slide-78
SLIDE 78

Two dimensional feature extraction

noisy parabola polynomials

  • f increasing
  • rder

(1 is PCA)

−1 1 −0.5 0.5 1 Eigenvalue=0.000 −1 1 −0.5 0.5 1 Eigenvalue=0.291 −1 1 −0.5 0.5 1 Eigenvalue=0.709 −1 1 −0.5 0.5 1 Eigenvalue=0.034 −1 1 −0.5 0.5 1 Eigenvalue=0.345 −1 1 −0.5 0.5 1 Eigenvalue=0.621 −1 1 −0.5 0.5 1 Eigenvalue=0.026 −1 1 −0.5 0.5 1 Eigenvalue=0.395 −1 1 −0.5 0.5 1 Eigenvalue=0.570 −1 1 −0.5 0.5 1 Eigenvalue=0.021 −1 1 −0.5 0.5 1 Eigenvalue=0.418 −1 1 −0.5 0.5 1 Eigenvalue=0.552

slide-79
SLIDE 79

Feature extraction

Eigenvalue=0.251 Eigenvalue=0.233 Eigenvalue=0.052 Eigenvalue=0.044 Eigenvalue=0.037 Eigenvalue=0.033 Eigenvalue=0.031 Eigenvalue=0.025 Eigenvalue=0.014 Eigenvalue=0.008 Eigenvalue=0.007 Eigenvalue=0.006 Eigenvalue=0.005 Eigenvalue=0.004 Eigenvalue=0.003 Eigenvalue=0.002

slide-80
SLIDE 80

Mean Classifier

slide-81
SLIDE 81

‘Trivial’ classifier

  • Represent each class by mean in feature space
  • Classify along direction of maximum

discrepancy between classes

  • Trivial to ‘train’
  • +

+ + +

  • c+

c- x-c w x c

.

slide-82
SLIDE 82

‘Trivial’ classifier

  • Class mean
  • Classifier
  • +

+ + +

  • c+

c- x-c w x c

.

µ+ = 1 m+ X

i:yi=1

φ(xi) and µ− = 1 m− X

i:yi=−1

φ(xi)

like Watson Nadaraya

f(x) = hµ+ µ−, φ(x)i = X

i

yi myi k(xi, x)

slide-83
SLIDE 83

More kernel methods

  • Canonical Correlation analysis
  • Two sample test
  • Mean in feature space is sufficient to fully

represent a distribution

  • Compare them by computing distance
  • Independence test
  • Compare joint and product of marginals
  • Structured feature extraction
  • Find directions of high significance and low

function complexity

slide-84
SLIDE 84

Conditional Models

slide-85
SLIDE 85

Gaussian Processes

slide-86
SLIDE 86

Weight & height

slide-87
SLIDE 87

Weight & height

assume Gaussian correlation

slide-88
SLIDE 88

p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)

slide-89
SLIDE 89

p(x2|x1) ∝ exp " −1 2  x1 − µ1 x2 − µ2 >  Σ11 Σ12 Σ12 Σ22 1  x1 − µ1 x2 − µ2 #

keep linear and quadratic terms of exponent

slide-90
SLIDE 90

The gory math

Correlated Observations Assume that the random variables t 2 Rn, t0 2 Rn0 are jointly normal with mean (µ, µ0) and covariance matrix K p(t, t0) / exp 1 2  t µ t0 µ0 >  Ktt Ktt0 K>

tt0 Kt0t0

1  t µ t0 µ0 ! . Inference Given t, estimate t0 via p(t0|t). Translation into machine learning language: we learn t0 from t. Practical Solution Since t0|t ⇠ N(˜ µ, ˜ K), we only need to collect all terms in p(t, t0) depending on t0 by matrix inversion, hence ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t µ)

⇤ | {z }

independent of t0

slide-91
SLIDE 91

Gaussian Process

Key Idea Instead of a fixed set of random variables t, t0 we assume a stochastic process t : X ! R, e.g. X = Rn. Previously we had X = {age, height, weight, . . .}. Definition of a Gaussian Process A stochastic process t : X ! R, where all (t(x1), . . . , t(xm)) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x0) := Cov(t(x), t(x0)) Simplifying Assumption We assume knowledge of k(x, x0) and set µ = 0.

slide-92
SLIDE 92

Kernels ...

Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .

slide-93
SLIDE 93

The connection

Gaussian Process on Parameters t ⇠ N(µ, K) where Kij = k(xi, xj) Linear Model in Feature Space t(x) = hΦ(x), wi + µ(x) where w ⇠ N(0, 1) The covariance between t(x) and t(x0) is then given by Ew [hΦ(x), wihw, Φ(x0)i] = hΦ(x), Φ(x0)i = k(x, x0) Conclusion A small weight vector in “feature space”, as commonly used in SVM amounts to observing t with high p(t). Log prior log p(t) ( ) Margin kwk2 Will get back to this later again.

slide-94
SLIDE 94

Regression

slide-95
SLIDE 95

Joint Gaussian Model

  • Random variables (t,t’) are drawn from GP
  • Observe a subset t of them
  • Predict the rest using
  • Linear expansion (precompute things)
  • Predictive uncertainty is data independent

Good for experimental design

  • Predictive uncertainty is data independent
  • Predictive variance vanishes if K is rank deficient

˜ K = Kt0t0 − K>

tt0K1 tt Ktt0 and ˜

µ = µ0 + K>

tt0

⇥ K1

tt (t − µ)

slide-96
SLIDE 96

Some kernels

Observation Any function k leading to a symmetric matrix with non- negative eigenvalues is a valid covariance function. Necessary and sufficient condition (Mercer’s Theorem) k needs to be a nonnegative integral kernel. Examples of kernels k(x, x0) Linear hx, x0i Laplacian RBF exp (λkx x0k) Gaussian RBF exp

  • λkx x0k2

Polynomial (hx, x0i + ci)d , c 0, d 2 N B-Spline B2n+1(x x0)

  • Cond. Expectation

Ec[p(x|c)p(x0|c)]

slide-97
SLIDE 97

Linear ‘GP regression’

Linear kernel: k(x, x0) = hx, x0i Kernel matrix X>X Mean and covariance ˜ K = X0>X0 X0>X(X>X)1X>X0 = X0>(1 PX)X0. ˜ µ = X0>⇥ X(X>X)1t ⇤ ˜ µ is a linear function of X0. Problem The covariance matrix X>X has at most rank n. After n observations (x 2 Rn) the variance vanishes. This is not realistic. “Flat pancake” or “cigar” distribution.

slide-98
SLIDE 98

Degenerate Covariance

slide-99
SLIDE 99

Additive Noise

Indirect Model Instead of observing t(x) we observe y = t(x) + ξ, where ξ is a nuisance term. This yields p(Y |X) = Z

m

Y

i=1

p(yi|ti)p(t|X)dt where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N(0, σ2) then y is the sum of two Gaussian ran- dom variables. Means and variances add up. y ∼ N(µ, K + σ21).

slide-100
SLIDE 100

Data

slide-101
SLIDE 101

Predictive mean k(x, X)>(K(X, X) + σ21)1y

slide-102
SLIDE 102

Variance

slide-103
SLIDE 103

Putting it all together

slide-104
SLIDE 104

Putting it all together

slide-105
SLIDE 105

Ugly details

Covariance Matrices Additive noise K = Kkernel + σ21 Predictive mean and variance ˜ K = Kt0t0 K>

tt0K1 tt Ktt0 and ˜

µ = K>

tt0K1 tt t

Pointwise prediction Ktt = K + σ21 Kt0t0 = k(x, x) + σ2 Ktt0 = (k(x1, x), . . . , k(xm, x)) Plug this into the mean and covariance equations.

slide-106
SLIDE 106

Gaussian Process Conditional Models

slide-107
SLIDE 107

Exponential Families

slide-108
SLIDE 108

Exponential Families

  • Density function

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi)

slide-109
SLIDE 109

Exponential Families

  • Density function
  • Log partition function generates cumulants

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2

θg(θ) = Var [φ(x)]

slide-110
SLIDE 110

Exponential Families

  • Density function
  • Log partition function generates cumulants
  • g is convex (second derivative is p.s.d.)

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2

θg(θ) = Var [φ(x)]

slide-111
SLIDE 111

Conditional Exponential Families

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

slide-112
SLIDE 112

Conditional Exponential Families

  • Density function

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

slide-113
SLIDE 113

Conditional Exponential Families

  • Density function
  • Log partition function generates cumulants

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

slide-114
SLIDE 114

Conditional Exponential Families

  • Density function
  • Log partition function generates cumulants
  • g is convex (second derivative is p.s.d.)

p(y|x; θ) = exp (hφ(x, y), θi g(θ|x)) where g(θ|x) = log X

y0

exp (hφ(x, y0), θi) ∂θg(θ|x) = E [φ(x, y)|x] ∂2

θg(θ|x) = Var [φ(x, y)|x]

slide-115
SLIDE 115

Key Idea

  • Gaussian Process indexed by (x,y)
  • Binary y yields classification
  • Set for y yields multiclass
  • Integer y yields Poisson regression
  • Scalar y yields heteroscedastic regression
  • Sequence for y yields CRF
  • ... and lots more ...
  • The GP is in the latent variables

(Regression is special case where we can integrate)

slide-116
SLIDE 116

Conditional GP Model

  • Data likelihood
  • Prior
  • Posterior distribution
  • Maximize with respect to t for MAP estimate

p(y|x, t(x)) := et(x,y)−g(t(x)) where g(t(x)) = X

y

et(x,y) t ∼ N(µ, K) p(t|X, Y ) ∝ exp X

i

t(xi, yi) − g(t(xi)) − 1 2t>K1t !

slide-117
SLIDE 117

Logistic Regression

slide-118
SLIDE 118

Binomial Model

  • Binary label space {-1, 1}
  • We can center t(x,y) as y t(x)

(constant offset doesn’t change model)

  • Log-likelihood
  • After rescaling by 2 this is the logistic loss
  • MAP estimation problem

− log p(y|t) = log ⇥ et + e−t⇤ − yt = log ⇥ 1 + e−2yt⇤

minimize

t

1 2t>K1t +

m

X

i=1

log ⇥ 1 + eyiti⇤

slide-119
SLIDE 119

More loss functions

  • Logistic
  • Huberized loss
  • Soft margin

     if f(x) > 1

1 2(1 − f(x))2

if f(x) ∈ [0, 1]

1 2 − f(x)

if f(x) < 0

max(0, 1 − f(x))

(asymptotically) linear (asymptotically) 0

log h 1 + e−f(x)i

slide-120
SLIDE 120

Clean Data

slide-121
SLIDE 121

Noisy Data

slide-122
SLIDE 122

Heteroscedastic Estimation

slide-123
SLIDE 123

Motivation

  • GP Regression has variance estimate

independent of observed data

  • Assumes that we know variance globally

beforehand

  • This is nonsense!
  • Estimate mean and variance jointly
  • Easily possible in an exponential family model

Le, Canu, Smola, 2005

slide-124
SLIDE 124

Recall - Normal distributions

Engineer’s favorite p(x) = 1 p 2πσ2 exp ✓ 1 2σ2(x µ)2 ◆ where x 2 R =: X Massaging the math p(x) = exp ⇣ h(x, 0.5x2) | {z }

φ(x)

, θi ⇣ µ2 2σ2 + 1 2 log(2πσ2) ⌘ | {z }

g(θ)

⌘ Using the substitution θ2 := σ2 and θ1 := µσ2 yields g(θ) = 1 2 ⇥ θ2

1θ1 2

+ log 2π log θ2 ⇤

slide-125
SLIDE 125

Basic Idea

Sufficient Statistic We pick φ(x, y) = (yφ1(x), y2φ2(x)), that is k((x, y), (x0, y0)) = k1(x, x0)yy0+k2(x, x0)y2y02 where y, y0 2 R Hence estimate mean and variance simultaneously. Optimization Problem

minimize

m

X

i=1

2 41 4 " m X

j=1

α1jk1(xi, xj) #> " m X

j=1

α2jk2(xi, xj) #1 " m X

j=1

α1jk1(xi, xj) # 1 2 log det 2 " m X

j=1

α2jk2(xi, xj) #

  • m

X

j=1

h y>

i α1jk1(xi, xj) + (y> j α2jyj)k2(xi, xj)

i# + 1 2σ2 X

i,j

α>

1iα1jk1(xi, xj) + tr

h α2iα>

2j

i k2(xi, xj). subject to 0

m

X

i=1

α2ik(xi, xj)

The problem is convex The log-determinant from the normalization of the Gaussian acts as a barrrier function, i.e. a nice SDP .

slide-126
SLIDE 126
slide-127
SLIDE 127
slide-128
SLIDE 128

Newton Method with CG Solver Use Newton method to compute update direction, CG solver instead of inverting Hessian. Lazy Evaluation Never build explicit Hessian. Reduced Rank Use incomplete Cholesky factorization for low-rank ap- proximation. Result m 100 200 500 1k 2k 5k 10k 20k Direct Hessian 8 18 90 607 3551

  • Hessian vector

9 15 38 115 752

  • Reduced rank

7 7 12 30 54 179 368 727 This yields scaling of O(m2.1), O(m1.4), and O(m0.95).

Computational Issues

slide-129
SLIDE 129

Standard GP

slide-130
SLIDE 130

Heteroscedastic GP mean

slide-131
SLIDE 131

Heteroscedastic GP variance

slide-132
SLIDE 132
  • Kernel trick
  • Simple kernels
  • Kernel PCA
  • Mean Classifier
  • Support Vectors
  • Support Vector Machine classification
  • Regression
  • Logistic regression
  • Novelty detection
  • Gaussian Process Estimation
  • Regression
  • Classification
  • Heteroscedastic Regression

(Generalized) Linear Models

slide-133
SLIDE 133

Further reading

  • Ramp loss consistency

http://books.nips.cc/papers/files/nips24/NIPS2011_1222.pdf

  • Ranking and structured estimation

http://users.cecs.anu.edu.au/~chteo/pub/LeSmoChaTeo09.pdf

  • Invariances and convexity

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11755

  • Ramp loss for structured estimation

http://users.cecs.anu.edu.au/~chteo/pub/Chaetal09.pdf

  • Structured estimation (with margin rescaling)

http://ttic.uchicago.edu/~altun/pubs/AltHofTso06.pdf

  • Structured estimation (without margin rescaling)

http://www.seas.upenn.edu/~taskar/pubs/icml05.pdf

  • Ben Taskar’s tutorial

http://www.seas.upenn.edu/~taskar/nips07tut/nips07tut.ppt

slide-134
SLIDE 134

Further reading

  • SVM Tutorial (regression)

http://alex.smola.org/papers/2003/SmoSch03b.pdf

  • SVM Tutorial (classification)

http://www.umiacs.umd.edu/~joseph/support-vector- machines4.pdf

  • Introductory chapter of Kernel book

http://alex.smola.org/teaching/berkeley2012/slides/ lwk_chapter1.pdf

  • Introductory chapter of structured estimation book

http://alex.smola.org/teaching/berkeley2012/slides/ se_chapter2.pdf

  • Kernel PCA

http://dl.acm.org/citation.cfm?id=295919.295960