Introduction to Machine Learning 6. Kernels Methods Alex Smola - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning 6. Kernels Methods Alex Smola - - PowerPoint PPT Presentation

Introduction to Machine Learning 6. Kernels Methods Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Regression Regression Estimation Find function f minimizing regression error R [ f ] := E x,y


slide-1
SLIDE 1

Introduction to Machine Learning

  • 6. Kernels Methods

Alex Smola Carnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-701 10-701

slide-2
SLIDE 2

Regression

slide-3
SLIDE 3

Regression Estimation

  • Find function f minimizing regression error
  • Compute empirical average

Overfitting as we minimize empirical error

  • Add regularization for capacity control

R[f] := Ex,y∼p(x,y) [l(y, f(x))] Remp[f] := 1 m

m

X

i=1

l(yi, f(xi)) Rreg[f] := 1 m

m

X

i=1

l(yi, f(xi)) + λΩ[f]

slide-4
SLIDE 4

Squared loss

l(y, f(x)) = 1 2(y − f(x))2

slide-5
SLIDE 5

l1 loss

l(y, f(x)) = |y − f(x)|

slide-6
SLIDE 6

ε-insensitive Loss

l(y, f(x)) = max(0, |y − f(x)| − ✏)

slide-7
SLIDE 7

Penalized least mean squares

  • Optimization problem
  • Solution

∂w [. . .] = 1 m

m

X

i=1

⇥ xix>

i w − xiyi

⇤ + λw =  1 mXX> + λ1

  • w − 1

mXy = 0 hence w = ⇥ XX> + λm1 ⇤1 Xy

Conjugate Gradient Sherman Morrison Woodbury

Outer product matrix in X

minimize

w

1 2m

m

X

i=1

(yi hxi, wi)2 + λ 2 kwk2

slide-8
SLIDE 8
  • Optimization problem
  • Representer Theorem (Kimeldorf & Wahba, 1971)

Penalized least mean squares ... now with kernels

minimize

w

1 2m

m

X

i=1

(yi hφ(xi), wi)2 + λ 2 kwk2 kwk2 =

  • wk
  • 2 + kw?k2

empirical risk dependent

w⊥ wk

slide-9
SLIDE 9

Penalized least mean squares ... now with kernels

  • Optimization problem
  • Representer Theorem (Kimeldorf & Wahba, 1971)
  • Optimal solution is in span of data
  • Proof - risk term only depends on data via
  • Regularization ensures that orthogonal part is 0
  • Optimization problem in terms of w

solve for as linear system

φ(xi) w = X

i

αiφ(xi) minimize

α

1 2m

m

X

i=1

⇣ yi − X

j

Kijαj ⌘2 + λ 2 X

i,j

αiαjKij minimize

w

1 2m

m

X

i=1

(yi hφ(xi), wi)2 + λ 2 kwk2 α = (K + mλ1)−1y

slide-10
SLIDE 10

SVM Regression (ϵ-insensitive loss)

x x x x x x x x x x x x x x

+ε −ε

x

ξ +ε −ε ξ

y x y − f(x) loss

don’t care about deviations within the tube

slide-11
SLIDE 11

SVM Regression (ϵ-insensitive loss)

  • Optimization Problem (as constrained QP)
  • Lagrange Function

minimize

w,b

1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ]

subject to hw, xii + b  yi + ✏ + ⇠i and ⇠i 0 hw, xii + b yi ✏ ⇠∗

i and ⇠∗ i 0

L =1 2 kwk2 + C

m

X

i=1

[⇠i + ⇠∗

i ] m

X

i=1

[⌘i⇠i + ⌘∗

i ⇠∗ i ] + m

X

i=1

↵i [hw, xii + b yi ✏ ⇠i] +

m

X

i=1

↵∗

i [yi ✏ ⇠∗ i hw, xii b]

slide-12
SLIDE 12

SVM Regression (ϵ-insensitive loss)

  • First order conditions
  • Dual problem

∂wL = 0 = w + X

i

[αi − α∗

i ] xi

∂bL = 0 = X

i

[αi − α∗

i ]

∂ξiL = 0 = C − ηi − αi ∂ξ∗

i L = 0 = C − η∗

i − α∗ i

minimize

α,α∗

1 2(↵ − ↵⇤)>K(↵ − ↵⇤) + ✏1>(↵ + ↵⇤) + y>(↵ − ↵⇤) subject to 1>(↵ − ↵⇤) = 0 and ↵i, ↵⇤

i ∈ [0, C]

slide-13
SLIDE 13

Properties

  • Ignores ‘typical’ instances with small error
  • Only upper or lower bound active at any time
  • QP in 2n variables as cheap as SVM problem
  • Robustness with respect to outliers
  • l1 loss yields same problem without epsilon
  • Huber’s robust loss yields similar problem but

with added quadratic penalty on coefficients

slide-14
SLIDE 14

Regression example

sinc x + 0.1 sinc x - 0.1 approximation

slide-15
SLIDE 15

Regression example

sinc x + 0.2 sinc x - 0.2 approximation

slide-16
SLIDE 16

Regression example

sinc x + 0.5 sinc x - 0.5 approximation

slide-17
SLIDE 17

Regression example

Support Vectors Support Vectors Support Vectors

slide-18
SLIDE 18

Huber’s robust loss

quadratic linear

l(y, f(x)) = (

1 2(y − f(x))2

if |y − f(x)| < 1 |y − f(x)| − 1

2

  • therwise

trimmed mean estimatior

slide-19
SLIDE 19

Novelty Detection

slide-20
SLIDE 20

Basic Idea

Data Observations (xi) generated from some P(x), e.g., network usage patterns handwritten digits alarm sensors factory status Task Find unusual events, clean database, dis- tinguish typical ex- amples.

slide-21
SLIDE 21

Applications

Network Intrusion Detection Detect whether someone is trying to hack the network, downloading tons of MP3s, or doing anything else un- usual on the network. Jet Engine Failure Detection You can’t destroy jet engines just to see how they fail. Database Cleaning We want to find out whether someone stored bogus in- formation in a database (typos, etc.), mislabelled digits, ugly digits, bad photographs in an electronic album. Fraud Detection Credit Cards, Telephone Bills, Medical Records Self calibrating alarm devices Car alarms (adjusts itself to where the car is parked), home alarm (furniture, temperature, windows, etc.)

slide-22
SLIDE 22

Novelty Detection via Density Estimation

Key Idea Novel data is one that we don’t see frequently. It must lie in low density regions. Step 1: Estimate density Observations x1, . . . , xm Density estimate via Parzen windows Step 2: Thresholding the density Sort data according to density and use it for rejection Practical implementation: compute p(xi) = 1 m X

j

k(xi, xj) for all i and sort according to magnitude. Pick smallest p(xi) as novel points.

slide-23
SLIDE 23

Order Statistics of Densities

slide-24
SLIDE 24

Typical Data

slide-25
SLIDE 25

Outliers

slide-26
SLIDE 26

A better way

Problems We do not care about estimating the density properly in regions of high density (waste of capacity). We only care about the relative density for threshold- ing purposes. We want to eliminate a certain fraction of observations and tune our estimator specifically for this fraction. Solution Areas of low density can be approximated as the level set of an auxiliary function. No need to estimate p(x) directly — use proxy of p(x). Specifically: find f(x) such that x is novel if f(x) ≤ c where c is some constant, i.e. f(x) describes the amount of novelty.

slide-27
SLIDE 27
  • Exponential Family for density estimation
  • MAP estimation

Problems with density estimation

X Advantages Convex optimization problem Concentration of measure Problems Normalization g(θ) may be painful to compute For density estimation we need no normalized p(x|θ) No need to perform particularly well in high density regions

p(x|θ) = exp (hφ(x), θi g(θ)) minimize

θ

X

i

g(θ) hφ(xi), θi + 1 2σ2 kθk2

slide-28
SLIDE 28

Thresholding

slide-29
SLIDE 29

Optimization Problem

Optimization Problem MAP

m

X

i=1

log p(xi|θ) + 1 2σ2kθk2 Novelty

m

X

i=1

max ✓ log p(xi|θ) exp(ρ g(θ)), 0 ◆ + 1 2kθk2

m

X

i=1

max(ρ hφ(xi), θi, 0) + 1 2kθk2 Advantages No normalization g(θ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program

slide-30
SLIDE 30

Maximum Distance Hyperplane

Idea Find hyperplane, given by f(x) = hw, xi + b = 0 that has maximum distance from origin yet is still closer to the origin than the observations. Hard Margin minimize 1 2kwk2 subject to hw, xii 1 Soft Margin minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 ξi ξi 0

slide-31
SLIDE 31

Optimization Problem

Primal Problem minimize 1 2kwk2 + C

m

X

i=1

ξi subject to hw, xii 1 + ξi 0 and ξi 0 Lagrange Function L Subtract constraints, multiplied by Lagrange multipli- ers (αi and ηi), from Primal Objective Function. Lagrange function L has saddlepoint at optimum. L = 1 2kwk2 + C

m

X

i=1

ξi

m

X

i=1

αi (hw, xii 1 + ξi)

m

X

i=1

ηiξi subject to αi, ηi 0.

slide-32
SLIDE 32

Dual Problem

Optimality Conditions ∂wL = w

m

X

i=1

αixi = 0 = ) w =

m

X

i=1

αixi ∂ξiL = C αi ηi = 0 = ) αi 2 [0, C] Now substitute the optimality conditions back into L. Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji

m

X

i=1

αi subject to αi 2 [0, C] All this is only possible due to the convexity of the primal problem.

slide-33
SLIDE 33

Minimum enclosing ball

  • Observations on

surface of ball

  • Find minimum

enclosing ball

  • Equivalent to

single class SVM

||w|| ρ/

.

x

R

x x x x x

slide-34
SLIDE 34

Adaptive thresholds

Problem Depending on C, the number of novel points will vary. We would like to specify the fraction ν beforehand. Solution Use hyperplane separating data from the origin H := {x|hw, xi = ρ} where the threshold ρ is adaptive. Intuition Let the hyperplane shift by shifting ρ Adjust it such that the ’right’ number of observations is considered novel. Do this automatically

slide-35
SLIDE 35

Optimization Problem

Primal Problem minimize 1 2kwk2 +

m

X

i=1

ξi mνρ where hw, xii ρ + ξi 0 ξi 0 Dual Problem minimize 1 2

m

X

i=1

αiαjhxi, xji where αi 2 [0, 1] and

m

X

i=1

αi = νm. Similar to SV classification problem, use standard

slide-36
SLIDE 36

The ν-property theorem

  • Optimization problem
  • Solution satisfies
  • At most a fraction of ν points are novel
  • At most a fraction of (1-ν) points aren’t novel
  • Fraction of points on boundary vanishes for

large m (for non-pathological kernels)

minimize

w

1 2 kwk2 +

m

X

i=1

ξi mνρ subject to hw, xii ρ ξi and ξi 0

slide-37
SLIDE 37

Proof

  • Move boundary at optimality
  • For smaller threshold m- points on wrong side
  • f margin contribute
  • For larger threshold m+ points not on ‘good’

side of margin yield

  • Combining inequalities
  • Margin set of measure 0

δ(m− − νm) ≤ 0 δ(m+ − νm) ≥ 0 m− m ≤ ν ≤ m+ m

slide-38
SLIDE 38

Toy example

ν, width c 0.5, 0.5 0.5, 0.5 0.1, 0.5 0.5, 0.1

  • frac. SVs/OLs

0.54, 0.43 0.59, 0.47 0.24, 0.03 0.65, 0.38 margin ρ/w 0.84 0.70 0.62 0.48

threshold and smoothness requirements

slide-39
SLIDE 39

Novelty detection for OCR

Better estimates since we only optimize in low density regions. Specifically tuned for small number of outliers. Only estimates of a level-set. For ν = 1 we get the Parzen-windows estimator back.

slide-40
SLIDE 40

Classification with the ν-trick

changing kernel width and threshold

slide-41
SLIDE 41

Convex Optimization

S M L

slide-42
SLIDE 42

Selecting Variables

slide-43
SLIDE 43
  • Optimization Problem
  • Support Vector classification
  • Support Vector regression
  • Novelty detection
  • Solving it
  • Off the shelf solvers for small problems
  • Solve sequence of subproblems
  • Optimization in primal space (the w space)

Constrained Quadratic Program

minimize

α

1 2α>Qα + l>α subject to Cα + b ≤ 0

slide-44
SLIDE 44

Convex problem

slide-45
SLIDE 45

Subproblems

  • Original optimization problem
  • Key Idea - solve subproblems one at a time and

decompose into active and fixed set

  • Subproblem is again a convex problem
  • Updating subproblems is cheap

α = (αa, αf) minimize

α

1 2α>Qα + l>α subject to Cα + b ≤ 0 minimize

α

1 2α>

a Qaaαa + [la + Qafαf]> αa

subject to Caαa + [b + Cfαf] ≤ 0

slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48

w

w = X

i

yiαixi

αi = 0 = ) yi [hw, xii + b] 1 0 < αi < C = ) yi [hw, xii + b] = 1 αi = C = ) yi [hw, xii + b]  1

αi [yi [hw, xii + b] + ξi 1] = 0 ηiξi = 0

Picking observations

  • Most violated margin condition
  • Points on the boundary
  • Points with nonzero Lagrange multiplier that are correct
slide-49
SLIDE 49

Selecting variables

  • Incrementally increase (chunking)
  • Select promising subset of actives (SVMLight)
  • Select pairs of variables (SMO)
slide-50
SLIDE 50

Data CachedData (WorkingSet) Parameter

Reading Thread

Disk

RAM

RAM

Training Thread

(

  • g Set)

Update Read

(Random Access)

Load

(Random Access)

Read

(Sequential Access) ) ) )

  • Being smart about hardware
  • Data flow from disk to CPU
  • IO speeds

System Capacity Bandwidth IOPs Disk 3TB 150MB/s 102 SSD 256GB 500MB/s 5 · 104 RAM 16GB 30GB/s 108 Cache 16MB 100GB/s 109

slide-51
SLIDE 51

Data CachedData (WorkingSet) Parameter

Reading Thread

Disk

RAM

RAM

Training Thread

(

  • g Set)

Update Read

(Random Access)

Load

(Random Access)

Read

(Sequential Access) ) ) )

  • Being smart about hardware
  • Data flow from disk to CPU
  • IO speeds

System Capacity Bandwidth IOPs Disk 3TB 150MB/s 102 SSD 256GB 500MB/s 5 · 104 RAM 16GB 30GB/s 108 Cache 16MB 100GB/s 109

  • reuse

data

slide-52
SLIDE 52

Runtime Example (Matsushima, Vishwanathan, Smola, 2012)

1 2 3 4 ·104 10−11 10−9 10−7 10−5 10−3 10−1 dna C = 1.0 StreamSVM SBM BM

fastest competitor

slide-53
SLIDE 53

Primal Space Methods

slide-54
SLIDE 54

Gradient Descent

  • Assume we can optimize

in feature space directly

  • Minimize regularized risk
  • Compute gradient

and update

  • This fails in narrow canyons
  • Wasteful if we have lots of similar data

R[w] = 1 m

m

X

i=1

l(xi, yi, w) + λ 2 kwk2

g = ∂wR[w] w ← w − γg

slide-55
SLIDE 55

Stochastic gradient descent

  • Empirical risk as expectation
  • Stochastic gradient descent (pick random x,y)
  • Often we require that parameters are restricted

to some convex set X, hence we project on it

1 m

m

X

i=1

l (yi hφ(xi), wi) = Ei∼{1,..m} [l (yi hφ(xi), wi)] wt+1 wt ηt∂w (yt, hφ(xt), wti) wt+1 πx [wt ηt∂w (yt, hφ(xt), wti)] here πX(w) = argmin

x∈X

kx wk

slide-56
SLIDE 56

Some applications

  • Classification
  • Soft margin loss
  • Logistic loss
  • Regression
  • Quadratic loss
  • l1 loss
  • Huber’s loss
  • Novelty detection

... and many more

l(x, y, w) = max(0, 1 y hw, φ(x)i) l(x, y, w) = log (1 + exp (y hw, φ(x)i)) l(x, y, w) = (y hw, φ(x)i)2 l(x, y, w) = |y hw, φ(x)i|

l(x, y, w) = (

1 2σ2 (y hw, φ(x)i)2

if |y hw, φ(x)i|  σ

1 σ |y hw, φ(x)i| 1 2

if |y hw, φ(x)i| > σ

l(x, w) = max(0, 1 hw, φ(x)i)

slide-57
SLIDE 57

Convergence in Expectation

  • Proof

Show that parameters converge to minimum

θ

⇥ l(¯ θ) ⇤ l∗  R2 + L2 PT −1

t=0 η2 t

2 PT −1

t=0 ηt

where l(θ) = E(x,y) [l(y, hφ(x), θi)] and l∗ = inf

θ∈X l(θ) and ¯

θ = PT −1

t=0 θtηt

PT −1

t=0 ηt

expected loss

parameter average

θ∗ 2 argmin

θ∈X

l(θ) and set rt := kθ∗ θtk

from Nesterov and Vial

initial loss

slide-58
SLIDE 58

Proof

  • Summing over inequality for t proves claim
  • This yields randomized algorithm for

minimizing objective functions (try log times and pick the best / or average median trick)

r2

t+1 = kπX[θt ηtgt] θ∗k2

 kθt ηtgt θ∗k2 = r2

t + η2 t kgtk2 2ηt hθt θ∗, gti

hence E ⇥ r2

t+1 r2 t

⇤  η2

t L2 + 2ηt [l∗ E[l(θt)]]

 η2

t L2 + 2ηt

⇥ l∗ E[l(¯ θ)] ⇤

by convexity by convexity

slide-59
SLIDE 59

Rates

  • Guarantee
  • If we know R, L, T pick constant learning rate
  • If we don’t know T pick

This costs us an additional log term

θ

⇥ l(¯ θ) ⇤ − l∗ ≤ R2 + L2 PT −1

t=0 η2 t

2 PT −1

t=0 ηt

η = R L √ T and hence E¯

θ[l(¯

θ)] − l∗ ≤ R[1 + 1/T]L 2 √ T < LR √ T ηt = O(t− 1

2 )

θ[l(¯

θ)] − l∗ = O ✓log T √ T ◆

slide-60
SLIDE 60

Strong Convexity

  • Use this to bound the expected deviation
  • Exponentially decaying averaging

and plugging this into the discrepancy yields

li(θ0) li(θ) + h∂θli(θ), θ0 θi + 1 2λ kθ θ0k2 r2

t+1  r2 t + η2 t kgtk2 2ηt hθt θ∗, gti

 r2

t + η2 t L2 2ηt [lt(θt) lt(θ∗)] 2ληtr2 k

hence E[r2

t+1]  (1 λht)E[r2 t ] 2ηt [E [l(θt)] l∗]

¯ θ = 1 − σ 1 − σT

T −1

X

t=0

σT −1−tθt l(¯ θ) − l∗ ≤ 2L2 λT log " 1 + λRT

1 2

2L # for η = 2 λT log " 1 + λRT

1 2

2L #

slide-61
SLIDE 61

More variants

  • Adversarial guarantees

has low regret (average instantaneous cost) for arbitrary orders (useful for game theory)

  • Ratliff, Bagnell, Zinkevich

learning rate

  • Shalev-Shwartz, Srebro, Singer (Pegasos)

learning rate (but need constants)

  • Bartlett, Rakhlin, Hazan

(add strong convexity penalty)

θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)] O(t− 1

2 )

O(t−1)

slide-62
SLIDE 62

Regularization

slide-63
SLIDE 63

Problems with Kernels

Myth Support Vectors work because they map data into a high-dimensional feature space. And your statistician (Bellmann) told you . . . The higher the dimensionality, the more data you need Example: Density Estimation Assuming data in [0, 1]m, 1000 observations in [0, 1] give you on average 100 instances per bin (using binsize 0.1m) but only

1 100 instances in [0, 1]5.

Worrying Fact Some kernels map into an infinite-dimensional space, e.g., k(x, x0) = exp( 1

2σ2kx x0k2)

Encouraging Fact SVMs work well in practice . . .

slide-64
SLIDE 64

Solving the Mystery

The Truth is in the Margins Maybe the maximum margin requirement is what saves us when finding a classifier, i.e., we minimize kwk2. Risk Functional Rewrite the optimization problems in a unified form Rreg[f] =

m

X

i=1

c(xi, yi, f(xi)) + Ω[f] c(x, y, f(x)) is a loss function and Ω[f] is a regularizer. Ω[f] =

2kwk2 for linear functions.

For classification c(x, y, f(x)) = max(0, 1 yf(x)). For regression c(x, y, f(x)) = max(0, |y f(x)| ✏).

slide-65
SLIDE 65

Typical SVM loss

Soft Margin Loss ε-insensitive Loss

slide-66
SLIDE 66

Soft Margin Loss

Original Optimization Problem minimize

w,ξ

1 2kwk2 + C

m

X

i=1

ξi subject to yif(xi) 1 ξi and ξi 0 for all 1  i  m Regularization Functional minimize

w

λ 2kwk2 +

m

X

i=1

max(0, 1 yif(xi)) For fixed f, clearly ξi max(0, 1 yif(xi)). For ξ > max(0, 1 yif(xi)) we can decrease it such that the bound is matched and improve the objective function. Both methods are equivalent.

slide-67
SLIDE 67

Why Regularization?

What we really wanted . . . Find some f(x) such that the expected loss E[c(x, y, f(x))] is small. What we ended up doing . . . Find some f(x) such that the empirical average of the expected loss Eemp[c(x, y, f(x))] is small. Eemp[c(x, y, f(x))] = 1 m

m

X

i=1

c(xi, yi, f(xi)) However, just minimizing the empirical average does not guarantee anything for the expected loss (overfitting). Safeguard against overfitting We need to constrain the class of functions f ∈ F some-

  • how. Adding Ω[f] as a penalty does exactly that.
slide-68
SLIDE 68

Some regularization ideas

Small Derivatives We want to have a function f which is smooth on the entire domain. In this case we could use Ω[f] = Z

X

k∂xf(x)k2 dx = h∂xf, ∂xfi. Small Function Values If we have no further knowledge about the domain X, minimizing kfk2 might be sensible, i.e., Ω[f] = kfk2 = hf, fi. Splines Here we want to find f such that both kfk2 and k∂2

xfk2

are small. Hence we can minimize Ω[f] = kfk2 + k∂2

xfk2 = h(f, ∂2 xf), (f, ∂2 xf)i

slide-69
SLIDE 69

Regularization

Regularization Operators We map f into some Pf, which is small for desirable f and large otherwise, and minimize Ω[f] = kPfk2 = hPf, Pfi. For all previous examples we can find such a P. Function Expansion for Regularization Operator Using a linear function expansion of f in terms of some fi, that is for f(x) = X

i

αifi(x) we can compute Ω[f] = * P X

i

αifi(x), P X

j

αjfi(x) + = X

i,j

αiαjhPfi, Pfji.

slide-70
SLIDE 70

Regularization and Kernels

Regularization for Ω[f] = 1

2kwk2

w = X

i

αiΦ(xi) = ) kwk2 = X

i,j

αiαjk(xi, xj) This looks very similar to hPfi, Pfji. Key Idea So if we could find a P and k such that k(x, x0) = hPk(x, ·), Pk(x0, ·)i we could show that using a kernel means that we are minimizing the empirical risk plus a regularization term. Solution: Greens Functions A sufficient condition is that k is the Greens Function of P ⇤P, that is hP ⇤Pk(x, ·), f(·)i = f(x). One can show that this is necessary and sufficient.

slide-71
SLIDE 71

Building Kernels

Kernels from Regularization Operators: Given an operator P ⇤P, we can find k by solving the self consistency equation hPk(x, ·), Pk(x0, ·)i = k>(x, ·)(P ⇤P)k(x0, ·) = k(x, x0) and take f to be the span of all k(x, ·). So we can find k for a given measure of smoothness. Regularization Operators from Kernels: Given a kernel k, we can find some P ⇤P for which the self consistency equation is satisfied. So we can find a measure of smoothness for a given k.

slide-72
SLIDE 72

Spectrum and Kernels

Effective Function Class Keeping Ω[f] small means that f(x) cannot take on arbi- trary function values. Hence we study the function class FC = ⇢ f

  • 1

2hPf, Pfi  C

  • Example

For f = X

i

αik(xi, x) this implies 1 2α>Kα  C. Kernel Matrix K =  5 2 2 1

  • Coefficients

Function Values

slide-73
SLIDE 73

Fourier Regularization

Alexander J. Smola: An Introduction to Support Vectors and Regularization, Page 13

Goal Find measure of smoothness that depends on the fre- quency properties of f and not on the position of f. A Hint: Rewriting kfk2 + k∂xfk2 Notation: ˜ f(ω) is the Fourier transform of f. kfk2 + k∂xfk2 = Z |f(x)|2 + |∂xf(x)|2dx = Z | ˜ f(ω)|2 + ω2| ˜ f(ω)|2dω = Z | ˜ f(ω)|2 p(ω) dω where p(ω) = 1 1 + ω2. Idea Generalize to arbitrary p(ω), i.e. Ω[f] := 1 2 Z | ˆ f(ω)|2 p(ω) dω

slide-74
SLIDE 74

Greens Function

Theorem For regularization functionals Ω[f] := 1 2 Z | ˆ f(ω)|2 p(ω) dω the self-consistency condition hPk(x, ·), Pk(x0, ·)i = k>(x, ·)(P ⇤P)k(x0, ·) = k(x, x0) is satisfied if k has p(ω) as its Fourier transform, i.e., k(x, x0) = Z exp(ihω, (x x0)i)p(ω)dω Consequences small p(ω) correspond to high penalty (regularization). Ω[f] is translation invariant, that is Ω[f(·)] = Ω[f(·x)].

slide-75
SLIDE 75

Examples

Laplacian Kernel k(x, x0) = exp(kx x0k) p(ω) / (1 + kωk2)1 Gaussian Kernel k(x, x0) = e1

2σ2kxx0k2

p(ω) / e1

2σ2kωk2

Fourier transform of k shows regularization properties. The more rapidly p(ω) decays, the more high frequencies are filtered out.

slide-76
SLIDE 76

Rules of thumb

Fourier transform is sufficient to check whether k(x, x0) satisfies Mercer’s condition: only check if ˜ k(ω) 0. Example: k(x, x0) = sinc(x x0). ˜ k(ω) = χ[π,π](ω), hence k is a proper kernel. Width of kernel often more important than type of kernel (short range decay properties matter). Convenient way of incorporating prior knowledge, e.g.: for speech data we could use the autocorrelation func- tion. Sum of derivatives becomes polynomial in Fourier space.

slide-77
SLIDE 77

Polynomial Kernels

Functional Form k(x, x0) = κ(hx, x0i) Series Expansion Polynomial kernels admit an expansion in terms of Leg- endre polynomials (LN

n : order n in RN).

k(x, x0) =

1

X

n=0

bnLn(hx, x0i) Consequence: Ln (and their rotations) form an orthonormal basis on the unit sphere, P ⇤P is rotation invariant, and P ⇤P is diago- nal with respect to Ln. In other words (P ⇤P)Ln(hx, ·i) = b1

n Ln(hx, ·i)

slide-78
SLIDE 78

Polynomial Kernels

Decay properties of bn determine smoothness of func- tions specified by k(hx, x0i). For N ! 1 all terms of LN

n but xn vanish, hence a Taylor

series k(x, x0) = P

i aihx, x0ii gives a good guess.

Inhomogeneous Polynomial k(x, x0) = (hx, x0i + 1)p an = ✓p n ◆ if n  p Vovk’s Real Polynomial k(x, x0) = 1 hx, x0ip 1 (hx, x0i) an = 1 if n < p

slide-79
SLIDE 79

Mini Summary

Regularized Risk Functional From Optimization Problems to Loss Functions Regularization Safeguard against Overfitting Regularization and Kernels Examples of Regularizers Regularization Operators Greens Functions and Self Consistency Condition Fourier Regularization Translation Invariant Regularizers Regularization in Fourier Space Kernel is inverse Fourier Transformation of Weight Polynomial Kernels and Series Expansions

slide-80
SLIDE 80

Text Analysis (string kernels)

slide-81
SLIDE 81

String Kernel (pre)History

slide-82
SLIDE 82

The Kernel Perspective

  • Design a kernel implementing good features
  • Many variants
  • Bag of words (AT&T labs 1995, e.g. Vapnik)
  • Matching substrings (Haussler, Watkins 1998)
  • Spectrum kernel (Leslie, Eskin, Noble, 2000)
  • Suffix tree (Vishwanathan, Smola, 2003)
  • Suffix array (Teo, Vishwanathan, 2006)
  • Rational kernels (Mohri, Cortes, Haffner, 2004 ...)

k(x, x0) = hφ(x), φ(x0)i and f(x) = hφ(x), wi = X

i

αik(xi, x)

slide-83
SLIDE 83

Bag of words

  • At least since 1995 known in AT&T labs

(to be or not to be) (be:2, or:1, not:1, to:2)

  • Joachims 1998: Use sparse vectors
  • Haffner 2001: Inverted index for faster training
  • Lots of work on feature weighting (TF/IDF)
  • Variants of it deployed in many spam filters

k(x, x0) = X

w

nw(x)nw(x0) and f(x) = X

w

ωwnw(x0)

slide-84
SLIDE 84

Substring (mis)matching

  • Watkins 1998+99 (dynamic alignment, etc)
  • Haussler 1999 (convolution kernels)
  • In general O(x x’) runtime

(e.g. Cristianini, Shawe-Taylor, Lodhi, 2001)

  • Dynamic programming solution for pair-HMM

k(x, x0) = X

w2x

X

w02x0

κ(w, w0)

B 1

  • 1
  • 1
  • 1
  • 1

1 1

END

AB

START

A

slide-85
SLIDE 85

Spectrum Kernel

  • Leslie, Eskin, Noble & coworkers, 2002
  • Key idea is to focus on features directly
  • Linear time operation to get features
  • Limited amount of mismatch

(exponential in number of missed chars)

  • Explicit feature construction

(good & fast for DNA sequences)

slide-86
SLIDE 86

Suffix Tree Kernel

  • Vishwanathan & Smola, 2003 (O(x + x’) time)
  • Mismatch-free kernel + arbitrary weights
  • Linear time construction

(Ukkonen, 1995)

  • Find matches for second

string in linear time (Chang & Lawler, 1994)

  • Precompute weights on path

k(x, x0) = X

w

ωwnw(x)nw(x0)

slide-87
SLIDE 87

Are we done?

  • Large vocabulary size
  • Need to build dictionary
  • Approximate matches are still a problem
  • Suffix tree/array is storage inefficient (40-60x)
  • Realtime computation
  • Memory constraints (keep in RAM)
  • Difficult to implement
slide-88
SLIDE 88

Multitask Learning

slide-89
SLIDE 89

Classifier Classifier Classifier Classifier

Multitask Learning

slide-90
SLIDE 90

1: donut? 0: not- spam! 1: spam! ?

malicious educated misinformed confused silent

0: quality

Classifier Classifier Classifier Classifier

Multitask Learning

slide-91
SLIDE 91

Classifier

malicious educated misinformed confused silent

Classifier Classifier Classifier Classifier

Multitask Learning

slide-92
SLIDE 92

Classifier Classifier Classifier Classifier Classifier

malicious educated misinformed confused silent

Global Classifier

Multitask Learning

slide-93
SLIDE 93

Collaborative Classification

  • Primal representation

Kernel representation Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...

  • Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤φ(x), w⌅ + ⇤φ(x), wu⌅ = ⇤φ(x) ⇥ (1 eu), w⌅ k((x, u), (x, u)) = k(x, x)[1 + δu,u]

slide-94
SLIDE 94

Collaborative Classification

email w wuser

  • Primal representation

Kernel representation Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...

  • Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤φ(x), w⌅ + ⇤φ(x), wu⌅ = ⇤φ(x) ⇥ (1 eu), w⌅ k((x, u), (x, u)) = k(x, x)[1 + δu,u]

slide-95
SLIDE 95

Collaborative Classification

email w wuser

email (1 + euser)

w + euser

  • Primal representation

Kernel representation Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...

  • Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤φ(x), w⌅ + ⇤φ(x), wu⌅ = ⇤φ(x) ⇥ (1 eu), w⌅ k((x, u), (x, u)) = k(x, x)[1 + δu,u]

slide-96
SLIDE 96

Hashing

slide-97
SLIDE 97

*in the old days

Hash Kernels

slide-98
SLIDE 98

Hey, please mention subtly during your talk that people should use Yahoo* search more

  • ften.

Thanks, instance: dictionary:

1 2 1 1

task/user (=barney): sparse

*in the old days

Hash Kernels

slide-99
SLIDE 99

Hey, please mention subtly during your talk that people should use Yahoo* search more

  • ften.

Thanks, instance: dictionary:

1 2 1 1

task/user (=barney): sparse

1 3 2 1

Rm

hash function:

h()

sparse

*in the old days

Hash Kernels

slide-100
SLIDE 100

Hey, please mention subtly during your talk that people should use Yahoo search more

  • ften.

Thanks, instance: task/user (=barney):

⇥ xi ∈ RN×(U+1)

1 3 2

  • 1

h()

h(‘mention’) h(‘mention_barney’)

s(m_b) s(m)

{-1, 1}

Similar to count hash (Charikar, Chen, Farrach-Colton, 2003)

Hash Kernels

slide-101
SLIDE 101

Advantages of hashing

slide-102
SLIDE 102

Advantages of hashing

  • No dictionary!
  • Content drift is no problem
  • All memory used for classification
  • Finite memory guarantee (via online learning)
slide-103
SLIDE 103

Advantages of hashing

  • No dictionary!
  • Content drift is no problem
  • All memory used for classification
  • Finite memory guarantee (via online learning)
  • No Memory needed for projection. (vs LSH)
slide-104
SLIDE 104

Advantages of hashing

  • No dictionary!
  • Content drift is no problem
  • All memory used for classification
  • Finite memory guarantee (via online learning)
  • No Memory needed for projection. (vs LSH)
  • Implicit mapping into high dimensional space!
slide-105
SLIDE 105

Advantages of hashing

  • No dictionary!
  • Content drift is no problem
  • All memory used for classification
  • Finite memory guarantee (via online learning)
  • No Memory needed for projection. (vs LSH)
  • Implicit mapping into high dimensional space!
  • It is sparsity preserving! (vs LSH)
slide-106
SLIDE 106

Approximate Orthogonality

Rsmall

We can do multi-task learning!

ξ() h() Rlarge Rsmall

slide-107
SLIDE 107

Guarantees

  • For a random hash function the inner product vanishes with

high probability via

  • We can use this for multitask learning
  • The hashed inner product is unbiased

Proof: take expectation over random signs

  • The variance is O(1/n)

Proof: brute force expansion

  • Restricted isometry property (Kumar, Sarlos, Dasgupta 2010)

Pr{|⌅wv, hu(x)⇧| > } 2e−C2m Direct sum in Hilbert Space Sum in Hash Space

slide-108
SLIDE 108

Spam classification results

!"#$% !"#&% !"##% !"##% !% !"!'% #"$'% #"(#% #")$% #")(% #"##% #"'#% #"*#% #")#% #"$#% !"##% !"'#% !$% '#% ''% '*% ')% !"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2% 0%0&)!%&1%3#!3')#0,*% +,-./,01/2134% 5362-7/,8934% ./23,873%

N=20M, U=400K

slide-109
SLIDE 109

Lazy users ...

1
 10
 100
 1000
 10000
 100000
 1000000
 0
 13
 26
 39
 52
 65
 78
 91
 104
 117
 130
 143
 156
 169
 182
 197
 211
 228
 244
 261
 288
 317
 370
 523
 number
of
users
 number
of
labels


Labeled
emails
per
user


slide-110
SLIDE 110

Results by user group

slide-111
SLIDE 111

Results by user group

!" !#$" !#%" !#&" !#'" (" (#$" (#%" ('" $!" $$" $%" $&" !"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2% 0%0&)!%&1%3#!3')#0,*% )!*" )(*" )$+,*" )%+-*" )'+(.*" )(&+,(*" ),$+&%*" )&%+/0" 12345674"

labeled emails:

slide-112
SLIDE 112

Results by user group

!" !#$" !#%" !#&" !#'" (" (#$" (#%" ('" $!" $$" $%" $&" !"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2% 0%0&)!%&1%3#!3')#0,*% )!*" )(*" )$+,*" )%+-*" )'+(.*" )(&+,(*" ),$+&%*" )&%+/0" 12345674"

labeled emails:

slide-113
SLIDE 113

Details

slide-114
SLIDE 114

Estimation details

  • Works best with stochastic gradient descent

(or any other primal space method)

  • Never instantiate hash map explicitly
  • Random memory access pattern (latency)
  • Multiclass classification - joint hash

f(x) = hw, φ(x)i = X

s

w[h(s)]ns(x)

slide-115
SLIDE 115

Approximate Matches

  • General idea
  • Simplification
  • Weigh by mismatch amount |w-w’|
  • Map into fragments: dog -> (*og, d*g, do*)
  • Hash fragments and weigh them based on

mismatch amount

  • Exponential in amount of mismatch

But not in alphabet size

k(x, x0) = X

w2x

X

w02x0

κ(w, w0) for |w − w0| ≤ δ

slide-116
SLIDE 116
  • Cache size is a few MBs

Very fast random memory access

  • RAM (DDR3 or better) is GBs
  • Fast sequential memory access (burst read)
  • CPU caches memory read from RAM
  • Random memory access is very slow
  • CPU caches memory read from RAM

Memory access patterns

vector hashed sequence

slide-117
SLIDE 117

Speeding up access

  • Key idea - bound the range of h(i,j)
  • Linear offset

bad collisions in i

  • Sum of hash functions

bad collisions in j

  • Optimal Golomb Ruler (Langford)

NP hard in general

  • Feistel Network / Cryptography (new)

for j=1 to n access h(i,j)

h(i, j) = h(i) + j h(i, j) = h(i) + h0(j) h(i, j) = h(i) + OGR(j) h(i, j) = h(i) + crypt(j|i)