Kenji Fukumizu The Institute of Statistical Mathematics / Graduate - - PowerPoint PPT Presentation

kenji fukumizu the institute of statistical mathematics
SMART_READER_LITE
LIVE PREVIEW

Kenji Fukumizu The Institute of Statistical Mathematics / Graduate - - PowerPoint PPT Presentation

Kernel Methods for Statistical Learning Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7, 2012 Machine Learning Summer School 2012, Kyoto Version 2012.09.04 The latest version of


slide-1
SLIDE 1

1 1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7, 2012

Machine Learning Summer School 2012, Kyoto Version 2012.09.04 The latest version of slides is downloadable at http://www.ism.ac.jp/~ fukumizu/MLSS2012/

Kernel Methods for Statistical Learning

slide-2
SLIDE 2

Lecture Plan

I. Introduction to kernel methods

  • II. Various kernel methods

kernel PCA, kernel CCA, kernel ridge regression, etc

  • III. Support vector machine

A brief introduction to SVM

  • IV. Theoretical backgrounds of kernel methods

Mathematical aspects of positive definite kernels

V. Nonparametric inference with positive definite kernels

Recent advances of kernel methods

2

slide-3
SLIDE 3

 General references (More detailed lists are given at the end of

each section)

– Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002. – Lecture slides (more detailed than this course) This page contains Japanese information, but the slides are written in English. Slides: 1, 2, 3, 4, 5, 6, 7, 8 – For Japanese only (Sorry!):

  • 福水 「カーネル法入門

-正定値カーネルによるデータ解析」 朝倉書店(2010)

  • 赤穂 「カーネル多変量解析

―非線形データ解析の新しい展開」 岩波書店(2008)

3

slide-4
SLIDE 4

1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto

  • I. Introduction to Kernel Methods

I-1

slide-5
SLIDE 5

2

Outline

  • 1. Linear and nonlinear data analysis
  • 2. Principles of kernel methods
  • 3. Positive definite kernels and feature spaces

I-2

slide-6
SLIDE 6

3

Linear and Nonlinear Data Analysis

I-3

slide-7
SLIDE 7

What is data analysis?

– Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. –Wikipedia

I-4

slide-8
SLIDE 8

Linear data analysis

– ‘Table’ of numbers  Matrix expression – Linear algebra is used for methods of analysis

  • Correlation,
  • Linear regression analysis,
  • Principal component analysis,
  • Canonical correlation analysis,

etc.

              

) ( ) ( 1 ) 2 ( ) 2 ( 1 ) 1 ( ) 1 ( 1 N m N m m

X X X X X X      X

m dimensional, N data

I-5

slide-9
SLIDE 9

 Example 1: Principal component analysis (PCA)

PCA: project data onto the subspace with largest variance. 1st direction =

] [ Var argmax

1 || ||

X aT

a 

 

 

             

N i N j j i T T

X N X a N X a

1 2 1 ) ( ) (

1 1 ] [ Var . a V a

XX T

  

  

              

N i T N j j i N j j i

X N X X N X N

1 1 ) ( ) ( 1 ) ( ) (

1 1 1

XX

V

(Empirical) covariance matrix of where

I-6

slide-10
SLIDE 10

– 1st principal direction – p-th principal direction = PCA Eigenproblem of covariance matrix

a V a

XX T a 1 || ||

argmax

1

u 

unit eigenvector w.r.t. the largest eigenvalue of

unit eigenvector w.r.t. the p-th largest eigenvalue of

I-7

slide-11
SLIDE 11

 Example 2: Linear classification

– Binary classification Find a linear classifier so that – Example: Fisher’s linear discriminant analyzer, Linear SVM, etc.

              

) ( ) ( 1 ) 2 ( ) 2 ( 1 ) 1 ( ) 1 ( 1 N m N m m

X X X X X X      X

Input data

N N

Y Y Y Y } 1 {

) ( ) 2 ( ) 1 (

                 

Class label

) sgn( ) ( b x a x h

T

 

) ( ) ( )

(

i i

Y X h 

for all (or most) i.

I-8

slide-12
SLIDE 12

Are linear methods enough?

  • 6
  • 4
  • 2

2 4 6

  • 6
  • 4
  • 2

2 4 6

x1 x2

Watch the following movie! http://jp.youtube.com/watch?v=3liCbRZPrZA linearly inseparable

5 10 15 20 5 10 15 20

  • 15
  • 10
  • 5

5 10 15

z1 z3 z2

transform

) 2 , , ( ) , , (

2 1 2 2 2 1 3 2 1

x x x x z z z 

linearly separable

I-9

slide-13
SLIDE 13

 Another example: correlation

] [ ] [ ] , [ Y Var X Var Y X Cov

XY 

ρ

      

  

 

2 2

] [ ] [ ] [ ] [ Y E Y E X E X E Y E Y X E X E     

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

X Y ρ = 0.94

I-10

slide-14
SLIDE 14

0.5 1 1.5 2 2.5

  • 0.5

0.5 1 1.5 2 2.5

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

  • 0.5

0.5 1 1.5 2 2.5

X Y ρ(X, Y ) = 0.17 ρ(X2,Y ) = 0.96 transform (X, Y) (X2, Y)

I-11

slide-15
SLIDE 15

Nonlinear transform helps!

Analysis of data is a process of inspecting, cleaning, transforming,

and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. – Wikipedia. Kernel method = a systematic way of transforming data into a high- dimensional feature space to extract nonlinearity or higher-order moments of data.

I-12

slide-16
SLIDE 16

Principles of kernel methods

I-13

slide-17
SLIDE 17

Kernel method: Big picture

– Idea of kernel method – What kind of space is appropriate as a feature space?

  • Should incorporate various nonlinear information of the
  • riginal data.
  • The inner product should be computable. It is essential for

many linear methods.

Feature space xi  Hk  xj

,

i

x

j

x

Space of original data feature map Do linear analysis in the feature space! e.g. SVM

I-14

slide-18
SLIDE 18

 Computational issue

– For example, how about using power series expansion? (X, Y, Z)  (X, Y, Z, X2, Y2, Z2, XY, YZ, ZX, …) – But, many recent data are high-dimensional. e.g. microarray, images, etc... The above expansion is intractable! e.g. Up to 2nd moments, 10,000 dimension: Dim of feature space:

10000C1 + 10000C2 = 50,005,000 (!)

– Need a cleverer way  Kernel method.

I-15

slide-19
SLIDE 19

Feature space by positive definite kernel

– Feature map: from original space to feature space – With special choice of feature space, we have a function (positive definite kernel) k(x, y) such that – Many linear methods use only the inner products of data, and do not need the explicit form of the vector Φ. (e.g. PCA)

) , ( ) ( ), (

j i j i

X X k X X    H    :

kernel trick.

) ( , ), ( , ,

1 1 n n

X X X X     

I-16

slide-20
SLIDE 20

Positive definite kernel

Definition.

Ω: set. : Ω Ω → is a positive definite kernel if

1) (symmetry) 2) (positivity) for arbitrary x1, …, xn ∈ Ω

) , (

1 ,

 n j i j i j i

x x k c c           ) , ( ) , ( ) , ( ) , (

1 1 1 1 n n n n

x x k x x k x x k x x k     

) , ( ) , ( x y k y x k 

is positive semidefinite, i.e., for any

R 

i

c

(Gram matrix)

I-17

slide-21
SLIDE 21

Examples: positive definite kernels on Rm (proof is give in Section IV)

  • Euclidean inner product
  • Gaussian RBF kernel
  • Laplacian kernel
  • Polynomial kernel

 

2 2

exp ) , (  y x y x kG   

d T P

y x c y x k ) ( ) , (  

) , ( N   d c

) (  

 

 

  

m i i i L

y x y x k

1

| | exp ) , (  ) (  

y x y x k

T

 ) , (

I-18

slide-22
SLIDE 22

Proposition 1.1 Let be a vector space with inner product ⋅,⋅ and Φ: Ω → be a map (feature map). If : Ω Ω → is defined by then k(x,y) is necessarily positive definite. – Positive definiteness is necessary. – Proof)

), , ( ) ( ), ( y x k y x   

I-19

slide-23
SLIDE 23

– Positive definite kernel is sufficient. Theorem 1.2 (Moore-Aronszajn) For a positive definite kernel on Ω, there is a Hilbert space

(reproducing kernel Hilbert space, RKHS) that consists of

functions on Ω such that 1)

⋅, ∈ for any

2) span ⋅,

∈ Ω

is dense in 3) (reproducing property)

, ⋅,

for any ∈ , ∈ Ω

* Hilbert space: vector space with inner product such that the topology is complete.

I-20

slide-24
SLIDE 24

Feature map by positive definite kernel

– Feature space = RKHS. Feature map:

Φ: Ω → , ↦ ⋅, , … , ↦ ⋅, , … , ⋅,

– Kernel trick: by reproducing property

Φ , Φ ⋅, , ⋅, ,

– Prepare only positive definite kernel: We do not need an explicit form of feature vector or feature space. All we need for kernel methods are kernel values ,

.

I-21

slide-25
SLIDE 25

1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7

Machine Learning Summer School 2012, Kyoto

  • II. Various Kernel Methods

1

slide-26
SLIDE 26

Outline

  • 1. Kernel PCA
  • 2. Kernel CCA
  • 3. Kernel ridge regression
  • 4. Some topics on kernel methods

II-2

slide-27
SLIDE 27

Kernel Principal Component Analysis

II-3

slide-28
SLIDE 28

Principal Component Analysis

 PCA (review)

– Linear method for dimension reduction of data – Project the data in the directions of large variance. 1st principal axis =

] [ Var argmax

1 || ||

X aT

a 

 

 

             

n i n j j i T T

X n X a n X a

1 2 1

1 1 ] [ Var . a V a

XX T

  

  

              

n i T n j j i n j j i

X n X X n X n

1 1 1

1 1 1

XX

V

where

II-4

slide-29
SLIDE 29

From PCA to Kernel PCA

– Kernel PCA: nonlinear dimension reduction of data (Schölkopf et al.

1998).

– Do PCA in feature space

: max

1 || || 

H

f

: max

1 || ||  a

II-5

 

 

             

n i n s s i T T

X n X a n X a

1 2 1

1 1 ] [ Var

 

 

          

n i n s s i

X n X f n X f

1 2 1

) ( 1 ) ( , 1 ] ) ( , [ Var

slide-30
SLIDE 30

It is sufficient to assume

Orthogonal directions to the data can be neglected, since for

𝑔 = ∑ 𝑑𝑗(Φ 𝑌𝑗 −

1 𝑜 ∑

Φ 𝑌𝑡 ) + 𝑔

⊥ 𝑜 𝑡=1 𝑜 𝑗=1

, where 𝑔

⊥ is orthogonal to

the span{Φ 𝑌𝑗 −

1 𝑜 ∑

Φ 𝑌𝑡 }

𝑜 𝑡=1 𝑗=1 𝑜

, the objective function of kernel PCA does not depend on 𝑔

⊥.

Then,

 

c K c X f

X T 2

~ ) ( ,   Var

) ( ~ ), ( ~ : ~

, j i ij X

X X K   

  

   

n s s n i i

X X X

1 1

) ( ) ( : ) ( ~

with

(centered Gram matrix)

where

c K c f

X T H

~ || ||

2 

(centered feature vector)

II-6

𝑔 = 𝑑𝑗 Φ 𝑌𝑗 − 1 𝑜 Φ 𝑌𝑡

𝑜 𝑡=1 𝑜 𝑗=1

[Exercise]

slide-31
SLIDE 31

 Objective function of kernel PCA

c K c

X T 2

~

max

subject to

1 ~  c K c

X T

 

 

 

n s s i j i ij X

X X k n X X k K

1

) , ( 1 ) , ( ~

 

 

 

N s t s t n t j t

X X k n X X k n

1 , 2 1

) , ( 1 ) , ( 1

The centered Gram matrix 𝐿

𝑌 is expressed with Gram matrix

𝐿𝑌 = 𝑙 𝑌𝑗, 𝑌

𝑘 𝑗𝑘 as

𝐿 𝑌 = 𝐽𝑜 − 1 𝑜 𝟐𝑜𝟐𝑜

𝑈 𝑈

𝐿 𝐽𝑜 − 1 𝑜 𝟐𝑜𝟐𝑜

𝑈

𝟐𝑜 = 1 ⋮ 1 ∈ 𝐒𝑜

[Exercise]

II-7

𝐽𝑜 = Unit matrix

slide-32
SLIDE 32

– Kernel PCA can be solved by eigen-decomposition. – Kernel PCA algorithm

  • Compute centered Gram matrix
  • Eigendecompsoition of
  • p-th principal component of

X

K ~

 

N i T i i i X

u u K

1

~ λ

2 1

   

N

λ λ λ 

eigenvalues unit eigenvectors

N

u u u , , ,

2 1

X

K ~

i T p p i

X u X λ 

II-8

slide-33
SLIDE 33

Derivation of kernel method in general

– Consider feature vectors with kernels. – Linear method is applied to the feature vectors. (Kernelization) – Typically, only the inner products are used to express the objective function of the method. – The solution is given in the form (representer theorem, see Sec.IV), and thus everything is written by Gram matrices. These are common in derivation of any kernel methods.

) , ( ) ( ), (

j i j i

X X k X X    ) ( ,

i

X f 

, ) (

1

 

n i i i

X c f

II-9

slide-34
SLIDE 34

Example of Kernel PCA

 Wine data (from UCI repository)

13 dim. chemical measurements of for three types of wine. 178 data. Class labels are NOT used in PCA, but shown in the figures. First two principal components:

  • 5

5

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

Linear PCA Kernel PCA (Gaussian kernel)

( = 3)

II-10

slide-35
SLIDE 35
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6  = 4

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5

Kernel PCA (Gaussian)

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

 = 5  = 2

 

2 2

exp ) , (  y x y x kG   

II-11

slide-36
SLIDE 36

Noise Reduction with Kernel PCA

– PCA can be used for noise reduction (principal directions represent signal, other directions noise). – Apply kernel PCA for noise reduction:

  • Compute d-dim subspace Vd spanned by d-principal directions.
  • For a new data x,

Gx: Projection ofx onto Vd = noise reduced feacure vector.

  • Compute a preimage in

data sapce for the noise reduced feature vector Gx.

2

argmin ( ) ˆ

x x

x x G

′   

  • 25
  • 20
  • 15
  • 10
  • 5

5 10 15 20

  • 20

20

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10

(x) Gx

x ˆ

Note: Gxis not necessariy

given as an image of .

II-12

slide-37
SLIDE 37

 USPS data

Noise reduced images(linear PCA) Noisy images Noise reduced images (kernel PCA, Gaussian kernel) Original data (NOT used for PCA) Generated by Matlab stprtool (by V. Franc)

II-13

slide-38
SLIDE 38

Properties of Kernel PCA

– Nonlinear features can be extracted. – Can be used for a preprocessing of other analysis like

  • classification. (dimension reduction / feature extraction)

– The results depend on the choice of kernel and kernel parameters. Interpreting the results may not be straightforward. – How to choose a kernel and kernel parameter?

  • Cross-validation is not straightforward unlike SVM.
  • If it is a preprocessing, the performance of the final analysis

should be maximized.

II-14

slide-39
SLIDE 39

Kernel Canonical Correlation Analysis

II-15

slide-40
SLIDE 40

Canonical Correlation Analysis

– Canonical correlation analysis (CCA, Hotelling 1936) Linear dependence of two multivariate random vectors.

  • Data 𝑌1, 𝑍

1 , … , (𝑌𝑂, 𝑍 𝑂)

  • 𝑌𝑗: m-dimensional, 𝑍

𝑗: ℓ-dimensional

Find the directions 𝑏 and 𝑐 so that the correlation of 𝑏𝑈𝑌 and

𝑐𝑈𝑍 is maximized.

, ] [ ] [ ] , [ max ] , [ max

, ,

Y b X a Y b X a Y b X a

T T T T b a T T b a

Var Var Cov Corr   ρ

X Y a aTX bTY b

II-16

slide-41
SLIDE 41

 Solution of CCA

– Rewritten as a generalized eigenproblem: [Exercise: derive this. (Hint. Use Lagrange multiplier method.)] – Solution:

max

𝑏,𝑐 𝑏𝑈𝑊

𝑌𝑌𝑐 subject to 𝑏𝑈𝑊 𝑌𝑌𝑏 = 𝑐𝑈𝑊 𝑌𝑌𝑐 = 1.

𝑃 𝑊 𝑌𝑌 𝑊 𝑌𝑌 𝑃 𝑏 𝑐 = 𝜍 𝑊 𝑌𝑌 𝑃 𝑃 𝑊 𝑌𝑌 𝑏 𝑐 𝑏 = 𝑊

𝑌𝑌 1/2𝑣1, 𝑐 = 𝑊 𝑌𝑌 1/2𝑤1

where 𝑣1 (𝑤1, resp.) is the left (right, resp.) first eigenvector for the SVD of 𝑊

  • 𝑌𝑌

−1/2𝑊

𝑌𝑌𝑊

  • 𝑌𝑌

−1/2.

II-17

slide-42
SLIDE 42

Kernel CCA

– Kernel CCA (Akaho 2000, Melzer et al. 2002, Bach et al 2002)

  • Dependence (not only correlation) of two random variables.
  • Data: 𝑌1, 𝑍

1 , … , (𝑌𝑂, 𝑍 𝑂) arbitrary variables

  • Consider CCA for the feature vectors with 𝑙𝑌 and 𝑙𝑌:

𝑌1, … , 𝑌𝑂 ↦ Φ𝑌 𝑌1 , … , , Φ𝑌 𝑌𝑂 ∈ 𝐼𝑌, 𝑍

1, … , 𝑍 𝑂 ↦ Φ𝑌 𝑍 1 , … , Φ𝑌 𝑍 𝑂 ∈ 𝐼𝑌.

X Y f x(X)

) (X f

x y(Y) y g

) (Y g

max

𝑔∈𝐼𝑌,𝑕∈𝐼𝑍

Cov[𝑔 𝑌 , 𝑕 𝑍 ] Var 𝑔 𝑌 Var[𝑕 𝑍 ] = max

𝑔∈𝐼𝑌,𝑕∈𝐼𝑍

∑ 𝑔, Φ 𝑌 𝑌𝑗 〈Φ 𝑌 𝑍

𝑗 , 𝑕〉 𝑂 𝑗

∑ 𝑔, Φ 𝑌 𝑌𝑗

2 𝑂 𝑗

∑ 𝑕, Φ 𝑌 𝑍

𝑗 2 𝑂 𝑗

II-18

slide-43
SLIDE 43

– We can assume 𝑔 = ∑𝑗=1

𝑂 𝛽𝑗Φ

𝑌(𝑌𝑗) and 𝑕 = ∑𝑗=1

𝑂 𝛾𝑗Φ

𝑌(𝑍

𝑗).

max𝛽∈R𝑂,𝛾∈𝑆𝑂 𝛽𝑈 𝐿 𝑌𝐿 𝑌𝛾 𝛽𝑈𝐿 𝑌

2𝛽 𝛾𝑈𝐿

𝑌

2𝛾

– Regularization: to avoid trivial solution, – Solution: generalized eigenproblem

(same as kernel PCA)

max

𝑔∈𝐼𝑌,𝑕∈𝐼𝑍

∑ 𝑔, Φ 𝑌 𝑌𝑗 〈Φ 𝑌 𝑍

𝑗 , 𝑕〉 𝑂 𝑗=1

∑ 𝑔, Φ 𝑌 𝑌𝑗

2 𝑂 𝑗=1

+ 𝜁𝑂 𝑔 𝐼𝑌

2 ∑

𝑕, Φ 𝑌 𝑍

𝑗 2 +𝜁𝑂

𝑕 𝐼𝑍

2 𝑂 𝑗=1

II-19

𝐿 𝑌 and 𝐿 𝑌 : centered

Gram matrices.

𝑃 𝐿 𝑌𝐿 𝑌 𝐿 𝑌𝐿 𝑌 𝑃 𝛽 𝛾 = 𝜍 𝐿 𝑌

2 + 𝜁𝑂𝐿

𝑌 𝑃 𝑃 𝐿 𝑌

2 + 𝜁𝑂𝐿

𝑌 𝛽 𝛾

slide-44
SLIDE 44

Application of KCCA

– Application to image retrieval (Hardoon, et al. 2004).

  • Xi: image,

Yi: corresponding texts (extracted from the same webpages).

  • Idea: use d eigenvectors f1,…,fd and g1,…,gd as the feature

spaces which contain the dependence between X and Y.

  • Given a new word 𝑍

𝑜𝑜𝑥, compute its feature vector, and find

the image whose feature has the highest inner product.

Xi Y x(Xi) x y(Y) y g               ) ( , ) ( ,

1

Y g Y g

y d y

             ) ( , ) ( ,

1 i x d i x

X f X f 

  • “at phoenix sky harbor
  • n july 6, 1997. 757-

2s7, n907wa, ...”

II-20

f

slide-45
SLIDE 45

– Example:

  • Gaussian kernel for images.
  • Bag-of-words kernel (frequency of words) for texts.

Text -- “height: 6-11, weight: 235 lbs, position: forward, born: september 18, 1968, split, croatia college: none” Extracted images

II-21

slide-46
SLIDE 46

Kernel Ridge Regression

II-22

slide-47
SLIDE 47

Ridge regression

𝑌1, 𝑍

1 , … , 𝑌𝑜, 𝑍 𝑜 : data (𝑌𝑗 ∈ 𝐒𝑛, 𝑍 𝑗 ∈ 𝐒)

Ridge regression: linear regression with 𝑀2 penalty.

min

𝑏

|𝑍

𝑗 2 − 𝑏𝑈𝑌𝑗|2 𝑜 𝑗=1

+ 𝜇 𝑏 2

– Solution (quadratic function):

𝑏 = 𝑊

𝑌𝑌 + 𝜇𝐽𝑜 −1𝑌𝑈𝑍

– Ridge regression is preferred, when 𝑊

𝑌𝑌 is (close to) singular.

(The constant term is omitted for simplicity)

𝑊

𝑌𝑌 = 1 𝑜 𝑌𝑈𝑌, 𝑌 =

𝑌1

𝑈

⋮ 𝑌𝑜

𝑈

∈ 𝐒𝑜×𝑛, 𝑍 = 𝑍

1

⋮ 𝑍

𝑜

∈ 𝐒𝑜

where

II-23

slide-48
SLIDE 48

Kernel ridge regression

– 𝑌1, 𝑍

1 , … , 𝑌𝑜, 𝑍 𝑜 : 𝑌 arbitrary data, 𝑍 ∈ 𝐒.

– Kernelization of ridge regression: positive definite kernel 𝑙 for 𝑌

min

𝑔∈𝐼 |𝑍 𝑗 − 𝑔, Φ 𝑌𝑗 𝐼|2 𝑜 𝑗=1

+ 𝜇 𝑔 𝐼

2

min

𝑔∈𝐼 |𝑍 𝑗 − 𝑔(𝑌𝑗)|2 𝑜 𝑗=1

+ 𝜇 𝑔 𝐼

2

equivalently,

Ridge regression on 𝐼 Nonlinear ridge regr.

II-24

slide-49
SLIDE 49

– Solution is given by the form

Let 𝑔 = ∑

𝑑𝑗Φ 𝑌𝑗 + 𝑔

⊥ = 𝑔 Φ + 𝑔 ⊥ 𝑜 𝑗=1

(𝑔

Φ ∈ 𝑇𝑇𝑏𝑜 Φ 𝑌𝑗 𝑗=1 𝑜

, 𝑔

⊥: orthogonal complement)

Objective function = ∑

𝑍

𝑗 − 𝑔 Φ + 𝑔 ⊥, Φ 𝑌𝑗 2 + 𝜇 𝑔 Φ + 𝑔 ⊥ 2 𝑜 𝑗=1

= ∑

𝑍

𝑗 − 𝑔 Φ, Φ 𝑌𝑗 2 + 𝜇 ( 𝑔 Φ 2+ 𝑔 ⊥ 2 𝑜 𝑗=1

) The 1st term does not depend on 𝑔

⊥, and 2nd term is minimized in the

case 𝑔

⊥ = 0.

– Objective function :

𝑍 − 𝐿𝑌𝑑 2 + 𝜇 𝑑𝑈𝐿𝑌𝑑

– Solution: 𝑑̂ = 𝐿𝑌 + 𝜇𝐽𝑜 −1𝑍 Function: 𝑔

̂ 𝑦 = 𝑍𝑈 𝐿𝑌 + 𝜇𝐽𝑜 −1𝐥 𝑦

, ) (

1

 

n j j j

X c f

II-25

𝐥 𝑦 = 𝑙 𝑦, 𝑌1 ⋮ 𝑙 𝑦, 𝑌𝑜

slide-50
SLIDE 50

Regularization

– The minimization min

𝑔

𝑍

𝑗 − 𝑔 𝑌𝑗 2

may be attained with zero errors. But the function may not be unique. – Regularization min

𝑔∈𝐼 ∑

|𝑍

𝑗 − 𝑔(𝑌𝑗)|2 𝑜 𝑗=1

+ 𝜇 𝑔 𝐼

2

  • Regularization with smoothness

penalty is preferred for uniqueness and smoothness.

  • Link with some RKHS norm and

smoothness is discussed in Sec. IV.

II-26

slide-51
SLIDE 51

Comparison

 Kernel ridge regression vs local linear regression

𝑍 = 1/ 1.5 + | 𝑌| 2 + 𝑎, 𝑌 ~ 𝑂 0, 𝐽𝑒 , 𝑎~𝑂 0, 0.12

𝑜 = 100, 500 runs

Kernel ridge regression with Gaussian kernel Local linear regression with Epanechnikov kernel (‘locfit’ in R is used) Bandwidth parameters are chosen by CV.

II-27

1 10 100 0.002 0.004 0.006 0.008 0.01 0.012 Dimension of X Mean square errors Kernel method Local linear

slide-52
SLIDE 52

 Local linear regression (e.g., Fan and Gijbels 1996)

– 𝐿: smoothing kernel (𝐿 𝑦 ≥ 0, ∫ 𝐿 𝑦 𝑒𝑦 = 1, not necessarily

positive definite) – Local linear regression

𝐹 𝑍 𝑌 = 𝑦0 is estimated by

  • For each 𝑦0, this minimization can be solved by linear algebra.
  • Statistical property of this estimator is well studied.
  • For one dimensional 𝑌, it works nicely with some theoretical
  • ptimality.
  • But, weak for high-dimensional data.

min

𝑏,𝑐 𝑍 𝑗 − 𝑏 − 𝑐𝑈(𝑌𝑗−𝑦0) 2 𝑜 𝑗

𝐿ℎ(𝑌𝑗 − 𝑦0) 𝐿ℎ(𝑦) = ℎ−𝑒𝐿

𝑦 ℎ

II-28

slide-53
SLIDE 53

Some topics on kernel methods

II-29

  • Representer theorem
  • Structured data
  • Kernel choice
  • Low rank approximation
slide-54
SLIDE 54

Representer theorem

𝑌1, 𝑍

1 , … , 𝑌𝑜, 𝑍 𝑜 : data

𝑙: positive definite kernel for 𝑌, 𝐼: corresponding RKHS. Ψ: monotonically increasing function on 𝑆+.

Theorem 2.1 (representer theorem, Kimeldorf & Wahba 1970) The solution to the minimization problem: min

𝑔∈𝐼 𝐺 (𝑌1, 𝑍 1, 𝑔 𝑌1

, … , (𝑌𝑜, 𝑍

𝑜, 𝑔 𝑌𝑜 )) + Ψ(| 𝑔 |)

is attained by

𝑔 = ∑ 𝑑𝑗Φ 𝑌𝑗

𝑜 𝑗=1

with some 𝑑1, … , 𝑑𝑜 ∈ 𝑆𝑜 . The proof is essentially the same as the one for the kernel ridge

  • regression. [Exercise: complete the proof]

II-30

slide-55
SLIDE 55

Structured Data

– Structured data: non-vectorial data with some structure.

  • Sequence data (variable length):

DNA sequence, Protein (sequence of amino acid) Text (sequence of words)

  • Graph data (Koji’s lecture)

Chemical compound etc.

  • Tree data

Parse tree

  • Histograms / probability

measures – Many kernels uses counts of substructures (Haussler 1999).

S NP VP Det N The cat chased the mouse. V Det N NP

II-31

slide-56
SLIDE 56

Spectrum kernel

– p-spectrum kernel (Leslie et al 2002): positive definite kernel for string.

𝑙𝑞 𝑡, 𝑢 = Occurrences of common subsequences of length p.

– Example: s = “statistics” t = “pastapistan” 3-spectrum

s: sta, tat, ati, tis, ist, sti, tic, ics

t : pas, ast, sta, tap, api, pis, ist, sta, tan

K3(s, t) = 1・2 + 1・1 = 3

– Linear time (𝑃(𝑇(|𝑇| + |𝑢|) ) algorithm with suffix tree is known

(Vishwanathan & Smola 2003). sta tat ati tis ist sti tic ics pas ast tap api pis tan

(s) 1 1 1 1 1 1 1 1 (t) 2 1 1 1 1 1 1 1

II-32

slide-57
SLIDE 57

– Application: kernel PCA of ‘words’ with 3-spectrum kernel

  • 4
  • 2

2 4 6 8

  • 6
  • 4
  • 2

2 4 6 8 mathematics engineering pioneering cybernetics physics psychology metaphysics methodology biology biostatistics statistics informatics bioinformatics II-33

slide-58
SLIDE 58

Choice of kernel

 Choice of kernel

– Choice of kernel (polyn or Gauss) – Choice of parameters (bandwidth parameter in Gaussian kernel)

 General principles

– Reflect the structure of data (e.g., kernels for structured data) – For supervised learning (e.g., SVM)  Cross-validation – For unsupervised learning (e.g. kernel PCA)

  • No general methods exist.
  • Guideline: make or use a relevant supervised problem, and use

CV. – Learning a kernel: Multiple kernel learning (MKL)

𝑙 𝑦, 𝑧 = ∑ 𝑑𝑗𝑙𝑗(𝑦, 𝑧)

𝑁 𝑗=1

  • ptimize 𝑑𝑗

II-34

slide-59
SLIDE 59

Low rank approximation

– Gram matrix: 𝑜 × 𝑜 where 𝑜 is the sample size. Large 𝑜 causes computational problems. e.g. Inversion, eigendecomposition costs 𝑃(𝑜3) in time. – Low-rank approximation

𝐿 ≈ 𝑆𝑆𝑈, where 𝑆: 𝑜 × 𝑠 matrix (𝑠 < 𝑜)

– The decay of eigenvalues of a Gram matrix is often quite fast

(See Widom 1963, 1964; Bach & Jordan 2002).

II-35

𝐿

𝑆 𝑆

𝑠 𝑜 𝑠 𝑜 𝑜 𝑜

slide-60
SLIDE 60

– Two major methods

  • Incomplete Cholesky factorization (Fine & Sheinberg 2001)

𝑃(𝑜𝑠2) in time and 𝑃(𝑜𝑠) in space

  • Nyström approximation (Williams and Seeger 2001)

Random sampling + eigendecomposition – Example: kernel ridge regression

𝑍𝑈 𝐿𝑌 + 𝜇𝐽𝑜 −1𝐥 𝑦

time : 𝑃(𝑜3) Low rank approximation: 𝐿𝑌 ≈ 𝑆𝑆𝑈. With Woodbury formula*

𝑍𝑈 𝐿𝑌 + 𝜇𝐽𝑜 −1𝐥 𝑦 ≈ 𝑍𝑈 𝑆𝑆𝑈 + 𝜇𝐽𝑜 −1𝐥 𝑦

=

1 𝜇 𝑍𝑈𝐥 𝑦 − 𝑍𝑈𝑆 𝑆𝑈𝑆 + 𝜇𝐽𝑠 −1𝑆𝑈𝐥 𝑦

time : 𝑃(𝑠2𝑜 + 𝑠3) * Woodbury (Sherman–Morrison–Woodbury) formula:

𝐵 + 𝑉𝑊 −1 = 𝐵−1 − 𝐵−1𝑉 𝐽 + 𝑊𝐵−1𝑉 −1𝑊𝐵−1.

II-36

slide-61
SLIDE 61

Other kernel methods

– Kernel Fisher discriminant analysis (kernel FDA) (Mika et al. 1999) – Kernel logistic regression (Roth 2001, Zhu&Hastie 2005) – Kernel partial least square(kernel PLS)(Rosipal&Trejo 2001) – Kernel K-means clustering(Dhillon et al 2004) etc, etc, ... – Variants of SVM  Section III.

II-37

slide-62
SLIDE 62

Summary: Properties of kernel methods

– Various classical linear methods can be kernelized

 Linear algorithms on RKHS.

– The solution typically has the form – The problem is reduced to manipulation of Gram matrices of size n (sample size).

  • Advantage for high dimensional data.
  • For a large number of data, low-rank approximation is used

effectively. – Structured data:

  • kernel can be defined on any set.
  • kernel methods can be applied to any type of data.

. ) (

1

 

n i i i

X c f

(representer theorem)

II-38

slide-63
SLIDE 63

References

  • Akaho. (2000) Kernel Canonical Correlation Analysis. Proc. 3rd Workshop on

Induction-based Information Sciences (IBIS2000). (in Japanese) Bach, F.R. and M.I. Jordan. (2002) Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48. Dhillon, I. S., Y. Guan, and B. Kulis. (2004) Kernel k-means, spectral clustering and normalized cuts. Proc. 10th ACM SIGKDD Intern. Conf. Knowledge Discovery and Data Mining (KDD), 551–556. Fan, J. and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman Hall/CRC, 1996. Fine, S. and K. Scheinberg. (2001) Efficient SVM Training Using Low-Rank Kernel

  • Representations. Journal of Machine Learning Research, 2:243-264.

Gökhan, B., T. Hofmann, B. Schölkopf, A.J. Smola, B. Taskar, S.V.N. Vishwanathan. (2007) Predicting Structured Data. MIT Press. Hardoon, D.R., S. Szedmak, and J. Shawe-Taylor. (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639– 2664. Haussler, D. Convolution kernels on discrete structures. Tech Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz, 1999.

II-39

slide-64
SLIDE 64

Leslie, C., E. Eskin, and W.S. Noble. (2002) The spectrum kernel: A string kernel for SVM protein classification, in Proc. Pacific Symposium on Biocomputing, 564–575. Melzer, T., M. Reiter, and H. Bischof. (2001) Nonlinear feature extraction using generalized canonical correlation analysis. Proc. Intern. Conf. .Artificial Neural Networks (ICANN 2001), 353–360. Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller. (1999) Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, edits, Neural Networks for Signal Processing, volume IX, 41–48. IEEE. Rosipal, R. and L.J. Trejo. (2001) Kernel partial least squares regression in reproducing kernel Hilbert space. Journal of Machine Learning Research, 2: 97–123. Roth, V. (2001) Probabilistic discriminative kernel classifiers for multi-class problems. In Pattern Recognition: Proc. 23rd DAGM Symposium, 246–253. Springer. Schölkopf, B., A. Smola, and K-R. Müller. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319. Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002. Vishwanathan, S. V. N. and A.J. Smola. (2003) Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15, 569–576. MIT Press. Williams, C. K. I. and M. Seeger. (2001) Using the Nyström method to speed up kernel

  • machines. Advances in Neural Information Processing Systems, 13:682–688.

II-40

slide-65
SLIDE 65

Widom, H. (1963) Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society, 109:278{295, 1963. Widom, H. (1964) Asymptotic behavior of the eigenvalues of certain integral equations

  • II. Archive for Rational Mechanics and Analysis, 17:215{229, 1964.

II-41

slide-66
SLIDE 66

Appendix

II-42

slide-67
SLIDE 67

Exercise for kernel PCA

– –

II-43

c K c f

X T H

~ || ||

2 

𝑔 𝐼

2 = 𝑑𝑗Φ

𝑌𝑗

𝑜 𝑗=1

, 𝑑

𝑘Φ

𝑌

𝑘 𝑜 𝑘=1

= 𝑑𝑗𝑑

𝑘 Φ

𝑌𝑗 , Φ 𝑌

𝑘 𝑜 𝑗

= 𝑑𝑈𝐿 𝑌𝑑.

 

c K c X f

X T 2

~ ) ( ,   Var

Var 𝑔, Φ 𝑌 = 1 𝑜 𝑑𝑗Φ 𝑌𝑗

𝑜 𝑗=1

, Φ 𝑌

𝑘 2 𝑜 𝑘=1

= 1 𝑜 𝑑𝑗Φ 𝑌𝑗

𝑜 𝑗=1

, Φ 𝑌

𝑘

𝑑ℎΦ 𝑌ℎ

𝑜 ℎ=1

, Φ 𝑌

𝑘 𝑜 𝑘=1

= 1 𝑜 𝑑𝑗𝐿 𝑗𝑘

𝑜 𝑗

𝑑ℎ𝐿 ℎ𝑘

𝑜 ℎℎ 𝑜 𝑘=1

= 1 𝑜 𝑑𝑈𝐿 𝑌

2𝑑

slide-68
SLIDE 68

1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7

Machine Learning Summer School 2012, Kyoto

  • III. Support Vector Machines

A Brief Introduction

1

slide-69
SLIDE 69

Large margin classifier

– 𝑌1, 𝑍

1 , … , (𝑌𝑜, 𝑍 𝑜): training data

  • 𝑌𝑗: input (𝑛-dimensional)
  • 𝑍

𝑗 ∈ {±1}: binary teaching data,

– Linear classifier

𝑔

𝑥(𝑦) = 𝑥𝑈𝑦 + 𝑐

ℎ 𝑦 = sgn 𝑔

𝑥 𝑦

We wish to make 𝑔

𝑥(𝑦) with

the training data so that a new data 𝑦 can be correctly classified.

y = fw(x) fw(x)≧0 fw(x) < 0

III-2

slide-70
SLIDE 70
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8

 Large margin criterion

Assumption: the data is linearly separable. Among infinite number of separating hyperplanes, choose the one to give the largest margin. – Margin = distance of two classes measured along the direction of 𝑥. – Support vector machine: Hyperplane to give the largest margin. – The classifier is the middle of the margin. – “Supporting points”

  • n the two boundaries are

called support vectors.

𝑥

margin support vector

III-3

slide-71
SLIDE 71

– Fix the scale (rescaling of (𝑥, 𝑐) does not change the plane) Then Margin =

2 𝑥

[Exercise] Prove this.

          1 ) max( 1 ) min( b X w b X w

i T i T

if Yi = 1, if Yi = − 1

III-4

slide-72
SLIDE 72

 Support vector machine (linear, hard margin)

(Boser, Guyon, Vapnik 1992)

Objective function:

max

𝑥,𝑐 1 𝑥 subject to 𝑥𝑈𝑌𝑗 + 𝑐 ≥ 1 if 𝑍 𝑗 = 1,

𝑥𝑈𝑌𝑗 + 𝑐 ≥ 1 if 𝑍

𝑗 = −1.

– Quadratic program (QP):

  • Minimization of a quadratic function with linear constraints.
  • Convex, no local optima (Vandenberghe’ lecture)
  • Many solvers available (Chih-Jen Lin’s lecture)

subject to 𝑍

𝑗(𝑥𝑈𝑌𝑗 + 𝑐) ≥ 1 (∀𝑗)

min

𝑥,𝑐 𝑥 2

SVM (hard margin)

III-5

slide-73
SLIDE 73

 Soft-margin SVM

– “Linear separability” is too strong. Relax it. – This is also QP. – The coefficient C must be given. Cross-validation is often used. Hard margin Soft margin

𝑍

𝑗(𝑥𝑈𝑌𝑗 + 𝑐) ≥ 1

𝑍

𝑗(𝑥𝑈𝑌𝑗 + 𝑐) ≥ 1 − 𝜊𝑗,

𝜊𝑗 ≥ 0 𝑍

𝑗(𝑥𝑈𝑌𝑗 + 𝑐) ≥ 1 − 𝜊𝑗,

𝜊𝑗≥ 0 min

𝑥,𝑐 𝑥 2 + 𝐷 ∑

𝜊𝑗

𝑜 𝑗=1

SVM (soft margin) subject to (∀𝑗)

III-6

slide-74
SLIDE 74

III-7

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8

w 𝑥𝑈𝑦 + 𝑐 = 1 𝑥𝑈𝑦 + 𝑐 = −1 𝒙𝑼𝒀𝒋 + 𝒄 = 𝟐 − 𝝄𝒋 𝒙𝑼𝒀𝒌 + 𝒄 = −𝟐 + 𝝄𝒌

slide-75
SLIDE 75

SVM and regularization

– Soft-margin SVM is equivalent to the regularization problem:

  • loss function: Hinge loss
  • c.f. Ridge regression (squared error)

[Exercise] Confirm the above equivalence.

min

𝑥,𝑐 1 − 𝑍 𝑗 𝑥𝑈𝑌𝑗 + 𝑐 + + 𝜇 𝑥 2 𝑜 𝑗=1

loss regularization term 𝑨 + = max{0, 𝑨} ℓ 𝑔 𝑦 , 𝑧 = 1 − 𝑔 𝑦

+

III-8

min

𝑥,𝑐 ∑

𝑍

𝑗 − 𝑥𝑈𝑌𝑗 + 𝑐 2

+ 𝜇 𝑥 2

𝑜 𝑗=1

slide-76
SLIDE 76

Kernelization: nonlinear SVM

– 𝑌1, 𝑍

1 , … , (𝑌𝑜, 𝑍 𝑜): training data

  • 𝑌𝑗: input on arbitrary space Ω
  • 𝑍

𝑗 ∈ {±1}: binary teaching data,

– Kernelization: Positive definite kernel 𝑙 on Ω (RKHS 𝐼), Feature vectors Φ 𝑌1 , … , Φ 𝑌𝑜 – Linear classifier on 𝐼  nonlinear classifier on Ω

ℎ 𝑦 = sgn 𝑔, Φ 𝑦 + 𝑐

= sgn 𝑔 𝑦 + 𝑐 , 𝑔 ∈ 𝐼.

III-9

slide-77
SLIDE 77

 Nonlinear SVM

By representer theorem, 𝑔 = ∑

𝑥

𝑘Φ 𝑌 𝑘 𝑜 𝑘=1

.

  • This is again QP.

min

𝑔,𝑐 1 − 𝑍 𝑗(𝑔 𝑌𝑗 + 𝑐) + + 𝜇 𝑔 𝐼 2 𝑜 𝑗=1

𝑍

𝑗(〈𝑔, Φ 𝑌𝑗 〉 + 𝑐) ≥ 1,

𝜊𝑗≥ 0 min

𝑔,𝑐 𝑔 2 + 𝐷 𝜊𝑗 𝑜 𝑗=1

subject to (∀𝑗)

  • r equivalently,

𝑍

𝑗( 𝐿𝑥 𝑗 + 𝑐) ≥ 1,

𝜊𝑗≥ 0 min

𝑥,𝑐 𝑥𝑈𝐿𝑥 + 𝐷 𝜊𝑗 𝑜 𝑗=1

nonlinear SVM (soft margin) subject to (∀𝑗)

III-10

slide-78
SLIDE 78

 Dual problem

– By Lagrangian multiplier method, the dual problem is SVM (dual problem)

  • The dual problem is often preferred.
  • The classifier is expressed by

𝑔

∗ 𝑦 + 𝑐∗ = 𝛽𝑗∗𝑍 𝑗 𝑜 𝑗=1

𝐿 𝑦, 𝑌𝑗 + 𝑐∗

– Sparse expression: Only the data with 0 < 𝛽𝑗∗ ≤ 𝐷 appear in the summation.  Support vectors.

𝑍

𝑗( 𝐿𝑥 𝑗 + 𝑐) ≥ 1,

𝜊𝑗≥ 0 min

𝑥,𝑐 𝑥𝑈𝐿𝑥 + 𝐷 ∑

𝜊𝑗

𝑜 𝑗=1

subject to (∀𝑗) 0 ≤ 𝛽𝑗 ≤ 𝐷, ∑ 𝑍

𝑗𝛽𝑗 𝑜 𝑗=1

= 0 max

𝛽

∑ 𝛽𝑗

𝑜 𝑗=1

− ∑ 𝛽𝑗𝛽𝑘𝑍

𝑗𝑍 𝑘𝐿𝑗𝑘 𝑜 𝑗,𝑘=1

subject to (∀𝑗)

III-11

slide-79
SLIDE 79

 KKT condition

Theorem The solution of the primal and dual problem of SVM is given by the following equations: (1) 1 − 𝑍

𝑗 𝑔∗ 𝑌𝑗 + 𝑐∗ − 𝜊𝑗 ∗ ≤ 0 (∀ 𝑗)

[primal constraint]

(2) 𝜊𝑗

∗ ≥ 0 (∀ 𝑗)

[primal constraint]

(3) 0 ≤ 𝛽𝑗

∗ ≤ 𝐷, (∀ 𝑗)

[dual constraint]

(4) 𝛽𝑗

∗(1 − 𝑍 𝑗 (𝑔∗ 𝑌𝑗 + 𝑐∗) − 𝜊𝑗 ∗) = 0 (∀ 𝑗)

[complementary]

(6) 𝜊𝑗

∗ 𝐷 − 𝛽𝑗 ∗ = 0 (∀ 𝑗),

[complementary]

(7) ∑𝑘=1

𝑜

𝐿𝑗𝑘𝑥

𝑘 ∗ − ∑ 𝛽𝑘 ∗𝑍 𝑘𝐿𝑗𝑘 𝑜 𝑘

= 0,

(8) ∑ 𝑘=1

𝑜

𝛽𝑘

∗𝑍 𝑘 = 0,

III-12

slide-80
SLIDE 80

 Sparse expression

– Two types of support vectors.

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8

w support vectors 0 < i < C (Yi 𝜒(Xi) = 1) support vectors i = C (Yi𝜒(Xi) 1)

𝜒∗(𝑦) = 𝑔

∗ 𝑦 + 𝑐∗ =

  • 𝛽𝑗∗𝑍

𝑗 𝑌𝑗: support vectors

𝐿 𝑦, 𝑌𝑗 + 𝑐∗

III-13

slide-81
SLIDE 81

Summary of SVM

– One of the kernel methods:

  • kernelization of linear large margin classifier.
  • Computation depends on Gram matrices of size 𝑜.

– Quadratic program:

  • No local optimum.
  • Many solvers are available.
  • Further efficient optimization methods are available

(e.g. SMO, Plat 1999) – Sparse representation

  • The solution is written by a small number of support vectors.

– Regularization

  • The objective function can be regarded as regularization with

hinge loss function.

III-14

slide-82
SLIDE 82

– NOT discussed on SVM in this lecture are

  • Many successful applications
  • Multi-class extension

– Combination of binary classifiers (1-vs-1, 1-vs-rest) – Generalization of large margin criterion

Crammer & Singer (2001), Mangasarian & Musicant (2001), Lee, Lin, & Wahba (2004), etc

  • Other extensions

– Support vector regression (Vapnik 1995)

– 𝜉-SVM (Schölkopf et al 2000)

– Structured-output (Collins & Duffty 2001, Taskar et al 2004, Altun

et al 2003, etc)

– One-class SVM (Schökopf et al 2001)

  • Optimization

– Solving primal problem

  • Implementation (Chih-Jen Lin’s lecture)

III-15

slide-83
SLIDE 83

References

Altun, Y., I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proc. 20th Intern. Conf. Machine Learning, 2003. Boser, B.E., I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin

  • classifiers. In D. Haussler, editor, Fifth Annual ACM Workshop on Computational

Learning Theory, pages 144–152, Pittsburgh, PA, 1992. ACM Press. Crammer, K. and Y. Singer. On the algorithmic implementation of multiclass kernel- based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. Collins, M. and N. Duffy. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14, pages 625–632. MIT Press, 2001. Mangasarian, O. L. and David R. Musicant. Lagrangian support vector machines. Journal of Machine Learning Research, 1:161–177, 2001 Lee, Y., Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99: 67–81, 2004. Schölkopf, B., A. Smola, R.C. Williamson, and P.L. Bartlett. (2000) New support vector algorithms. Neural Computation, 12(5):1207–1245.

III-16

slide-84
SLIDE 84

Schölkopf, B., J.C. Platt, J. Shawe-Taylor, R.C. Williamson, and A.J.Smola. (2001) Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471. Vapnik, V.N. The Nature of Statistical Learning Theory. Springer 1995. Platt, J. Fast training of support vector machines using sequential minimal optimization. In

  • B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -

Support Vector Learning, pages 185–208. MIT Press, 1999. Books on Application domains: – Lee, S.-W., A. Verri (Eds.) Pattern Recognition with Support Vector Machines: First International Workshop, SVM 2002, Niagara Falls, Canada, August 10, 2002.

  • Proceedings. Lecture Notes in Computer Science 2388, Springer, 2002.

– Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Springer, 2002. – Schölkopf, B., K. Tsuda, J.-P. Vert (Eds.). Kernel Methods in Computational

  • Biology. MIT Press, 2004.

III-17

slide-85
SLIDE 85

1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7

Machine Learning Summer School 2012, Kyoto

  • IV. Theoretical Backgrounds of

Kernel Methods

1

slide-86
SLIDE 86

𝐃-valued Positive definite kernel

Definition.

: set. is a positive definite kernel if for

arbitrary 𝑦1, … , 𝑦𝑜 ∈ Ω and 𝑑1, … , 𝑑𝑜 ∈ 𝐃, Remark: From the above condition, the Gram matrix 𝑙 𝑦𝑗, 𝑦𝑘

𝑗𝑘is

necessarily Hermitian, i.e. 𝑙 𝑧, 𝑦 = 𝑙(𝑦, 𝑧). [Exercise]

. ) , (

1 ,

 n j i j i j i

x x k c c

C   ×  : k

IV-2

slide-87
SLIDE 87

Operations that preserve positive definiteness

Proposition 4.1 If 𝑙𝑗: 𝑌 × 𝑌 → 𝐃 (𝑗 = 1,2, … ,) are positive definite kernels, then so are the following: 1. (positive combination) 𝑏𝑙1 + 𝑐𝑙2 (𝑏, 𝑐 ≥ 0). 2. (product) 𝑙1𝑙2 𝑙1 𝑦, 𝑧 𝑙2 𝑦, 𝑧 3. (limit) lim

𝑗→∞ 𝑙𝑗(𝑦, 𝑧), assuming the limit exists.

  • Proof. 1 and 3 are trivial from the definition. For 2, it suffices to

prove that Hadamard product (element-wise) of two positive semidefinite matrices is positive semidefinite. Remark: The set of positive definite kernels on 𝑌 is a closed cone, where the topology is defined by the point-wise convergence.

IV-3

slide-88
SLIDE 88

Proposition 4.2 Let 𝐵 and 𝐶 be positive semidefinite Hermitian matrices. Then, Hadamard product 𝐿 = 𝐵 ∗ 𝐶 (element-wise product) is positive semidefinite. Proof) Eigendecomposition of 𝐵: 𝐵 = 𝑉Λ𝑉

𝑈 =

i.e., Then,

 

n p j p i p p ij

U U A

(λp ≧ 0 by the positive semidefiniteness).

 n j i ij j i

K c c

1 ,

   

.

1 , 1 , 1 1 1

   

 

  n j i ij j n j i n i n n j i ij j j i i

B U c U c B U c U c λ λ 

 

 

n p n j i ij j p i p p j i

B U U c c

1 1 ,

λ

IV-4

𝑉𝑞

𝑗

𝜇1 0 ⋯ 0 0 𝜇2 ⋯ ⋮ ⋱ ⋮ 0 ⋯ 0 𝜇𝑜 𝑉𝑞

𝑘 𝑈

slide-89
SLIDE 89

Normalization

Proposition 4.3 Let 𝑙 be a positive definite kernel on Ω, and 𝑔: Ω → 𝐃 be an arbitrary function. Then,

𝑙 𝑦, 𝑧 : = 𝑔 𝑦 𝑙 𝑦, 𝑧 𝑔(𝑧)

is positive definite. In particular,

𝑔 𝑦 𝑔(𝑧)

is a positive definite kernel. – Proof [Exercise] – Example. Normalization: is positive definite, and Φ(𝑦)

= 1 for any 𝑦 ∈ Ω. 𝑙 𝑦, 𝑧 = 𝑙 𝑦, 𝑧 𝑙(𝑦, 𝑦)𝑙(𝑧, 𝑧)

IV-5

slide-90
SLIDE 90

Proof of positive definiteness

– Euclidean inner product 𝑦𝑈𝑧: easy (Prop. 1.1). – Polynomial kernel 𝑦𝑈𝑧 + 𝑑 𝑒 (𝑑 ≥ 0):

𝑦𝑈𝑧 + 𝑑 𝑒 = 𝑦𝑈𝑧 𝑒 + 𝑏1 𝑦𝑈𝑧 𝑒−1 + ⋯ + 𝑏𝑒, 𝑏𝑗 ≥ 0.

Product and non-negative combination of p.d. kernels. – Gaussian RBF kernel exp −

𝑦−𝑧 2 𝜏2

:

exp − 𝑦 − 𝑧 2 𝜏2 = 𝑓− 𝑦 2/𝜏2𝑓𝑦𝑈𝑧/𝜏2 𝑓− 𝑧 2/𝜏2

Note

𝑓𝑦𝑈𝑧/𝜏2 = 1 + 1 1! 𝜏2 𝑦𝑈𝑧 + 1 2! 𝜏4 𝑦𝑈𝑧 2 + ⋯

is positive definite (Prop. 4.1). Proposition 4.3 then completes the proof. – Laplacian kernel is discussed later.

IV-6

slide-91
SLIDE 91

Shift-invariant kernel

– A positive definite kernel 𝑙 𝑦, 𝑧 on 𝐒𝑛 is called shift-invariant if the kernel is of the form 𝑙 𝑦, 𝑧 = 𝜔 𝑦 − 𝑧 . – Examples: Gaussian, Laplacian kernel – Fourier kernel (C-valued positive definite kernel): for each 𝜕,

𝑙𝐺 𝑦, 𝑧 = exp −1𝜕𝑈 𝑦 − 𝑧 = exp −1𝜕𝑈𝑦 exp −1𝜕𝑈𝑧

(Prop. 4.3) – If 𝑙 𝑦, 𝑧 = 𝜔 𝑦 − 𝑧 is positive definite, the function 𝜔 is called positive definite.

IV-7

slide-92
SLIDE 92

Bochner’s theorem

Theorem 4.4 (Bochner) Let 𝜔 be a continuous function on 𝐒𝑛. Then, 𝜔 is (𝐃-valued) positive definite if and only if there is a finite non-negative Borel measure Λ on 𝐒𝑛 such that

𝜔 𝑨 = exp −1𝜕𝑈𝑨 𝑒Λ(𝜕)

Bochner’s theorem characterizes all the continuous shift-invariant positive definite kernels. exp

−1𝜕𝑈𝑨 𝜕 ∈ 𝐒𝑛 is the generator

  • f the cone (see Prop. 4.1).

– Λ is the inverse Fourier (or Fourier-Stieltjes) transform of 𝜔.

– Roughly speaking, the shift invariant functions are the class that have non-negative Fourier transform. – Sufficiency is easy: ∑

𝑑𝑗𝑑

𝑘

  • 𝑗,𝑘

𝜔 𝑨𝑗 − 𝑨

𝑘 = ∫ | ∑ 𝑑𝑗𝑓 −1𝜕𝑈𝑨𝑗 |2𝑒Λ 𝜕 𝑗

.

Necessity is difficult.

IV-8

slide-93
SLIDE 93

RKHS in frequency domain

Suppose (shift-invariant) kernel 𝑙 has a form

𝑙 𝑦, 𝑧 = exp −1𝜕𝑈 𝑦 − 𝑧 𝜍 𝜕 𝑒𝜕 .

Then, RKHS 𝐼𝑙 is given by

𝐼𝑙 = 𝑔 ∈ 𝑀2 𝐒, 𝑒𝑦 𝑔 ̂ 𝜕

2

𝜍 𝜕 𝑒𝜕 < ∞ , 𝑔, 𝑕 = 𝑔 ̂ 𝜕 𝑕 𝜕 𝜍 𝜕 𝑒𝜕

where 𝑔

̂ is the Fourier transform of 𝑔: 𝑔 ̂(𝜕) =

1 2𝜌 𝑛 ∫ 𝑔(𝑦)exp − −1𝜕𝑈𝑦 𝑒𝜕 .

𝜍 𝜕 > 0.

IV-9

slide-94
SLIDE 94

 Gaussian kernel

𝑙𝐻 𝑦, 𝑧 = exp − 𝑦 − 𝑧 2 2𝜏2 , 𝜍𝐻 𝜕 = 1 2𝜌 𝑛 exp − 𝜏2 𝜕 2 2 𝐼𝑙𝐻 = 𝑔 ∈ 𝑀2 𝐒, 𝑒𝑦 𝑔 ̂ 𝜕

2 exp 𝜏2 𝜕 2

2 𝑒𝜕 < ∞ 𝑔, 𝑕 = 2𝜌 𝑛 𝑔 ̂ 𝜕 𝑕 𝜕 exp 𝜏2 𝜕 2 2 𝑒𝜕

 Laplacian kernel (on 𝐒)

𝑙𝑀 𝑦, 𝑧 = exp −𝛾|𝑦 − 𝑧| , 𝜍𝑀 𝜕 = 1 2𝜌 𝜕2 + 𝛾 𝐼𝑙𝑀 = 𝑔 ∈ 𝑀2 𝐒, 𝑒𝑦 𝑔 ̂ 𝜕

2 𝜕2 + 𝛾 𝑒𝜕 < ∞

𝑔, 𝑕 = 2𝜌 𝑔 ̂ 𝜕 𝑕 𝜕 𝜕2 + 𝛾 𝑒𝜕

– Decay of 𝑔 ∈ 𝐼 for high-frequency is different for Gaussian and Laplacian.

IV-10

slide-95
SLIDE 95

RKHS by polynomial kernel

– Polynomial kernel on 𝐒:

𝑙𝑞 𝑦, 𝑧 = 𝑦𝑈𝑧 + 𝑑 𝑒, 𝑑 ≥ 0, 𝑒 ∈ 𝐎 𝑙𝑞 𝑦, 𝑨0 = 𝑨0

𝑒𝑦𝑒 + 𝑒

1 𝑑𝑨0

𝑒−1𝑦𝑒−1 + 𝑒

2 𝑑2𝑨0

𝑒−2𝑦𝑒−1 + ⋯ + 𝑑𝑒.

Span of these functions are polynomials of degree 𝑒. Proposition 4.5 If 𝑑 ≠ 0, the RKHS is the space of polynomials of degree at most 𝑒. [Proof: exercise. Hint. Find 𝑐𝑗 to satisfy ∑

𝑐𝑗𝑙 𝑦, 𝑨𝑗

𝑒 𝑗=0

= ∑ 𝑏𝑗𝑦𝑗

𝑒 𝑗=0

as a solution to a linear equation.]

IV-11

slide-96
SLIDE 96

Sum and product

𝐼1, 𝑙1 , 𝐼2, 𝑙2 : two RKHS’s and positive definite kernels on Ω.

 Sum

RKHS for 𝑙1 + 𝑙2:

𝐼1 + 𝐼2 = 𝑔: Ω → 𝑆 ∃𝑔

1 ∈ 𝐼1, ∃𝑔 2 ∈ 𝐼2, 𝑔 = 𝑔 1 + 𝑔 2}

𝑔 2 = 𝑔

1 𝐼1 2 + 𝑔 2 𝐼2 2 𝑔 = 𝑔 1 + 𝑔 2, 𝑔 1 ∈ 𝐼1, 𝑔 2 ∈ 𝐼2}

 Product

RKHS for 𝑙1𝑙2:

𝐼1 ⊗ 𝐼2 =tensor product as a vector space. 𝑔 = ∑ 𝑔

𝑗𝑕𝑗 𝑜 𝑗=1

𝑔

𝑗 ∈ 𝐼1, 𝑕𝑗 ∈ 𝐼2} is dense in 𝐼1 ⊗ 𝐼2.

∑ 𝑔

𝑗 1 𝑕𝑗 1 , 𝑜 𝑗=1

∑ 𝑔

𝑘 2 𝑕𝑘 2 𝑛 𝑘=1

= ∑ ∑ 𝑔

𝑗 1 , 𝑔 𝑘 2 𝐼1 𝑕𝑗 1 , 𝑕𝑘 2 𝐼2 𝑛 𝑘=1 𝑜 𝑗=1

.

IV-12

slide-97
SLIDE 97

Summary of Section IV

– Positive definiteness of kernels are preserved by

  • Non-negative combinations,
  • Product
  • Point-wise limit
  • Normalization

– Bochner’s theorem: characterization of the continuous shift- invariance kernels on 𝐒𝑛. – Explicit form of RKHS

  • RKHS with shift-invariance kernels has explicit expression in

frequency domain.

  • Polynomial kerns gives RKHS of polynomials.
  • Sum and product can be given.

IV-13

slide-98
SLIDE 98

References

Aronszajn., N. Theory of reproducing kernels. Trans. American Mathematical Society, 68(3):337–404, 1950. Saitoh., S. Integral transforms, reproducing kernels, and their applications. Addison Wesley Longman, 1997.

IV-14

slide-99
SLIDE 99

Solution to Exercises

 C-valued positive definiteness

Using the definition for one point, we have 𝑙 𝑦, 𝑦 is real and non- negative for all 𝑦. For any 𝑦 and 𝑧, applying the definition with coefficient (𝑑, 1) where 𝑑 ∈ 𝐃, we have

𝑑 2𝑙 𝑦, 𝑦 + 𝑑𝑙 𝑦, 𝑧 + 𝑑̅𝑙 𝑧, 𝑦 + 𝑙 𝑧, 𝑧 ≥ 0.

Since the right hand side is real, its complex conjugate also satisfies

𝑑 2𝑙 𝑦, 𝑦 + 𝑑̅𝑙 𝑦, 𝑧 + 𝑑𝑙 𝑧, 𝑦 + 𝑙 𝑧, 𝑧 ≥ 0.

The difference of the left hand side of the above two inequalities is real, so that

𝑑̅ 𝑙 𝑧, 𝑦 − 𝑙 𝑦, 𝑧 − 𝑑(𝑙 𝑧, 𝑦 − 𝑙 𝑦, 𝑧 )

is a real number. On the other hand, since 𝛽 − 𝛽

must be pure

imaginary for any complex number 𝛽,

𝑑̅ 𝑙 𝑧, 𝑦 − 𝑙 𝑦, 𝑧 = 0

holds for any 𝑑 ∈ 𝐃. This implies 𝑙 𝑧, 𝑦 = 𝑙(𝑦, 𝑧).

IV-15

slide-100
SLIDE 100

1

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto

  • V. Nonparametric Inference with

Positive Definite Kernels

1

slide-101
SLIDE 101

Outline

  • 1. Mean and Variance on RKHS
  • 2. Statistical tests with kernels
  • 3. Conditional probabilities and beyond.

V-2

slide-102
SLIDE 102

Introduction

 “Kernel methods” for statistical inference

– We have seen that kernelization of linear methods provides nonlinear methods, which capture ‘nonlinearity’ or ‘high-order moments’ of original data. e.g. nonlinear SVM, kernel PCA, kernel CCA, etc. – This section discusses more basic statistics on RKHS

(original space)  feature map H (RKHS) X  (X) = k( , X)

mean, covariance, conditional covariance

V-3

slide-103
SLIDE 103

Mean and covariance on RKHS

V-4

slide-104
SLIDE 104

Mean on RKHS

X: random variable taking value on a measurable space Ω k: measurable positive definite kernel on Ω. H: RKHS.

Always assumes “bounded” kernel for simplicity: sup

  • , ∞.

: feature vector = random variable on RKHS. – Definition. The kernel mean ∈ of X on H is defined by – Reproducing expectation:

* Notation: depends on , also. But, we do not show it for simplicity.

5

 

) ( , X f E f mX  ) , ( ) ( X k X   

Φ ⋅, .

∀ ∈

V-5

slide-105
SLIDE 105

Covariance operator

, : random variable taking values on Ω, Ω, resp. , , , : RKHS given by kernels on Ω and Ω, resp.

  • Definition. Cross-covariance operator:

– Simply, covariance of feature vectors. c.f. Euclidean case VYX = E[YXT] – E[Y]E[X]T : covariance matrix – Reproducing covariance:

Y X YX

H H   :

V-6

Σ ≡ Φ ⊗ Φ ∗ Φ ⊗ Φ

∗ denotes the linear functional ,⋅ : ↦ 〈, 〉

)]) ( ), ( [ Cov ( )] ( [ )] ( [ )] ( ) ( [ , Y g X f X f E Y g E X f Y g E f g

YX

   

for all ∈ , ∈ .

slide-106
SLIDE 106

– Standard identification:

∗ ≅ : ,⋅, ↔ .

– The operator is regarded as an element in ⊗ , i.e., can be regarded as

V-7

Σ ≅ , ⊗ ∈ ⊗

Σ ≡ Φ ⊗ Φ ∗ Φ ⊗ Φ

X Y X Y HX HY X Y X(X) Y(Y)

YX

slide-107
SLIDE 107

Characteristic kernel

(Fukumizu, Bach, Jordan 2004, 2009; Sriperumbudur et al 2010)

– Kernel mean can capture higher-order moments of the variable. Example

X: R-valued random variablek: pos.def. kernel on R.

Suppose k admits a Taylor series expansion on R. The kernel mean mX works as a moment generating function:

    

2 2 1

) ( ) ( ) , ( xu c xu c c x u k      

2 2 2 1

] [ ] [ )] , ( [ ) ( u X E c u X E c c X u k E u mX

] [ ) ( 1

   

X E u m du d c

u X

) exp( ) , ( xu u x k 

e.g.) (ci > 0)

V-8

slide-108
SLIDE 108

P : family of all the probabilities on a measurable space (, B).

H: RKHS on with a bounded measurable kernel k. mP: kernel mean on H for a variable with probability

  • Definition. A bounded measureable positive definite k is called

characteristic (w.r.t. P) if the mapping is one-to-one. – The kernel mean with a characteristic kernel uniquely determines a probability.

P

 P

P

m P H  , 

P

Q P m m

Q P

  

i.e.

V-9

~ ∼ ∀ ∈ ⇔ .

slide-109
SLIDE 109

– Analogy to “characteristic function” With Fourier kernel

  • The characteristic function uniquely determines a Borel

probability on Rm.

  • The kernel mean with a characteristic

kernel uniquely determines a probability on (, B). Note:  may not be Euclidean. – The characteristic RKHS must be large enough! Examples for Rm (proved later): Gaussian, Laplacian kernel. Polynomial kernels are not characteristic.

 

y x y x F

T

1 exp ) , (   )]. , ( [ ) ( . . u X F E u

X

 f Ch

)] , ( [ ) ( X u k E u mX 

V-10

slide-110
SLIDE 110

Statistical inference with kernel means

– Statistical inference: inference on some properties of probabilities. – With characteristic kernels, they can be cast into the inference

  • n kernel means.

Two sample problem: ? Independence test:

⊗ ?

? ⊗ ?

V-11

slide-111
SLIDE 111

Empirical Estimation

 Empirical kernel mean

– An advantage of RKHS approach is easy empirical estimation. – : i.i.d.

: sample on RKHS Empirical mean: Theorem 5.1 (strong -consistency) Assume

n

X X ,...,

1

   

n

X X   , ,

1 

 

 

   

n i i n i i n X

X k n X n m

1 1 ) (

) , ( 1 ) ( 1 ˆ

 

). ( 1 ˆ

) (

    n n O m m

p X n X

. )] , ( [   X X k E

V-12

slide-112
SLIDE 112

 Empirical covariance operator

,

, … , , : i.i.d. sample on Ω Ω.

An estimator of YX is defined by Theorem 5.2 – Hilbert-Schmidt norm: same as Frobenius norm of a matrix, but

  • ften used for infinite dimensional spaces.

| |

∑ ,

  

      

n i X i X Y i Y n YX

m X k m Y k n

1 ) (

ˆ ) , ( ˆ ) , ( 1 ˆ

 

) ( 1 ˆ

) (

      n n Op

HS YX n YX

(sum of squares in matrix expression)

V-13

slide-113
SLIDE 113

Statistical test with kernels

V-14

slide-114
SLIDE 114

Two‐sample problem

– Two sample homogeneity test Two i.i.d. samples are given; Q: Are they sampled from the same distribution? Null hypothesis H0: . Alternative H1: . – Practically important: we often wish to distinguish two things.

  • Are the experimental results of treatment and control significantly

different?

  • Were the plays “Henry VI” and “Henry II” written by the same author?

– If then means of and are different, we may use it for test. If they are identical, we need to look at higher order information.

P X X

n ~

,...,

1

. ~ ,...,

1

Q Y Y

and

V-15

slide-115
SLIDE 115

– If mean and variance are the same, it is a very difficult problem. – Example: do they have the same distribution?

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

ℓ 100

V-16

slide-116
SLIDE 116

Kernel method for two-sample problem

(Gretton et al. NIPS 2007, 2010, JMLR2012).

 Kernel approach

– Comparison of

and 

comparison of and .

 Maximum Mean Discrepancy

– In population – With characteristic kernel, MMD = 0 if and only if

.

)]. , ( [ 2 )] ~ , ( [ )] ~ , ( [

2 2

Y X k E Y Y k E X X k E m m MMD

H Y X

    

V-17

 

    

 

) ( ) ( ) ( ) ( sup , sup

1 || :|| 1 || :||

x dQ x f x dP x f m m f m m

H H

f f Y X f f H Y X

( , : independent copy of , )

hence, MMD.

slide-117
SLIDE 117

– Test statistics: Empirical estimator MMD2

emp

– Asymptotic distributions under H0 and H1 are known (see Appendix)

  • Null distribution: ℓ
  •  infinite mixture of
  • Alternative: ℓ
  •  normal distribution

– Approximation of null distribution

  • Approximation of the mixture coefficients.
  • Fitting it with Pearson curve by moment matching.
  • Bootstrap (Arcones & Gine 1992)

V-18 2

ˆ ˆ

H Y X

m m  

  

   

  

 

 

1 , 2 1 1 1 , 2

) , ( 1 ) , ( 2 ) , ( 1

b a b a n i a a i n j i j i

Y Y k Y X k n X X k n

2 emp

MMD

slide-118
SLIDE 118

Experiments

Data set Attr. MMD-B WW KS Neural I

Same

96.5 97.0 95.0

Different

0.0 0.0 10.0 Neural II

Same

94.6 95.0 94.5

Different

3.3 0.8 31.8 Health

Same

95.5 94.7 96.1

Different

1.0 2.8 44.0 Subtype

Same

99.1 94.6 97.3

Different

0.0 0.0 28.4

Percentage of accepting . Significance level .

Data size / Dim

Neural I: 4000 / 63 Neural II: 1000 / 100 Health: 25 / 12600 Subtype: 25 / 2118 Comparison of two databases.

(Gretton et al. JMLR 2012)

V-19

WW: Wald-Walfovitz test KS: Kolmogorov-Smirnov test Classical methods (see Appendix)

slide-119
SLIDE 119

20

Experiments for mixtures

0.2 0.4 0.6 0.8 1 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.2 0.4 0.6 0.8 1 0.005 0.01 0.015 0.02 0.025 0.03 0.2 0.4 0.6 0.8 1 0.005 0.01 0.015 0.02

NX = NY = 100 NX = NY = 200 NX = NY = 500 0,1 vs Unif 1 0,1 Average values of MMD

  • ver 100 samples

0,1 vs 0,1

c

slide-120
SLIDE 120

Independence test

 Independence

– and are independent if and only if

  • Probability density function: , .
  • Cumulative distribution function:

, .

  • Characteristic function:

 Independence test

Given i.i.d. sample ,

, … , , , are and independent?

– Null hypothesis H0:

⊗ (independent)

– Alternative H1:

(not independent)

V-21

slide-121
SLIDE 121

V-22

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

  • Independent

Dependent

slide-122
SLIDE 122

Independence with kernel

Theorem 5.3. (Fukumizu, Bach, Jordan, JMLR 2004) If the product kernel is characteristic, then Recall Σ ≅ ⊗ ∈ ⊗ . Comparison between (kernel mean of ) and ⊗ (kernel mean of

).

– Dependence measure: Hilbert-Schmidt independence criterion

O

YX 

 

X Y

V-23

, ≔ Σ

  • ,

,

  • ,
  • ,
  • 2 ,
  • ,
  • ,

: independent copy of , . [Exercise]

slide-123
SLIDE 123

Independence test with kernels

(Gretton, Fukumizu, Teo, Song, Schölkopf, Smola. NIPS 2008)

– Test statistic: – This is a special case of MMD comparing

and ⊗ with

product kernel . – Asymptotic distributions are given similarly to MMD (Gretton et al

2009).

V-24

 

 

 

n k j i k i Y j i X n j i j i Y j i X emp

Y Y k X X k n Y Y k X X k n Y X

1 , , 3 1 , 2

) , ( ) , ( 2 ) , ( ) , ( 1 ) , ( HSIC

 

 

n k k Y n j i j i X

Y Y k X X k n

1 , 1 , 4

) , ( ) , ( 1

 

Or equivalently,

HSIC, ≔ Σ

  • 1

Tr , : centered Gram matrices

slide-124
SLIDE 124

Comparison with existing method

 Distance covariance (Székely et al., 2007; Székely & Rizzo, 2009)

– Distance covariance / distance correlation has gained popularity in statistical community as a dependence measure beyond Pearson correlation.

  • Definition. , : and ℓ-dimensional random vectors .

where , and , are i.i.d. ~

.

Distance covariance: , ≔ 2 |

V-25

Distance correlation: , ≔ , , ,

slide-125
SLIDE 125

 Relation between distance covariance and MMD

Proposition 5.4 (Sejdinovic, Gretton, Sriperumbudur, F. ICML2012) A kernel on Euclidean spaces

, .

is positive definite, and

HSIC

, , .

– Distance covariance is a specific instance of MMD. – Positive definite kernel approach is more general in choosing kernels, and thus may perform better. – Extension

, is positive definite for 0 2. We can define

  • , ≔ HSIC

,

V-26

slide-126
SLIDE 126

Experiments

V-27

2

Dependence 

Ratio of accepting independence

Original distance covariance HSIC with Gaussian kernel

slide-127
SLIDE 127

 Choice of kernel for MMD

– Heuristics for Gaussian kernel:

median

  • , 1, … ,

– Using performance of statistical test: Type II error of the test statistics (Gretton et al NIPS 2012). Challenging open questions.

V-28

slide-128
SLIDE 128

Conditional probabilities and beyond

V-29

slide-129
SLIDE 129

Conditional probability

– Conditional probabilities appear in many machine learning problems

  • Regression / classification: direct inference of | or |.

 already seen in Section II.

  • Bayesian inference

  • Conditional independence / dependence

– Graphical modeling Conditional independent relations among variables. – Causality – Dimension reduction / feature extraction

V-30

X1 X4 X2 X3 X5

slide-130
SLIDE 130

Conditional kernel mean

 Conditional kernel mean

Definition. – Simply, kernel mean of |. – It determines the conditional probability with a characteristic kernel. – Again, inference problems on conditional probabilities can be solved as inference on conditional kernel means. – But, how can we estimate it?

Φ ⋅,

V-31

slide-131
SLIDE 131

Covariance operator revisited

 (Uncentered) cross-covariance operator

, : random variables on Ω, Ω, , : positive definite kernels on Ω, Ω, , , ∞.

  • Definition. Uncentered cross-covariance operator

– Reproducing property:

, ∀ ∈ , ∈

  • ,

∀ , ∈

≡ Φ ⊗ Φ ∗ ∶ →

(Or ∈ ⊗

∗ ≅ ⊗ )

V-32

slide-132
SLIDE 132

Conditional probability by regression

– Recall for zero-mean Gaussian random variable , ,

  • .
  • Given by the solution to the least mean square

,

– For the feature vector

Φ

Φ .

  • Given by the solution to the least mean square

Φ Φ

  • ,
  • is not well defined in infinite dimensional cases, but

regularized estimator can be justified.

V-33

slide-133
SLIDE 133

Estimator for conditional kernel mean

– Empirical estimation: given ,

, … , , ,

  • Φ

In Gram matrix expression,

Proposition 5.5 (Consistency) If is characteristic,

, ∈ ⊗ , and → 0, →

∞ as → ∞, then for every

  • ⋅, → Φ

in in probability.

, ⋮ , , ∗ ∗,

∗,

  • .

V-34

slide-134
SLIDE 134

Conditional covariance

 Review: Gaussian variables

Conditional covariance matrix:

| ≔

  • Fact:
  • | Cov, |

for any

 Conditional cross-covariance operator

Definition: , , : random variables on Ω, Ω, Ω.

, , : positive definite kernel on Ω, Ω, Ω.

Conditional cross-covariance operator

| ≡

  • – Reproducing averaged conditional covariance

Proposition 5.6 If is characteristic, then for ∀ ∈ , ∈ ,

V-35

, | Cov , |

slide-135
SLIDE 135

– An interpretation: Compare the conditional kernel means for

, and |. Φ ⊗ Φ Φ | ⊗ Φ

Dependence on is not easy to handle  Average it out.

Φ ⊗ Φ Φ| ⊗ Φ ]

– Empirical estimator:

  • | ≡
  • V-36

⋅ ⋅

slide-136
SLIDE 136

Conditional independence

– Recall: for Gaussian random variable

  • | ⇔ | .

– By average over , | does not imply conditional independence, which requires , | for each . – Trick: consider

| ≔

,

where , and the product kernel is used for . Theorem 5.7 (Fukumizu et al. JMLR 2004) Assume , , are characteristic, then

| ⇔ | . – |, | can be similarly used.

V-37

slide-137
SLIDE 137

Applications of conditional dependence measures

 Conditional independence test (Fukumizu et al. NIPS2007)

– Squared HS-norm

|

  • can be used for conditional

independence test. – Unlike the independence test, the asymptotic null distribution is not available. Permutation test is needed. – Background: The conditional independent test with continuous non-Gaussian variables is not easy, and a challenging open problem.

V-38

slide-138
SLIDE 138

 Causal inference

– Directional acyclic graph (DAG) is used for representing the causal structure among variables. – The structure can be learned by conditional independence tests. The above test can be applied (Sun, Janzing, Schölkopf, F. ICML2007).

 Feature extraction / dimension reduction for

supervised learning  see next.

V-39

X Y

and

| Z X Y

X Y Z

slide-139
SLIDE 139

Dimension reduction and conditional independence

 Dimension reduction for supervised learning

Input: X = (X1, ... , Xm), Output: Y (either continuous or discrete) Goal: find an effective dimension reduction space (EDR space) spanned by an m x d matrix B s.t. No further assumptions on cond. p.d.f. p.

 Conditional independence

| |

( | ) ( | )

T

T Y X Y B X

p Y X p Y B X 

BTX = (b1

TX, ..., bd TX)

linear feature vector where X U V Y

B spans effective subspace

X Y | BTX ,

  • V-40
slide-140
SLIDE 140

Gradient‐based method

(Samarov 1993; Hristache et al 2001)

Average Derivative Estimation (ADE)

– Assumptions:

  • Y is one dimensional
  • EDR space

– Gradient of the regression function lies in the EDR space at each x

= average or PCA

  • f the gradients at many x.

) | ( ~ ) | ( X B Y p X Y p

T

      dy x B y p y x x X Y E x

T )

| ( ~ ] | [

   dy z y p z y B

x B z

T

) | ( ~

B ˆ

V-41

slide-141
SLIDE 141

Kernel Helps!

– Weakness of ADE:

  • Difficulty of estimating gradients in high dimensional space.

ADE uses local polynomial regression. » Sensitive to bandwidth

  • May find only a subspace of the effective subspace.

). ) ( , ( ~ , ) ( ~

2 2 1

X N Z Z X f Y  

e.g.

X Y | BTX

– Kernel method

  • Can handle conditional probability in regression form

Φ

  • Characterizes conditional independence

V-42

slide-142
SLIDE 142

Derivative with kernel

– Reproducing the derivative (e.g. Steinwart & Christmann, Chap. 4): Assume is differentiable and then – Combining with the estimation of conditional kernel mean, The top eigenvectors of

estimates the EDR space

X X

H x x k     ) , ( ) ~ , ( x x k X

x x f x x k f

X

      ) ( ) , ( ,

V-43

for any ∈

j Y i Y ij

x x X Y E x x X Y E x M          ] | ) ( [ ˆ , ] | ) ( [ ˆ ) ( ˆ

j X n XX YX i X n XX YX

x x k I C C x x k I C C         

 

) , ( ) ˆ ( ˆ , ) , ( ) ˆ ( ˆ

1 1

 

slide-143
SLIDE 143

Gradient‐based kernel dimension reduction (gKDR)

(Fukumizu & Leng, NIPS 2012)

– Method

  • Compute
  • Compute top d eigenvectors of 

Estimator – gKDR estimate the subspace to realize the conditional independence – Choice of kernel: Cross-validation with some regressor/classifier, e.g. kNN method.

. ) ( ) ( ) ( ) ( 1 ~

1 1 1

  

    

n i i X n X Y n X T i X n

X I n G G I n G X n M k k  

T X x n X X i X

i

x x X k x x X k X

            ) , ( ,..., ) , ( ) (

1

k

. ~

n

M . ˆ B

 

) , (

j i X X

X X k G 

V-44

slide-144
SLIDE 144

Experiment: ISOLET

– Speech signals of 26 alphabets – 617 dim. (continuous) – 6238 training data / 1559 test data – Results

Dim

gKDR gKDR-v CCA

c.f. C4.5 + ECOC: 6.61% Neural Networks (best): 3.27% Classification errors for test data by SVM (%)

V-45

Data from UCI repository.

Dim

gKDR gKDR-v CCA 10 15 20 25 30 35 40 45 50

14.43 7.50 5.00 4.75

  • 16.87

7.57 4.75 4.30 3.85 3.85 3.59 3.53 3.08 13.09 8.66 6.54 6.09

slide-145
SLIDE 145

Experiment: Amazon Commerce Reviews

– Author identification for Amazon commerce reviews. – Dim = 10000 (linguistic style: e.g. usage of digit, punctuation, words and sentences' length, and frequency of words, etc)

– = # authors x 30

(Data from UCI repository.)

V-46

slide-146
SLIDE 146

– Example of 2-dim plots for 3 authors – gKDR (dim = # authors) vs correlation-based variable selection (dim= 500/2000)

10-fold cross-validation errors (%)

# Authors gKDR + 5NN gKDR + SVM Corr (500) + SVM Corr (2000) + SVM

10 9.3 12.0 15.7 8.3 20 16.2 16.2 30.2 18.0 30 20.1 18.0 29.2 24.0 40 22.8 21.8 35.4 25.0 50 22.7 19.5 41.1 29.0

V-47

slide-147
SLIDE 147

Bayesian inference with kernels

(Fukumizu, Song, Gretton NIPS 2011)

 Bayes’ rule

, .

 Kernel Bayes’ rule

: kernel mean of prior, ∑ ⋅,

  • – |  : kernel representation of relation between and ,

,

, … , , ~, i.i.d.

– Goal: compute kernel mean of posterior |.

  • | ∑
  • ⋅,

V-48

| | Λ Λ

Λ

Λ Diag

slide-148
SLIDE 148

Bayesian inference using kernel Bayes' rule

– NO PARAMETRIC MODELS, BUT SAMPLES! – When is it useful?

  • Explicit form of cond. p.d.f. | or prior

is unavailable, but sampling is easy. – Approximate Bayesian Computation (ABC), Process prior.

  • Cond. p.d.f. | is unknown, but sample from , is

given in training phase. – Example: nonparametric HMM (shown later).

  • If both of | and are known, there are many good

sampling methods, such as MCMC, SMC, etc. But, they may take long time. KBR uses matrix operations.

V-49

slide-149
SLIDE 149

Application: nonparametric HMM

Model: , ∏

|

  • – Assume:

| and/or | is not known.

But, data ,

  • is available

in training phase. Examples:

  • Measurement of hidden states is expensive,
  • Hidden states are measured with time delay.

– Testing phase (e.g., filtering, e.g.): given

, … , , estimate hidden state .

– Sequential filtering/prediction uses Bayes’ rule  KBR applied.

X0 X1 X2 X3 XT Y0 Y1 Y2 Y3 YT …

V-50

slide-150
SLIDE 150

 Camera angles

– Hidden : angles of a video camera located at a corner of a room. – Observed

: movie frame of a room + additive Gaussian noise.

– : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ).

– The first 1800 frames for training, and the second half for testing

V-51

noise KBR (Trace) Kalman filter(Q) 2 = 10-4 0.15 0.01 0.56 0.02 2 = 10-3 0.210.01 0.54 0.02

Average MSE for camera angles (10 runs) To represent SO(3) model, Tr[AB-1] for KBR, and quaternion expression for Kalman filter are used .

slide-151
SLIDE 151

Summary of Part V

 Kernel mean embedding of probabilities

– Kernel mean gives a representation of probability distribution. – Inference on probabilities can be cast into inference on kernel

  • means. e.g. two sample test, independent test

 Conditional probabilities

– Conditional probabilities can be handled with kernel means and covariances

  • Conditional independence test
  • Graphical modeling
  • Causal inference
  • Dimension reduction
  • Bayesian inference

V-52

slide-152
SLIDE 152

References

Fukumizu, K., Bach, F.R., and Jordan, M.I. (2004) Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5:73-99, Fukumizu, K., F.R. Bach and M. Jordan. (2009) Kernel dimension reduction in regression. Annals of Statistics. 37(4), pp.1871-1905 Fukumizu, K., L. Song, A. Gretton (2011) Kernel Bayes' Rule. Advances in Neural Information Processing Systems 24 (NIPS2011) 1737-1745. Gretton, A., K.M. Borgwardt, M.Rasch, B. Schölkopf, A.J. Smola (2007) A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. Gretton, A., Z. Harchaoui, K. Fukumizu, B. Sriperumbudur (2010) A Fast, Consistent Kernel Two-Sample Test. Advances in Neural Information Processing Systems 22, 673-681. Gretton, A., K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. (2008) A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20, 585-592. Gretton, A., K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. (2008) A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20, 585-592.

V-53

slide-153
SLIDE 153

Hristache, M., A. Juditsky, J. Polzehl, and V. Spokoiny. (2001) Structure Adaptive Approach for Dimension Reduction. Annals of Statsitics, 29(6):1537-1566. Samarov, A. (1993). Exploring regression structure using nonparametric functional

  • estimation. Journal of American Statistical Association. 88, 836-847.

Sejdinovic, D., A. Gretton, B. Sriperumbudur, K. Fukumizu. (2012) Hypothesis testing using pairwise distances and associated kernels. Proc. 29th International Conference on Machine Learning (ICML2012). Sriperumbudur, B.K., A. Gretton, K. Fukumizu, B. Schölkopf, G.R.G. Lanckriet. (2010) Hilbert Space Embeddings and Metrics on Probability Measures. Journal

  • f Machine Learning Research. 11:1517-1561.

Sun, X., D. Janzing, B. Schölkopf and K. Fukumizu. (2007) A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862. Székely, G.J., M.L. Rizzo, and N.K. Bakirov. (2007) Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6): 2769-2794. Székely, G.J. and M.L. Rizzo. (2009) Brownian distance covariance. Annals of Applied Statistics, 3(4):1236-1265.

V-54

slide-154
SLIDE 154

Appendices

V-55

slide-155
SLIDE 155

Statistical Test: quick introduction

 How should we set the threshold?

Example) Based on MMD, we wish to make a decision whether the two variables have the same distribution. Simple-minded idea: Set a small value like t = 0.001

MMD(X,Y) < t

Perhaps, same

MMD(X,Y) t

Different But, the threshold should depend on the properties of X and Y.

 Statistical hypothesis test

– A statistical way of deciding whether a hypothesis is true or not. – The decision is based on sample  We cannot be 100% certain.

V-56

slide-156
SLIDE 156

 Procedure of hypothesis test

  • Null hypothesis H0 = hypothesis assumed to be true

“X and Y have the same distribution”

  • Prepare a test statistic TN

e.g. TN = MMDemp

2

  • Null distribution: Distribution of TN under H0
  • Set significance level  Typically  = 0.05 or 0.01
  • Compute the critical region: = Pr(TN > t under H0)
  • Reject the null hypothesis if TN > t

The probability that MMDemp

2 > t under

H0 is very small.

  • therwise, accept H0 negatively.

V-57

slide-157
SLIDE 157

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p.d.f. of Null distribution area = (5%, 1% etc) threshold t

  • If H0 is the truth, the value of TN should follow the null distribution.
  • If H1 is the truth, the value of TN should be very large.
  • Set the threshold with risk .
  • The threshold depends on the distribution of the data.

critical region One-sided test TN area = p-value TN > t  p-value <  significance level

V-58

slide-158
SLIDE 158

 Type I and Type II error

– Type I error = false positive (e.g. “ ” = positive) – Type II error = false negative

TRUTH H0 Alternative TEST RESULT Reject H0 Accept H0 Type I error Type II error

  • Significance level controls the type I error.
  • Under a fixed type I error, a good test statistics should

give small type II error. True negative True positive False positive False negative

V-59

slide-159
SLIDE 159

MMD: Asymptotic distribution

 Under H0

, … , ~,

, … , ℓ ~: i.i.d. Let ℓ.

Assume

  • → ,

ℓ → 1

(0 1) as → ∞. Under the null hypothesis of , where , , … are i.i.d. with law 0; 1/ 1 , and

  • are

the eigenvalues of the integral operator on

,

with

the centered kernel

  • , , ,

, ,

  • .

MMD

  • 1

1

  • → ∞ ,
  • : independent copy of .

V-60

slide-160
SLIDE 160

 Under H1

Under the alternative , where

4

, ,

  • ,

,

  • .

– The asymptotic distributions are derived by the general theory of U-statistics (e.g. see van der Vaart 1998, Chapter 12). – Estimation of the null distribution:

  • Estimation of
  • Approximation by Pearson curve with moment matching.
  • Bootstrap MMD (Arcones & Gine 1992)

MMD

  • 0; → ∞ ,

V-61

slide-161
SLIDE 161

Conventional methods for two sample problem

 Kolmogorov-Smirnov (K-S) test for two samples

One-dimensional variables – Empirical distribution function – KS test statistics – Asymptotic null distribution is known (not shown here).

 

N i i N

t X I N t F

1

) ( 1 ) ( ) ( ) ( sup

2 1

t F t F D

N N t N

 

R

DN FN

1(t)

FN

2(t)

t

V-62

slide-162
SLIDE 162

 Wald-Wolfowitz run test

One-dimensional samples – Combine the samples and plot the points in ascending order. – Label the points based on the original two groups. – Count the number of “runs”, i.e. consecutive sequences of the same label. – Test statistics – In one-dimensional case, less powerful than KS test

 Multidimensional extension of KS and WW test

– Minimum spanning tree is used (Friedman Rafsky 1979)

R = Number of runs

) 1 , ( ] [ ] [ N R Var R E R TN   

R = 10

V-63