kenji fukumizu the institute of statistical mathematics
play

Kenji Fukumizu The Institute of Statistical Mathematics / Graduate - PowerPoint PPT Presentation

Kernel Methods for Statistical Learning Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7, 2012 Machine Learning Summer School 2012, Kyoto Version 2012.09.04 The latest version of


  1. Principal Component Analysis  PCA (review) – Linear method for dimension reduction of data – Project the data in the directions of large variance. a T argmax Var [ ] X 1st principal axis = a  || || 1 2     n 1 1   n     T T   Var [ ] a X a X X i  j     1 j n n  1 i  T . a V a XX where T     n 1 1 1    n n        V X X X X i  j i  j XX     1 1 j j n n n  1 i II-4

  2. From PCA to Kernel PCA – Kernel PCA: nonlinear dimension reduction of data (Schölkopf et al. 1998) . – Do PCA in feature space 2     n 1 1     n   max : T T   [ ] Var a X a X X  || || 1 a i  s    1  s n n  1 i 2   n 1 1   n      max :   Var [ , ( ) ] , ( ) ( ) f X f X X  || || 1 f i  s  1  s n n H  1 i II-5

  3. It is sufficient to assume 𝑜 𝑜 Φ 𝑌 𝑗 − 1 𝑔 = � 𝑑 𝑗 𝑜 � Φ 𝑌 𝑡 𝑗=1 𝑡=1 Orthogonal directions to the data can be neglected, since for 1 𝑜 𝑜 𝑔 = ∑ 𝑜 ∑ 𝑑 𝑗 ( Φ 𝑌 𝑗 − Φ 𝑌 𝑡 ) + 𝑔 , where 𝑔 ⊥ is orthogonal to ⊥ 𝑗=1 𝑡=1 1 𝑜 𝑜 𝑜 ∑ the span { Φ 𝑌 𝑗 − Φ 𝑌 𝑡 } , the objective function of kernel 𝑡=1 𝑗=1 PCA does not depend on 𝑔 ⊥ .   ~   2 T Var , ( ) f X c K c Then, X [Exercise] 2  ~ T || || f c K c H X ~ ~ ~    : ( ), ( ) K X X (centered Gram matrix) where , X ij i j ~        n ( ) : ( ) 1 ( ) with X X X i i s n s 1 (centered feature vector) II-6

  4.  Objective function of kernel PCA ~ ~  T max 2 T 1 c K c c K c subject to X X � 𝑌 is expressed with Gram matrix The centered Gram matrix 𝐿 𝐿 𝑌 = 𝑙 𝑌 𝑗 , 𝑌 𝑗𝑘 as 𝑘 1 𝑈 ∈ 𝐒 𝑜 𝐽 𝑜 − 1 𝐿 𝐽 𝑜 − 1 𝟐 𝑜 = ⋮ � 𝑌 = 𝑈 𝑈 𝐿 𝑜 𝟐 𝑜 𝟐 𝑜 𝑜 𝟐 𝑜 𝟐 𝑜 1 𝐽 𝑜 = Unit matrix   1 ~   n   ( , ) ( , ) K k X X k X X X i j i s ij s 1 n 1 1   n N   ( , ) ( , ) k X X k X X  t j  t s 2 t 1 t , s 1 n n [Exercise] II-7

  5. – Kernel PCA can be solved by eigen-decomposition. – Kernel PCA algorithm ~ • Compute centered Gram matrix K X ~ • Eigendecompsoition of K X ~    N λ T K u u X i i i 1 i λ  λ   λ   0 eigenvalues 1 2 N  , , , u u u unit eigenvectors 1 2 N  λ T X u X • p -th principal component of i p p i II-8

  6. Derivation of kernel method in general – Consider feature vectors with kernels. – Linear method is applied to the feature vectors. (Kernelization) – Typically, only the inner products    ( ), ( ) ( , ) X X k X X i j i j f  , ( ) X i are used to express the objective function of the method. n    ( ) , – The solution is given in the form f c X i i  1 i (representer theorem, see Sec.IV), and thus everything is written by Gram matrices. These are common in derivation of any kernel methods. II-9

  7. Example of Kernel PCA  Wine data (from UCI repository) 13 dim. chemical measurements of for three types of wine. 178 data. Class labels are NOT used in PCA, but shown in the figures. First two principal components: Linear PCA Kernel PCA (Gaussian kernel) 4 0.6 (  = 3 ) 3 0.4 2 0.2 1 0 0 -0.2 -1 -0.4 -2 -0.6 -3 -4 -0.8 -5 0 5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 II-10

  8. Kernel PCA (Gaussian) 0.6  = 4 0.5  = 2 0.4 0.4 0.3 0.2 0.2 0.1 0 0 -0.2 -0.1 -0.2 -0.4 -0.3 -0.6 -0.4 -0.5 -0.8 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.6  = 5 0.4   0.2    2  2 ( , ) exp k G x y x y 0 -0.2 -0.4 -0.6 II-11 -0.8 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

  9. Noise Reduction with Kernel PCA – PCA can be used for noise reduction (principal directions represent signal, other directions noise). – Apply kernel PCA for noise reduction: • Compute d -dim subspace V d spanned by d -principal directions. • For a new data x , G x : Projection of  x  onto V d = noise reduced feacure vector. ˆ x • Compute a preimage in data sapce for the noise  ( x ) 10 reduced feature vector G x . 8 6 G x ′ 2    4 ˆ argmin ( ) x x G x 2 ′ x 0 -2 Note: G x  is not necessariy -4 -6 given as an image of  . -20 -8 0 -10 20 -25 -20 -15 -10 -5 0 5 10 15 20 II-12

  10.  USPS data Original data (NOT used for PCA) Noisy images Noise reduced images ( linear PCA ) Noise reduced images ( kernel PCA, Gaussian kernel ) Generated by Matlab stprtool (by V. Franc) II-13

  11. Properties of Kernel PCA – Nonlinear features can be extracted. – Can be used for a preprocessing of other analysis like classification. (dimension reduction / feature extraction) – The results depend on the choice of kernel and kernel parameters. Interpreting the results may not be straightforward. – How to choose a kernel and kernel parameter? • Cross-validation is not straightforward unlike SVM. • If it is a preprocessing, the performance of the final analysis should be maximized. II-14

  12. Kernel Canonical Correlation Analysis II-15

  13. Canonical Correlation Analysis – Canonical correlation analysis (CCA, Hotelling 1936 ) Linear dependence of two multivariate random vectors. • Data 𝑌 1 , 𝑍 1 , … , ( 𝑌 𝑂 , 𝑍 𝑂 ) • 𝑌 𝑗 : m -dimensional, 𝑍 𝑗 : ℓ -dimensional Find the directions 𝑏 and 𝑐 so that the correlation of 𝑏 𝑈 𝑌 and 𝑐 𝑈 𝑍 is maximized. T T [ , ] Cov a X b Y ρ   T T max Corr [ , ] max , a X b Y T T , , a b a b Var [ ] Var [ ] a X b Y a b a T X b T Y X Y II-16

  14.  Solution of CCA � 𝑌𝑌 𝑐 subject to 𝑏 𝑈 𝑊 � 𝑌𝑌 𝑏 = 𝑐 𝑈 𝑊 � 𝑌𝑌 𝑐 = 1. 𝑏 , 𝑐 𝑏 𝑈 𝑊 max – Rewritten as a generalized eigenproblem: � 𝑌𝑌 � 𝑌𝑌 𝑃 𝑊 𝑐 = 𝜍 𝑊 𝑏 𝑃 𝑏 𝑐 � 𝑌𝑌 � 𝑌𝑌 𝑊 𝑃 𝑃 𝑊 [Exercise: derive this. (Hint. Use Lagrange multiplier method.)] – Solution: 1 / 2 𝑣 1 , 𝑐 = 𝑊 1 / 2 𝑤 1 𝑏 = 𝑊 𝑌𝑌 𝑌𝑌 where 𝑣 1 ( 𝑤 1 , resp.) is the left (right, resp.) first eigenvector −1 / 2 𝑊 −1 / 2 . � � 𝑌𝑌 𝑊 � for the SVD of 𝑊 𝑌𝑌 𝑌𝑌 II-17

  15. Kernel CCA – Kernel CCA (Akaho 2000, Melzer et al. 2002, Bach et al 2002) • Dependence (not only correlation) of two random variables. • Data: 𝑌 1 , 𝑍 1 , … , ( 𝑌 𝑂 , 𝑍 𝑂 ) arbitrary variables • Consider CCA for the feature vectors with 𝑙 𝑌 and 𝑙 𝑌 : 𝑌 1 , … , 𝑌 𝑂 ↦ Φ 𝑌 𝑌 1 , … , , Φ 𝑌 𝑌 𝑂 ∈ 𝐼 𝑌 , 𝑍 1 , … , 𝑍 𝑂 ↦ Φ 𝑌 𝑍 1 , … , Φ 𝑌 𝑍 𝑂 ∈ 𝐼 𝑌 . � 𝑌 𝑌 𝑗 � 𝑌 𝑍 𝑂 Cov[ 𝑔 𝑌 , 𝑕 𝑍 ] ∑ 𝑔 , Φ 〈Φ 𝑗 , 𝑕〉 𝑗 max = max 𝑔∈𝐼 𝑌 , 𝑕∈𝐼 𝑍 Var 𝑔 𝑌 Var[ 𝑕 𝑍 ] 𝑔∈𝐼 𝑌 , 𝑕∈𝐼 𝑍 � 𝑌 𝑌 𝑗 � 𝑌 𝑍 𝑂 𝑂 ∑ 𝑔 , Φ 2 ∑ 𝑕 , Φ 2 𝑗 𝑗 𝑗  y  x f g  y ( Y )  x ( X ) ( X ) ( Y ) f g X Y II-18

  16. 𝑂 𝛽 𝑗 Φ 𝑂 𝛾 𝑗 Φ � 𝑌 ( 𝑌 𝑗 ) and 𝑕 = ∑ 𝑗=1 � 𝑌 ( 𝑍 – We can assume 𝑔 = ∑ 𝑗=1 𝑗 ) . (same as kernel PCA) � 𝑌 𝐿 � 𝑌 𝛾 𝛽 𝑈 𝐿 max 𝛽∈R 𝑂 , 𝛾∈𝑆 𝑂 � 𝑌 and 𝐿 � 𝑌 : centered 𝐿 � 𝑌 � 𝑌 2 𝛽 𝛾 𝑈 𝐿 2 𝛾 𝛽 𝑈 𝐿 Gram matrices. – Regularization: to avoid trivial solution, � 𝑌 𝑌 𝑗 � 𝑌 𝑍 𝑂 ∑ 𝑔 , Φ 〈Φ 𝑗 , 𝑕〉 𝑗=1 max 𝑔∈𝐼 𝑌 , 𝑕∈𝐼 𝑍 2 ∑ 2 + 𝜁 𝑂 � 𝑌 𝑌 𝑗 � 𝑌 𝑍 𝑂 𝑂 2 ∑ 2 𝑔 , Φ + 𝜁 𝑂 𝑔 𝐼 𝑌 𝑕 , Φ 𝑕 𝐼 𝑍 𝑗 𝑗=1 𝑗=1 – Solution: generalized eigenproblem 2 + 𝜁 𝑂 𝐿 � 𝑌 𝐿 � 𝑌 � 𝑌 � 𝑌 𝛽 𝛽 𝛾 = 𝜍 𝐿 𝑃 𝑃 𝐿 𝛾 2 + 𝜁 𝑂 𝐿 � 𝑌 𝐿 � 𝑌 � 𝑌 � 𝑌 𝐿 𝑃 𝑃 𝐿 II-19

  17. Application of KCCA – Application to image retrieval (Hardoon, et al. 2004). • X i : image, Y i : corresponding texts (extracted from the same webpages). • Idea: use d eigenvectors f 1 ,…, f d and g 1 ,…, g d as the feature spaces which contain the dependence between X and Y . • Given a new word 𝑍 𝑜𝑜𝑥 , compute its feature vector, and find the image whose feature has the highest inner product.       , ( ) , ( ) f X g Y      x  y 1 1 f x i y g •    x ( X i )  y ( Y )     X i Y         , ( ) , ( )  f X  g Y   d x i d y “at phoenix sky harbor on july 6, 1997. 757- 2s7, n907wa, ...” II-20

  18. – Example: • Gaussian kernel for images. • Bag-of-words kernel (frequency of words) for texts. Text -- “height: 6-11, weight: 235 lbs, position: forward, born: september 18, 1968, split, croatia college: none” Extracted images II-21

  19. Kernel Ridge Regression II-22

  20. Ridge regression 𝑜 : data ( 𝑌 𝑗 ∈ 𝐒 𝑛 , 𝑍 𝑌 1 , 𝑍 1 , … , 𝑌 𝑜 , 𝑍 𝑗 ∈ 𝐒 ) Ridge regression: linear regression with 𝑀 2 penalty. 𝑜 2 − 𝑏 𝑈 𝑌 𝑗 | 2 + 𝜇 𝑏 2 min � | 𝑍 𝑗 𝑏 𝑗=1 (The constant term is omitted for simplicity) – Solution (quadratic function): 𝑌𝑌 + 𝜇𝐽 𝑜 −1 𝑌 𝑈 𝑍 𝑏 � = 𝑊 where 𝑈 𝑍 𝑌 1 1 1 𝑜 𝑌 𝑈 𝑌 , 𝑌 = ∈ 𝐒 𝑜 × 𝑛 , 𝑍 = ∈ 𝐒 𝑜 ⋮ 𝑊 𝑌𝑌 = ⋮ 𝑈 𝑍 𝑌 𝑜 𝑜 Ridge regression is preferred, when 𝑊 𝑌𝑌 is (close to) singular. – II-23

  21. Kernel ridge regression – 𝑌 1 , 𝑍 1 , … , 𝑌 𝑜 , 𝑍 𝑜 : 𝑌 arbitrary data, 𝑍 ∈ 𝐒 . – Kernelization of ridge regression: positive definite kernel 𝑙 for 𝑌 𝑜 𝐼 | 2 2 min 𝑔∈𝐼 � | 𝑍 𝑗 − 𝑔 , Φ 𝑌 𝑗 + 𝜇 𝑔 𝐼 Ridge regression on 𝐼 𝑗=1 equivalently, 𝑜 2 𝑗 − 𝑔 ( 𝑌 𝑗 )| 2 min 𝑔∈𝐼 � | 𝑍 + 𝜇 𝑔 𝐼 Nonlinear ridge regr. 𝑗=1 II-24

  22. n    ( ) , f c X – Solution is given by the form j j  1 j 𝑜 Let 𝑔 = ∑ 𝑑 𝑗 Φ 𝑌 𝑗 + 𝑔 ⊥ = 𝑔 Φ + 𝑔 ⊥ 𝑗=1 𝑜 ( 𝑔 Φ ∈ 𝑇𝑇𝑏𝑜 Φ 𝑌 𝑗 , 𝑔 ⊥ : orthogonal complement) 𝑗=1 2 + 𝜇 𝑔 𝑜 ⊥ 2 Objective function = ∑ 𝑍 𝑗 − 𝑔 Φ + 𝑔 ⊥ , Φ 𝑌 𝑗 Φ + 𝑔 𝑗=1 𝑜 2 + 𝜇 ( 𝑔 Φ 2 + 𝑔 ⊥ 2 = ∑ 𝑍 𝑗 − 𝑔 Φ , Φ 𝑌 𝑗 ) 𝑗=1 The 1st term does not depend on 𝑔 ⊥ , and 2nd term is minimized in the case 𝑔 ⊥ = 0. – Objective function : 𝑍 − 𝐿 𝑌 𝑑 2 + 𝜇 𝑑 𝑈 𝐿 𝑌 𝑑 – Solution: 𝑑̂ = 𝐿 𝑌 + 𝜇𝐽 𝑜 −1 𝑍 𝑙 𝑦 , 𝑌 1 ̂ 𝑦 = 𝑍 𝑈 𝐿 𝑌 + 𝜇𝐽 𝑜 −1 𝐥 𝑦 Function: 𝑔 𝐥 𝑦 = ⋮ 𝑙 𝑦 , 𝑌 𝑜 II-25

  23. Regularization – The minimization 2 min 𝑍 𝑗 − 𝑔 𝑌 𝑗 𝑔 may be attained with zero errors. But the function may not be unique. – Regularization 𝑜 2 𝑗 − 𝑔 ( 𝑌 𝑗 )| 2 𝑔∈𝐼 ∑ min | 𝑍 + 𝜇 𝑔 𝐼 𝑗=1 • Regularization with smoothness penalty is preferred for uniqueness and smoothness. • Link with some RKHS norm and smoothness is discussed in Sec. IV. II-26

  24. Comparison  Kernel ridge regression vs local linear regression 𝑍 = 1/ 1.5 + | 𝑌 | 2 + 𝑎 , 𝑌 ~ 𝑂 0, 𝐽 𝑒 , 𝑎 ~ 𝑂 0, 0.1 2 0.012 𝑜 = 100, 500 runs Kernel method Local linear 0.01 Kernel ridge regression Mean square errors with Gaussian kernel 0.008 Local linear regression 0.006 with Epanechnikov kernel (‘locfit’ in R is used) 0.004 Bandwidth parameters 0.002 are chosen by CV. 0 1 10 100 Dimension of X II-27

  25.  Local linear regression (e.g., Fan and Gijbels 1996) – 𝐿 : smoothing kernel ( 𝐿 𝑦 ≥ 0, ∫ 𝐿 𝑦 𝑒𝑦 = 1 , not necessarily positive definite) – Local linear regression 𝐹 𝑍 𝑌 = 𝑦 0 is estimated by 𝑦 𝐿 ℎ ( 𝑦 ) = ℎ −𝑒 𝐿 ℎ 𝑜 𝑗 − 𝑏 − 𝑐 𝑈 ( 𝑌 𝑗 −𝑦 0 ) 2 min 𝑏 , 𝑐 � 𝑍 𝐿 ℎ ( 𝑌 𝑗 − 𝑦 0 ) 𝑗 • For each 𝑦 0 , this minimization can be solved by linear algebra. • Statistical property of this estimator is well studied. • For one dimensional 𝑌 , it works nicely with some theoretical optimality. • But, weak for high-dimensional data. II-28

  26. Some topics on kernel methods • Representer theorem • Structured data • Kernel choice • Low rank approximation II-29

  27. Representer theorem 𝑌 1 , 𝑍 1 , … , 𝑌 𝑜 , 𝑍 𝑜 : data 𝑙 : positive definite kernel for 𝑌 , 𝐼 : corresponding RKHS. Ψ : monotonically increasing function on 𝑆 + . Theorem 2.1 (representer theorem, Kimeldorf & Wahba 1970 ) The solution to the minimization problem: min 𝑔∈𝐼 𝐺 ( 𝑌 1 , 𝑍 1 , 𝑔 𝑌 1 , … , ( 𝑌 𝑜 , 𝑍 𝑜 , 𝑔 𝑌 𝑜 )) + Ψ (| 𝑔 |) is attained by 𝑜 with some 𝑑 1 , … , 𝑑 𝑜 ∈ 𝑆 𝑜 . 𝑔 = ∑ 𝑑 𝑗 Φ 𝑌 𝑗 𝑗=1 The proof is essentially the same as the one for the kernel ridge regression. [Exercise: complete the proof] II-30

  28. Structured Data – Structured data: non-vectorial data with some structure. • Sequence data (variable length): DNA sequence, Protein (sequence of amino acid) Text (sequence of words) • Graph data (Koji’s lecture) S Chemical compound etc. • Tree data NP VP Parse tree NP • Histograms / probability Det N V Det N measures The cat chased the mouse. – Many kernels uses counts of substructures (Haussler 1999) . II-31

  29. Spectrum kernel – p -spectrum kernel (Leslie et al 2002) : positive definite kernel for string. 𝑙 𝑞 𝑡 , 𝑢 = Occurrences of common subsequences of length p. – Example: s = “statistics” t = “pastapistan” 3-spectrum s: sta, tat, ati, tis, ist, sti, tic, ics t : pas, ast, sta, tap, api, pis, ist, sta, tan sta tat ati tis ist sti tic ics pas ast tap api pis tan  (s) 1 1 1 1 1 1 1 1 0 0 0 0 0 0  (t) 2 0 0 0 1 0 0 0 1 1 1 1 1 1 K 3 ( s , t ) = 1 ・ 2 + 1 ・ 1 = 3 – Linear time ( 𝑃 ( 𝑇 (| 𝑇 | + | 𝑢 |) ) algorithm with suffix tree is known (Vishwanathan & Smola 2003) . II-32

  30. – Application: kernel PCA of ‘words’ with 3-spectrum kernel 8 bioinformatics 6 informatics 4 2 mathematics biology cybernetics psychology 0 physics statistics methodology -2 metaphysics biostatistics pioneering -4 engineering -6 -4 -2 0 2 4 6 8 II-33

  31. Choice of kernel  Choice of kernel – Choice of kernel (polyn or Gauss) – Choice of parameters (bandwidth parameter in Gaussian kernel)  General principles – Reflect the structure of data (e.g., kernels for structured data) – For supervised learning (e.g., SVM)  Cross-validation – For unsupervised learning (e.g. kernel PCA) • No general methods exist. • Guideline: make or use a relevant supervised problem, and use CV. – Learning a kernel: Multiple kernel learning (MKL) 𝑁 𝑙 𝑦 , 𝑧 = ∑ 𝑑 𝑗 𝑙 𝑗 ( 𝑦 , 𝑧 ) optimize 𝑑 𝑗 𝑗=1 II-34

  32. Low rank approximation – Gram matrix: 𝑜 × 𝑜 where 𝑜 is the sample size. Large 𝑜 causes computational problems. e.g. Inversion, eigendecomposition costs 𝑃 ( 𝑜 3 ) in time. – Low-rank approximation 𝐿 ≈ 𝑆𝑆 𝑈 , where 𝑆 : 𝑜 × 𝑠 matrix ( 𝑠 < 𝑜 ) 𝐿 ≈ 𝑆 𝑆 𝑜 𝑜 𝑠 𝑜 𝑜 𝑠 – The decay of eigenvalues of a Gram matrix is often quite fast (See Widom 1963, 1964; Bach & Jordan 2002) . II-35

  33. – Two major methods • Incomplete Cholesky factorization (Fine & Sheinberg 2001) 𝑃 ( 𝑜𝑠 2 ) in time and 𝑃 ( 𝑜𝑠 ) in space • Nyström approximation (Williams and Seeger 2001) Random sampling + eigendecomposition – Example: kernel ridge regression 𝑍 𝑈 𝐿 𝑌 + 𝜇𝐽 𝑜 −1 𝐥 𝑦 time : 𝑃 ( 𝑜 3 ) Low rank approximation: 𝐿 𝑌 ≈ 𝑆𝑆 𝑈 . With Woodbury formula * 𝑍 𝑈 𝐿 𝑌 + 𝜇𝐽 𝑜 −1 𝐥 𝑦 ≈ 𝑍 𝑈 𝑆𝑆 𝑈 + 𝜇𝐽 𝑜 −1 𝐥 𝑦 1 𝜇 𝑍 𝑈 𝐥 𝑦 − 𝑍 𝑈 𝑆 𝑆 𝑈 𝑆 + 𝜇𝐽 𝑠 −1 𝑆 𝑈 𝐥 𝑦 = time : 𝑃 ( 𝑠 2 𝑜 + 𝑠 3 ) * Woodbury (Sherman–Morrison–Woodbury) formula: 𝐵 + 𝑉𝑊 −1 = 𝐵 −1 − 𝐵 −1 𝑉 𝐽 + 𝑊𝐵 −1 𝑉 −1 𝑊𝐵 −1 . II-36

  34. Other kernel methods – Kernel Fisher discriminant analysis ( kernel FDA ) (Mika et al. 1999) – Kernel logistic regression ( Roth 2001, Zhu&Hastie 2005 ) – Kernel partial least square ( kernel PLS ) ( Rosipal&Trejo 2001 ) – Kernel K-means clustering ( Dhillon et al 2004 ) etc, etc, ... Variants of SVM  Section III. – II-37

  35. Summary: Properties of kernel methods – Various classical linear methods can be kernelized  Linear algorithms on RKHS. – The solution typically has the form n    ( ) . f c X (representer theorem) i i  1 i – The problem is reduced to manipulation of Gram matrices of size n (sample size). • Advantage for high dimensional data. • For a large number of data, low-rank approximation is used effectively. – Structured data: • kernel can be defined on any set. • kernel methods can be applied to any type of data. II-38

  36. References Akaho. (2000) Kernel Canonical Correlation Analysis. Proc. 3rd Workshop on Induction-based Information Sciences (IBIS2000). (in Japanese) Bach, F.R. and M.I. Jordan. (2002) Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48. Dhillon, I. S., Y. Guan, and B. Kulis. (2004) Kernel k-means, spectral clustering and normalized cuts. Proc. 10th ACM SIGKDD Intern. Conf. Knowledge Discovery and Data Mining (KDD), 551–556. Fan, J. and I. Gijbels. Local Polynomial Modelling and Its Applications . Chapman Hall/CRC, 1996. Fine, S. and K. Scheinberg. (2001) Efficient SVM Training Using Low-Rank Kernel Representations. Journal of Machine Learning Research , 2:243-264. Gökhan, B., T. Hofmann, B. Schölkopf, A.J. Smola, B. Taskar, S.V.N. Vishwanathan. (2007) Predicting Structured Data . MIT Press. Hardoon, D.R., S. Szedmak, and J. Shawe-Taylor. (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation , 16:2639– 2664. Haussler, D. Convolution kernels on discrete structures. Tech Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz , 1999. II-39

  37. Leslie, C., E. Eskin, and W.S. Noble. (2002) The spectrum kernel: A string kernel for SVM protein classification, in Proc. Pacific Symposium on Biocomputing , 564–575. Melzer, T., M. Reiter, and H. Bischof. (2001) Nonlinear feature extraction using generalized canonical correlation analysis. Proc. Intern. Conf. .Artificial Neural Networks (ICANN 2001), 353–360. Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller. (1999) Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, edits, Neural Networks for Signal Processing , volume IX, 41–48. IEEE. Rosipal, R. and L.J. Trejo. (2001) Kernel partial least squares regression in reproducing kernel Hilbert space. Journal of Machine Learning Research , 2: 97–123. Roth, V. (2001) Probabilistic discriminative kernel classifiers for multi-class problems. In Pattern Recognition: Proc. 23rd DAGM Symposium , 246–253. Springer. Schölkopf, B., A. Smola, and K-R. Müller. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation , 10:1299–1319. Schölkopf, B. and A. Smola. Learning with Kernels . MIT Press. 2002. Vishwanathan, S. V. N. and A.J. Smola. (2003) Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15 , 569–576. MIT Press. Williams, C. K. I. and M. Seeger. (2001) Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems , 13:682–688. II-40

  38. Widom, H. (1963) Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society , 109:278{295, 1963. Widom, H. (1964) Asymptotic behavior of the eigenvalues of certain integral equations II. Archive for Rational Mechanics and Analysis , 17:215{229, 1964. II-41

  39. Appendix II-42

  40. Exercise for kernel PCA 2  ~ T || || f c K c – H X 𝑜 𝑜 𝑜 2 = � 𝑑 𝑗 Φ � 𝑌 𝑗 � 𝑌 � 𝑌 𝑗 , Φ � 𝑌 � 𝑌 𝑑 . = 𝑑 𝑈 𝐿 𝑔 𝐼 , � 𝑑 𝑘 Φ = � 𝑑 𝑗 𝑑 𝑘 Φ 𝑘 𝑘 𝑗=1 𝑘=1 𝑗   ~   T 2 Var , ( ) – f X c K c X 2 𝑜 𝑜 = 1 � 𝑌 𝑗 � 𝑌 Var 𝑔 , Φ 𝑌 𝑜 � � 𝑑 𝑗 Φ , Φ 𝑘 𝑘=1 𝑗=1 𝑜 𝑜 𝑜 = 1 � 𝑌 𝑗 � 𝑌 � 𝑌 ℎ � 𝑌 𝑜 � � 𝑑 𝑗 Φ , Φ � 𝑑 ℎ Φ , Φ 𝑘 𝑘 𝑘=1 𝑗=1 ℎ=1 𝑜 𝑜 𝑜 = 1 = 1 � 𝑗𝑘 � ℎ𝑘 � 𝑌 𝑜 𝑑 𝑈 𝐿 2 𝑑 𝑜 � � 𝑑 𝑗 𝐿 � 𝑑 ℎ 𝐿 ∎ 𝑘=1 𝑗 ℎℎ II-43

  41. III. Support Vector Machines A Brief Introduction Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto 1 1

  42. Large margin classifier – 𝑌 1 , 𝑍 1 , … , ( 𝑌 𝑜 , 𝑍 𝑜 ): training data • 𝑌 𝑗 : input ( 𝑛 -dimensional) • 𝑍 𝑗 ∈ {±1} : binary teaching data, – Linear classifier 𝑥 ( 𝑦 ) = 𝑥 𝑈 𝑦 + 𝑐 𝑔 ℎ 𝑦 = sgn 𝑔 𝑥 𝑦 y = f w ( x ) f w ( x ) < 0 We wish to make 𝑔 𝑥 ( 𝑦 ) with the training data so that a new data 𝑦 can be correctly classified. f w ( x ) ≧ 0 III-2

  43.  Large margin criterion Assumption: the data is linearly separable. Among infinite number of separating hyperplanes, choose the one to give the largest margin. – Margin = distance of two classes measured along the direction of 𝑥 . – Support vector machine: 8 𝑥 Hyperplane to give 6 the largest margin. 4 support – The classifier is the middle of vector 2 the margin. 0 -2 – “Supporting points” -4 on the two boundaries are margin called support vectors. -6 -8 -8 -6 -4 -2 0 2 4 6 8 III-3

  44. – Fix the scale (rescaling of ( 𝑥 , 𝑐 ) does not change the plane)    T  min( ) 1 w X b if Y i = 1, i      if Y i = − 1 T  max( ) 1 w X b i Then 2 Margin = 𝑥 [Exercise] Prove this. III-4

  45.  Support vector machine (linear, hard margin) (Boser, Guyon, Vapnik 1992) Objective function: 𝑥 subject to �𝑥 𝑈 𝑌 𝑗 + 𝑐 ≥ 1 if 𝑍 𝑗 = 1, 1 max 𝑗 = − 1. 𝑥 𝑈 𝑌 𝑗 + 𝑐 ≥ 1 if 𝑍 𝑥 , 𝑐 SVM (hard margin) 𝑥 , 𝑐 𝑥 2 min 𝑗 ( 𝑥 𝑈 𝑌 𝑗 + 𝑐 ) ≥ 1 ( ∀𝑗 ) subject to 𝑍 – Quadratic program (QP): • Minimization of a quadratic function with linear constraints. • Convex, no local optima (Vandenberghe’ lecture) • Many solvers available (Chih-Jen Lin’s lecture) III-5

  46.  Soft-margin SVM – “Linear separability” is too strong. Relax it. Hard margin Soft margin 𝑗 ( 𝑥 𝑈 𝑌 𝑗 + 𝑐 ) ≥ 1 𝑗 ( 𝑥 𝑈 𝑌 𝑗 + 𝑐 ) ≥ 1 − 𝜊 𝑗 , 𝑍 𝑍 𝜊 𝑗 ≥ 0 SVM (soft margin) 𝑗 ( 𝑥 𝑈 𝑌 𝑗 + 𝑐 ) ≥ 1 − 𝜊 𝑗 , 𝑍 𝑥 , 𝑐 𝑥 2 + 𝐷 ∑ 𝑜 min 𝜊 𝑗 subject to 𝑗=1 𝜊 𝑗 ≥ 0 ( ∀𝑗 ) – This is also QP. – The coefficient C must be given. Cross-validation is often used. III-6

  47. 𝑥 𝑈 𝑦 + 𝑐 = 1 𝒙 𝑼 𝒀 𝒋 + 𝒄 = 𝟐 − 𝝄 𝒋 8 w 6 4 𝑥 𝑈 𝑦 + 𝑐 = − 1 2 0 -2 -4 𝒙 𝑼 𝒀 𝒌 + 𝒄 = −𝟐 + 𝝄 𝒌 -6 -8 -8 -6 -4 -2 0 2 4 6 8 III-7

  48. SVM and regularization – Soft-margin SVM is equivalent to the regularization problem: 𝑜 𝑗 𝑥 𝑈 𝑌 𝑗 + 𝑐 + + 𝜇 𝑥 2 min 𝑥 , 𝑐 � 1 − 𝑍 𝑗=1 regularization term loss 𝑨 + = max{0, 𝑨 } • loss function: Hinge loss ℓ 𝑔 𝑦 , 𝑧 = 1 − 𝑔 𝑦 + • c.f . Ridge regression (squared error) 𝑜 𝑗 − 𝑥 𝑈 𝑌 𝑗 + 𝑐 2 + 𝜇 𝑥 2 𝑥 , 𝑐 ∑ min 𝑍 𝑗=1 [Exercise] Confirm the above equivalence. III-8

  49. Kernelization: nonlinear SVM – 𝑌 1 , 𝑍 1 , … , ( 𝑌 𝑜 , 𝑍 𝑜 ): training data • 𝑌 𝑗 : input on arbitrary space Ω • 𝑍 𝑗 ∈ {±1} : binary teaching data, – Kernelization: Positive definite kernel 𝑙 on Ω (RKHS 𝐼 ), Feature vectors Φ 𝑌 1 , … , Φ 𝑌 𝑜 – Linear classifier on 𝐼  nonlinear classifier on Ω ℎ 𝑦 = sgn 𝑔 , Φ 𝑦 + 𝑐 = sgn 𝑔 𝑦 + 𝑐 , 𝑔 ∈ 𝐼 . III-9

  50.  Nonlinear SVM 𝑜 𝑍 𝑗 ( 〈𝑔 , Φ 𝑌 𝑗 〉 + 𝑐 ) ≥ 1, 𝑔 , 𝑐 𝑔 2 + 𝐷 � 𝜊 𝑗 min subject to ( ∀𝑗 ) 𝜊 𝑗 ≥ 0 𝑗=1 or equivalently, 𝑜 2 min 𝑔 , 𝑐 � 1 − 𝑍 𝑗 ( 𝑔 𝑌 𝑗 + 𝑐 ) + + 𝜇 𝑔 𝐼 𝑗=1 𝑜 By representer theorem, 𝑔 = ∑ 𝑥 𝑘 Φ 𝑌 . 𝑘 𝑘=1 nonlinear SVM (soft margin) 𝑜 𝑍 𝑗 ( 𝐿𝑥 𝑗 + 𝑐 ) ≥ 1, 𝑥 , 𝑐 𝑥 𝑈 𝐿𝑥 + 𝐷 � 𝜊 𝑗 min subject to ( ∀𝑗 ) 𝜊 𝑗 ≥ 0 𝑗=1 • This is again QP. III-10

  51.  Dual problem 𝑍 𝑗 ( 𝐿𝑥 𝑗 + 𝑐 ) ≥ 1, 𝑥 , 𝑐 𝑥 𝑈 𝐿𝑥 + 𝐷 ∑ 𝑜 min 𝜊 𝑗 subject to ( ∀𝑗 ) 𝑗=1 𝜊 𝑗 ≥ 0 – By Lagrangian multiplier method, the dual problem is SVM (dual problem) 0 ≤ 𝛽 𝑗 ≤ 𝐷 , 𝑜 𝑜 ∑ − ∑ ( ∀𝑗 ) max 𝛽 𝑗 𝛽 𝑗 𝛽 𝑘 𝑍 𝑗 𝑍 𝑘 𝐿 𝑗𝑘 subject to 𝑗=1 𝑗 , 𝑘=1 𝑜 ∑ 𝑍 𝑗 𝛽 𝑗 = 0 𝛽 𝑗=1 • The dual problem is often preferred. • The classifier is expressed by 𝑜 𝑔 ∗ 𝑦 + 𝑐 ∗ = � 𝛽 𝑗∗ 𝑍 𝐿 𝑦 , 𝑌 𝑗 + 𝑐 ∗ 𝑗 𝑗=1 – Sparse expression: Only the data with 0 < 𝛽 𝑗∗ ≤ 𝐷 appear in the summation.  Support vectors. III-11

  52.  KKT condition Theorem The solution of the primal and dual problem of SVM is given by the following equations: ∗ ≤ 0 ( ∀ 𝑗 ) 𝑗 𝑔 ∗ 𝑌 𝑗 + 𝑐 ∗ − 𝜊 𝑗 (1) 1 − 𝑍 [primal constraint] ∗ ≥ 0 ( ∀ 𝑗 ) (2) 𝜊 𝑗 [primal constraint] ∗ ≤ 𝐷 , ( ∀ 𝑗 ) (3) 0 ≤ 𝛽 𝑗 [dual constraint] 𝑗 ( 𝑔 ∗ 𝑌 𝑗 + 𝑐 ∗ ) − 𝜊 𝑗 ∗ (1 − 𝑍 ∗ ) = 0 ( ∀ 𝑗 ) (4) 𝛽 𝑗 [complementary] ∗ 𝐷 − 𝛽 𝑗 ∗ = 0 ( ∀ 𝑗 ), (6) 𝜊 𝑗 [complementary] ∗ − ∑ 𝛽 𝑘 𝑜 𝑜 ∗ 𝑍 (7) ∑ 𝑘=1 𝐿 𝑗𝑘 𝑥 𝑘 𝐿 𝑗𝑘 = 0, 𝑘 𝑘 𝑜 ∗ 𝑍 (8) ∑ 𝑘=1 𝛽 𝑘 𝑘 = 0, III-12

  53.  Sparse expression 𝜒 ∗ ( 𝑦 ) = 𝑔 ∗ 𝑦 + 𝑐 ∗ = � 𝛽 𝑗∗ 𝑍 𝐿 𝑦 , 𝑌 𝑗 + 𝑐 ∗ 𝑗 𝑌 𝑗 : support vectors – Two types of support vectors. 8 w 6 support vectors 4 0 <  i < C 2 ( Y i 𝜒 ( X i ) = 1 ) 0 support vectors -2  i = C -4  ( Y i 𝜒 ( X i ) 1 ) -6 -8 -8 -6 -4 -2 0 2 4 6 8 III-13

  54. Summary of SVM – One of the kernel methods: • kernelization of linear large margin classifier. • Computation depends on Gram matrices of size 𝑜 . – Quadratic program: • No local optimum. • Many solvers are available. • Further efficient optimization methods are available (e.g. SMO, Plat 1999 ) – Sparse representation • The solution is written by a small number of support vectors. – Regularization • The objective function can be regarded as regularization with hinge loss function. III-14

  55. – NOT discussed on SVM in this lecture are • Many successful applications • Multi-class extension – Combination of binary classifiers (1-vs-1, 1-vs-rest) – Generalization of large margin criterion Crammer & Singer (2001), Mangasarian & Musicant (2001), Lee, Lin, & Wahba (2004), etc • Other extensions – Support vector regression (Vapnik 1995) – 𝜉 -SVM (Schölkopf et al 2000) – Structured-output (Collins & Duffty 2001, Taskar et al 2004, Altun et al 2003, etc) – One-class SVM (Schökopf et al 2001) • Optimization – Solving primal problem • Implementation (Chih-Jen Lin’s lecture) III-15

  56. References Altun, Y., I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proc. 20th Intern. Conf. Machine Learning , 2003. Boser, B.E., I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Fifth Annual ACM Workshop on Computational Learning Theory , pages 144–152, Pittsburgh, PA, 1992. ACM Press. Crammer, K. and Y. Singer. On the algorithmic implementation of multiclass kernel- based vector machines. Journal of Machine Learning Research , 2:265–292, 2001. Collins, M. and N. Duffy. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 , pages 625–632. MIT Press, 2001. Mangasarian, O. L. and David R. Musicant. Lagrangian support vector machines. Journal of Machine Learning Research , 1:161–177, 2001 Lee, Y., Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association , 99: 67–81, 2004. Schölkopf, B., A. Smola, R.C. Williamson, and P.L. Bartlett. (2000) New support vector algorithms. Neural Computation, 12(5):1207–1245. III-16

  57. Schölkopf, B., J.C. Platt, J. Shawe-Taylor, R.C. Williamson, and A.J.Smola. (2001) Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471. Vapnik, V.N. The Nature of Statistical Learning Theory . Springer 1995. Platt, J. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning , pages 185–208. MIT Press, 1999. Books on Application domains: – Lee, S.-W., A. Verri (Eds.) Pattern Recognition with Support Vector Machines: First International Workshop, SVM 2002 , Niagara Falls, Canada, August 10, 2002 . Proceedings. Lecture Notes in Computer Science 2388, Springer, 2002. – Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms . Springer, 2002. – Schölkopf, B., K. Tsuda, J.-P. Vert (Eds.). Kernel Methods in Computational Biology . MIT Press, 2004. III-17

  58. IV. Theoretical Backgrounds of Kernel Methods Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto 1 1

  59. 𝐃 -valued Positive definite kernel Definition.  ×    : set. is a positive definite kernel if for C : k arbitrary 𝑦 1 , … , 𝑦 𝑜 ∈ Ω and 𝑑 1 , … , 𝑑 𝑜 ∈ 𝐃 ,  n  ( , ) 0 . c c k x x  i j i j , 1 i j Remark: From the above condition, the Gram matrix 𝑙 𝑦 𝑗 , 𝑦 𝑘 𝑗𝑘 is necessarily Hermitian, i.e. 𝑙 𝑧 , 𝑦 = 𝑙 ( 𝑦 , 𝑧 ) . [Exercise] IV-2

  60. Operations that preserve positive definiteness Proposition 4.1 If 𝑙 𝑗 : 𝑌 × 𝑌 → 𝐃 ( 𝑗 = 1,2, … , ) are positive definite kernels, then so are the following: (positive combination) 𝑏𝑙 1 + 𝑐𝑙 2 ( 𝑏 , 𝑐 ≥ 0) . 1. (product) 𝑙 1 𝑙 2 𝑙 1 𝑦 , 𝑧 𝑙 2 𝑦 , 𝑧 2. (limit) lim 𝑗→∞ 𝑙 𝑗 ( 𝑦 , 𝑧 ) , assuming the limit exists. 3. Proof. 1 and 3 are trivial from the definition. For 2, it suffices to prove that Hadamard product (element-wise) of two positive semidefinite matrices is positive semidefinite. Remark: The set of positive definite kernels on 𝑌 is a closed cone, where the topology is defined by the point-wise convergence. IV-3

  61. Proposition 4.2 Let 𝐵 and 𝐶 be positive semidefinite Hermitian matrices. Then, Hadamard product 𝐿 = 𝐵 ∗ 𝐶 (element-wise product) is positive semidefinite. 𝜇 1 0 ⋯ 0 Proof) 𝑘 𝑈 0 𝜇 2 ⋯ � 𝑈 = 𝑗 Eigendecomposition of 𝐵 : 𝐵 = 𝑉Λ𝑉 𝑉 𝑞 𝑉 𝑞 ⋮ ⋱ ⋮ 0 ⋯ 0 𝜇 𝑜 i.e.,   n  1 λ ( λ p ≧ 0 by the positive semidefiniteness ) . i j A U U ij p p p p Then,    n n n  λ i j c c K c c U U B  i j ij   i j p p p ij , 1 1 , 1 i j p i j       n n  λ   λ  i j i j  0 . c U c U B c U c U B 1 1 1  i j ij n  i n j n ij ∎ i , j 1 i , j 1 IV-4

  62. Normalization Proposition 4.3 Let 𝑙 be a positive definite kernel on Ω , and 𝑔 : Ω → 𝐃 be an arbitrary function. Then, � 𝑦 , 𝑧 : = 𝑔 𝑦 𝑙 𝑦 , 𝑧 𝑔 ( 𝑧 ) 𝑙 is positive definite. In particular, 𝑔 𝑦 𝑔 ( 𝑧 ) is a positive definite kernel. – Proof [Exercise] – Example. Normalization: 𝑙 𝑦 , 𝑧 � 𝑦 , 𝑧 = 𝑙 𝑙 ( 𝑦 , 𝑦 ) 𝑙 ( 𝑧 , 𝑧 ) is positive definite, and Φ ( 𝑦 ) = 1 for any 𝑦 ∈ Ω . IV-5

  63. Proof of positive definiteness – Euclidean inner product 𝑦 𝑈 𝑧 : easy (Prop. 1.1). – Polynomial kernel 𝑦 𝑈 𝑧 + 𝑑 𝑒 ( 𝑑 ≥ 0 ): 𝑦 𝑈 𝑧 + 𝑑 𝑒 = 𝑦 𝑈 𝑧 𝑒 + 𝑏 1 𝑦 𝑈 𝑧 𝑒−1 + ⋯ + 𝑏 𝑒 , 𝑏 𝑗 ≥ 0. Product and non-negative combination of p.d. kernels. 𝑦−𝑧 2 Gaussian RBF kernel exp − – : 𝜏 2 exp − 𝑦 − 𝑧 2 = 𝑓 − 𝑦 2 / 𝜏 2 𝑓 𝑦 𝑈 𝑧 / 𝜏 2 𝑓 − 𝑧 2 / 𝜏 2 𝜏 2 Note 1 1 𝑓 𝑦 𝑈 𝑧 / 𝜏 2 = 1 + 2! 𝜏 4 𝑦 𝑈 𝑧 2 + ⋯ 1! 𝜏 2 𝑦 𝑈 𝑧 + is positive definite (Prop. 4.1). Proposition 4.3 then completes the proof. – Laplacian kernel is discussed later. IV-6

  64. Shift-invariant kernel – A positive definite kernel 𝑙 𝑦 , 𝑧 on 𝐒 𝑛 is called shift-invariant if the kernel is of the form 𝑙 𝑦 , 𝑧 = 𝜔 𝑦 − 𝑧 . – Examples: Gaussian, Laplacian kernel – Fourier kernel ( C -valued positive definite kernel): for each 𝜕 , − 1 𝜕 𝑈 𝑦 − 𝑧 − 1 𝜕 𝑈 𝑦 exp − 1 𝜕 𝑈 𝑧 𝑙 𝐺 𝑦 , 𝑧 = exp = exp (Prop. 4.3) – If 𝑙 𝑦 , 𝑧 = 𝜔 𝑦 − 𝑧 is positive definite, the function 𝜔 is called positive definite. IV-7

  65. Bochner’s theorem Theorem 4.4 (Bochner) Let 𝜔 be a continuous function on 𝐒 𝑛 . Then, 𝜔 is ( 𝐃 -valued) positive definite if and only if there is a finite non-negative Borel measure Λ on 𝐒 𝑛 such that − 1 𝜕 𝑈 𝑨 𝑒Λ ( 𝜕 ) 𝜔 𝑨 = � exp Bochner’s theorem characterizes all the continuous shift-invariant 𝜕 ∈ 𝐒 𝑛 is the generator − 1 𝜕 𝑈 𝑨 positive definite kernels. exp of the cone (see Prop. 4.1). – Λ is the inverse Fourier (or Fourier-Stieltjes) transform of 𝜔 . – Roughly speaking, the shift invariant functions are the class that have non-negative Fourier transform. 𝑘 = ∫ | ∑ 𝑑 𝑗 𝑓 −1𝜕 𝑈 𝑨 𝑗 | 2 𝑒Λ 𝜕 Sufficiency is easy: ∑ 𝑑 𝑗 𝑑 � 𝜔 𝑨 𝑗 − 𝑨 . – 𝑘 𝑗 , 𝑘 𝑗 Necessity is difficult. IV-8

  66. RKHS in frequency domain Suppose (shift-invariant) kernel 𝑙 has a form − 1 𝜕 𝑈 𝑦 − 𝑧 𝜍 𝜕 > 0. 𝑙 𝑦 , 𝑧 = � exp 𝜍 𝜕 𝑒𝜕 . Then, RKHS 𝐼 𝑙 is given by 2 ̂ 𝜕 𝐼 𝑙 = 𝑔 ∈ 𝑀 2 𝐒 , 𝑒𝑦 � � 𝑔 𝑒𝜕 < ∞ , 𝜍 𝜕 ̂ 𝜕 𝑕 𝑔 , 𝑕 = � 𝑔 � 𝜕 𝑒𝜕 𝜍 𝜕 ̂ is the Fourier transform of 𝑔 : where 𝑔 1 ̂ ( 𝜕 ) = 2𝜌 𝑛 ∫ 𝑔 ( 𝑦 )exp − − 1 𝜕 𝑈 𝑦 𝑒𝜕 . 𝑔 IV-9

  67.  Gaussian kernel 2𝜌 𝑛 exp − 𝜏 2 𝜕 2 𝑙 𝐻 𝑦 , 𝑧 = exp − 𝑦 − 𝑧 2 1 , 𝜍 𝐻 𝜕 = 2𝜏 2 2 2 exp 𝜏 2 𝜕 2 𝐼 𝑙 𝐻 = 𝑔 ∈ 𝑀 2 𝐒 , 𝑒𝑦 � � 𝑔 ̂ 𝜕 𝑒𝜕 < ∞ 2 � 𝜕 exp 𝜏 2 𝜕 2 𝑔 , 𝑕 = 2𝜌 𝑛 � 𝑔 ̂ 𝜕 𝑕 𝑒𝜕 2  Laplacian kernel (on 𝐒 ) 1 𝑙 𝑀 𝑦 , 𝑧 = exp −𝛾 | 𝑦 − 𝑧 | , 𝜍 𝑀 𝜕 = 2𝜌 𝜕 2 + 𝛾 2 𝜕 2 + 𝛾 𝑒𝜕 < ∞ 𝐼 𝑙 𝑀 = 𝑔 ∈ 𝑀 2 𝐒 , 𝑒𝑦 � � 𝑔 ̂ 𝜕 𝜕 2 + 𝛾 𝑒𝜕 ̂ 𝜕 𝑕 𝑔 , 𝑕 = 2𝜌 �𝑔 � 𝜕 – Decay of 𝑔 ∈ 𝐼 for high-frequency is different for Gaussian and IV-10 Laplacian.

  68. RKHS by polynomial kernel – Polynomial kernel on 𝐒 : 𝑙 𝑞 𝑦 , 𝑧 = 𝑦 𝑈 𝑧 + 𝑑 𝑒 , 𝑑 ≥ 0, 𝑒 ∈ 𝐎 𝑒 𝑦 𝑒 + 𝑒 𝑒−1 𝑦 𝑒−1 + 𝑒 𝑒−2 𝑦 𝑒−1 + ⋯ + 𝑑 𝑒 . 2 𝑑 2 𝑨 0 𝑙 𝑞 𝑦 , 𝑨 0 = 𝑨 0 1 𝑑𝑨 0 Span of these functions are polynomials of degree 𝑒 . Proposition 4.5 If 𝑑 ≠ 0 , the RKHS is the space of polynomials of degree at most 𝑒 . 𝑒 [Proof: exercise. Hint. Find 𝑐 𝑗 to satisfy ∑ 𝑐 𝑗 𝑙 𝑦 , 𝑨 𝑗 = 𝑗=0 𝑒 𝑏 𝑗 𝑦 𝑗 ∑ as a solution to a linear equation.] 𝑗=0 IV-11

  69. Sum and product 𝐼 1 , 𝑙 1 , 𝐼 2 , 𝑙 2 : two RKHS’s and positive definite kernels on Ω .  Sum RKHS for 𝑙 1 + 𝑙 2 : 𝐼 1 + 𝐼 2 = 𝑔 : Ω → 𝑆 ∃𝑔 1 ∈ 𝐼 1 , ∃𝑔 2 ∈ 𝐼 2 , 𝑔 = 𝑔 1 + 𝑔 2 } 𝑔 2 = 2 + 𝑔 2 𝑔 = 𝑔 𝑔 1 + 𝑔 2 , 𝑔 1 ∈ 𝐼 1 , 𝑔 2 ∈ 𝐼 2 } 1 𝐼 1 2 𝐼 2  Product RKHS for 𝑙 1 𝑙 2 : 𝐼 1 ⊗ 𝐼 2 = tensor product as a vector space. 𝑜 𝑔 = ∑ 𝑔 𝑗 𝑕 𝑗 𝑔 𝑗 ∈ 𝐼 1 , 𝑕 𝑗 ∈ 𝐼 2 } is dense in 𝐼 1 ⊗ 𝐼 2 . 𝑗=1 1 𝑕 𝑗 1 , 2 𝑕 𝑘 1 , 𝑔 1 , 𝑕 𝑘 2 2 2 𝑜 𝑛 𝑜 𝑛 ∑ ∑ = ∑ ∑ 𝑔 𝑔 𝑔 𝐼 1 𝑕 𝑗 . 𝑗=1 𝑘=1 𝑗=1 𝑘=1 𝐼 2 𝑗 𝑘 𝑗 𝑘 IV-12

  70. Summary of Section IV – Positive definiteness of kernels are preserved by • Non-negative combinations, • Product • Point-wise limit • Normalization – Bochner’s theorem: characterization of the continuous shift- invariance kernels on 𝐒 𝑛 . – Explicit form of RKHS • RKHS with shift-invariance kernels has explicit expression in frequency domain. • Polynomial kerns gives RKHS of polynomials. • Sum and product can be given. IV-13

  71. References Aronszajn., N. Theory of reproducing kernels. Trans. American Mathematical Society , 68(3):337–404, 1950. Saitoh., S. Integral transforms, reproducing kernels, and their applications. Addison Wesley Longman, 1997. IV-14

  72. Solution to Exercises  C-valued positive definiteness Using the definition for one point, we have 𝑙 𝑦 , 𝑦 is real and non- negative for all 𝑦 . For any 𝑦 and 𝑧 , applying the definition with coefficient ( 𝑑 , 1) where 𝑑 ∈ 𝐃 , we have 𝑑 2 𝑙 𝑦 , 𝑦 + 𝑑𝑙 𝑦 , 𝑧 + 𝑑̅𝑙 𝑧 , 𝑦 + 𝑙 𝑧 , 𝑧 ≥ 0. Since the right hand side is real, its complex conjugate also satisfies 𝑑 2 𝑙 𝑦 , 𝑦 + 𝑑̅𝑙 𝑦 , 𝑧 + 𝑑𝑙 𝑧 , 𝑦 + 𝑙 𝑧 , 𝑧 ≥ 0. The difference of the left hand side of the above two inequalities is real, so that 𝑑̅ 𝑙 𝑧 , 𝑦 − 𝑙 𝑦 , 𝑧 − 𝑑 ( 𝑙 𝑧 , 𝑦 − 𝑙 𝑦 , 𝑧 ) is a real number. On the other hand, since 𝛽 − 𝛽 � must be pure imaginary for any complex number 𝛽 , 𝑑̅ 𝑙 𝑧 , 𝑦 − 𝑙 𝑦 , 𝑧 = 0 holds for any 𝑑 ∈ 𝐃 . This implies 𝑙 𝑧 , 𝑦 = 𝑙 ( 𝑦 , 𝑧 ) . IV-15

  73. V. Nonparametric Inference with Positive Definite Kernels Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto 1 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend