kernel methods
play

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 - PowerPoint PPT Presentation

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods Introduction Linear parametric models for regression and classification. Memory-based methods: Parzen probability density estimation, k-nearest


  1. Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

  2. Introduction Linear parametric models for regression and classification. Memory-based methods: Parzen probability density estimation, k-nearest neighbor. Storing the entire training set in order to make predictions for future data. Fast to “train”, but slow at prediction. Is it possible to connect these two different formulations? Lei Tang Kernel Methods

  3. Introduction Linear parametric models for regression and classification. Memory-based methods: Parzen probability density estimation, k-nearest neighbor. Storing the entire training set in order to make predictions for future data. Fast to “train”, but slow at prediction. Is it possible to connect these two different formulations? Lei Tang Kernel Methods

  4. Introduction Linear parametric models for regression and classification. Memory-based methods: Parzen probability density estimation, k-nearest neighbor. Storing the entire training set in order to make predictions for future data. Fast to “train”, but slow at prediction. Is it possible to connect these two different formulations? Lei Tang Kernel Methods

  5. Dual Representations Many Linear models for regression and classification can be reformulated in terms of a dual representation in which kernel function arises naturally. N J ( w ) = 1 � � 2 + λ � w T φ ( x n ) − t n 2 w T w (1) 2 n =1 Lei Tang Kernel Methods

  6. The derivative with respect to w is N � � � w T φ ( x n ) − t n ∇ J ( w ) = φ ( x n ) + λ w = 0 i =1 N N � � w = − 1 � � w T φ ( x n ) − t n a n φ ( x n ) = Φ T a = ⇒ = λ n =1 n =1 a n = − 1 � � w T φ ( x n ) − t n λ Lei Tang Kernel Methods

  7. Plug in the new formulation of w = Φ T a into J ( w ), = 1 2(Φ w − t ) T (Φ w − t ) + λ 2 w T w J ( w ) = 1 t + 1 2 t T t + λ 2 a T ΦΦ T ΦΦ T a − a T ΦΦ T 2 ΦΦ T a � �� � K = 1 2 a T KKa − a T K t + 1 2 t T t + λ 2 a T Ka J ( a ) = ( K + λ I N ) − 1 t = ⇒ a = w T φ ( x ) = a T Φ φ ( x ) = k ( x ) T ( K + λ I N ) − 1 t = a T k ( x ) y ( x ) Lei Tang Kernel Methods

  8. Plug in the new formulation of w = Φ T a into J ( w ), = 1 2(Φ w − t ) T (Φ w − t ) + λ 2 w T w J ( w ) = 1 t + 1 2 t T t + λ 2 a T ΦΦ T ΦΦ T a − a T ΦΦ T 2 ΦΦ T a � �� � K = 1 2 a T KKa − a T K t + 1 2 t T t + λ 2 a T Ka J ( a ) = ( K + λ I N ) − 1 t = ⇒ a = w T φ ( x ) = a T Φ φ ( x ) = k ( x ) T ( K + λ I N ) − 1 t = a T k ( x ) y ( x ) Lei Tang Kernel Methods

  9. Plug in the new formulation of w = Φ T a into J ( w ), = 1 2(Φ w − t ) T (Φ w − t ) + λ 2 w T w J ( w ) = 1 t + 1 2 t T t + λ 2 a T ΦΦ T ΦΦ T a − a T ΦΦ T 2 ΦΦ T a � �� � K = 1 2 a T KKa − a T K t + 1 2 t T t + λ 2 a T Ka J ( a ) = ( K + λ I N ) − 1 t = ⇒ a = w T φ ( x ) = a T Φ φ ( x ) = k ( x ) T ( K + λ I N ) − 1 t = a T k ( x ) y ( x ) Lei Tang Kernel Methods

  10. Advantages of dual methods The dual formulation allows the solution to be expressed entirely in terms of the kernel function k ( x , x ′ ). In dual formulation, need to invert a N × N matrix as a = ( K + λ I N ) − 1 t In the original parameter, need to invert a M × M matrix, w = ( λ I + Φ T Φ) − 1 Φ T t If number of instances is smaller than dimensionality, dual formulation is preferred. Dual formulation directly works on kernels, avoids the explicit introduction of feature vector φ ( x ). Lei Tang Kernel Methods

  11. Advantages of dual methods The dual formulation allows the solution to be expressed entirely in terms of the kernel function k ( x , x ′ ). In dual formulation, need to invert a N × N matrix as a = ( K + λ I N ) − 1 t In the original parameter, need to invert a M × M matrix, w = ( λ I + Φ T Φ) − 1 Φ T t If number of instances is smaller than dimensionality, dual formulation is preferred. Dual formulation directly works on kernels, avoids the explicit introduction of feature vector φ ( x ). Lei Tang Kernel Methods

  12. Advantages of dual methods The dual formulation allows the solution to be expressed entirely in terms of the kernel function k ( x , x ′ ). In dual formulation, need to invert a N × N matrix as a = ( K + λ I N ) − 1 t In the original parameter, need to invert a M × M matrix, w = ( λ I + Φ T Φ) − 1 Φ T t If number of instances is smaller than dimensionality, dual formulation is preferred. Dual formulation directly works on kernels, avoids the explicit introduction of feature vector φ ( x ). Lei Tang Kernel Methods

  13. Advantages of dual methods The dual formulation allows the solution to be expressed entirely in terms of the kernel function k ( x , x ′ ). In dual formulation, need to invert a N × N matrix as a = ( K + λ I N ) − 1 t In the original parameter, need to invert a M × M matrix, w = ( λ I + Φ T Φ) − 1 Φ T t If number of instances is smaller than dimensionality, dual formulation is preferred. Dual formulation directly works on kernels, avoids the explicit introduction of feature vector φ ( x ). Lei Tang Kernel Methods

  14. The Representer Theorem More general case: Denote by Ω : [0 , ∞ ) → R a strictly monotonic increasing function, by X a set, and by c an arbitrary loss function. Then each minimizer f ∈ H of the regularized risk c (( x 1 , t 1 , f ( x 1 )) , · · · , ( x N , t N , f ( x N ))) + Ω( || f || H ) admits a representation of the form N � f ( x ) = a n k ( x n , x ) n =1 To be proved later ... Lei Tang Kernel Methods

  15. A toy example √ Define φ ([ x ] 1 , [ x ] 2 ) = ([ x ] 2 1 , [ x ] 2 2 , 2[ x ] 1 [ x ] 2 ) or φ ([ x ] 1 , [ x ] 2 ) = ([ x ] 2 1 , [ x ] 2 2 , [ x ] 1 [ x ] 2 , [ x ] 2 [ x ] 1 ) Then � φ ( x ) , φ ( x ′ ) � [ x ] 2 1 [ x ′ ] 2 1 + [ x ] 2 2 [ x ′ ] 2 2 + 2[ x ] 1 [ x ] 2 [ x ′ ] 1 [ x ′ ] 2 = ([ x ] 1 [ x ′ ] 1 + [ x ] 2 [ x ′ ] 2 ) 2 = � x , x ′ � 2 = The dot product in the 3-dim space can be computed without computing φ . Lei Tang Kernel Methods

  16. More general case Suppose the input vector dimension is M , and we define the feature mapping as to all the d -th order products (monomials) of [ x ] j of x [ x ] j 1 · [ x ] j 2 · · · [ x ] j d After mapping, the dimension becomes M d . To compute the inner product, require at least O ( M d ) operations. M M M � � � � φ d ( x ) , φ d ( x ′ ) � [ x ] j 1 · · · [ x ] j d · [ x ′ ] j 1 · · · [ x ′ ] j d = · · · j 1 =1 j 2 =1 j d =1 M M � � [ x ] j 1 · [ x ′ ] j 1 · · · [ x ] j d [ x ′ ] j d = j 1 =1 j d =1   d M � [ x ] j · [ x ′ ] j = � x , x ′ � d =   j =1 Requires only O ( M ) computation to get the inner product. Lei Tang Kernel Methods

  17. More general case Suppose the input vector dimension is M , and we define the feature mapping as to all the d -th order products (monomials) of [ x ] j of x [ x ] j 1 · [ x ] j 2 · · · [ x ] j d After mapping, the dimension becomes M d . To compute the inner product, require at least O ( M d ) operations. M M M � � � � φ d ( x ) , φ d ( x ′ ) � [ x ] j 1 · · · [ x ] j d · [ x ′ ] j 1 · · · [ x ′ ] j d = · · · j 1 =1 j 2 =1 j d =1 M M � � [ x ] j 1 · [ x ′ ] j 1 · · · [ x ] j d [ x ′ ] j d = j 1 =1 j d =1   d M � [ x ] j · [ x ′ ] j = � x , x ′ � d =   j =1 Requires only O ( M ) computation to get the inner product. Lei Tang Kernel Methods

  18. Myths of Kernel Kernel is a similarity measure Kernel corresponds to dot products in feature space H via a mapping φ . k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � Questions 1 What kind of kernel functions admits the above form? 2 Give a kernel, how to construct an associated feature space? Lei Tang Kernel Methods

  19. Myths of Kernel Kernel is a similarity measure Kernel corresponds to dot products in feature space H via a mapping φ . k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � Questions 1 What kind of kernel functions admits the above form? 2 Give a kernel, how to construct an associated feature space? Lei Tang Kernel Methods

  20. Positive Definite Kernels Gram Matrix Given a function k : X 2 → R , and input x 1 , · · · x N ∈ X , then the matrix K ij := k ( x i , x j ) is called the Gram matrix. Positive Definite Kernel A function k on X × X which for any number of x 1 , x 2 , · · · , x N ∈ X gives rise to a positive semi-definite Gram matrix, is called a positive definite matrix. A positive definite kernel can always be written as inner products of some feature mapping! Lei Tang Kernel Methods

  21. Positive Definite Kernels Gram Matrix Given a function k : X 2 → R , and input x 1 , · · · x N ∈ X , then the matrix K ij := k ( x i , x j ) is called the Gram matrix. Positive Definite Kernel A function k on X × X which for any number of x 1 , x 2 , · · · , x N ∈ X gives rise to a positive semi-definite Gram matrix, is called a positive definite matrix. A positive definite kernel can always be written as inner products of some feature mapping! Lei Tang Kernel Methods

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend