mit 9 520 6 860 fall 2018 statistical learning theory and
play

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f ( x ) = w x . f w is one to one, H := w


  1. MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco

  2. Linear functions Let H lin be the space of linear functions f ( x ) = w ⊤ x . ◮ f ↔ w is one to one, � � H := w ⊤ ¯ f , ¯ ◮ inner product f w , � � � � � f − ¯ ◮ norm/metric f � H := � w − ¯ w � . L.Rosasco, 9.520/6.860 2018

  3. An observation Function norm controls point-wise convergence. Since | f ( x ) − ¯ f ( x ) | ≤ � x �� w − ¯ w � , ∀ x ∈ X then w j → w ⇒ f j ( x ) → f ( x ) , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

  4. ERM � n 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 , min λ ≥ 0 n w ∈ R d i = 1 ◮ λ → 0 ordinary least squares (bias to minimal norm), ◮ λ > 0 ridge regression (stable). L.Rosasco, 9.520/6.860 2018

  5. Computations Let Xn ∈ R nd and � Y ∈ R n . The ridge regression solution is w λ = ( Xn ⊤ Xn + n λ I ) − 1 Xn ⊤ � O ( nd 2 ∨ d 3 ) O ( nd ∨ d 2 ) time mem. � Y but also w λ = Xn ⊤ ( XnXn ⊤ + n λ I ) − 1 � O ( dn 2 ∨ n 3 ) O ( nd ∨ n 2 ) � Y time mem. L.Rosasco, 9.520/6.860 2018

  6. Representer theorem in disguise We noted that � n � n w λ = Xn ⊤ c = ˆ f λ ( x ) = x ⊤ x i c i , � x i c i ⇔ i = 1 i = 1 c = ( XnXn ⊤ + n λ I ) − 1 � ( XnXn ⊤ ) ij = x ⊤ Y , i x j . L.Rosasco, 9.520/6.860 2018

  7. Limits of linear functions Regression L.Rosasco, 9.520/6.860 2018

  8. Limits of linear functions Classification L.Rosasco, 9.520/6.860 2018

  9. Nonlinear functions Two main possibilities: f ( x ) = w ⊤ Φ ( x ) , f ( x ) = Φ ( w ⊤ x ) where Φ is a non linear map. ◮ The former choice leads to linear spaces of functions 1 . ◮ The latter choice can be iterated f ( x ) = Φ ( w ⊤ L Φ ( w ⊤ L − 1 ... Φ ( w ⊤ 1 x ))) . 1 The spaces are linear, NOT the functions! L.Rosasco, 9.520/6.860 2018

  10. Features and feature maps f ( x ) = w ⊤ Φ ( x ) , where Φ : X → R p Φ ( x ) = ( ϕ 1 ( x ) ,...,ϕ p ( x )) ⊤ and ϕ j : X → R , for j = 1 ,..., p . ◮ X need not be R d . ◮ We can also write p � w j ϕ j ( x ) . f ( x ) = i = 1 L.Rosasco, 9.520/6.860 2018

  11. Geometric view f ( x ) = w ⊤ Φ ( x ) L.Rosasco, 9.520/6.860 2018

  12. An example L.Rosasco, 9.520/6.860 2018

  13. More examples The equation p � f ( x ) = w ⊤ Φ ( x ) = w j ϕ j ( x ) i = 1 suggests to think of features as some form of basis. Indeed we can consider ◮ Fourier basis, ◮ wave-lets + their variations, ◮ ... L.Rosasco, 9.520/6.860 2018

  14. And even more examples Any set of functions ϕ j : X → R , j = 1 ,..., p can be considered. Feature design/engineering ◮ vision: SIFT, HOG ◮ audio: MFCC ◮ ... L.Rosasco, 9.520/6.860 2018

  15. Nonlinear functions using features Let H Φ be the space of linear functions f ( x ) = w ⊤ Φ ( x ) . ◮ f ↔ w is one to one, if ( ϕ j ) j are lin. indip. � � H Φ := w ⊤ ¯ ◮ inner product f , ¯ f w , � � � � � f − ¯ ◮ norm/metric f � H Φ := � w − ¯ w � . In this case | f ( x ) − ¯ f ( x ) | ≤ � Φ ( x ) �� w − ¯ w � , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

  16. Back to ERM � n 1 ( y i − w ⊤ Φ ( x i )) 2 + λ � w � 2 , min λ ≥ 0 , n w ∈ R p i = 1 Equivalent to, � n 1 ( y i − f ( x i )) 2 + λ � f � 2 min H Φ , λ ≥ 0 . n f ∈H Φ i = 1 L.Rosasco, 9.520/6.860 2018

  17. Computations using features Φ ∈ R np with Let � ( � Φ ) ij = ϕ j ( x i ) The ridge regression solution is w λ = ( � Φ ⊤ � Φ + n λ I ) − 1 � Φ ⊤ � O ( np 2 ∨ p 3 ) O ( np ∨ p 2 ) , time mem. � Y but also w λ = � Φ ⊤ ( � Φ � Φ ⊤ + n λ I ) − 1 � O ( pn 2 ∨ n 3 ) O ( np ∨ n 2 ) . � Y time mem. L.Rosasco, 9.520/6.860 2018

  18. Representer theorem a little less in disguise Analogously to before n n � � w λ = � Φ ⊤ c = ˆ f λ ( x ) = Φ ( x ) ⊤ Φ ( x i ) c i � Φ ( x i ) c i ⇔ i = 1 i = 1 Φ ⊤ + λ I ) − 1 � c = ( � Φ � ( � Φ � Φ ⊤ ) ij = Φ ( x i ) ⊤ Φ ( x j ) Y , p � Φ ( x ) ⊤ Φ (¯ x ) = ϕ s ( x ) ϕ s (¯ x ) . s = 1 L.Rosasco, 9.520/6.860 2018

  19. Unleash the features ◮ Can we consider linearly dependent features? ◮ Can we consider p = ∞ ? L.Rosasco, 9.520/6.860 2018

  20. An observation For X = R consider � ( 2 γ ) ( j − 1 ) ϕ j ( x ) = x j − 1 e − x 2 γ ( j − 1 )! , j = 2 ,..., ∞ with ϕ 1 ( x ) = 1. Then � � ∞ ∞ � � ( 2 γ ) j − 1 ( 2 γ ) j − 1 x j − 1 e − x 2 γ x 2 γ x j − 1 e − ¯ ϕ j ( x ) ϕ j (¯ x ) = ( j − 1 )! ¯ ( j − 1 )! j = 1 j = 1 � ∞ ( 2 γ ) j − 1 e − x 2 γ e − ¯ x 2 γ x ) j − 1 = e − x 2 γ e − ¯ x 2 γ e 2 x ¯ x 2 γ = ( j − 1 )! ( x ¯ j = 1 x | 2 γ e −| x − ¯ = L.Rosasco, 9.520/6.860 2018

  21. From features to kernels � ∞ Φ ( x ) ⊤ Φ (¯ x ) = ϕ j ( x ) ϕ j (¯ x ) = k ( x , ¯ x ) j = 1 We might be able to compute the series in closed form. The function k is called kernel. Can we run ridge regression ? L.Rosasco, 9.520/6.860 2018

  22. Kernel ridge regression We have � n � n ˆ Φ ( x ) ⊤ Φ ( x i ) c i = f λ ( x ) = k ( x , x i ) c i i = 1 i = 1 K + λ I ) − 1 � K ) ij = Φ ( x i ) ⊤ Φ ( x j ) = k ( x i , x j ) c = ( � ( � Y , � K is the kernel matrix, the Gram (inner products) matrix of the data. “The kernel trick” L.Rosasco, 9.520/6.860 2018

  23. Kernels ◮ Can we start from kernels instead of features? ◮ Which functions k : X × X → R define kernels we can use? L.Rosasco, 9.520/6.860 2018

  24. Positive definite kernels A function k : X × X → R is called positive definite: ◮ if the matrix ˆ K is positive semidefinite for all choice of points x 1 ,..., x n , i.e. a ⊤ � ∀ a ∈ R n . Ka ≥ 0 , ◮ Equivalently � n k ( x i , x j ) a i a j ≥ 0 , i , j = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X . L.Rosasco, 9.520/6.860 2018

  25. Inner product kernels are pos. def. Assume Φ : X → R p , p ≤ ∞ and x ) = Φ ( x ) ⊤ Φ (¯ k ( x , ¯ x ) Note that � � � � 2 � n � n � n � � � � Φ ( x i ) ⊤ Φ ( x j ) a i a j = � � k ( x i , x j ) a i a j = Φ ( x i ) a i . � � � � i , j = 1 i , j = 1 i = 1 Clearly k is symmetric. L.Rosasco, 9.520/6.860 2018

  26. But there are many pos. def. kernels Classic examples ◮ linear k ( x , ¯ x ) = x ⊤ ¯ x ◮ polynomial k ( x , ¯ x ) = ( x ⊤ ¯ x + 1 ) s x � 2 γ x ) = e −� x − ¯ ◮ Gaussian k ( x , ¯ But one can consider ◮ kernels on probability distributions ◮ kernels on strings ◮ kernels on functions ◮ kernels on groups ◮ kernels graphs ◮ ... It is natural to think of a kernel as a measure of similarity. L.Rosasco, 9.520/6.860 2018

  27. From pos. def. kernels to functions Let X be any set/ Given a pos. def. kernel k . ◮ consider the space H k of functions N � f ( x ) = k ( x , x i ) a i i = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X and any N ∈ N . ◮ Define an inner product on H k ¯ � N � N � � f , ¯ H k = k ( x i , ¯ x j ) a i ¯ f a j . i = 1 j = 1 ◮ H k can be completed to a Hilbert space. L.Rosasco, 9.520/6.860 2018

  28. A key result Functions defind by Gaussian kernels with large and small widths. L.Rosasco, 9.520/6.860 2018

  29. An illustration Theorem Given a pos. def. k there exists Φ s.t. k ( x , ¯ x ) = � Φ ( x ) , Φ (¯ x ) � H k and H Φ ≃ H k Roughly speaking � N f ( x ) = w ⊤ Φ ( x ) ≃ f ( x ) = k ( x , x i ) a i i = 1 L.Rosasco, 9.520/6.860 2018

  30. From features and kernels to RKHS and beyond H k and H Φ have many properties, characterizations, connections: ◮ reproducing property ◮ reproducing kernel Hilbert spaces (RKHS) ◮ Mercer theorem (Kar hunen Lo´ eve expansion) ◮ Gaussian processes ◮ Cameron-Martin spaces L.Rosasco, 9.520/6.860 2018

  31. Reproducing property Note that by definition of H k ◮ k x = k ( x , · ) is a function in H k ◮ For all f ∈ H k , x ∈ X f ( x ) = � f , k x � H k called the reproducing property ◮ Note that � � � � | f ( x ) − ¯ � f − ¯ � H k , f ( x ) | ≤ � k x � H k f ∀ x ∈ X . The above observations have a converse. L.Rosasco, 9.520/6.860 2018

  32. RKHS Definition A RKHS H is a Hilbert with a function k : X × X → R s.t. ◮ k x = k ( x , · ) ∈ H k , ◮ and f ( x ) = � f , k x � H k . Theorem If H is a RKHS then k is pos. def. L.Rosasco, 9.520/6.860 2018

  33. Evaluation functionals in a RKHS If H is a RKHS then the evaluation functionals e x ( f ) = f ( x ) are continuous. i.e. � � � � | e x ( f ) − e x (¯ � f − ¯ � H k , f ) | � f ∀ x ∈ X since e x ( f ) = � f , k x � H k . Note that L 2 ( R d ) or C ( R d ) don’t have this property! L.Rosasco, 9.520/6.860 2018

  34. Alternative RKHS definition Turns out the previous property also characterizes a RKHS. Theorem A Hilbert space with continuous evaluation functionals is a RKHS. L.Rosasco, 9.520/6.860 2018

  35. Summing up ◮ From linear to non linear functions ◮ using features ◮ using kernels plus ◮ pos. def. functions ◮ reproducing property ◮ RKHS L.Rosasco, 9.520/6.860 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend