MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco

Linear functions Let H lin be the space of linear functions f ( x ) = w ⊤ x . ◮ f ↔ w is one to one, � � H := w ⊤ ¯ f , ¯ ◮ inner product f w , � � � � � f − ¯ ◮ norm/metric f � H := � w − ¯ w � . L.Rosasco, 9.520/6.860 2018

An observation Function norm controls point-wise convergence. Since | f ( x ) − ¯ f ( x ) | ≤ � x �� w − ¯ w � , ∀ x ∈ X then w j → w ⇒ f j ( x ) → f ( x ) , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

ERM � n 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 , min λ ≥ 0 n w ∈ R d i = 1 ◮ λ → 0 ordinary least squares (bias to minimal norm), ◮ λ > 0 ridge regression (stable). L.Rosasco, 9.520/6.860 2018

Computations Let Xn ∈ R nd and � Y ∈ R n . The ridge regression solution is w λ = ( Xn ⊤ Xn + n λ I ) − 1 Xn ⊤ � O ( nd 2 ∨ d 3 ) O ( nd ∨ d 2 ) time mem. � Y but also w λ = Xn ⊤ ( XnXn ⊤ + n λ I ) − 1 � O ( dn 2 ∨ n 3 ) O ( nd ∨ n 2 ) � Y time mem. L.Rosasco, 9.520/6.860 2018

Representer theorem in disguise We noted that � n � n w λ = Xn ⊤ c = ˆ f λ ( x ) = x ⊤ x i c i , � x i c i ⇔ i = 1 i = 1 c = ( XnXn ⊤ + n λ I ) − 1 � ( XnXn ⊤ ) ij = x ⊤ Y , i x j . L.Rosasco, 9.520/6.860 2018

Limits of linear functions Regression L.Rosasco, 9.520/6.860 2018

Limits of linear functions Classification L.Rosasco, 9.520/6.860 2018

Nonlinear functions Two main possibilities: f ( x ) = w ⊤ Φ ( x ) , f ( x ) = Φ ( w ⊤ x ) where Φ is a non linear map. ◮ The former choice leads to linear spaces of functions 1 . ◮ The latter choice can be iterated f ( x ) = Φ ( w ⊤ L Φ ( w ⊤ L − 1 ... Φ ( w ⊤ 1 x ))) . 1 The spaces are linear, NOT the functions! L.Rosasco, 9.520/6.860 2018

Features and feature maps f ( x ) = w ⊤ Φ ( x ) , where Φ : X → R p Φ ( x ) = ( ϕ 1 ( x ) ,...,ϕ p ( x )) ⊤ and ϕ j : X → R , for j = 1 ,..., p . ◮ X need not be R d . ◮ We can also write p � w j ϕ j ( x ) . f ( x ) = i = 1 L.Rosasco, 9.520/6.860 2018

Geometric view f ( x ) = w ⊤ Φ ( x ) L.Rosasco, 9.520/6.860 2018

An example L.Rosasco, 9.520/6.860 2018

More examples The equation p � f ( x ) = w ⊤ Φ ( x ) = w j ϕ j ( x ) i = 1 suggests to think of features as some form of basis. Indeed we can consider ◮ Fourier basis, ◮ wave-lets + their variations, ◮ ... L.Rosasco, 9.520/6.860 2018

And even more examples Any set of functions ϕ j : X → R , j = 1 ,..., p can be considered. Feature design/engineering ◮ vision: SIFT, HOG ◮ audio: MFCC ◮ ... L.Rosasco, 9.520/6.860 2018

Nonlinear functions using features Let H Φ be the space of linear functions f ( x ) = w ⊤ Φ ( x ) . ◮ f ↔ w is one to one, if ( ϕ j ) j are lin. indip. � � H Φ := w ⊤ ¯ ◮ inner product f , ¯ f w , � � � � � f − ¯ ◮ norm/metric f � H Φ := � w − ¯ w � . In this case | f ( x ) − ¯ f ( x ) | ≤ � Φ ( x ) �� w − ¯ w � , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

Back to ERM � n 1 ( y i − w ⊤ Φ ( x i )) 2 + λ � w � 2 , min λ ≥ 0 , n w ∈ R p i = 1 Equivalent to, � n 1 ( y i − f ( x i )) 2 + λ � f � 2 min H Φ , λ ≥ 0 . n f ∈H Φ i = 1 L.Rosasco, 9.520/6.860 2018

Computations using features Φ ∈ R np with Let � ( � Φ ) ij = ϕ j ( x i ) The ridge regression solution is w λ = ( � Φ ⊤ � Φ + n λ I ) − 1 � Φ ⊤ � O ( np 2 ∨ p 3 ) O ( np ∨ p 2 ) , time mem. � Y but also w λ = � Φ ⊤ ( � Φ � Φ ⊤ + n λ I ) − 1 � O ( pn 2 ∨ n 3 ) O ( np ∨ n 2 ) . � Y time mem. L.Rosasco, 9.520/6.860 2018

Representer theorem a little less in disguise Analogously to before n n � � w λ = � Φ ⊤ c = ˆ f λ ( x ) = Φ ( x ) ⊤ Φ ( x i ) c i � Φ ( x i ) c i ⇔ i = 1 i = 1 Φ ⊤ + λ I ) − 1 � c = ( � Φ � ( � Φ � Φ ⊤ ) ij = Φ ( x i ) ⊤ Φ ( x j ) Y , p � Φ ( x ) ⊤ Φ (¯ x ) = ϕ s ( x ) ϕ s (¯ x ) . s = 1 L.Rosasco, 9.520/6.860 2018

Unleash the features ◮ Can we consider linearly dependent features? ◮ Can we consider p = ∞ ? L.Rosasco, 9.520/6.860 2018

An observation For X = R consider � ( 2 γ ) ( j − 1 ) ϕ j ( x ) = x j − 1 e − x 2 γ ( j − 1 )! , j = 2 ,..., ∞ with ϕ 1 ( x ) = 1. Then � � ∞ ∞ � � ( 2 γ ) j − 1 ( 2 γ ) j − 1 x j − 1 e − x 2 γ x 2 γ x j − 1 e − ¯ ϕ j ( x ) ϕ j (¯ x ) = ( j − 1 )! ¯ ( j − 1 )! j = 1 j = 1 � ∞ ( 2 γ ) j − 1 e − x 2 γ e − ¯ x 2 γ x ) j − 1 = e − x 2 γ e − ¯ x 2 γ e 2 x ¯ x 2 γ = ( j − 1 )! ( x ¯ j = 1 x | 2 γ e −| x − ¯ = L.Rosasco, 9.520/6.860 2018

From features to kernels � ∞ Φ ( x ) ⊤ Φ (¯ x ) = ϕ j ( x ) ϕ j (¯ x ) = k ( x , ¯ x ) j = 1 We might be able to compute the series in closed form. The function k is called kernel. Can we run ridge regression ? L.Rosasco, 9.520/6.860 2018

Kernel ridge regression We have � n � n ˆ Φ ( x ) ⊤ Φ ( x i ) c i = f λ ( x ) = k ( x , x i ) c i i = 1 i = 1 K + λ I ) − 1 � K ) ij = Φ ( x i ) ⊤ Φ ( x j ) = k ( x i , x j ) c = ( � ( � Y , � K is the kernel matrix, the Gram (inner products) matrix of the data. “The kernel trick” L.Rosasco, 9.520/6.860 2018

Kernels ◮ Can we start from kernels instead of features? ◮ Which functions k : X × X → R define kernels we can use? L.Rosasco, 9.520/6.860 2018

Positive definite kernels A function k : X × X → R is called positive definite: ◮ if the matrix ˆ K is positive semidefinite for all choice of points x 1 ,..., x n , i.e. a ⊤ � ∀ a ∈ R n . Ka ≥ 0 , ◮ Equivalently � n k ( x i , x j ) a i a j ≥ 0 , i , j = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X . L.Rosasco, 9.520/6.860 2018

Inner product kernels are pos. def. Assume Φ : X → R p , p ≤ ∞ and x ) = Φ ( x ) ⊤ Φ (¯ k ( x , ¯ x ) Note that � � � � 2 � n � n � n � � � � Φ ( x i ) ⊤ Φ ( x j ) a i a j = � � k ( x i , x j ) a i a j = Φ ( x i ) a i . � � � � i , j = 1 i , j = 1 i = 1 Clearly k is symmetric. L.Rosasco, 9.520/6.860 2018

But there are many pos. def. kernels Classic examples ◮ linear k ( x , ¯ x ) = x ⊤ ¯ x ◮ polynomial k ( x , ¯ x ) = ( x ⊤ ¯ x + 1 ) s x � 2 γ x ) = e −� x − ¯ ◮ Gaussian k ( x , ¯ But one can consider ◮ kernels on probability distributions ◮ kernels on strings ◮ kernels on functions ◮ kernels on groups ◮ kernels graphs ◮ ... It is natural to think of a kernel as a measure of similarity. L.Rosasco, 9.520/6.860 2018

From pos. def. kernels to functions Let X be any set/ Given a pos. def. kernel k . ◮ consider the space H k of functions N � f ( x ) = k ( x , x i ) a i i = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X and any N ∈ N . ◮ Define an inner product on H k ¯ � N � N � � f , ¯ H k = k ( x i , ¯ x j ) a i ¯ f a j . i = 1 j = 1 ◮ H k can be completed to a Hilbert space. L.Rosasco, 9.520/6.860 2018

A key result Functions defind by Gaussian kernels with large and small widths. L.Rosasco, 9.520/6.860 2018

An illustration Theorem Given a pos. def. k there exists Φ s.t. k ( x , ¯ x ) = � Φ ( x ) , Φ (¯ x ) � H k and H Φ ≃ H k Roughly speaking � N f ( x ) = w ⊤ Φ ( x ) ≃ f ( x ) = k ( x , x i ) a i i = 1 L.Rosasco, 9.520/6.860 2018

From features and kernels to RKHS and beyond H k and H Φ have many properties, characterizations, connections: ◮ reproducing property ◮ reproducing kernel Hilbert spaces (RKHS) ◮ Mercer theorem (Kar hunen Lo´ eve expansion) ◮ Gaussian processes ◮ Cameron-Martin spaces L.Rosasco, 9.520/6.860 2018

Reproducing property Note that by definition of H k ◮ k x = k ( x , · ) is a function in H k ◮ For all f ∈ H k , x ∈ X f ( x ) = � f , k x � H k called the reproducing property ◮ Note that � � � � | f ( x ) − ¯ � f − ¯ � H k , f ( x ) | ≤ � k x � H k f ∀ x ∈ X . The above observations have a converse. L.Rosasco, 9.520/6.860 2018

RKHS Definition A RKHS H is a Hilbert with a function k : X × X → R s.t. ◮ k x = k ( x , · ) ∈ H k , ◮ and f ( x ) = � f , k x � H k . Theorem If H is a RKHS then k is pos. def. L.Rosasco, 9.520/6.860 2018

Evaluation functionals in a RKHS If H is a RKHS then the evaluation functionals e x ( f ) = f ( x ) are continuous. i.e. � � � � | e x ( f ) − e x (¯ � f − ¯ � H k , f ) | � f ∀ x ∈ X since e x ( f ) = � f , k x � H k . Note that L 2 ( R d ) or C ( R d ) don’t have this property! L.Rosasco, 9.520/6.860 2018

Alternative RKHS definition Turns out the previous property also characterizes a RKHS. Theorem A Hilbert space with continuous evaluation functionals is a RKHS. L.Rosasco, 9.520/6.860 2018

Summing up ◮ From linear to non linear functions ◮ using features ◮ using kernels plus ◮ pos. def. functions ◮ reproducing property ◮ RKHS L.Rosasco, 9.520/6.860 2018

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f ( x ) = w x . f w is one to one, H := w

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

Call for inputs workshop UKCTA 3 May 2012 Agenda Introductions, background & objective

POLICY FOR THE TEAM OF NEGOTIATORS FOR Project funded by the ETHIOPIAS WTO ACCESSION

Monetary Policy Report October 2019 Chapter 1 Figure 1.1. Repo rate with uncertainty bands Per

CMP213 TransmiT TNUoS Modification Place your chosen image here. The four corners must just

WHATS DATA? INTRODUCTION TO DATA ANALYSIS LEARNING GOALS appreciate the diversity of data

THEORY OF COOLING NEUTRON STARS Self-similarity and model-independent data analysis D.G. Yakovlev

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

Sambuz

Useful Links

Newsletter

Mail Us

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f ( x ) = w x . f w is one to one, H := w

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

Call for inputs workshop UKCTA 3 May 2012 Agenda Introductions, background &amp; objective

POLICY FOR THE TEAM OF NEGOTIATORS FOR Project funded by the ETHIOPIAS WTO ACCESSION

Monetary Policy Report October 2019 Chapter 1 Figure 1.1. Repo rate with uncertainty bands Per

CMP213 TransmiT TNUoS Modification Place your chosen image here. The four corners must just

WHATS DATA? INTRODUCTION TO DATA ANALYSIS LEARNING GOALS appreciate the diversity of data

THEORY OF COOLING NEUTRON STARS Self-similarity and model-independent data analysis D.G. Yakovlev

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

Sambuz

Useful Links

Newsletter

Mail Us

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

Call for inputs workshop UKCTA 3 May 2012 Agenda Introductions, background & objective