MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning

What is data representation? Let X be a data-space Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) F X X A data representation is a map Φ : X → F , from the data space to a representation space F . A data reconstruction is a map Ψ : F → X . 9.520/6.860 Fall 2017

Road map Last class: ◮ Prologue: Learning theory and data representation ◮ Part I: Data representations by design This class: ◮ Part II: Data representations by unsupervised learning – Dictionary Learning – PCA – Sparse coding – K-means, K-flats Next class: ◮ Part III: Deep data representations 9.520/6.860 Fall 2017

Notation X : data space ◮ X = R d or X = C d (also more general later). ◮ x ∈ X Data representation : Φ : X → F . ∀ x ∈ X , ∃ z ∈ F : Φ( x ) F : representation space ◮ F = R p or F = C p ◮ z ∈ F Data reconstruction : Ψ : F → X . ∀ z ∈ F , ∃ x ∈ X : Ψ( z ) = x 9.520/6.860 Fall 2017

Why learning? Ideally: automatic, autonomous learning ◮ with as little prior information as possible, but also.... . . ◮ . . . with as little human supervision as possible. f ( x ) = � w , Φ( x ) � F , ∀ x ∈ X Two-step learning scheme: ◮ supervised or unsupervised learning of Φ: X → F ◮ supervised learning of w in F 9.520/6.860 Fall 2017

Unsupervised representation learning Samples from a distribution ρ on input space X S = { x 1 , . . . , x n } ∼ ρ n Training set S from ρ (supported on X ρ ). Goal: find Φ( x ) which is “good” not only for S but for other x ∼ ρ . Principles for unsupervised learning of “good” representations? 9.520/6.860 Fall 2017

Unsupervised representation learning principles Two main concepts: 1. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X 2. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x , ∀ x ∈ X 9.520/6.860 Fall 2017

Plan We will first introduce a reconstruction based framework for learning data representation, and then discuss in some detail several examples . We will mostly consider X = R d and F = R p ◮ Representation : Φ : X → F . ◮ Reconstruction : Ψ : F → X . If linear maps: ◮ Representation : Φ( x ) = Cx (coding) ◮ Reconstruction : Ψ( z ) = Dz (decoding) 9.520/6.860 Fall 2017

Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , Ψ ◦ Φ: denotes the composition of Φ and Ψ 9.520/6.860 Fall 2017

Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = X where ρ is the data distribution (fixed but uknown). 9.520/6.860 Fall 2017

Empirical data and population � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = X Caveat Reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! 9.520/6.860 Fall 2017

Parsimonious reconstruction Reconstruction is meaningful only with constraints ! ◮ constraints implement some form of parsimonious reconstruction, ◮ identified with a form of regularization , ◮ choice of the constraints corresponds to different algorithms . Fundamental difference with supervised learning: problem is not well defined! 9.520/6.860 Fall 2017

Parsimonious reconstruction Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) X F X 9.520/6.860 Fall 2017

Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p . 1. linear reconstruction Ψ( z ) = Dz , D ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ 9.520/6.860 Fall 2017

Linear reconstruction and dictionaries Reconstruction D ∈ D can be identified by a d × p dictionary matrix with columns a 1 , . . . , a p ∈ R d . Reconstruction of x ∈ X corresponds to a suitable linear expansion on the dictionary D with coefficients β k = z k , z ∈ F λ p p � � a k z k = x = Dz = a k β k , β 1 , . . . , β k ∈ R . k =1 k =1 9.520/6.860 Fall 2017

Nearest neighbor representation � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ Nearest neighbor (NN) representation since, for D ∈ D and letting X λ = D F λ , Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min z ′ ∈F λ � x − Dz ′ � 2 . d ( x , X λ ) = min 9.520/6.860 Fall 2017

Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , z ∈F λ � x − Dz � 2 . min Alternatively, let F λ = F and add a regularization term R : F → R � � � x − Dz � 2 + λ R ( z ) min . z ∈F Note : Formulations coincide for R ( z ) = 1 I F λ , z ∈ F . 9.520/6.860 Fall 2017

Dictionary learning Empirical reconstruction error minimization � n 1 � � x i − Ψ ◦ Φ( x i ) � 2 min E (Φ , Ψ) = min n Φ , Ψ Φ , Ψ i =1 for joint dictionary and representation learning: n � 1 z i ∈F λ � x i − Dz i � 2 min min . n D ∈D �� i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary, ◮ while simultaneously learning the dictionary itself. 9.520/6.860 Fall 2017

Examples The DL framework encompasses a number of approaches. ◮ PCA (& kernel PCA) ◮ K-SVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . 9.520/6.860 Fall 2017

Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n , d } , and D = { D : F → X , linear | D ∗ D = I } . ◮ D is a d × k matrix with orthogonal, unit norm columns ◮ Reconstruction: k � a j z j , Dz = z ∈ F j =1 ◮ Representation: D ∗ : X → F , D ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X 9.520/6.860 Fall 2017

PCA and subset selection � k DD ∗ : X → X , DD ∗ x = a j � a j , x � , x ∈ X . j =1 P = DD ∗ is a projection 1 on subspace of R d spanned by a 1 , . . . , a k . 1 P = P 2 (idempotent) 9.520/6.860 Fall 2017

Rewriting PCA � n 1 z i ∈F k � x i − Dz i � 2 min min . n D ∈D � �� i =1 Representation learning Note that: � x − Dz � 2 , Φ( x ) = D ∗ x = arg min ∀ x ∈ X , z ∈F k Rewrite minimization (set z = D ∗ x ) as n � 1 � x i − DD ∗ x i � 2 . min n D ∈D i =1 Subspace learning Finding the k − dimensional orthogonal projection D ∗ with the best (empirical) reconstruction . 9.520/6.860 Fall 2017

Learning a linear representation with PCA Subspace learning Finding the k − dimensional orthogonal projection with the best reconstruction . X 9.520/6.860 Fall 2017

PCA computation Recall the solution for k = 1. For all x ∈ X , DD ∗ x = � a , x � a , � x − � a , x � a � 2 = � x � 2 − | � a , x � | 2 with a ∈ R d such that � a � = 1. Then, equivalently: � n � n 1 1 � x i − DD ∗ x i � 2 ⇔ | � a , x i � | 2 . min max n n D ∈D a ∈ R d , � a � =1 i =1 i =1 9.520/6.860 Fall 2017

PCA computation (cont.) X T � Let � n � X the n × d data matrix and V = 1 X . � � n n n � � � 1 | � a , x i � | 2 = 1 a , 1 � a , x i � � a , x i � = � a , x i � x i = � a , Va � . n n n i =1 i =1 i =1 Then, equivalently: � n 1 | � a , x i � | 2 ⇔ max a ∈ R d , � a � =1 � a , Va � max n a ∈ R d , � a � =1 i =1 9.520/6.860 Fall 2017

PCA is an eigenproblem a ∈ R d , � a � =1 � a , Va � max ◮ Solutions are the stationary points of the Lagrangian L ( a , λ ) = � a , Va � − λ ( � a � 2 − 1) . ◮ Set ∂ L /∂ a = 0, then Va = λ a , � a , Va � = λ . Optimization problem is solved by the eigenvector of V associated to the largest eigenvalue. Note : reasoning extends to k > 1 – solution is given by the first k eigenvectors of V . 9.520/6.860 Fall 2017

PCA model Assumes the support of the data distribution is well approximated by a low dimensional linear subspace. X Can we consider an affine representation? Can we consider non-linear representations using PCA? 9.520/6.860 Fall 2017

PCA and affine dictionaries Consider the problem, with D as in PCA: n � 1 z i ∈F k � x i − Dz i − b � 2 . min min n D ∈D , b ∈ R d i =1 The above problem is equivalent to � � 2 � � n � � � 1 � x i − DD ∗ � min x i � �� n D ∈D � � i =1 P with x i = x i − m , i = 1 . . . , n . Note : - Computations are unchanged but need to consider centered data. 9.520/6.860 Fall 2017

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning What is data representation? Let X be a data-space ( M ) M ( M ) F X X A data representation is a map : X F ,

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

W ISE M OVE ? A research platform that mimics our autonomous driving stack. Objective:

Do Managers and Leaders Really Do Different Things? by John OLeary JUNE 20, 2016 Business

1 3.1.1 Formal Properties and a little Remarks (III) Theory This definition of a MAS is

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Regulation update Monitoring and Compliance 2016 Statement of Compliance Evidence required

Our Vocational Qualification Strategy Cassy Taylor Associate Director, Vocational

Impact Evaluating the effectiveness of your careers programme Jo Welch Natasha Davies

JFS: An International Baccalaureate Candidate School Why: -Moving away from industrial model (sit

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning What is data representation? Let X be a data-space ( M ) M ( M ) F X X A data representation is a map : X F ,

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

W ISE M OVE ? A research platform that mimics our autonomous driving stack. Objective:

Do Managers and Leaders Really Do Different Things? by John OLeary JUNE 20, 2016 Business

1 3.1.1 Formal Properties and a little Remarks (III) Theory This definition of a MAS is

CSC421/2516 Lecture 3: Automatic Differentiation &amp; Distributed Representations Jimmy Ba

Regulation update Monitoring and Compliance 2016 Statement of Compliance Evidence required

Our Vocational Qualification Strategy Cassy Taylor Associate Director, Vocational

Impact Evaluating the effectiveness of your careers programme Jo Welch Natasha Davies

JFS: An International Baccalaureate Candidate School Why: -Moving away from industrial model (sit

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba