mit 9 520 6 860 fall 2017 statistical learning theory and
play

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning What is data representation? Let X be a data-space ( M ) M ( M ) F X X A data representation is a map : X F ,


  1. MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning

  2. What is data representation? Let X be a data-space Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) F X X A data representation is a map Φ : X → F , from the data space to a representation space F . A data reconstruction is a map Ψ : F → X . 9.520/6.860 Fall 2017

  3. Road map Last class: ◮ Prologue: Learning theory and data representation ◮ Part I: Data representations by design This class: ◮ Part II: Data representations by unsupervised learning – Dictionary Learning – PCA – Sparse coding – K-means, K-flats Next class: ◮ Part III: Deep data representations 9.520/6.860 Fall 2017

  4. Notation X : data space ◮ X = R d or X = C d (also more general later). ◮ x ∈ X Data representation : Φ : X → F . ∀ x ∈ X , ∃ z ∈ F : Φ( x ) F : representation space ◮ F = R p or F = C p ◮ z ∈ F Data reconstruction : Ψ : F → X . ∀ z ∈ F , ∃ x ∈ X : Ψ( z ) = x 9.520/6.860 Fall 2017

  5. Why learning? Ideally: automatic, autonomous learning ◮ with as little prior information as possible, but also.... . . ◮ . . . with as little human supervision as possible. f ( x ) = � w , Φ( x ) � F , ∀ x ∈ X Two-step learning scheme: ◮ supervised or unsupervised learning of Φ: X → F ◮ supervised learning of w in F 9.520/6.860 Fall 2017

  6. Unsupervised representation learning Samples from a distribution ρ on input space X S = { x 1 , . . . , x n } ∼ ρ n Training set S from ρ (supported on X ρ ). Goal: find Φ( x ) which is “good” not only for S but for other x ∼ ρ . Principles for unsupervised learning of “good” representations? 9.520/6.860 Fall 2017

  7. Unsupervised representation learning principles Two main concepts: 1. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X 2. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x , ∀ x ∈ X 9.520/6.860 Fall 2017

  8. Plan We will first introduce a reconstruction based framework for learning data representation, and then discuss in some detail several examples . We will mostly consider X = R d and F = R p ◮ Representation : Φ : X → F . ◮ Reconstruction : Ψ : F → X . If linear maps: ◮ Representation : Φ( x ) = Cx (coding) ◮ Reconstruction : Ψ( z ) = Dz (decoding) 9.520/6.860 Fall 2017

  9. Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , Ψ ◦ Φ: denotes the composition of Φ and Ψ 9.520/6.860 Fall 2017

  10. Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = X where ρ is the data distribution (fixed but uknown). 9.520/6.860 Fall 2017

  11. Empirical data and population � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = X Caveat Reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! 9.520/6.860 Fall 2017

  12. Parsimonious reconstruction Reconstruction is meaningful only with constraints ! ◮ constraints implement some form of parsimonious reconstruction, ◮ identified with a form of regularization , ◮ choice of the constraints corresponds to different algorithms . Fundamental difference with supervised learning: problem is not well defined! 9.520/6.860 Fall 2017

  13. Parsimonious reconstruction Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) X F X 9.520/6.860 Fall 2017

  14. Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p . 1. linear reconstruction Ψ( z ) = Dz , D ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ 9.520/6.860 Fall 2017

  15. Linear reconstruction and dictionaries Reconstruction D ∈ D can be identified by a d × p dictionary matrix with columns a 1 , . . . , a p ∈ R d . Reconstruction of x ∈ X corresponds to a suitable linear expansion on the dictionary D with coefficients β k = z k , z ∈ F λ p p � � a k z k = x = Dz = a k β k , β 1 , . . . , β k ∈ R . k =1 k =1 9.520/6.860 Fall 2017

  16. Nearest neighbor representation � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ Nearest neighbor (NN) representation since, for D ∈ D and letting X λ = D F λ , Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min z ′ ∈F λ � x − Dz ′ � 2 . d ( x , X λ ) = min 9.520/6.860 Fall 2017

  17. Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , z ∈F λ � x − Dz � 2 . min Alternatively, let F λ = F and add a regularization term R : F → R � � � x − Dz � 2 + λ R ( z ) min . z ∈F Note : Formulations coincide for R ( z ) = 1 I F λ , z ∈ F . 9.520/6.860 Fall 2017

  18. Dictionary learning Empirical reconstruction error minimization � n 1 � � x i − Ψ ◦ Φ( x i ) � 2 min E (Φ , Ψ) = min n Φ , Ψ Φ , Ψ i =1 for joint dictionary and representation learning: n � 1 z i ∈F λ � x i − Dz i � 2 min min . n D ∈D ���� � �� � i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary, ◮ while simultaneously learning the dictionary itself. 9.520/6.860 Fall 2017

  19. Examples The DL framework encompasses a number of approaches. ◮ PCA (& kernel PCA) ◮ K-SVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . 9.520/6.860 Fall 2017

  20. Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n , d } , and D = { D : F → X , linear | D ∗ D = I } . ◮ D is a d × k matrix with orthogonal, unit norm columns ◮ Reconstruction: k � a j z j , Dz = z ∈ F j =1 ◮ Representation: D ∗ : X → F , D ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X 9.520/6.860 Fall 2017

  21. PCA and subset selection � k DD ∗ : X → X , DD ∗ x = a j � a j , x � , x ∈ X . j =1 P = DD ∗ is a projection 1 on subspace of R d spanned by a 1 , . . . , a k . 1 P = P 2 (idempotent) 9.520/6.860 Fall 2017

  22. Rewriting PCA � n 1 z i ∈F k � x i − Dz i � 2 min min . n D ∈D � �� � i =1 Representation learning Note that: � x − Dz � 2 , Φ( x ) = D ∗ x = arg min ∀ x ∈ X , z ∈F k Rewrite minimization (set z = D ∗ x ) as n � 1 � x i − DD ∗ x i � 2 . min n D ∈D i =1 Subspace learning Finding the k − dimensional orthogonal projection D ∗ with the best (empirical) reconstruction . 9.520/6.860 Fall 2017

  23. Learning a linear representation with PCA Subspace learning Finding the k − dimensional orthogonal projection with the best reconstruction . X 9.520/6.860 Fall 2017

  24. PCA computation Recall the solution for k = 1. For all x ∈ X , DD ∗ x = � a , x � a , � x − � a , x � a � 2 = � x � 2 − | � a , x � | 2 with a ∈ R d such that � a � = 1. Then, equivalently: � n � n 1 1 � x i − DD ∗ x i � 2 ⇔ | � a , x i � | 2 . min max n n D ∈D a ∈ R d , � a � =1 i =1 i =1 9.520/6.860 Fall 2017

  25. PCA computation (cont.) X T � Let � n � X the n × d data matrix and V = 1 X . � � n n n � � � 1 | � a , x i � | 2 = 1 a , 1 � a , x i � � a , x i � = � a , x i � x i = � a , Va � . n n n i =1 i =1 i =1 Then, equivalently: � n 1 | � a , x i � | 2 ⇔ max a ∈ R d , � a � =1 � a , Va � max n a ∈ R d , � a � =1 i =1 9.520/6.860 Fall 2017

  26. PCA is an eigenproblem a ∈ R d , � a � =1 � a , Va � max ◮ Solutions are the stationary points of the Lagrangian L ( a , λ ) = � a , Va � − λ ( � a � 2 − 1) . ◮ Set ∂ L /∂ a = 0, then Va = λ a , � a , Va � = λ . Optimization problem is solved by the eigenvector of V associated to the largest eigenvalue. Note : reasoning extends to k > 1 – solution is given by the first k eigenvectors of V . 9.520/6.860 Fall 2017

  27. PCA model Assumes the support of the data distribution is well approximated by a low dimensional linear subspace. X Can we consider an affine representation? Can we consider non-linear representations using PCA? 9.520/6.860 Fall 2017

  28. PCA and affine dictionaries Consider the problem, with D as in PCA: n � 1 z i ∈F k � x i − Dz i − b � 2 . min min n D ∈D , b ∈ R d i =1 The above problem is equivalent to � � 2 � � n � � � 1 � x i − DD ∗ � min x i � ���� � n D ∈D � � i =1 P with x i = x i − m , i = 1 . . . , n . Note : - Computations are unchanged but need to consider centered data. 9.520/6.860 Fall 2017

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend