regml 2016 class 7 dictionary learning
play

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2016 Data representation


  1. RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

  2. Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2016

  3. Data representation (cont.) X data-space, a data representation is a map Φ : X → F , to a representation space F . Different names in different fields: ◮ machine learning : feature map ◮ signal processing : analysis operator/transform ◮ information theory : encoder ◮ computational geometry : embedding L.Rosasco, RegML 2016

  4. Supervised or Unsupervised? Supervised (labelled/annotated) data are expensive! Ideally a good data representation should reduce the need of (human) annotation. . . � Unsupervised learning of Φ L.Rosasco, RegML 2016

  5. Unsupervised representation learning Samples S = { x 1 , . . . , x n } from a distribution ρ on the input space X are available. What are the principles to learn ”good” representation in an unsupervised fashion? L.Rosasco, RegML 2016

  6. Unsupervised representation learning principles Two main concepts 1. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x, ∀ x ∈ X 2. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X Most unsupervised work has focused on reconstruction rather than on similarity ��� We give an overview next L.Rosasco, RegML 2016

  7. Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , L.Rosasco, RegML 2016

  8. Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = where ρ is the data distribution (fixed but uknown). L.Rosasco, RegML 2016

  9. Empirical data and population � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = Caveat. . . But reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! L.Rosasco, RegML 2016

  10. Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p 1. linear reconstruction Ψ ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ where F λ is a subset of F . L.Rosasco, RegML 2016

  11. Linear reconstruction and dictionaries Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a 1 , . . . , a p ∈ R d . The reconstruction of an input x ∈ X corresponds to a suitable linear expansion on the dictionary � p x = a j β j , β 1 , . . . , β p ∈ R . j =1 L.Rosasco, RegML 2016

  12. Nearest neighbor representation � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ The above representation is called nearest neighbor (NN) since, for Ψ ∈ D , X λ = Ψ F λ , the representation Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min β ∈F λ � x − Ψ β � 2 . d ( x, X λ ) = min L.Rosasco, RegML 2016

  13. Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , β ∈F λ � x − Ψ β � 2 . min Alternatively let F λ = F and adding a regularization term R λ : F → R � � � x − Ψ β � 2 + R λ ( β ) min . β ∈F L.Rosasco, RegML 2016

  14. Dictionary learning Then � n 1 � x i − Ψ ◦ Φ( x i ) � 2 min n Ψ , Φ i =1 becomes � n 1 β i ∈F λ � x i − Ψ β i � 2 min min . n Ψ ∈D ���� � �� � i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary. . . ◮ while simultaneously learning the dictionary itself. L.Rosasco, RegML 2016

  15. Examples The framework introduced above encompasses a large number of approaches. ◮ PCA (& kernel PCA) ◮ KSVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . L.Rosasco, RegML 2016

  16. Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, k � Ψ β = a j β j , β ∈ F j =1 ◮ Ψ ∗ : X → F , Ψ ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X L.Rosasco, RegML 2016

  17. PCA & best subspace ΨΨ ∗ x = � k ◮ ΨΨ ∗ : X → X , j =1 a j � a j , x � , x ∈ X . x x − h x, a i a a |{z} h x,a i a ◮ P = ΨΨ ∗ is the projection ( P = P 2 ) on the subspace of R d spanned by a 1 , . . . , a k . L.Rosasco, RegML 2016

  18. Rewriting PCA Note that, � x − Ψ β � 2 , Φ( x ) = Ψ ∗ x = arg min ∀ x ∈ X , β ∈F k so that we can rewrite the PCA minimization as n � 1 � x − ΨΨ ∗ x i � 2 . min n Ψ ∈D i =1 Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . L.Rosasco, RegML 2016

  19. PCA computation X T � Let � n � X the n × d data matrix and C = 1 X . . . . PCA optimization problem is solved by the eigenvector of C associated to the K largest eigenvalues. L.Rosasco, RegML 2016

  20. Learning a linear representation with PCA Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . X PCA assumes the support of the data distribution to be well approximated by a low dimensional linear subspace L.Rosasco, RegML 2016

  21. PCA beyond linearity X L.Rosasco, RegML 2016

  22. PCA beyond linearity X L.Rosasco, RegML 2016

  23. PCA beyond linearity X L.Rosasco, RegML 2016

  24. Kernel PCA Consider K ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H φ : X → H , and a feature map and associated (reproducing) kernel . We can consider the empirical reconstruction in the feature space , � n 1 β i ∈H � φ ( x i ) − Ψ β i � 2 min min H . n Ψ ∈D i =1 Connection to manifold learning. . . L.Rosasco, RegML 2016

  25. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , ◮ D = { Ψ : F → X | � Ψ e j � F ≤ 1 } . Hence, n � 1 β i ∈F λ � x i − Ψ β i � 2 min min n Ψ ∈D ���� � �� � i =1 dictionary learning sparse representation L.Rosasco, RegML 2016

  26. Sparse coding (cont.) n � 1 β i ∈ R p , � β i �≤ λ � x i − Ψ β i � 2 min min n Ψ ∈D i =1 ◮ The problem is not convex . . . but it is separately convex in the β i ’s and Ψ . ◮ An alternating minimization is fairly natural (other approaches possible–see e.g. [Schnass ’15, Elad et al. ’06]) L.Rosasco, RegML 2016

  27. Representation computation Given a dictionary, the problems β ∈F λ � x i − Ψ β � 2 , i = 1 , . . . , n min are convex and correspond to a sparse representation problems. They can be solved using convex optimization techniques. Splitting/proximal methods β t +1 = T γ,λ ( β t − γ Ψ ∗ ( x i − Ψ β t )) , β 0 , t = 0 , . . . , T max with T λ the soft-thresholding operator, L.Rosasco, RegML 2016

  28. Dictionary computation Given Φ( x i ) = β i , i = 1 , . . . , n , we have � � n � 1 1 2 � x i − Ψ ◦ Φ( x i ) � 2 = min � � � � X − B ∗ Ψ min F , � n n Ψ ∈D Ψ ∈D i =1 where B is the n × p matrix with rows β i , i = 1 , . . . , n and we denoted by �·� F , the Frobenius norm. It is a convex problem, solvable via standard techniques. Splitting/proximal methods Ψ t +1 = P (Ψ t − γ t B ∗ ( X − Ψ B )) , Ψ 0 , t = 0 , . . . , T max where P is the projection corresponding to the constraints, � � Ψ j � � � Ψ j � � , � > 1 P (Ψ j ) Ψ j / = if � � Ψ j � � ≤ 1 . P (Ψ j ) Ψ j , = if L.Rosasco, RegML 2016

  29. Sparse coding model ◮ Sparse coding assumes the support of the data distribution to be a � p � union of subspaces, i.e. all possible s dimensional subspaces in s R p , where s is the sparsity level. ◮ More general penalties, more general geometric assumptions. L.Rosasco, RegML 2016

  30. Example 3: K-means & vector quantization K-means is typically seen as a clustering algorithm in machine learning. . . but it is also a classical vector quantization approach. Here we revisit this point of view from a data representation perspective. K-means corresponds to ◮ F λ = F k = { e 1 , . . . , e k } , the canonical basis in R k , k ≤ n ◮ D = { Ψ : F → X | linear } . L.Rosasco, RegML 2016

  31. K-means computation n � 1 β i ∈{ e 1 ,...,e k } � x i − Ψ β i � 2 min min n Ψ ∈D i =1 The K-means problem is not convex. Alternating minimization 1. Initialize dictionary Ψ 0 . 2. Let Φ( x i ) = β i , i = 1 , . . . , n be the solution of the problems β ∈{ e 1 ,...,e k } � x i − Ψ β � 2 , min i = 1 , . . . , n. with V j = { x ∈ S | Φ( x ) = e j } , (multiple points have same representation since k ≤ n ). 3. Letting a j = Ψ e j , we can write � n � k � 1 1 � x i − Ψ ◦ Φ( x i ) � 2 = � x − a j � 2 . min min n n Ψ ∈D a 1 ,...,a k ∈ R d i =1 j =1 x ∈ V j L.Rosasco, RegML 2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend