part 3 latent representations and unsupervised learning
play

Part 3: Latent representations and unsupervised learning Dale - PowerPoint PPT Presentation

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for


  1. Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

  2. Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for unsupervised

  3. Unsupervised representation learning Consider generative training x φ

  4. Unsupervised representation learning Examples • dimensionality reduction (PCA, exponential family PCA) • sparse coding • independent component analysis • deep learning . . . Usually involves learning both a latent representation for data and a data reconstruction model Context could be: unsupervised, semi-supervised, or supervised

  5. Challenge Optimal feature discovery appears to be generally intractable Have to jointly train • latent representation • data reconstruction model Usually resort to alternating minimization (sole exception: PCA)

  6. First consider unsupervised feature discovery

  7. Unsupervised feature discovery Single layer case = matrix factorization original data learned dictionary new representation ϕ ≈ Φ X B B n ! t n ! m t = # training examples x n = # original features m ! t m = # new features Choose B and Φ to minimize data reconstruction loss L ( B Φ; X ) = � t i =1 L ( B Φ : i ; X : i ) Seek desired structure in latent feature representation Φ low rank : dimensionality reduction Φ sparse : sparse coding Φ rows independent : independent component analysis

  8. Generalized matrix factorization Assume reconstruction loss L (ˆ x ; x ) is convex in first argument Bregman divergence L (ˆ x ; x ) = D F (ˆ x � x ) = D F ∗ ( f ( x ) � f (ˆ x )) ( F strictly convex potential with transfer f = ∇ F ) Tries to make ˆ x ≈ x Matching loss x � f − 1 ( x )) = D F ∗ ( x � f (ˆ L (ˆ x ; x ) = D F (ˆ x )) ϕ Tries to make f (ˆ x ) ≈ x B (A nonlinear predictor, but loss still convex in ˆ x ) x Regular exponential family x � f − 1 ( x )) − F ∗ ( x ) − const L (ˆ x ; x ) = − log p B ( x | φ ) = D F (ˆ

  9. Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ?

  10. Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ? Dimensionality reduction Fix # features m < min( n , t ) • But only known to be tractable if L ( ˆ X ; X ) = � ˆ X − X � 2 F (PCA) • No known efficient algorithm for other standard losses Problem rank (Φ) = m constraint is too hard

  11. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 2 , 1 min B ∈B m 2 How to impose desired structure on Φ? Relaxed dimensionality reduction (subspace learning) Add rank reducing regularizer � Φ � 2 , 1 = � m j =1 � Φ j : � 2 Favors null rows in Φ But need to add constraint to B B : j ∈ B 2 = { b : � b � 2 ≤ 1 } (Otherwise can make Φ small just by making B large)

  12. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 1 , 1 min B ∈B m q How to impose desired structure on Φ? Sparse coding Use sparsity inducing regularizer � Φ � 1 , 1 = � m � t i =1 | Φ ji | j =1 Favors sparse entries in Φ Need to add constraint to B B : j ∈ B q = { b : � b � q ≤ 1 } (Otherwise can make Φ small just by making B large)

  13. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α D (Φ) min B ∈ R n × m How to impose desired structure on Φ? Independent components analysis Usually enforces B Φ = X as a constraint • but interpolation is generally a bad idea • Instead just minimize reconstruction loss plus a dependence measure D (Φ) as a regularizer Difficulty Formulating a reasonable convex dependence penalty

  14. Training problem Consider subspace learning and sparse coding min Φ ∈ R m × t L ( B Φ; X ) + α � Φ � min B ∈B m Choice of � Φ � and B determines type of representation recovered

  15. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Choice of � Φ � and B determines type of representation recovered Problem Still have rank constraint imposed by # new features m Idea Just relax m → ∞ • Rely on sparsity inducing norm � Φ � to select features

  16. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ

  17. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 1: Alternate! • convex in B given Φ • convex in Φ given B Could use any other form of local training

  18. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 2: Boost! • Implicitly fix B to universal dictionary • Keep row-wise sparse Φ • Incrementally select column in B (“weak learning problem”) • Update sparse Φ Can prove convergence under broad conditions

  19. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem? Optimization problem is not jointly convex in B and Φ Idea 3: Solve! • Can easily solve for globally optimal joint B and Φ • But requires a significant reformulation

  20. A useful observation

  21. Equivalent reformulation Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ | is an induced matrix norm on ˆ • | � · � X determined by B and � · � p , 1 Important fact Norms are always convex Computational strategy 1. Solve for optimal response matrix ˆ X first (convex minimization) 2. Then recover optimal B and Φ from ˆ X

  22. Example: subspace learning min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 2 , 1 min B ∈B ∞ 2 X ∈ R n × t L ( ˆ X ; X ) + α � ˆ = min X � tr ˆ Recovery • Let U Σ V ′ = svd ( ˆ X ) • Set B = U and Φ = Σ V ′ Preserves optimality • � B : j � 2 = 1 hence B ∈ B n 2 • � Φ � 2 , 1 = � Σ V ′ � 2 , 1 = � j σ j � V : j � 2 = � j σ j = � ˆ X � tr Thus L ( ˆ X ; X ) + α � ˆ X � tr L ( B Φ; X ) + α � Φ � 2 , 1 =

  23. Example: sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 1 , 1 min B ∈B ∞ q X ∈ R n × t L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = min ˆ Recovery � � 1 1 ˆ ˆ B = X :1 , ..., X : t (rescaled columns) � ˆ � ˆ X :1 � q X : t � q   � ˆ X :1 � q 0 ...   Φ = (diagonal matrix)   � ˆ 0 X : t � q Preserves optimality • � B : j � q = 1 hence B ∈ B t q • � Φ � 1 , 1 = � j � ˆ X : j � q = � ˆ X ′ � q , 1 Thus L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = L ( B Φ; X ) + α � Φ � 1 , 1

  24. Example: sparse coding Outcome Sparse coding with � · � 1 , 1 regularization = vector quantization • drops some examples • memorizes remaining examples Optimal solution is not overcomplete Could not make these observations using local solvers

  25. Simple extensions • Missing observations in X • Robustness to outliers in X X ∈ R n × t L ( ( ˆ � ˆ X + S ) Ω ; X Ω ) + α | X � | + β � S � 1 , 1 min min S ∈ R n × t ˆ Ω = observed indices in X S = speckled outlier noise (jointly convex in ˆ X and S )

  26. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ )

  27. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm)

  28. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm) of a vector-norm induced matrix norm � Λ ′ � ( B , p ∗ ) = max b ∈B � Λ ′ b � p ∗ (easy to prove this yields a norm on matrices)

  29. Proof outline min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ L ( ˆ = min min min X ; X ) + α � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X ∈ R n × t X X ∈ R n × t L ( ˆ = min X ; X ) + α min min � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend