Part 3: Latent representations and unsupervised learning Dale - PowerPoint PPT Presentation

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for unsupervised

Unsupervised representation learning Consider generative training x φ

Unsupervised representation learning Examples • dimensionality reduction (PCA, exponential family PCA) • sparse coding • independent component analysis • deep learning . . . Usually involves learning both a latent representation for data and a data reconstruction model Context could be: unsupervised, semi-supervised, or supervised

Challenge Optimal feature discovery appears to be generally intractable Have to jointly train • latent representation • data reconstruction model Usually resort to alternating minimization (sole exception: PCA)

First consider unsupervised feature discovery

Unsupervised feature discovery Single layer case = matrix factorization original data learned dictionary new representation ϕ ≈ Φ X B B n ! t n ! m t = # training examples x n = # original features m ! t m = # new features Choose B and Φ to minimize data reconstruction loss L ( B Φ; X ) = � t i =1 L ( B Φ : i ; X : i ) Seek desired structure in latent feature representation Φ low rank : dimensionality reduction Φ sparse : sparse coding Φ rows independent : independent component analysis

Generalized matrix factorization Assume reconstruction loss L (ˆ x ; x ) is convex in first argument Bregman divergence L (ˆ x ; x ) = D F (ˆ x � x ) = D F ∗ ( f ( x ) � f (ˆ x )) ( F strictly convex potential with transfer f = ∇ F ) Tries to make ˆ x ≈ x Matching loss x � f − 1 ( x )) = D F ∗ ( x � f (ˆ L (ˆ x ; x ) = D F (ˆ x )) ϕ Tries to make f (ˆ x ) ≈ x B (A nonlinear predictor, but loss still convex in ˆ x ) x Regular exponential family x � f − 1 ( x )) − F ∗ ( x ) − const L (ˆ x ; x ) = − log p B ( x | φ ) = D F (ˆ

Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ?

Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ? Dimensionality reduction Fix # features m < min( n , t ) • But only known to be tractable if L ( ˆ X ; X ) = � ˆ X − X � 2 F (PCA) • No known efficient algorithm for other standard losses Problem rank (Φ) = m constraint is too hard

Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 2 , 1 min B ∈B m 2 How to impose desired structure on Φ? Relaxed dimensionality reduction (subspace learning) Add rank reducing regularizer � Φ � 2 , 1 = � m j =1 � Φ j : � 2 Favors null rows in Φ But need to add constraint to B B : j ∈ B 2 = { b : � b � 2 ≤ 1 } (Otherwise can make Φ small just by making B large)

Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 1 , 1 min B ∈B m q How to impose desired structure on Φ? Sparse coding Use sparsity inducing regularizer � Φ � 1 , 1 = � m � t i =1 | Φ ji | j =1 Favors sparse entries in Φ Need to add constraint to B B : j ∈ B q = { b : � b � q ≤ 1 } (Otherwise can make Φ small just by making B large)

Training problem min Φ ∈ R m × t L ( B Φ; X )+ α D (Φ) min B ∈ R n × m How to impose desired structure on Φ? Independent components analysis Usually enforces B Φ = X as a constraint • but interpolation is generally a bad idea • Instead just minimize reconstruction loss plus a dependence measure D (Φ) as a regularizer Difficulty Formulating a reasonable convex dependence penalty

Training problem Consider subspace learning and sparse coding min Φ ∈ R m × t L ( B Φ; X ) + α � Φ � min B ∈B m Choice of � Φ � and B determines type of representation recovered

Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Choice of � Φ � and B determines type of representation recovered Problem Still have rank constraint imposed by # new features m Idea Just relax m → ∞ • Rely on sparsity inducing norm � Φ � to select features

Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ

Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 1: Alternate! • convex in B given Φ • convex in Φ given B Could use any other form of local training

Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 2: Boost! • Implicitly fix B to universal dictionary • Keep row-wise sparse Φ • Incrementally select column in B (“weak learning problem”) • Update sparse Φ Can prove convergence under broad conditions

Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem? Optimization problem is not jointly convex in B and Φ Idea 3: Solve! • Can easily solve for globally optimal joint B and Φ • But requires a significant reformulation

A useful observation

Equivalent reformulation Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ | is an induced matrix norm on ˆ • | � · � X determined by B and � · � p , 1 Important fact Norms are always convex Computational strategy 1. Solve for optimal response matrix ˆ X first (convex minimization) 2. Then recover optimal B and Φ from ˆ X

Example: subspace learning min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 2 , 1 min B ∈B ∞ 2 X ∈ R n × t L ( ˆ X ; X ) + α � ˆ = min X � tr ˆ Recovery • Let U Σ V ′ = svd ( ˆ X ) • Set B = U and Φ = Σ V ′ Preserves optimality • � B : j � 2 = 1 hence B ∈ B n 2 • � Φ � 2 , 1 = � Σ V ′ � 2 , 1 = � j σ j � V : j � 2 = � j σ j = � ˆ X � tr Thus L ( ˆ X ; X ) + α � ˆ X � tr L ( B Φ; X ) + α � Φ � 2 , 1 =

Example: sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 1 , 1 min B ∈B ∞ q X ∈ R n × t L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = min ˆ Recovery � � 1 1 ˆ ˆ B = X :1 , ..., X : t (rescaled columns) � ˆ � ˆ X :1 � q X : t � q   � ˆ X :1 � q 0 ...   Φ = (diagonal matrix)   � ˆ 0 X : t � q Preserves optimality • � B : j � q = 1 hence B ∈ B t q • � Φ � 1 , 1 = � j � ˆ X : j � q = � ˆ X ′ � q , 1 Thus L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = L ( B Φ; X ) + α � Φ � 1 , 1

Example: sparse coding Outcome Sparse coding with � · � 1 , 1 regularization = vector quantization • drops some examples • memorizes remaining examples Optimal solution is not overcomplete Could not make these observations using local solvers

Simple extensions • Missing observations in X • Robustness to outliers in X X ∈ R n × t L ( ( ˆ � ˆ X + S ) Ω ; X Ω ) + α | X � | + β � S � 1 , 1 min min S ∈ R n × t ˆ Ω = observed indices in X S = speckled outlier noise (jointly convex in ˆ X and S )

Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ )

Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm)

Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm) of a vector-norm induced matrix norm � Λ ′ � ( B , p ∗ ) = max b ∈B � Λ ′ b � p ∗ (easy to prove this yields a norm on matrices)

Proof outline min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ L ( ˆ = min min min X ; X ) + α � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X ∈ R n × t X X ∈ R n × t L ( ˆ = min X ; X ) + α min min � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X

Part 3: Latent representations and unsupervised learning Dale - PowerPoint PPT Presentation

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Deep Learning Tutorial - Part 2 Alex Graves MarcAurelio Ranzato

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

1 Latent variable models In the next section we will discuss latent variable models for

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested

Finding Latent Code Errors via Machine Learning over Program Executions Yuriy Brun Michael D.

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan

The semnova Package for Latent Repeated Measures ANOVA Benedikt Langenberg, RWTH Aachen

POIR 613: Computational Social Science Pablo Barber a School of International Relations

lavaan : an R package for structural equation modeling and more Yves Rosseel Department of Data

and how to protect against them Bianca Schroeder, Sotirios Damouras, Phillipa Gill Motivation