Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang
Supervised v.s. Unsupervised
Math formulation for supervised learning β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 β’ Find π§ = π(π¦) β π that minimizes ΰ· π π Ο π=1 π π = π(π, π¦ π , π§ π ) β’ s.t. the expected loss is small π π = π½ π¦,π§ ~πΈ [π(π, π¦, π§)]
Unsupervised learning β’ Given training data π¦ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Extract some βstructureβ from the data β’ Do not have a general framework β’ Typical unsupervised tasks: β’ Summarization: clustering, dimension reduction β’ Learning probabilistic models: latent variable model, density estimation
Principal Component Analysis (PCA)
High dimensional data β’ Example 1: images Dimension: 300x300 = 90,000
High dimensional data β’ Example 2: documents β’ Features: β’ Unigram (count of each word): thousands β’ Bigram (co-occurrence contextual information): millions β’ Netflix survey: 480189 users x 17770 movies Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 β¦ User 1 5 ? ? 1 3 ? User 2 ? ? 3 1 2 5 User 3 4 3 1 ? 5 1 β¦ Example from Nina Balcan
Principal Component Analysis (PCA) β’ Data analysis point of view: dimension reduction technique on a given set of high dimensional data π¦ π : 1 β€ π β€ π β’ Math point of view: eigen-decomposition of the covariance (or singular value decomposition of the data) β’ Classic, commonly used tool
Principal Component Analysis (PCA) β’ Extract hidden lower dimensional structure of the data β’ Try to capture the variance structure as much as possible β’ Computation: solved by singular value decomposition (SVD)
Principal Component Analysis (PCA) β’ Definition: an orthogonal projection or transformation of the data into a (typically lower dimensional) subspace so that the variance of the projected data is maximized. Figure from isomorphismes @stackexchange
Principal Component Analysis (PCA) β’ An illustration of the projection to 1 dim β’ Pay attention to the variance of the projected points Figure from amoeba@stackexchange
Principal Component Analysis (PCA) β’ Principal Components (PC) are directions that capture most of the variance in the data β’ First PC: direction of greatest variability in data β’ Data points are most spread out when projected on the first PC compared to any other direction β’ Second PC: next direction of greatest variability, orthogonal to first PC β’ Third PC: next direction of greatest variability, orthogonal to first and second PCβs β’ β¦
Math formulation π β’ Suppose the data are centered: Ο π=1 π¦ π = 0 π π€ π π¦ π = 0 β’ Then their projections on any direction π€ are centered: Ο π=1 β’ First PC: maximize the variance of the projections π (π€ π π¦ π ) 2 , π‘. π’. π€ π π€ = 1 max ΰ· π€ π=1 equivalent to π€ π ππ π π€ , π‘. π’. π€ π π€ = 1 max π€ where the columns of π are the data points
Math formulation β’ First PC: π€ π ππ π π€ , π‘. π’. π€ π π€ = 1 max π€ where the columns of π are the data points β’ Solved by Lagrangian: exists π , so that π€ π ππ π π€ β ππ€ π π€ max π€ π ππ π β ππ½ π€ = 0 β ππ π π€ = ππ€ ππ€ = 0 β
Computation: Eigen-decomposition β’ First PC: ππ π π€ = ππ€ β’ ππ π : covariance matrix β’ π€ : eigen-vector of the covariance matrix β’ First PC: first eigen-vector of the covariance matrix β’ Top π PCβs: similar argument shows they are the top π eigen-vectors
Computation: Eigen-decomposition β’ Top π PCβs: the top π eigen-vectors ππ π π = Ξπ where Ξ is a diagonal matrix β’ π are the left singular vectors of π β’ Recall SVD decomposition theorem: β’ An π Γ π real matrix π has factorization π = πΞ£π π where π is an π Γ π orthogonal matrix, Ξ£ is a π Γ π rectangular diagonal matrix with non-negative real numbers on the diagonal, and π is an π Γ π orthogonal matrix.
Equivalent view: low rank approximation β’ First PC maximizes variance: π€ π ππ π π€ , π‘. π’. π€ π π€ = 1 max π€ β’ Alternative viewpoint: find vector π€ such that the projection yields minimum MSE reconstruction π 1 ||π¦ π β π€π€ π π¦ π || 2 , π‘. π’. π€ π π€ = 1 min π ΰ· π€ π=1
Equivalent view: low rank approximation β’ Alternative viewpoint: find vector π€ such that the projection yields minimum MSE reconstruction π 1 ||π¦ π β π€π€ π π¦ π || 2 , π‘. π’. π€ π π€ = 1 min π ΰ· π€ π=1 Figure from Nina Balcan
Summary β’ PCA: orthogonal projection that maximizes variance β’ Low rank approximation: orthogonal projection that minimizes error β’ Eigen-decomposition/SVD β’ All equivalent for centered data
Sparse coding
A latent variable view of PCA β’ Let β π = π€ π π¦ π β’ Data point viewed as π¦ π = π€β π + ππππ‘π β π€ π¦ π
A latent variable view of PCA β’ Consider top π PCβs π β’ Let β π = π π π¦ π β’ Data point viewed as π¦ π = πβ π + ππππ‘π β π π¦
A latent variable view of PCA PCA structure assumption: β low dimension. What about β’ Consider top π PCβs π other assumptions? β’ Let β π = π π π¦ π β’ Data point viewed as π¦ π = πβ π + ππππ‘π β π π¦
Sparse coding β’ Structure assumption: β is sparse, i.e., β 0 is small β’ Dimension of β can be large β π π¦
Sparse coding β’ Latent variable probabilistic model view: 1 π π¦ β = πβ + π 0, πΎ π½ , β is sparse, π π β’ E.g., from Laplacian prior: π β = 2 exp(β 2 β 1 ) β π π¦
Sparse coding β’ Suppose π is known. MLE on β is β β = arg max log π β π¦ β 2 β β = arg min β π β 1 + πΎ π¦ β πβ 2 β’ Suppose both π, β unknown. β’ Typically alternate between updating π, β
Sparse coding β’ Historical note: study on visual system β’ Bruno A Olshausen, and David Field. "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature 381.6583 (1996): 607-609.
Project paper list
Supervised learning β’ AlexNet: ImageNet Classification with Deep Convolutional Neural Networks β’ GoogLeNet: Going Deeper with Convolutions β’ Residue Network: Deep Residual Learning for Image Recognition
Unsupervised learning β’ Deep belief networks: A fast learning algorithm for deep belief nets β’ Reducing the Dimensionality of Data with Neural Networks β’ Variational autoencoder: Auto-Encoding Variational Bayes β’ Generative Adversarial Nets
Recurrent neural networks β’ Long-short term memory β’ Memory networks β’ Sequence to Sequence Learning with Neural Networks
You choose the paper that interests you! β’ Need to consult with TA β’ Heavier responsibility on the student side if customize the project β’ Check recent papers in the conferences ICML, NIPS, ICLR β’ Check papers by leading researchers: Hinton, Lecun, Bengio, etc β’ Explore whether deep learning can be applied to your application β’ Not recommend arXiv: too many deep learning papers
Recommend
More recommend