lecture 7 factor analysis

Lecture 7: Factor Analysis Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data , : 1


  1. Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang

  2. Supervised v.s. Unsupervised

  3. Math formulation for supervised learning β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Unsupervised learning β€’ Given training data 𝑦 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Extract some β€œstructure” from the data β€’ Do not have a general framework β€’ Typical unsupervised tasks: β€’ Summarization: clustering, dimension reduction β€’ Learning probabilistic models: latent variable model, density estimation

  5. Principal Component Analysis (PCA)

  6. High dimensional data β€’ Example 1: images Dimension: 300x300 = 90,000

  7. High dimensional data β€’ Example 2: documents β€’ Features: β€’ Unigram (count of each word): thousands β€’ Bigram (co-occurrence contextual information): millions β€’ Netflix survey: 480189 users x 17770 movies Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 … User 1 5 ? ? 1 3 ? User 2 ? ? 3 1 2 5 User 3 4 3 1 ? 5 1 … Example from Nina Balcan

  8. Principal Component Analysis (PCA) β€’ Data analysis point of view: dimension reduction technique on a given set of high dimensional data 𝑦 𝑗 : 1 ≀ 𝑗 ≀ π‘œ β€’ Math point of view: eigen-decomposition of the covariance (or singular value decomposition of the data) β€’ Classic, commonly used tool

  9. Principal Component Analysis (PCA) β€’ Extract hidden lower dimensional structure of the data β€’ Try to capture the variance structure as much as possible β€’ Computation: solved by singular value decomposition (SVD)

  10. Principal Component Analysis (PCA) β€’ Definition: an orthogonal projection or transformation of the data into a (typically lower dimensional) subspace so that the variance of the projected data is maximized. Figure from isomorphismes @stackexchange

  11. Principal Component Analysis (PCA) β€’ An illustration of the projection to 1 dim β€’ Pay attention to the variance of the projected points Figure from amoeba@stackexchange

  12. Principal Component Analysis (PCA) β€’ Principal Components (PC) are directions that capture most of the variance in the data β€’ First PC: direction of greatest variability in data β€’ Data points are most spread out when projected on the first PC compared to any other direction β€’ Second PC: next direction of greatest variability, orthogonal to first PC β€’ Third PC: next direction of greatest variability, orthogonal to first and second PC’s β€’ …

  13. Math formulation π‘œ β€’ Suppose the data are centered: Οƒ 𝑗=1 𝑦 𝑗 = 0 π‘œ 𝑀 π‘ˆ 𝑦 𝑗 = 0 β€’ Then their projections on any direction 𝑀 are centered: Οƒ 𝑗=1 β€’ First PC: maximize the variance of the projections π‘œ (𝑀 π‘ˆ 𝑦 𝑗 ) 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max ෍ 𝑀 𝑗=1 equivalent to 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 where the columns of π‘Œ are the data points

  14. Math formulation β€’ First PC: 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 where the columns of π‘Œ are the data points β€’ Solved by Lagrangian: exists πœ‡ , so that 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 βˆ’ πœ‡π‘€ π‘ˆ 𝑀 max 𝑀 πœ– π‘Œπ‘Œ π‘ˆ βˆ’ πœ‡π½ 𝑀 = 0 β†’ π‘Œπ‘Œ π‘ˆ 𝑀 = πœ‡π‘€ πœ–π‘€ = 0 β†’

  15. Computation: Eigen-decomposition β€’ First PC: π‘Œπ‘Œ π‘ˆ 𝑀 = πœ‡π‘€ β€’ π‘Œπ‘Œ π‘ˆ : covariance matrix β€’ 𝑀 : eigen-vector of the covariance matrix β€’ First PC: first eigen-vector of the covariance matrix β€’ Top 𝑙 PC’s: similar argument shows they are the top 𝑙 eigen-vectors

  16. Computation: Eigen-decomposition β€’ Top 𝑙 PC’s: the top 𝑙 eigen-vectors π‘Œπ‘Œ π‘ˆ 𝑉 = Λ𝑉 where Ξ› is a diagonal matrix β€’ 𝑉 are the left singular vectors of π‘Œ β€’ Recall SVD decomposition theorem: β€’ An 𝑛 Γ— π‘œ real matrix 𝑁 has factorization 𝑁 = π‘‰Ξ£π‘Š π‘ˆ where 𝑉 is an 𝑛 Γ— 𝑛 orthogonal matrix, Ξ£ is a 𝑛 Γ— π‘œ rectangular diagonal matrix with non-negative real numbers on the diagonal, and π‘Š is an π‘œ Γ— π‘œ orthogonal matrix.

  17. Equivalent view: low rank approximation β€’ First PC maximizes variance: 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 β€’ Alternative viewpoint: find vector 𝑀 such that the projection yields minimum MSE reconstruction π‘œ 1 ||𝑦 𝑗 βˆ’ 𝑀𝑀 π‘ˆ 𝑦 𝑗 || 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 min π‘œ ෍ 𝑀 𝑗=1

  18. Equivalent view: low rank approximation β€’ Alternative viewpoint: find vector 𝑀 such that the projection yields minimum MSE reconstruction π‘œ 1 ||𝑦 𝑗 βˆ’ 𝑀𝑀 π‘ˆ 𝑦 𝑗 || 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 min π‘œ ෍ 𝑀 𝑗=1 Figure from Nina Balcan

  19. Summary β€’ PCA: orthogonal projection that maximizes variance β€’ Low rank approximation: orthogonal projection that minimizes error β€’ Eigen-decomposition/SVD β€’ All equivalent for centered data

  20. Sparse coding

  21. A latent variable view of PCA β€’ Let β„Ž 𝑗 = 𝑀 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘€β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑀 𝑦 𝑗

  22. A latent variable view of PCA β€’ Consider top 𝑙 PC’s 𝑉 β€’ Let β„Ž 𝑗 = 𝑉 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘‰β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑉 𝑦

  23. A latent variable view of PCA PCA structure assumption: β„Ž low dimension. What about β€’ Consider top 𝑙 PC’s 𝑉 other assumptions? β€’ Let β„Ž 𝑗 = 𝑉 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘‰β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑉 𝑦

  24. Sparse coding β€’ Structure assumption: β„Ž is sparse, i.e., β„Ž 0 is small β€’ Dimension of β„Ž can be large β„Ž 𝑋 𝑦

  25. Sparse coding β€’ Latent variable probabilistic model view: 1 π‘ž 𝑦 β„Ž = π‘‹β„Ž + 𝑂 0, 𝛾 𝐽 , β„Ž is sparse, πœ‡ πœ‡ β€’ E.g., from Laplacian prior: π‘ž β„Ž = 2 exp(βˆ’ 2 β„Ž 1 ) β„Ž 𝑋 𝑦

  26. Sparse coding β€’ Suppose 𝑋 is known. MLE on β„Ž is β„Ž βˆ— = arg max log π‘ž β„Ž 𝑦 β„Ž 2 β„Ž βˆ— = arg min β„Ž πœ‡ β„Ž 1 + 𝛾 𝑦 βˆ’ π‘‹β„Ž 2 β€’ Suppose both 𝑋, β„Ž unknown. β€’ Typically alternate between updating 𝑋, β„Ž

  27. Sparse coding β€’ Historical note: study on visual system β€’ Bruno A Olshausen, and David Field. "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature 381.6583 (1996): 607-609.

  28. Project paper list

  29. Supervised learning β€’ AlexNet: ImageNet Classification with Deep Convolutional Neural Networks β€’ GoogLeNet: Going Deeper with Convolutions β€’ Residue Network: Deep Residual Learning for Image Recognition

  30. Unsupervised learning β€’ Deep belief networks: A fast learning algorithm for deep belief nets β€’ Reducing the Dimensionality of Data with Neural Networks β€’ Variational autoencoder: Auto-Encoding Variational Bayes β€’ Generative Adversarial Nets

  31. Recurrent neural networks β€’ Long-short term memory β€’ Memory networks β€’ Sequence to Sequence Learning with Neural Networks

  32. You choose the paper that interests you! β€’ Need to consult with TA β€’ Heavier responsibility on the student side if customize the project β€’ Check recent papers in the conferences ICML, NIPS, ICLR β€’ Check papers by leading researchers: Hinton, Lecun, Bengio, etc β€’ Explore whether deep learning can be applied to your application β€’ Not recommend arXiv: too many deep learning papers

Recommend


More recommend