lecture 7 factor analysis
play

Lecture 7: Factor Analysis Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data , : 1


  1. Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang

  2. Supervised v.s. Unsupervised

  3. Math formulation for supervised learning β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Unsupervised learning β€’ Given training data 𝑦 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Extract some β€œstructure” from the data β€’ Do not have a general framework β€’ Typical unsupervised tasks: β€’ Summarization: clustering, dimension reduction β€’ Learning probabilistic models: latent variable model, density estimation

  5. Principal Component Analysis (PCA)

  6. High dimensional data β€’ Example 1: images Dimension: 300x300 = 90,000

  7. High dimensional data β€’ Example 2: documents β€’ Features: β€’ Unigram (count of each word): thousands β€’ Bigram (co-occurrence contextual information): millions β€’ Netflix survey: 480189 users x 17770 movies Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 … User 1 5 ? ? 1 3 ? User 2 ? ? 3 1 2 5 User 3 4 3 1 ? 5 1 … Example from Nina Balcan

  8. Principal Component Analysis (PCA) β€’ Data analysis point of view: dimension reduction technique on a given set of high dimensional data 𝑦 𝑗 : 1 ≀ 𝑗 ≀ π‘œ β€’ Math point of view: eigen-decomposition of the covariance (or singular value decomposition of the data) β€’ Classic, commonly used tool

  9. Principal Component Analysis (PCA) β€’ Extract hidden lower dimensional structure of the data β€’ Try to capture the variance structure as much as possible β€’ Computation: solved by singular value decomposition (SVD)

  10. Principal Component Analysis (PCA) β€’ Definition: an orthogonal projection or transformation of the data into a (typically lower dimensional) subspace so that the variance of the projected data is maximized. Figure from isomorphismes @stackexchange

  11. Principal Component Analysis (PCA) β€’ An illustration of the projection to 1 dim β€’ Pay attention to the variance of the projected points Figure from amoeba@stackexchange

  12. Principal Component Analysis (PCA) β€’ Principal Components (PC) are directions that capture most of the variance in the data β€’ First PC: direction of greatest variability in data β€’ Data points are most spread out when projected on the first PC compared to any other direction β€’ Second PC: next direction of greatest variability, orthogonal to first PC β€’ Third PC: next direction of greatest variability, orthogonal to first and second PC’s β€’ …

  13. Math formulation π‘œ β€’ Suppose the data are centered: Οƒ 𝑗=1 𝑦 𝑗 = 0 π‘œ 𝑀 π‘ˆ 𝑦 𝑗 = 0 β€’ Then their projections on any direction 𝑀 are centered: Οƒ 𝑗=1 β€’ First PC: maximize the variance of the projections π‘œ (𝑀 π‘ˆ 𝑦 𝑗 ) 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max ෍ 𝑀 𝑗=1 equivalent to 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 where the columns of π‘Œ are the data points

  14. Math formulation β€’ First PC: 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 where the columns of π‘Œ are the data points β€’ Solved by Lagrangian: exists πœ‡ , so that 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 βˆ’ πœ‡π‘€ π‘ˆ 𝑀 max 𝑀 πœ– π‘Œπ‘Œ π‘ˆ βˆ’ πœ‡π½ 𝑀 = 0 β†’ π‘Œπ‘Œ π‘ˆ 𝑀 = πœ‡π‘€ πœ–π‘€ = 0 β†’

  15. Computation: Eigen-decomposition β€’ First PC: π‘Œπ‘Œ π‘ˆ 𝑀 = πœ‡π‘€ β€’ π‘Œπ‘Œ π‘ˆ : covariance matrix β€’ 𝑀 : eigen-vector of the covariance matrix β€’ First PC: first eigen-vector of the covariance matrix β€’ Top 𝑙 PC’s: similar argument shows they are the top 𝑙 eigen-vectors

  16. Computation: Eigen-decomposition β€’ Top 𝑙 PC’s: the top 𝑙 eigen-vectors π‘Œπ‘Œ π‘ˆ 𝑉 = Λ𝑉 where Ξ› is a diagonal matrix β€’ 𝑉 are the left singular vectors of π‘Œ β€’ Recall SVD decomposition theorem: β€’ An 𝑛 Γ— π‘œ real matrix 𝑁 has factorization 𝑁 = π‘‰Ξ£π‘Š π‘ˆ where 𝑉 is an 𝑛 Γ— 𝑛 orthogonal matrix, Ξ£ is a 𝑛 Γ— π‘œ rectangular diagonal matrix with non-negative real numbers on the diagonal, and π‘Š is an π‘œ Γ— π‘œ orthogonal matrix.

  17. Equivalent view: low rank approximation β€’ First PC maximizes variance: 𝑀 π‘ˆ π‘Œπ‘Œ π‘ˆ 𝑀 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 max 𝑀 β€’ Alternative viewpoint: find vector 𝑀 such that the projection yields minimum MSE reconstruction π‘œ 1 ||𝑦 𝑗 βˆ’ 𝑀𝑀 π‘ˆ 𝑦 𝑗 || 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 min π‘œ ෍ 𝑀 𝑗=1

  18. Equivalent view: low rank approximation β€’ Alternative viewpoint: find vector 𝑀 such that the projection yields minimum MSE reconstruction π‘œ 1 ||𝑦 𝑗 βˆ’ 𝑀𝑀 π‘ˆ 𝑦 𝑗 || 2 , 𝑑. 𝑒. 𝑀 π‘ˆ 𝑀 = 1 min π‘œ ෍ 𝑀 𝑗=1 Figure from Nina Balcan

  19. Summary β€’ PCA: orthogonal projection that maximizes variance β€’ Low rank approximation: orthogonal projection that minimizes error β€’ Eigen-decomposition/SVD β€’ All equivalent for centered data

  20. Sparse coding

  21. A latent variable view of PCA β€’ Let β„Ž 𝑗 = 𝑀 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘€β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑀 𝑦 𝑗

  22. A latent variable view of PCA β€’ Consider top 𝑙 PC’s 𝑉 β€’ Let β„Ž 𝑗 = 𝑉 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘‰β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑉 𝑦

  23. A latent variable view of PCA PCA structure assumption: β„Ž low dimension. What about β€’ Consider top 𝑙 PC’s 𝑉 other assumptions? β€’ Let β„Ž 𝑗 = 𝑉 π‘ˆ 𝑦 𝑗 β€’ Data point viewed as 𝑦 𝑗 = π‘‰β„Ž 𝑗 + π‘œπ‘π‘—π‘‘π‘“ β„Ž 𝑉 𝑦

  24. Sparse coding β€’ Structure assumption: β„Ž is sparse, i.e., β„Ž 0 is small β€’ Dimension of β„Ž can be large β„Ž 𝑋 𝑦

  25. Sparse coding β€’ Latent variable probabilistic model view: 1 π‘ž 𝑦 β„Ž = π‘‹β„Ž + 𝑂 0, 𝛾 𝐽 , β„Ž is sparse, πœ‡ πœ‡ β€’ E.g., from Laplacian prior: π‘ž β„Ž = 2 exp(βˆ’ 2 β„Ž 1 ) β„Ž 𝑋 𝑦

  26. Sparse coding β€’ Suppose 𝑋 is known. MLE on β„Ž is β„Ž βˆ— = arg max log π‘ž β„Ž 𝑦 β„Ž 2 β„Ž βˆ— = arg min β„Ž πœ‡ β„Ž 1 + 𝛾 𝑦 βˆ’ π‘‹β„Ž 2 β€’ Suppose both 𝑋, β„Ž unknown. β€’ Typically alternate between updating 𝑋, β„Ž

  27. Sparse coding β€’ Historical note: study on visual system β€’ Bruno A Olshausen, and David Field. "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature 381.6583 (1996): 607-609.

  28. Project paper list

  29. Supervised learning β€’ AlexNet: ImageNet Classification with Deep Convolutional Neural Networks β€’ GoogLeNet: Going Deeper with Convolutions β€’ Residue Network: Deep Residual Learning for Image Recognition

  30. Unsupervised learning β€’ Deep belief networks: A fast learning algorithm for deep belief nets β€’ Reducing the Dimensionality of Data with Neural Networks β€’ Variational autoencoder: Auto-Encoding Variational Bayes β€’ Generative Adversarial Nets

  31. Recurrent neural networks β€’ Long-short term memory β€’ Memory networks β€’ Sequence to Sequence Learning with Neural Networks

  32. You choose the paper that interests you! β€’ Need to consult with TA β€’ Heavier responsibility on the student side if customize the project β€’ Check recent papers in the conferences ICML, NIPS, ICLR β€’ Check papers by leading researchers: Hinton, Lecun, Bengio, etc β€’ Explore whether deep learning can be applied to your application β€’ Not recommend arXiv: too many deep learning papers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend