big data management
play

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA - PowerPoint PPT Presentation

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015 Sabrina Friedl LMU Munich 1 Product Component Analysis (PCA) REVISION AND EXAMPLE 2 Goals of PCA Find a lower-dimensional representation of


  1. Big Data Management & Analytics EXERCISE 8 – TEXT PROCESSING, PCA 21st of December, 2015 Sabrina Friedl LMU Munich 1

  2. Product Component Analysis (PCA) REVISION AND EXAMPLE 2

  3. Goals of PCA Find a lower-dimensional representation of data to: ◦ Detect hidden correlations ◦ Remove (summarize redundant, irrelevant or noisy features ◦ Fascilitate interpretation and visualization (actually visualization is possible only for few dimensions) ◦ Make storage and processing of data easier d=2 d=3 3

  4. Idea of PCA A good data representation retains the main differences between data points but eliminates irrelevant variances ◦ Given matrix 𝑌 : 𝑜 data points with 𝑒 dimensions (features) ◦ Find 𝑙 directions (linear combinations of dimensions) with highest variance = principal components: 𝑤 1 , 𝑤 2 , … 𝑤 𝑙 ◦ Project data points onto these directions ◦ General Form: 𝑌𝑄 = 𝑍 X = raw data matrix P = (v 1 , v 2 ,... v k ) transformation matrix (n x d) * (d x k) = (n x k) Y = k-dimensional representation of X 4

  5. PCA – Graphical Intuition Center data Transform by P 5

  6. How to get Principal Components? Calculate the eigenvalues and eigenvectors of the covariance matrix Sigma here is the name of the matrix, not the sum symbol! = 𝐷𝑃𝑊(𝑌, 𝑌) Describes the pairwise correlation between all features For a centralized data matrix 𝑌 with µ = 0 we 𝟐 𝒐 𝒀 𝑼 𝒀 = can calculate the covariance matrix as: 6

  7. Eigenvalues and Eigenvectors 7

  8. Dimension Reduction For 𝑜 dimensions of 𝑌 we get 𝑜 eigevalues and eigenvectors. The transformation matrix is then constructed by putting the eigenvectors as columns into a matrix: T = 𝑤 1 , 𝑤 2 , … 𝑤 𝑜 Σ = covariance matrix T = (v 1 , v 2 ,... v n ) transformation matrix Eigendecomposition: Σ = 𝑈Ʌ𝑈 𝑈 Ʌ = diagonalised matrix with eigenvalues on diagonal To get a k-dimensional representation Y of (centered) data X we take only the first k eigenvectors (principal components) of T and call this matrix P . We calculate: 𝒀𝑸 = Y To transform back: Z = 𝑍𝑄 𝑈 8

  9. PCA – Summary of Steps Center the data 𝑌 : 𝑦 𝑗 − µ 𝑗 1. Σ = 1 2. Calculate the covariance-matrix: 𝑜 𝑌 𝑈 𝑌 3. Calculate the eigenvalues and eigenvectors of Σ Calculate eigenvalues λ by finding the zeros of the characteristic polynomial: det( Σ − λ 𝐽 ) ◦ ◦ Calculate the eigenvectors by solving ( Σ − λ 𝐽)𝑤 = 0 4. Select the 𝑙 eigenvectors with the biggest eigenvalues and create P = (𝑤 1 , 𝑤 2 , … 𝑤 𝑙 ) Transform the original (n x d) matrix 𝑌 to a (n x k) representation: 𝑌𝑄 = 𝑍 5. 9

  10. Useful links o KDD II script: http://www.dbs.ifi.lmu.de/Lehre/KDD_II/WS1516/skript/KDD2-2- HDData.DimensionalityReduction.pdf o A tutorial about PCA: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend