Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA - - PowerPoint PPT Presentation

β–Ά
big data management
SMART_READER_LITE
LIVE PREVIEW

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA - - PowerPoint PPT Presentation

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015 Sabrina Friedl LMU Munich 1 Product Component Analysis (PCA) REVISION AND EXAMPLE 2 Goals of PCA Find a lower-dimensional representation of


slide-1
SLIDE 1

Big Data Management & Analytics

EXERCISE 8 – TEXT PROCESSING, PCA

21st of December, 2015

Sabrina Friedl LMU Munich

1

slide-2
SLIDE 2

Product Component Analysis (PCA)

REVISION AND EXAMPLE

2

slide-3
SLIDE 3

Goals of PCA

3

Find a lower-dimensional representation of data to:

  • Detect hidden correlations
  • Remove (summarize redundant, irrelevant or noisy

features

  • Fascilitate interpretation and visualization (actually

visualization is possible only for few dimensions)

  • Make storage and processing of data easier

d=2 d=3

slide-4
SLIDE 4

Idea of PCA

A good data representation retains the main differences between data points but eliminates irrelevant variances

  • Given matrix π‘Œ: π‘œ data points with 𝑒 dimensions (features)
  • Find 𝑙 directions (linear combinations of dimensions) with highest

variance = principal components: 𝑀1, 𝑀2, … 𝑀𝑙

  • Project data points onto these directions
  • General Form: π‘Œπ‘„ = 𝑍

(n x d) * (d x k) = (n x k)

4 X = raw data matrix P = (v1, v2,... vk) transformation matrix Y = k-dimensional representation of X

slide-5
SLIDE 5

PCA – Graphical Intuition

5

Center data Transform by P

slide-6
SLIDE 6

How to get Principal Components?

6

Calculate the eigenvalues and eigenvectors of the covariance matrix

=π·π‘ƒπ‘Š(π‘Œ, π‘Œ) Describes the pairwise correlation between all features For a centralized data matrix π‘Œ with Β΅ = 0 we can calculate the covariance matrix as:

𝟐 𝒐 𝒀𝑼𝒀 =

Sigma here is the name of the matrix, not the sum symbol!

slide-7
SLIDE 7

Eigenvalues and Eigenvectors

7

slide-8
SLIDE 8

Dimension Reduction

8

For π‘œ dimensions of π‘Œ we get π‘œ eigevalues and eigenvectors. The transformation matrix is then constructed by putting the eigenvectors as columns into a matrix: T = 𝑀1, 𝑀2, … π‘€π‘œ Eigendecomposition: Ξ£ = π‘ˆΙ…π‘ˆπ‘ˆ To get a k-dimensional representation Y of (centered) data X we take only the first k eigenvectors (principal components) of T and call this matrix P. We calculate: 𝒀𝑸 = Y To transform back: Z = π‘π‘„π‘ˆ

Ξ£ = covariance matrix T = (v1, v2,... vn) transformation matrix Ι… = diagonalised matrix with eigenvalues on diagonal

slide-9
SLIDE 9

PCA – Summary of Steps

1. Center the data π‘Œ : 𝑦𝑗 βˆ’ ¡𝑗 2. Calculate the covariance-matrix: 3. Calculate the eigenvalues and eigenvectors of Ξ£

  • Calculate eigenvalues Ξ» by finding the zeros of the characteristic polynomial: det(Ξ£ βˆ’ λ𝐽)
  • Calculate the eigenvectors by solving (Ξ£ βˆ’ λ𝐽)𝑀 = 0

4. Select the 𝑙 eigenvectors with the biggest eigenvalues and create P = (𝑀1, 𝑀2, … 𝑀𝑙) 5. Transform the original (n x d) matrix π‘Œ to a (n x k) representation: π‘Œπ‘„ = 𝑍

9 Ξ£ = 1 π‘œ π‘Œπ‘ˆπ‘Œ

slide-10
SLIDE 10

Useful links

  • KDD II script: http://www.dbs.ifi.lmu.de/Lehre/KDD_II/WS1516/skript/KDD2-2-

HDData.DimensionalityReduction.pdf

  • A tutorial about PCA:

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

10