Principal Components Analysis David Benjamin, Broad DSDE Methods - - PowerPoint PPT Presentation

principal components analysis
SMART_READER_LITE
LIVE PREVIEW

Principal Components Analysis David Benjamin, Broad DSDE Methods - - PowerPoint PPT Presentation

Principal Components Analysis David Benjamin, Broad DSDE Methods February 10, 2016 What is PCA? PCA turns high-dimensional data into low-dimensional data by throwing out directions with low variance. Keep y , throw out x . Assumption: noise


slide-1
SLIDE 1

Principal Components Analysis

David Benjamin, Broad DSDE Methods February 10, 2016

slide-2
SLIDE 2

What is PCA?

PCA turns high-dimensional data into low-dimensional data by throwing out directions with low variance. Keep y, throw out x. Assumption: noise smaller than signal.

slide-3
SLIDE 3

What about correlations?

PCA turns high-dimensional data into low-dimensional data by throwing out directions with low variance. Find the pink and green axes. Throw out the pink component. Resulting low-dimensional data is projection onto green axis.

slide-4
SLIDE 4

Covariance matrix

Σij = 1 N

  • n

(xni − µi)(xnj − µj) = 0 if xi and xj are correlated.

Figure: Σ = Σxx Σyy

  • Figure:

Σ =

  • Σxx

Σxy > 0 Σxy > 0 Σyy

  • We want coordinates that make Σ diagonal.
slide-5
SLIDE 5

PCA recipe

Coordinates (principal components) that make Σ diagonal are the eigenvectors of Σ. PCA recipe Calculate covariance matrix Σ. Find eigenvectors v and eigenvalues λ such that Σvk = λkvk. λk is the variance in the kk direction. Use heuristic to choose K eigenvectors to keep. Data is now K-dimensional: x ≈ µ +

K

  • k=1

ckvk, ck = (x − µ) · vk Generative model: x = µ +

K

  • k=1

ckvk + noise

slide-6
SLIDE 6

Eigenfaces

Pixel images are very high-dimensional vectors. Run PCA and look at the principal components. . . Not strictly “eigenfaces,” but eigen-variation in faces relative to average face.

slide-7
SLIDE 7

Eigenfaces

Pixel images are very high-dimensional vectors. Run PCA and look at the principal components. . . Clockwise from top left full head of hair sunken eyes war paint your interpretation goes here Not strictly “eigenfaces,” but eigen-variation in faces relative to average face.

slide-8
SLIDE 8

Eigenfaces

slide-9
SLIDE 9

PCA map of Europe

Data: xni = genotype (0, 1, 2) of SNP i in person n.

slide-10
SLIDE 10

PCA map of Europe

Applications Classification / geneaology Population stratification in GWAS (regress against PCs)

slide-11
SLIDE 11

PCA map of Europe

Applications Classification / geneaology Population stratification in GWAS (regress against PCs) Do the PCs correspond to the map suspiciously well? Why do the genes of a population migrating north keep going straight along the first PC? Why is Hungary - Austria parallel to Switzerland - France?

slide-12
SLIDE 12

Copy number variation from exome capture

crash course in exome capture get DNA exon DNA hybridizes to baits, throw out remaining DNA sequence exon DNA

slide-13
SLIDE 13

Copy number variation from exome capture

crash course in exome capture get DNA exon DNA hybridizes to baits, throw out remaining DNA sequence exon DNA copy number variation align sequenced DNA to reference genome count number of reads from each exon more (less) reads implies duplication (deletion)

slide-14
SLIDE 14

Copy number variation from exome capture

slide-15
SLIDE 15

Copy number variation from exome capture

slide-16
SLIDE 16

Copy number variation from exome capture

x =µ +

  • k

(v⊤

k x)vk + copy number signal

⇒ copy number signal =x − µ −

  • k

(v⊤

k x)vk

PCs v come from non-tumor samples with no CNVs!

slide-17
SLIDE 17

Pitfalls

PCs might not be good for classification

slide-18
SLIDE 18

Pitfalls

PCs might not be good for classification Low-dimensional space might be non-linear

slide-19
SLIDE 19

Pitfalls

PCs might not be good for classification Low-dimensional space might be non-linear Non-issue: Σ is a big matrix. (Use iterative PCA, FastPCA, flashpca. . .)

slide-20
SLIDE 20

Generalizations

x = µ +

  • ckvk + noise is part of a larger model:

probabilistic PCA.

slide-21
SLIDE 21

Generalizations

x = µ +

  • ckvk + noise is part of a larger model:

probabilistic PCA. Don’t like heuristics for choosing number of PCs to use: Bayesian PCA.

slide-22
SLIDE 22

Generalizations

x = µ +

  • ckvk + noise is part of a larger model:

probabilistic PCA. Don’t like heuristics for choosing number of PCs to use: Bayesian PCA. Data are not linear: nonlinear dimensionality reduction (tSNE, autoencoders, GPLVM, Isomap, SOM. . .)

slide-23
SLIDE 23

Equations

Find the direction (unit vector) v of greatest variance. Projection

  • f x is x⊤v.

σ2 = 1 N

  • n
  • x⊤

n v − µ⊤v

2 = 1 N

  • n
  • (xn − µ)⊤v

2 =v⊤ 1 N

  • n

(xn − µ)(xn − µ)⊤v = v⊤Σv Set ∇v = 0 with Lagrange multiplier for v⊤v = 1: ∇v

  • v⊤Σv + λ(1 − v⊤v)
  • = 0 ⇒ Σv = λv

Dotting with v⊤ gives λ = λv⊤v = v⊤Σv = σ2.