PCA and Admixture proportions for low depth NGS data Anders - - PowerPoint PPT Presentation

pca and admixture proportions for low depth ngs data
SMART_READER_LITE
LIVE PREVIEW

PCA and Admixture proportions for low depth NGS data Anders - - PowerPoint PPT Presentation

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Structured populations analysis based on individual allele frequencies Analysis of low depth sequencing data Admixture proportions Individual allele frequencies


slide-1
SLIDE 1

PCA and Admixture ’proportions’ for low depth NGS data

Anders Albrechtsen

slide-2
SLIDE 2

Structured populations analysis based on individual allele frequencies

Analysis of low depth sequencing data

Admixture proportions PCA Individual allele frequencies (PCA)

thHan(40,919) alHan(80,714) SouthHan(20,969)

slide-3
SLIDE 3

Structured populations analysis based on individual allele frequencies

Admixture clustering /PCA - which is more informative?

slide-4
SLIDE 4

Structured populations analysis based on individual allele frequencies

This morning

1

Structured populations population structure and PCA

2

analysis based on individual allele frequencies Admixture proportions vs. PCA

slide-5
SLIDE 5

Structured populations analysis based on individual allele frequencies

PCA for NGS: Ancient Eskimoa

aRasmussen et. al., 2010

Figure: First principal components of selected populations.

slide-6
SLIDE 6

Structured populations analysis based on individual allele frequencies

Genotype data

5 individuals, genotypes SNP1 AG AG AG AA AA SNP2 TT TA AA AT AA SNP3 AA AC AC CC AC SNP4 GG GG GC CC CC SNP5 TT TC TC CC CC SNP6 AA AA AC AC AC SNP7 TT TT TC TC CC G: 5 individuals, allele counts SNP1 1 1 1 SNP2 1 2 1 2 SNP3 2 1 1 1 SNP4 1 2 2 SNP5 2 1 1 SNP6 1 1 1 SNP7 2 2 1 1

slide-7
SLIDE 7

Structured populations analysis based on individual allele frequencies

singular value decomposition

SVD - singular value decomposition G = UDV T

  • G (genotype matrix) does not have to be symmetric

PCA for a covariance matrix (C) or pairwise distance C = V √ DV T

  • The first principal component/eigenvector accounts for as much of

the variability in the data as possible

  • C is symmetric
  • Optimally the multidimensional data is identically distributed
slide-8
SLIDE 8

Structured populations analysis based on individual allele frequencies

IBS distances

5 individuals SNP1 1 1 1 SNP2 1 2 1 2 SNP3 2 1 1 1 SNP4 1 2 2 SNP5 2 1 1 SNP6 1 1 1 SNP7 2 2 1 1 Total Distance Ind1 Ind2 Ind3 Ind4 Ind5 Ind1 3 7 10 11 Ind2 3 4 7 8 Ind3 7 4 5 4 Ind4 10 7 5 3 Ind5 11 8 4 3 1 dimensional projection Ind1 Ind2 Ind3 Ind4 Ind5 1st 0.65 0.36

  • 0.08
  • 0.4
  • 0.53
slide-9
SLIDE 9

Structured populations analysis based on individual allele frequencies

Approximation of the genotype covariance

M number of sites G genotypes G j genotypes for individual j G j

k genotypes for site k in individual j

fk allele frequency for site k Known genotypes - covariance between individuals i and ja

aPatterson N, Price AL, Reich D, plos genet. 2006

cov(G i, G j) = 1 M

M

  • k=1

(G i

k − 2fk)(G j k − 2fk)

2fk(1 − fk) = 1 M ˜ G ˜ G T ˜ G i

k =

G i

k − 2fk

  • 2fk(1 − fk)

, var(Gk) = 2fk(1 − fk)

slide-10
SLIDE 10

Structured populations analysis based on individual allele frequencies

Dealing with Missingness

Covariance matrix - Eigensoft a

aPatterson N, Price AL, Reich D, plos genet. 2006

If a genotype is missing then ˜ G i

k is set to zero

  • E[˜

G i

k] = 0 for a random individual

  • E[cov(G i, G j)] = 0 i.e. relatedness or population structure.
  • r a site is discarded
  • Not possible for large samples
  • Will likely cause bias

IBS matrix The site is skipped for the pair of individuals

  • Missingness must be random
slide-11
SLIDE 11

Structured populations analysis based on individual allele frequencies

Population frequencies causes depth bias

Admixture proportions

0.0 0.2 0.4 0.6 0.8 1.0 −0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10

known genotypes

PC 1 PC 2

  • Pop 1

Pop 2 Pop 3

−0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10

Called genotypes

PC 1 PC 2

  • Pop 1

Pop 2 Pop 3

avg Depth per individual

Depth 5 10 15 20 −0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10

E(G|f)

PC 1 PC 2

  • Pop 1

Pop 2 Pop 3

−0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10

NGSadmix/admixRelate

PC 1 PC 2

  • Pop 1

Pop 2 Pop 3

slide-12
SLIDE 12

Structured populations analysis based on individual allele frequencies

NGS framework for heterogenious samples

Admixture aware priors Instead of a single allele frequency we will use a different prior for each individuals Admixture proportions priors individual allele frequency at site i: πi = q1

i f 1 i + q2 i f 2 i + ... + qk i f k i

PCA based priors is also possible individual allele frequency predicted from the PCA

slide-13
SLIDE 13

Structured populations analysis based on individual allele frequencies

individual allele frequencies from PCA

Figure: PCAngsd framework

slide-14
SLIDE 14

Structured populations analysis based on individual allele frequencies

1000 Genomes - true genotypes

Figure: 1000 Genomes data

slide-15
SLIDE 15

Structured populations analysis based on individual allele frequencies

1000 Genomes - called genotypes from low depth

Figure: 1000 Genomes data

slide-16
SLIDE 16

Structured populations analysis based on individual allele frequencies

1000 Genomes - Genotype likelihood with frequency prior

Figure: 1000 Genomes data

slide-17
SLIDE 17

Structured populations analysis based on individual allele frequencies

1000 Genomes - Genotype likelihood with individuals frequency prior

Figure: 1000 Genomes data

slide-18
SLIDE 18

Structured populations analysis based on individual allele frequencies

Admixture VS PCA

indirect goal of both ADMIXTURE and PCA To predict the individual allele frequencies Π from lower dimensional

  • matrices. E(G) = 2Π

ADMIXTURE K=N-1 G = 2QF T K low Π ≈ QF T PCA K=N-1 ˜ G = UDV T K low ˜ Π ≈ U[1:K]DV T

[1:K]

ADMIXTURE → PCA cov(˜ G i, ˜ G j) =

1 M

M

m=1 (Πi

m−fm)(Πj m−fm)

fm(1−fm)

=

1 M ˜

G ˜ G T PCA → ADMIXTURE argminQ,F||Π − QF T||2

F

Solved with NMF with penalty

slide-19
SLIDE 19

Structured populations analysis based on individual allele frequencies

140K chinese ultra low depth genomes

PCA colored by province

Figure: flash PCAngsd

slide-20
SLIDE 20

Structured populations analysis based on individual allele frequencies

Selection scan from PCA for NGS data

FastPCA test statistic from Galinsky et al (2016) M D2

k

(2˜ ΠmVk)2 ∼ χ2 selection scan in >140k Han chinese with low depth sequencing < 0.1X

thHan(40,919) alHan(80,714) SouthHan(20,969)

slide-21
SLIDE 21

Structured populations analysis based on individual allele frequencies

The end

Conclusion

  • Calling genotypes can cause major bias for PCA and Admixture

analysis

  • Using genotype likelihoods instead can solve the problems
  • Admixture analysis and PCA are related and can both be used to

estimate individual allele frequencies

  • individual allele frequencies can be used for selection scans