PCA and Admixture proportions for low depth NGS data Anders - - PowerPoint PPT Presentation
PCA and Admixture proportions for low depth NGS data Anders - - PowerPoint PPT Presentation
PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Structured populations analysis based on individual allele frequencies Analysis of low depth sequencing data Admixture proportions Individual allele frequencies
Structured populations analysis based on individual allele frequencies
Analysis of low depth sequencing data
Admixture proportions PCA Individual allele frequencies (PCA)
thHan(40,919) alHan(80,714) SouthHan(20,969)
Structured populations analysis based on individual allele frequencies
Admixture clustering /PCA - which is more informative?
Structured populations analysis based on individual allele frequencies
This morning
1
Structured populations population structure and PCA
2
analysis based on individual allele frequencies Admixture proportions vs. PCA
Structured populations analysis based on individual allele frequencies
PCA for NGS: Ancient Eskimoa
aRasmussen et. al., 2010
Figure: First principal components of selected populations.
Structured populations analysis based on individual allele frequencies
Genotype data
5 individuals, genotypes SNP1 AG AG AG AA AA SNP2 TT TA AA AT AA SNP3 AA AC AC CC AC SNP4 GG GG GC CC CC SNP5 TT TC TC CC CC SNP6 AA AA AC AC AC SNP7 TT TT TC TC CC G: 5 individuals, allele counts SNP1 1 1 1 SNP2 1 2 1 2 SNP3 2 1 1 1 SNP4 1 2 2 SNP5 2 1 1 SNP6 1 1 1 SNP7 2 2 1 1
Structured populations analysis based on individual allele frequencies
singular value decomposition
SVD - singular value decomposition G = UDV T
- G (genotype matrix) does not have to be symmetric
PCA for a covariance matrix (C) or pairwise distance C = V √ DV T
- The first principal component/eigenvector accounts for as much of
the variability in the data as possible
- C is symmetric
- Optimally the multidimensional data is identically distributed
Structured populations analysis based on individual allele frequencies
IBS distances
5 individuals SNP1 1 1 1 SNP2 1 2 1 2 SNP3 2 1 1 1 SNP4 1 2 2 SNP5 2 1 1 SNP6 1 1 1 SNP7 2 2 1 1 Total Distance Ind1 Ind2 Ind3 Ind4 Ind5 Ind1 3 7 10 11 Ind2 3 4 7 8 Ind3 7 4 5 4 Ind4 10 7 5 3 Ind5 11 8 4 3 1 dimensional projection Ind1 Ind2 Ind3 Ind4 Ind5 1st 0.65 0.36
- 0.08
- 0.4
- 0.53
Structured populations analysis based on individual allele frequencies
Approximation of the genotype covariance
M number of sites G genotypes G j genotypes for individual j G j
k genotypes for site k in individual j
fk allele frequency for site k Known genotypes - covariance between individuals i and ja
aPatterson N, Price AL, Reich D, plos genet. 2006
cov(G i, G j) = 1 M
M
- k=1
(G i
k − 2fk)(G j k − 2fk)
2fk(1 − fk) = 1 M ˜ G ˜ G T ˜ G i
k =
G i
k − 2fk
- 2fk(1 − fk)
, var(Gk) = 2fk(1 − fk)
Structured populations analysis based on individual allele frequencies
Dealing with Missingness
Covariance matrix - Eigensoft a
aPatterson N, Price AL, Reich D, plos genet. 2006
If a genotype is missing then ˜ G i
k is set to zero
- E[˜
G i
k] = 0 for a random individual
- E[cov(G i, G j)] = 0 i.e. relatedness or population structure.
- r a site is discarded
- Not possible for large samples
- Will likely cause bias
IBS matrix The site is skipped for the pair of individuals
- Missingness must be random
Structured populations analysis based on individual allele frequencies
Population frequencies causes depth bias
Admixture proportions
0.0 0.2 0.4 0.6 0.8 1.0 −0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10
known genotypes
PC 1 PC 2
- Pop 1
Pop 2 Pop 3
−0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10
Called genotypes
PC 1 PC 2
- Pop 1
Pop 2 Pop 3
avg Depth per individual
Depth 5 10 15 20 −0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10
E(G|f)
PC 1 PC 2
- Pop 1
Pop 2 Pop 3
−0.15 −0.10 −0.05 0.00 0.05 −0.10 −0.05 0.00 0.05 0.10
NGSadmix/admixRelate
PC 1 PC 2
- Pop 1
Pop 2 Pop 3
Structured populations analysis based on individual allele frequencies
NGS framework for heterogenious samples
Admixture aware priors Instead of a single allele frequency we will use a different prior for each individuals Admixture proportions priors individual allele frequency at site i: πi = q1
i f 1 i + q2 i f 2 i + ... + qk i f k i
PCA based priors is also possible individual allele frequency predicted from the PCA
Structured populations analysis based on individual allele frequencies
individual allele frequencies from PCA
Figure: PCAngsd framework
Structured populations analysis based on individual allele frequencies
1000 Genomes - true genotypes
Figure: 1000 Genomes data
Structured populations analysis based on individual allele frequencies
1000 Genomes - called genotypes from low depth
Figure: 1000 Genomes data
Structured populations analysis based on individual allele frequencies
1000 Genomes - Genotype likelihood with frequency prior
Figure: 1000 Genomes data
Structured populations analysis based on individual allele frequencies
1000 Genomes - Genotype likelihood with individuals frequency prior
Figure: 1000 Genomes data
Structured populations analysis based on individual allele frequencies
Admixture VS PCA
indirect goal of both ADMIXTURE and PCA To predict the individual allele frequencies Π from lower dimensional
- matrices. E(G) = 2Π
ADMIXTURE K=N-1 G = 2QF T K low Π ≈ QF T PCA K=N-1 ˜ G = UDV T K low ˜ Π ≈ U[1:K]DV T
[1:K]
ADMIXTURE → PCA cov(˜ G i, ˜ G j) =
1 M
M
m=1 (Πi
m−fm)(Πj m−fm)
fm(1−fm)
=
1 M ˜
G ˜ G T PCA → ADMIXTURE argminQ,F||Π − QF T||2
F
Solved with NMF with penalty
Structured populations analysis based on individual allele frequencies
140K chinese ultra low depth genomes
PCA colored by province
Figure: flash PCAngsd
Structured populations analysis based on individual allele frequencies
Selection scan from PCA for NGS data
FastPCA test statistic from Galinsky et al (2016) M D2
k
(2˜ ΠmVk)2 ∼ χ2 selection scan in >140k Han chinese with low depth sequencing < 0.1X
thHan(40,919) alHan(80,714) SouthHan(20,969)
Structured populations analysis based on individual allele frequencies
The end
Conclusion
- Calling genotypes can cause major bias for PCA and Admixture
analysis
- Using genotype likelihoods instead can solve the problems
- Admixture analysis and PCA are related and can both be used to
estimate individual allele frequencies
- individual allele frequencies can be used for selection scans