An Application of Principal Components Analysis in Genetics Samuel - - PowerPoint PPT Presentation

an application of principal components analysis in
SMART_READER_LITE
LIVE PREVIEW

An Application of Principal Components Analysis in Genetics Samuel - - PowerPoint PPT Presentation

An Application of Principal Components Analysis in Genetics Samuel Morrissette April 14, 2020 Samuel Morrissette PCA in Genetics April 14, 2020 1 / 22 Background and Terminology 1 Eigenstrat Algorithm and Definitions 2 Results 3


slide-1
SLIDE 1

An Application of Principal Components Analysis in Genetics

Samuel Morrissette April 14, 2020

Samuel Morrissette PCA in Genetics April 14, 2020 1 / 22

slide-2
SLIDE 2

1

Background and Terminology

2

Eigenstrat Algorithm and Definitions

3

Results

4

Example - Bovine Data

5

Conclusion

Samuel Morrissette PCA in Genetics April 14, 2020 2 / 22

slide-3
SLIDE 3

Section 1 Background and Terminology

Samuel Morrissette PCA in Genetics April 14, 2020 3 / 22

slide-4
SLIDE 4

Genetic Association Studies

Genetic association studies:

◮ Test for an association between certain genetic variants (alleles) and a

particular disease or trait.

◮ Are frequently conducted through a case-control study.

Expected occurrence of alleles in case group vs. control group

Samuel Morrissette PCA in Genetics April 14, 2020 4 / 22

slide-5
SLIDE 5

Population Stratification

Population stratification refers to the differences in allele frequencies arising from systematic ancestral differences. Case-control studies may be confounded by population stratification.

◮ Overrepresentation of a population in the case or control group can

result in spurious associations.

Samuel Morrissette PCA in Genetics April 14, 2020 5 / 22

slide-6
SLIDE 6

Correcting for Population Stratification

Avoiding population stratification is difficult and likely unrealistic Correcting for population stratification is more realistic

◮ Genomic control and structured association were two of the most

common methods

◮ Eigenstrat, proposed by Price et al. in 2006, has since become the

prevailing approach

Samuel Morrissette PCA in Genetics April 14, 2020 6 / 22

slide-7
SLIDE 7

Section 2 Eigenstrat Algorithm and Definitions

Samuel Morrissette PCA in Genetics April 14, 2020 7 / 22

slide-8
SLIDE 8

Eigenstrat Algorithm

Eigenstrat consists of three main steps:

1 Apply PCA to random SNPs (preferably unrelated to the candidate

SNPs of interest) to infer “axes of variation”

2 Adjust the candidate SNPs and phenotypes of the samples based on

these axes

3 Compute a test statistic using adjusted values Samuel Morrissette PCA in Genetics April 14, 2020 8 / 22

slide-9
SLIDE 9

Axes of Variation

The axes of variation: Defined as the top principal components Can capture differences in genetic variation attributable to ancestry.

◮ May have a geographical interpretation within continents (figure below) Samuel Morrissette PCA in Genetics April 14, 2020 9 / 22

slide-10
SLIDE 10

Adjustment and Test Statistic Calculation

The genotypes of the candidate SNPs and phenotypes of the samples are adjusted

◮ Adjustment corrects for population stratification

The Eigenstrat test statistic is then calculated based on these adjusted genotypes and phenotypes

Samuel Morrissette PCA in Genetics April 14, 2020 10 / 22

slide-11
SLIDE 11

Section 3 Results

Samuel Morrissette PCA in Genetics April 14, 2020 11 / 22

slide-12
SLIDE 12

Testing Scenarios

Price et al., tested the Eigenstrat algorithm on simulated data: Simulated candidate SNPs in three different categories:

1

Random SNPs with no association to disease

2

Highly differentiated SNPs with no association to disease

3

Causal SNPs associated with a disease

Results were compared with:

◮ Armitage trend test statistic (uncorrected for stratification) ◮ Genomic control (corrects for stratification using a uniform inflation

factor)

Samuel Morrissette PCA in Genetics April 14, 2020 12 / 22

slide-13
SLIDE 13

Advantages of Eigenstrat

Eigenstrat corrected for stratification better than the uncorrected and genomic control-corrected test statistics in all simulation scenarios.

◮ Fewer spurious associations in non-causal SNPs. ◮ More powerful when detecting true associations at causal SNPs.

Computationally tractable

Samuel Morrissette PCA in Genetics April 14, 2020 13 / 22

slide-14
SLIDE 14

Section 4 Example - Bovine Data

Samuel Morrissette PCA in Genetics April 14, 2020 14 / 22

slide-15
SLIDE 15

microbov data

PCA can correct for population stratification in bovines using data from the “adegenet” package in R. microbov: sample of 704 cattle from Africa and France genotyped at 373 SNPs.

R Code

dim(data) ## [1] 373 704 data[1:3,1:3] ## AFBIBOR9503 AFBIBOR9504 AFBIBOR9505 ## INRA63.167 ## INRA63.171 ## INRA63.173

Samuel Morrissette PCA in Genetics April 14, 2020 15 / 22

slide-16
SLIDE 16

Bovines by Country

There is a clear separation between the cattles’ country of origin with the first principal component.

−0.10 −0.05 0.00 0.05 0.10 −0.08 −0.04 0.00 0.04

PC1 PC2 Country

Africa France

Samuel Morrissette PCA in Genetics April 14, 2020 16 / 22

slide-17
SLIDE 17

Bovines by Breed

The breed of each bovine is also included in the data. There is some evident separation with the first two principal components.

−0.10 −0.05 0.00 0.05 0.10 −0.08 −0.04 0.00 0.04

PC1 PC2 Breed

Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers

Samuel Morrissette PCA in Genetics April 14, 2020 17 / 22

slide-18
SLIDE 18

Simulation Setup

10,000 candidate SNPs with no association to disease are created using highly differentiated allelic frequencies between countries:

◮ 0.8 for Africa ◮ 0.2 for France

The case-control simulation study will include:

◮ 100 cases from Africa and 50 from France ◮ 50 controls from Africa and 100 from France Samuel Morrissette PCA in Genetics April 14, 2020 18 / 22

slide-19
SLIDE 19

Simulation Results

Using 10,000 candidate SNPs and a significance level of 0.0001: There were 6743 spurious associations detected using the Cochran-Armitage trend test statistic (Type I error rate = 0.6743) There were 23 spurious associations detected using the Eigenstrat test statistic (Type I error rate = 0.0023)

Samuel Morrissette PCA in Genetics April 14, 2020 19 / 22

slide-20
SLIDE 20

Section 5 Conclusion

Samuel Morrissette PCA in Genetics April 14, 2020 20 / 22

slide-21
SLIDE 21

Conclusion

Principal components analysis plays an important role in detecting and correcting for population stratification. Eigenstrat outperformed the alternatives at the time of publication and continues to be one of the most widely used methods of correction today “Eigenstrat is not a panacea”. Association studies should still be designed properly.

◮ Poor designs may result in a loss of power with Eigenstrat Samuel Morrissette PCA in Genetics April 14, 2020 21 / 22

slide-22
SLIDE 22

References

Balding, D. A tutorial on statistical methods for population association

  • studies. Nat Rev Genet 7, 781-791 (2006).

Novembre, J., Johnson, T., Bryc, K. et al. Genes mirror geography within Europe. Nature 456, 98-101 (2008). Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909 (2006).

Samuel Morrissette PCA in Genetics April 14, 2020 22 / 22