[PPT] - Genotype Imputation in Genome-wide Association Studies Fernando PowerPoint Presentation

SLIDE 1

Genotype Imputation in Genome-wide Association Studies

Fernando Rivadeneira 1,2

1Department of Internal Medicine

2Department of Epidemiology

Course “SNP’s and Human Diseases” Rotterdam November 12th, 2018

SLIDE 2

Imputation Facilitates Meta-Analysis and has been the key of the success of GWAS

Need large sample size for the detection of

moderate effects

One convenient approach: meta-analysis
However, different GWAS may use different

genotyping chips (examining a different set of SNPs)

– E.g., 3-way meta analysis on lipid concentrations

FUSION: Illumina 300K
DGI and SardiNIA: Affymetrix 500K

– Illumina 300K and Affymetrix 500K have <10% (45K) SNPs in common

SLIDE 3

Why is imputation important?

GWAS data with missing genotypes Association Study of genotyped data Imputation to a reference panel GWAS imputed data Association Study of imputed data Adapted Marchini & Howie . Nat. Rev. 2010

SLIDE 4

Imputation allows integration of results across different platforms (example HMGA region)

4

SLIDE 5

SNPs typed by all 3 groups (~45K) Affy panel SNPs (～321K) Imputed SNPs (~ 2.25 million)

Willer et al, Nat Genet 40: 161-9, 2008

LDL-C association illustrates how scarce the SNP density will be when limited to shared markers

SLIDE 6

HapMap estimated ~10M common SNPs in

the human genome (>3M genotyped)

GWAS examine 100K – 1M SNPs
These SNPs can be proxies for many others

– E.g. Illumina 300K covers ~78% of Hapmap

What do we achieve with imputation?

Increase in genome coverage is also facilitated by Imputation

SLIDE 7

Start with …

– 100 K – 1M SNPs from a GWAS – ~3M SNPs genotyped in HapMap…

~40-60 Million SNPs sequenced in reference

(HRC)

Then impute genotypes in study

samples for ALL the markers in Reference Set but absent from study

With imputation we fill-in the gaps (missing information) using a reference panel

SLIDE 8

Rationale: Even unrelated individuals

typically share short stretches of chromosome

Heuristic Procedure: Identify the

shared stretches of chromosome and impute by copying over Imputation works by determining “shared” regions of the genome and copying over

SLIDE 9

Using the LD relationships between SNPs genotypes can be predicted (imputed) for SNPs not measured experimentally Study Sample

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . . . . . . G . . . . . . . C . . . . A . . .

9

SLIDE 10

Study Sample

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . . . . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C

HapMap / 1K genomes Project /HRC Observed genotypes are phased and compared to the set of phased haplotypes of the reference

10

SLIDE 11

A probalistic search for mosaics among the reference is done looking for similar stretches of flanking haplotypes

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . . . . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C

Study Sample

11

HapMap / 1K genomes Project /HRC

SLIDE 12

Next step is impute missing genotypes from the reference using an algorithm which models each haplotype conditional on all others

Observed Genotypes

c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t c t c A t g c

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C

Study Sample

12

HapMap / 1K genomes Project /HRC

SLIDE 13

Observed Genotypes

c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t c t c A t g c

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C

Imputation increases power by increasing sample size (individuals with

missing genotypes), allowing higher LD and decreasing error

GG GA AA 0 1 2 0.68 Best guess Allele dose Example Best guess information is never used with low imputation quality scores! Little information is always better than NO information!

13

SLIDE 14

Michigan Age-related

Macular Degeneration Study

Dense genotyping in a

123Kb region

verlapping CFH
Used 11 tagSNPs to

predict 84 SNPs

Imputed genotypes

differ from the experimental ones

nly <1% of the time

Comparison of Test Statistics

50 100 150 50 100 150

Imputed Experimental

Experimental Imputed

Li et al, Nat Genet 38: 1049-54, 2006

Imputation works providing reliable test statics

SLIDE 15

Allele frequency P-value Odds ratio Imputed Genotyped Imputed Genotyped Imputed Genotyped .024 .021 2.5 x 10-6 6.3 x 10-6 2.57 2.20 .543 .540 5.3 x 10-6 1.1 x 10-5 1.33 1.31 .114 .136 2.0 x 10-5 4.1 x 10-5 1.47 1.41 .494 .490 6.6 x 10-5 5.5 x 10-5 1.28 1.28 .927 .924 7.5 x 10-5 9.0 x 10-5 1.72 1.65 .744 .753 1.4 x 10-4 3.9 x 10-4 1.33 1.30 .289 .291 1.7 x 10-4 1.2 x 10-4 1.27 1.28 .970 .973 1.9 x 10-4 3.6 x 10-5 2.47 2.58 .401 .361 6.3 x 10-4 1.6 x 10-3 1.26 1.22 .817 .816 9.5 x 10-4 1.0 x 10-3 1.31 1.30 .605 .605 9.9 x 10-4 1.2 x 10-3 1.23 1.22

Scott et al, Science 316: 1341-5, 2007

Imputation works providing reliable test statics and effect estimates

SLIDE 16

 Hidden State Sm: The

pair of contributing reference haplotypes at marker m

 Data Gm: Observed

genotypes at marker m

 Goal: Infer Sm

Imputation algorithm uses a hidden Markov model (MACH)

SLIDE 17

g A c c t c A t g g Iteration 1 t C t t t c A t g g g A c c t c A t g g Iteration 2 t C c c t c A t g c g A c c t c A t g g Iteration 3 t C c c t c A t g c g A c c t c A t g g Iteration 4 t C c c t c A t g c g A c c t c A t g g Consensus t C c c t c A t g c 1 1

3/4 3/4 1 1 1 1 1 3/4

Quality Score g A c c t c A t g g Reference Allele 1 1

7/4 7/4

2 2 2 2 2

5/4

Dosage

Estimated fractional count of the reference allele Best-guess or the most frequently

ccurring

genotype guess across all iterations

Output from the imputation (MACH)

Proportion of iterations where the guessed genotype agrees with the consensus

SLIDE 18

For large projects (n > 1,000 individuals) phasing each

chromosome can take a while..

Split data in chunks of 2000 markers (+50 in flanking

regions)

(MACH) Phase each chunk independently
Ligate chunks to reconstruct chromosome (using

flanking regions)

(Minimac) Impute pre-phased chunks with a particular

reference (e.g. different versions of 1000G) ⇒Prefer splitting over markers than individuals

⇒ http://genome.sph.umich.edu/wiki/Minimac:_1000_Genomes_Imputation_Cookbook

With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation)

SLIDE 19

With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation)

Genotyping Data QC-ing Phasing & Imputing Analyzing

SLIDE 20

Genotyping Data QC-ing Phasing & Imputing Analyzing Genotyping Data QC-ing Phasing Imputing Imputing Imputing Analyzing Analyzing Analyzing

With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation)

SLIDE 21

Output files (Mach)

 Dosage file

ID TYPE SNP1 SNP2 SNP3 SNP4 SNP5 RS3->232 ML_DOSE 2 2 2 2 2 RS3->2921 ML_DOSE 2 1 2 2 2 RS3->3370 ML_DOSE 1.999 1 2 2 2 RS3->3542 ML_DOSE 2 1 2 1.968 1.998 SNP Al1 Al2 Freq1 MAF Quality Rsq rs12828708 A G 0.9603 0.0397 0.9707 0.7232 rs10880855 T C 0.5149 0.4851 0.9991 0.9985 rs7979218 G A 0.9673 0.0327 0.9826 0.7903 rs7315793 C T 0.9537 0.0463 0.9554 0.6538 rs4768098 A G 0.6954 0.3046 0.9984 0.9971

 Info file

.. .. .. n

SLIDE 22

Imputation Accuracy: Concordance rate

between imputed genotypes and experimental genotypes

Measures of Accuracy

– Estimated Accuracy: Proportion of rounds where the imputed genotype agrees with the consensus (best-guess genotype) across all rounds – Estimated r2

Accuracy of the Imputation process needs to be assessed

SLIDE 23

Mach r2 measures hidden vs imputed genotypes (r2>0.3 by convention)

2

Var(dosage) E(r with true genotypes) = E[Var(dosage)]

A 2 2 A A 2 2 2 2 2 2 2 2

where, E[Var(dosage )] under HWE = E(dosage ) [E(dosage )] 2 * 1 * (2* 1* ) 2 * 1 *2 (1 ) [2* 1*2 (1 )] 2 (1 )

AA AA AA AA A A A A A A A A

p p p p p p p p p p p p − = + − + = + − − + − = −

SLIDE 24

Which factors affect imputation quality? Imputation Panel

Original imputation Panel More comprehensive imputation Panel Larger imputation Panel

SLIDE 25

31 44 48 61 76 74 85 86 89 46 65 68 82 91 90 94 94 95 25 50 75 100 A100 A250S A250N A500 A1000 I300 I550 I650 I1000 Commercial Genotyping Platform Coverage

Before Imputation After Imputation

Imputed genotypes translate into improved genome coverage

SLIDE 26

Simulated studies used a genotyped SNP panel that captures 80% of common variants with pairwise r2 > 0.80.

Disease Genotyped Genotyped + SNP MAF SNPs Only Imputed SNPs 2.5% 24.4% 56.2% 5% 55.8% 73.8% 10% 77.4% 87.2% 20% 85.6% 92.0% 50% 93.0% 96.0% Power

Imputed genotypes translate into improved genome coverage and boosted statistical power (HapMap data)

SLIDE 27

Improved imputation quality also boosts the test statistics of common variants (genome coverage vs imputation quality)

1000GP R06-2010 60 samples in reference 1000GP R08-2010 283 samples in reference APOE P= 9.7 x10-20 APOE MACH RSQ= 0.6 APOE MACH RSQ= 0.4 *Rotterdam Study I data on Alzheimer's dementia for IGAP consortium APOE P=7.1 x10-15 rs429358

SLIDE 28

SLIDE 29

1000 Genomes Project is a follow-up to HapMap consortium

To aid association studies of complex diseases
Sequencing the genomes of >1000 individuals

– A more complete catalog of common variants and a catalog of rare variants – Using our imputation strategy to reconstruct haplotypes

www.1000genomes.org

– All sequence data available from Short Read Archive

SLIDE 30

1.31% 0.88% 0.52% 0.40% 0.00% 0.50% 1.00% 1.50% 60 100 200 500

Number of Reference Individuals Imputation Error

Increase in size of reference panel reduces the imputation error

SLIDE 31

1000G, Nature, 2010 Existing SNPs New SNPs

Large fraction of newer SNPs are less frequent (rare) and population specific

SLIDE 32

Can we impute better with a reference panel arising from a “local” population?

SLIDE 33

Genome of the Netherlands GoNL

SLIDE 34

. . . . G . . . . . . . C . . . . A . . . . . . . G . . . . . . . C . . . . A . . . C G A G A C T C T C C G A C C T T A T G C C G A A G C T C T T T T C T C C T G T G C T G G G A T C T C C C G A C C T C A T G C C G A G A C T C T C C G A C C T C G T G C C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T C C G A C C T T A T G C C G A G A C T C T C C G A C C T C G T G C C G A G A T C T C C C G A C C T T G T G C C G A A G C T C T T T T C T C C T G T G C T G G G A T C T C C C G A C C T C A T G C

GONL Reference Panel 1000G Reference Panel Study Sample Study Sample =

Evaluating the imputation quality – GONL vs. 1000G

SLIDE 35

GoNL has improved accuracy, specially in less common non-reference alleles

Eskil Kreiner

SLIDE 36

SLIDE 37

SLIDE 38

SLIDE 39

Increase in Power/Coverage

SLIDE 40

Imputation Servers

SLIDE 41

SLIDE 42

Summary Imputation

Imputing reconstructs haplotypes from

genotype data in an efficient & effective way

Larger imputation panels provide the best

performance:

– Accurate measures of uncertainty – Increased coverage of the genome (towards lower MAF) – Improved power of association analyses

SLIDE 43

Ackowledgements

MaCH Development:

– Dr. Yun Li – Dr. Gonçalo Abecasis – Dr. Michael Boehnke – Paul Scheet – Jun Ding

– http://www.sph.umich.edu/csg/abecasis/MACH

SLIDE 44

Acknowledgements

GENETIC INVESTIGATIONS OF ANTHROPOMETRIC TRAITS

CHARGE

SLIDE 45

McGill University,CA Brent Richards Vince Forgetta Houfeng Zheng (China) NIH/AGES, US/Iceland Tamara Harris Vilmundur Gudnason Albert Vernon-Smith Guðny Eiriksdottir University of Maryland, US Laura M Yerges- Armstrong Indiana University, US Daniel L Koller Michael J. Eacons Munro Peacock University of Pittsburgh, US Jane Cauley HKSC, Hong Kong Annie Kung Aarhus University, Denmark Bente Langdahl, Lisa Husted VU MC, The Netherlands Paul Lips, Natasja van Schoor Greek Osteoporosis Study Panagoula Kollia ErasmusMC, Netherlands André Uitterlinden Joyce van Meurs Pascal Arp Mila Jhamai Jeroen van Rooij Robert Kraaij Carola Zillikens Cornelia van Duiijn Albert Hofman Oscar Franco Vincent Jaddoe University of Ioannina, GRE Vangelis Evangelou Evangelia Ntzani John Ioannidis deCODE, Iceland Unnur Styrskarsdottir Unnur Thorsteinsdottir University of Edinburg, UK Stuart Ralston Omar Albagha James Wilson Nerea Alonso Oxford University, UK Jonathan Reeve Cambridge University, UK Stephen Kaptoge Bristol University, UK David Evans John Tobias John Kemp CHOP, Philadelphia, US Struan Grant Babette Zemmel Alessandra Chessi Shlgrenska Academy, SWE Claes Ohlsen Joel Eriksson Mattias Lotrenzon Liesbeth van den Put King’s College, UK Tim Spector ,Scott Wilson (UWE) Brisbane University, AUS Matthew Brown Emma Duncan University of Sidney, AUS John Eisman HSL Harvard, University, US Doug Kiel David Karasik Yi-Hsiang Hsu Boston University Adrienne Cupples Ching-Ti Liu University of Washington, US John Robbins University of Barcelona, SPA Susana Balcells & Daniel Gringberg Sheffield University, UK Eugene McCloskey University of Southampton, UK Cyrus Cooper & Elaine Denisson University of Aberdeen Lynne Hocking University of Ljubljani Janja Marc University of Western Australia Richard Prince & Joshua Lewis Women’s Health Initiative Rebecca Jackson Grazl University, Austria Barbara Obermayer-Pietsch Malmo University, Sweden Kristina Ǻkesson & Fiona McGuigan

Hosp. Cantabria, Spain

Jose A. Riancho Quebec University, Canada Francois Rousseau Inst Biochem & Genetics, Rusia Elsa K. Khusnutdinova Phenotypes John Hopkins University, US Thomas J, Beck University of Geneve, SWI Didier Hans Functional work Rochester University, US Cheryl Ackert-Bicknell ErasmusMC, Netherlands Jerooen van den Peppel Bram van der Erden University of Oslo, Norrway Kaare Gautvik & Sjur Reppe Sanger Institute, Uk Vijay Yadav New York University, US Matthew Maurano University of Misoury, KC, US Lynda Bonewald