Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart - - PowerPoint PPT Presentation

genome wide association studies
SMART_READER_LITE
LIVE PREVIEW

Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart - - PowerPoint PPT Presentation

Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014 Outline Introduction to GWAS Study design o GWAS design o


slide-1
SLIDE 1

Genome-Wide Association Studies

Caitlin Collins, Thibaut Jombart

MRC Centre for Outbreak Analysis and Modelling Imperial College London

Genetic data analysis using 30-10-2014

slide-2
SLIDE 2

Outline

  • Introduction to GWAS
  • Study design
  • GWAS design
  • Issues and considerations in GWAS
  • Testing for association
  • Univariate methods
  • Multivariate methods
  • Penalized regression methods
  • Factorial methods

2

slide-3
SLIDE 3

Genomics & GWAS

3

slide-4
SLIDE 4

The genomics revolution

  • Sequencing technology
  • 1977 – Sanger
  • 1995 – 1st bacterial genomes
  • < 10,000

bases per day per machine

  • 2003 – 1st human genome
  • > 10,000,000,000,000

bases per day per machine

  • GWAS publications
  • 2005 – 1st GWAS
  • Age-related macular

degeneration

  • 2014 – 1,991 publications
  • 14,342 associations

Genomics & GWAS 4

slide-5
SLIDE 5

A few GWAS discoveries…

Genomics & GWAS 5

slide-6
SLIDE 6
  • Genome Wide Association Study
  • Looking for SNPs…

associated with a phenotype.

  • Purpose:
  • Explain
  • Understanding
  • Mechanisms
  • Therapeutics
  • Predict
  • Intervention
  • Prevention
  • Understanding not required

So what is GWAS?

Genomics & GWAS 6

slide-7
SLIDE 7
  • Definition
  • Any relationship

between two measured quantities that renders them statistically dependent.

  • Heritability
  • The proportion of

variance explained by genetics

  • P = G + E + G*E
  • Heritability > 0

Association

Genomics & GWAS 7

p

SNPs

n

individuals

Cases Controls

slide-8
SLIDE 8

Genomics & GWAS 8

slide-9
SLIDE 9

Why?

  • Environment, Gene-Environment interactions
  • Complex traits, small effects, rare variants
  • Gene expression levels
  • GWAS methodology?

Genomics & GWAS 9

slide-10
SLIDE 10

Study Design

10

slide-11
SLIDE 11
  • Case-Control
  • Well-defined “case”
  • Known heritability
  • Variations
  • Quantitative phenotypic data
  • Eg. Height, biomarker concentrations
  • Explicit models
  • Eg. Dominant or recessive

GWAS design

Study Design 11

slide-12
SLIDE 12
  • Data quality
  • 1% rule
  • Controlling for confounding
  • Sex, age, health profile
  • Correlation with other variables
  • Population stratification*
  • Linkage disequilibrium*

Issues & Considerations

Study Design 12

* *

slide-13
SLIDE 13
  • Problem
  • Violates assumed population homogeneity, independent
  • bservations
  •  Confounding, spurious associations
  • Case population more likely to be related than Control

population

  •  Over-estimation of significance of associations

Population stratification

Study Design 13

  • Definition
  • “Population stratification” =

population structure

  • Systematic difference in allele

frequencies btw. sub- populations…

  • … possibly due to different

ancestry

slide-14
SLIDE 14
  • Solutions
  • Visualise
  • Phylogenetics
  • PCA
  • Correct
  • Genomic Control
  • Regression on Principal

Components of PCA

Population stratification II

Study Design 14

slide-15
SLIDE 15
  • Definition
  • Alleles at separate loci are NOT

independent of each other

  • Problem?
  • Too much LD is a problem
  •  noise >> signal
  • Some (predictable) LD can be

beneficial

  •  enables use of “marker” SNPs

Linkage disequilibrium (LD)

Study Design 15

slide-16
SLIDE 16

Testing for Association

16

slide-17
SLIDE 17
  • Standard GWAS
  • Univariate methods
  • Incorporating interactions
  • Multivariate methods
  • Penalized regression methods (LASSO)
  • Factorial methods (DAPC-based FS)

Methods for association testing

Testing for Association 17

slide-18
SLIDE 18
  • Variations
  • Testing
  • Fisher’s exact test, Cochran-Armitage trend test, Chi-

squared test, ANOVA

  • Gold Standard—Fischer’s exact test
  • Correcting
  • Bonferroni
  • Gold Standard—FDR

Univariate methods

Testing for Association 18

p

SNPs

n

individuals

Cases Controls

  • Approach
  • Individual test statistics
  • Correction for multiple

testing

slide-19
SLIDE 19

Strengths Weaknesses

  • Straightforward
  • Computationally fast
  • Conservative
  • Easy to interpret
  • Multivariate system,

univariate framework

  • Effect size of individual

SNPs may be too small

  • Marginal effects of

individual SNPs ≠ combined effects

Univariate – Strengths & weaknesses

Testing for Association 19

slide-20
SLIDE 20

What about interactions?

Testing for Association 20

slide-21
SLIDE 21
  • Epistasis
  • “Deviation from linearity under a

general linear model”

  • With p predictors, there are:
  • 𝑞

𝑙 = 𝑞𝑙 𝑙! k-way interactions

  • p = 10,000,000  5 x 1011
  • That’s 500 BILLION possible pair-wise interactions!
  • Need some way to limit the number of pairwise

interactions considered…

Interactions

Testing for Association 21

𝑍

𝑗 = 𝑥0 + 𝑥1𝐵𝑗 + 𝑥2𝐶𝑗 +𝒙𝟒𝑩𝒋𝑪𝒋

slide-22
SLIDE 22

Multivariate methods

Testing for Association 22

LASSO penalized regression Ridge regression Neural Networks Penalized Regression Bayesian Approaches Factorial Methods Bayesian Epistasis Association Mapping Logic Trees Modified Logic Regression-Gene Expression Programming Genetic Programming for Association Studies Logic feature selection Monte Carlo Logic Regression Logic regression Supervised-PCA Sparse-PCA DAPC-based FS (snpzip) Bayesian partitioning The elastic net Bayesian Logistic Regression with Stochastic Search Variable Selection Odds-ratio- based MDR Multi-factor dimensionality reduction method Genetic programming

  • ptimized neural

networks Parametric decreasing method Restricted partitioning method Combinatorial partitioning method Random forests Set association approach Non-parametric Methods

slide-23
SLIDE 23
  • Penalized regression methods
  • LASSO penalized regression
  • Factorial methods
  • DAPC-based

feature selection

Multivariate methods (ii)

Testing for Association 23

slide-24
SLIDE 24
  • Approach
  • Regression models multivariate association
  • Shrinkage estimation  feature selection
  • Variations
  • LASSO, Ridge, Elastic net, Logic regression
  • Gold Standard—LASSO penalized regression

Penalized regression methods

Testing for Association 24

slide-25
SLIDE 25
  • Regression
  • Generalized linear model (“glm”)
  • Penalization
  • L1 norm
  • Coefficients  0
  • Feature selection!

LASSO penalized regression

Testing for Association 25

slide-26
SLIDE 26

Strengths Weaknesses

  • Stability
  • Interpretability
  • Likely to accurately

select the most influential predictors

  • Sparsity
  • Multicollinearity
  • Not designed for high-p
  • Computationally intensive
  • Calibration of penalty

parameters

  • User-defined  variability
  • Sparsity
  • NO p-values!

LASSO – Strengths & weaknesses

Testing for Association 26

slide-27
SLIDE 27
  • Approach
  • Place all variables (SNPs) in a multivariate space
  • Identify discriminant axis  best separation
  • Select variables with the highest contributions to that axis
  • Variations
  • Supervised-PCA, Sparse-PCA, DA, DAPC-based FS
  • Our focus—DAPC with feature selection (snpzip)

Factorial methods

Testing for Association 27

slide-28
SLIDE 28

Discriminant axis Density of individuals Discriminant axis Density of individuals

DAPC-based feature selection

a b c d e

Alleles Individuals

0.1 0.2 0.3 0.4 0.5 a b c d e Contribution to Discriminant Axis

Healthy (“controls”) Diseased (“cases”) Testing for Association 28 Discriminant Axis

Discriminant Axis

slide-29
SLIDE 29

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 a b c d e Contribution to Discriminant Axis

?

Discriminant axis Density of individuals

DAPC-based feature selection

  • Where should we draw the line?
  •  Hierarchical clustering
slide-30
SLIDE 30

Hierarchical clustering (FS)

Testing for Association 30

0.1 0.2 0.3 0.4 0.5 a b c d e Contribution to Discriminant Axis

Hooray!

slide-31
SLIDE 31

Strengths Weaknesses

  • More likely to catch all

relevant SNPs (signal)

  • Computationally quick
  • Good exploratory tool
  • Redundancy > sparsity
  • Sensitive to n.pca
  • N.snps.selected varies
  • No “p-value”
  • Redundancy > sparsity

DAPC – Strengths & weaknesses

Testing for Association 31

slide-32
SLIDE 32

Conclusions

  • Study design
  • GWAS design
  • Issues and considerations in GWAS
  • Testing for association
  • Univariate methods
  • Multivariate methods
  • Penalized regression methods
  • Factorial methods

32

slide-33
SLIDE 33

Thanks for listening!

33

slide-34
SLIDE 34

Questions?

34