Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

quantitative genomics and genetics btry 4830 6830 pbsb
SMART_READER_LITE
LIVE PREVIEW

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture19: Alternative Tests, Haplotype Testing, and Minimal GWAS Steps Jason Mezey jgm45@cornell.edu April 16, 2019 (T) 10:10-11:25 Summary of lecture 19 Today we will


slide-1
SLIDE 1

Jason Mezey jgm45@cornell.edu April 16, 2019 (T) 10:10-11:25

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01

Lecture19: Alternative Tests, Haplotype Testing, and Minimal GWAS Steps

slide-2
SLIDE 2

Summary of lecture 19

  • Today we will briefly review Generalized Linear Models

(GLMs) the class that includes linear and logistic regression

  • We will discuss the use of other (alternative) testing

approaches in GWAS (i.e., other than regressions)

  • We will discuss haplotype testing and the intuition behind

how haplotypes are defined

  • Finally we will discuss the minimal steps you should consider

when performing a GWAS

slide-3
SLIDE 3

Announcements

BTRY 4830/6830 & PBSB.5201.01 Quantitative Genomics and Genetics Spring 2019

Project (Version 1) Posted April 16; Due 11:59PM May 7

1 Introduction and instructions

The goal of the class project is for you to demonstrate what you have learned by performing a GWAS analysis on real data. To accomplish this, assume that you have been provided data by a collaborator who wants to identify positions of causal polymorphisms (loci). You will perform an in-depth analysis and write a report for your collaborator that explains your methods and results. Instructions: While we provide some general guidelines for how to proceed below, the techniques you use to analyze the data and how you construct your report will be up to you. Do however note the following instructions (PLEASE READ THESE CAREFULLY!!):

slide-4
SLIDE 4

Announcements

(1) Your project must be uploaded by 11:59PM, May 7 - if it is late for any reason, standard grading policies apply. (2) You are allowed to work together with other students in the class to analyze these data. However, note that turning in a report that describes exactly the same analyses as a fellow student is not a good strategy for getting a good grade. Also note that you must write your

  • wn report.

(3) This is an ‘open book’ assignment, such that you are allowed to use any resources online, in books, etc. You may also ask third-party (i.e. people not in the class) for suggestions on what analyses to perform but you cannot have a third-party do any of the analyses (or write any code for you!). (4) You are also allowed to use any software or programming language that you would like as part of your analysis. However, we expect that some of the tasks will be performed in R (also note that you are welcome to use any packages, functions, etc. in R). (5) Your final project will include a SINGLE report file and a SINGLE file including all of your R code (ideally an .rmd file!) and / or commands or scripts you used to run other software

  • packages. That is, for your R code, the best way to maximize your grade is to have well

commented code that we can run from the command line. If you use other software for some

  • f the tasks, a reasonable approach is to include commented out descriptions in your code

that provides details on how you ran the software, e.g. what parameters did you use, etc.

slide-5
SLIDE 5

Announcements

(6) The report file must be no more than 8 pages (single-sided), with NO MORE than 5 pages

  • f text and NO MORE than 3 pages of figures / tables.

(7) For your report, you must describe what you did in detail (a good guide is have you provided enough detail such that someone reading your report could replicate what you have done?). You also need to describe the results you have obtained from your analysis. You may also wish to include some text to describe interpretations and conclusions that may be of interest to your collaborator, including statistical and possibly, biological interpretations. For your Figures and Tables, note that clarity and clear labels is a strategy for maximizing your grade. (8) We will grade on two broad criteria: 1. the overall quality of the analyses / report, 2. the amount of effort put into your project. Note that ‘effort’ does not mean run many analyses without thinking carefully about why you are running them or how they fit together to provide a clear picture of results. A guide maximizing your grade on effort is to think carefully about how to produce the best possible report that you can and then put in as many hours as you wish to devote to the project accomplishing this objective (your effort level will be clear to us).

slide-6
SLIDE 6

Announcements

2 The experiment and data

The experiment: Among the recent large scale human genomics resources is Genetic European Variation in Health and Disease (gEUVADIS): http://www.geuvadis.org/ with a samples from 4 different European populations. Each of these individuals were part of the 1000 Genomes project and their genomes were sequenced and analyzed to identify SNP geno-

  • types. For expression profiling, lympoblastoid cell lines (LCL) were generated from each sample

and mRNA levels were quantified through RNA sequencing. Each of these gene expression measurements may be thought of as a phenotype and one can do a GWAS analysis on each individually, which is called an ?expression Quantitative Trait locus? or eQTL analysis, an unnecessarily fancy name for a GWAS when the phenotype is gene expression. What you have been provided is a small subset of these data that are publicly available. Specifi- cally, you have been provided 50,000 of the SNP genotypes for 344 samples from the CEU (Utah residents with European ancestry), FIN (Finns), GBR (British) and, TSI (Toscani) population. For these same individuals, you have also been provided the expression levels of five genes. You have also been provided information on the population and gender of each of these individuals, and information regarding the position of each gene and SNP in the genome. A description of the broader data set from which these data were extracted can be found in:: http://www.geuvadis.org/web/geuvadis/RNAseq-project and in other papers relating to analysis of the GEUVADIS data.

slide-7
SLIDE 7

Announcements

The data: These have been provided to you in five total files: ‘phenotypes.csv’,‘genotypes.csv’,

‘covars.csv’, ‘gene info.csv’,‘SNP info.csv’. ‘phenotypes.csv’ contains the phenotype data for 344 samples and 5 genes. ‘genotypes.csv’ contains the SNP data for 344 samples and 50000 genotypes. ‘gene info.csv’ contains information about each gene that was measured. The ‘chromosome’ column indicates the chromosome where the gene is located, ‘start’ marks the position in the chromosome where the region of the gene begins and ‘end’ marks the position where the region ends, ‘symbol’ contains the common gene name of the measured transcript and ‘probe’ contains the ids of the transcripts that match with the column names of the phenotype data. ‘SNP info.csv’ contains the additional information on the genotypes and has four columns. The 1st column contains the chromosome number of each SNP, the 2nd column contains the physical position of the SNP on the chromosome, the 3rd column contains the abbreviation used to the ‘rsID’ = the name of each SNP in order.

slide-8
SLIDE 8

Announcements

3 Your assignment and hints for getting started

Your GWAS assignment is to find the position of as many causal polymorphisms as possible for the five expressed genes using the data (note that each ‘hit’ will potentially indicate an eQTL). You may / should use any and as many analysis approaches as you think that are useful to accomplish this goal. In your report, you will need to describe in detail what you did, why you did it, and describe results in a manner that your ‘non-statistical’ collaborator will be able to understand, e.g. explain your terms, provide interpretations, etc. A few hints:

  • Apply the applicable steps of a ‘minimum GWAS’ analysis.
  • In your report, justify why you applied each individual step and statistical approach.
  • In your report, provide a summary of your results and what they mean.
  • You may want to consider going to various resources online (e.g. genecards, UCSC genome

browser, dbSNP, many others) to incorporate biological information into your interpretation and hypotheses concerning what you may have found.

  • Ask Olivia, Scott, and Jason for thoughts and ideas!

Good luck!

slide-9
SLIDE 9

Review: Logistic GWAS

  • Now we have all the critical components for performing a GWAS

with a case / control phenotype!

  • The procedure (and goals!) are the same as before, for a sample of n

individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests

  • To perform these hypothesis tests, we need to run our IRLS

algorithm (twice!) for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and under the null (= null hypothesis parameters set to zero!) and use these to calculate our LRT test statistic for each marker

  • We then use these N LRT statistics to calculate N p-values by using

a chi-square distribution (how do we do this is R?)

slide-10
SLIDE 10

Review: logistic hypothesis testing

  • Recall that our null and alternative hypotheses are:
  • We will use the LRT for the null (0) and alternative (1):
  • For our case, we need the following:

H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0

LRT = 2lnΛ = 2lnL(ˆ θ0|y) L(ˆ θ1|y)

  • |
  • LRT = 2lnΛ = 2l(ˆ

θ1|y) 2l(ˆ θ0|y)

  • l( ˆ

⌅1|y) = l( ˆ µ, ˆ a, ˆ d|y) l( ˆ ⌅0|y) = l( ˆ µ, 0, 0|y)

slide-11
SLIDE 11

Review: IRLS algorithm

  • For logistic regression (and GLM’s in general!) we will construct an

algorithm to find the parameters that correspond to the maximum

  • f the log-likelihood:
  • For logistic regression (and GLM’s in general!) we will construct an

Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure:

  • 1. Choose starting values for the β’s. Since we have a vector of three β’s in our case,

we assign these numbers and call the resulting vector β[0].

  • 2. Using the re-weighting equation (described next slide), update the β[t] vector.
  • 3. At each step t > 0 check if β[t+1] ⇡ β[t] (i.e. if these are approximately equal) using

an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.

l(β) =

n

i=1

  • yiln(γ1(βµ + xi,aβa + xi,dβd)) + (1 yi)ln(1 γ1(βµ + xi,aβa + xi,dβd))

slide-12
SLIDE 12

Review: Logistic Regression LRT

  • For our a logistic regression, our LRT (logistic) we have the same

equations:

  • Using the following estimates for the null hypothesis and the alternative

making use of the IRLS algorithm:

  • Under the null hypothesis, the LRT is still distributed as a Chi-square with

2 degree of freedom (why?):

LRT = −2lnΛ = 2l(ˆ θ1|y) − 2l(ˆ θ0|y)

LRT ! χ2

d f=2

l(ˆ θ1|y) =

n

X

i=1

[yiln(γ−1(βµ + xi,aβa + xi,dβd))+(1−yi)ln(γ−1(βµ + xi,aβa + xi,dβd))] l(ˆ θ0|y) =

n

X

i=1

[yiln(γ−1(βµ)) + (1 − yi)ln(γ−1(βµ))]

| | ˆ θ0 = {ˆ βµ, ˆ βa = 0, ˆ βd = 0}

{ ˆ θ1 = {ˆ βµ, ˆ βa, ˆ βd}

slide-13
SLIDE 13
  • To calculate our p-value, we need to know the

distribution of our LRT statistic under the null hypothesis

  • There is no simple form for this distribution for any given

n (contrast with F-statistics!!) but we know that as n goes to infinite, we know the distribution is i.e. ( ):

  • |
  • LRT = 2lnΛ = 2l(ˆ

θ1|y) 2l(ˆ θ0|y)

as n ! 1

ve an exact dis n LRT ! χ2

d f,

df) that depen

Review: Logistic Regression p-value

slide-14
SLIDE 14

Review: logistic covariates

  • For our a logistic regression, our LRT (logistic) we have the same

equations:

  • Using the following estimates for the null hypothesis and the alternative

making use of the IRLS algorithm (just add an additional parameter!):

  • Under the null hypothesis, the LRT is still distributed as a Chi-square with

2 degree of freedom (why?):

LRT = −2lnΛ = 2l(ˆ θ1|y) − 2l(ˆ θ0|y) l(ˆ θ1|y) =

n

X

i=1

h yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) i

LRT ! χ2

d f=2

ˆ θ0 = {ˆ βµ, ˆ βa = 0, ˆ βd = 0, ˆ βz}

ˆ θ1 = {ˆ βµ, ˆ βa, ˆ βd, ˆ βz}

l(ˆ ✓0|y) =

n

X

i=1

 yiln(−1(ˆ µ + xi,z ˆ z)) + (1 yi)ln(1 −1(ˆ µ + xi,z ˆ z))

slide-15
SLIDE 15
  • How about with covariates? Say you need to include a single “Z” (note: same structure for

more than one) we start with the same hypotheses:

  • We need the logistic model for this case
  • And the associated likelihood equation
  • Where we need to substitute the for the following two cases:
  • So use the same IRLS algorithm with the appropriate equation and x matrix with new columns

(run the algorithm twice as before!)

  • Substitute the MLEs, calculate the
  • And use a to calculate the p-val!

H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0

Yi = eβµ+xi,aβa+xi,dβd+xi,zβz 1 + eβµ+xi,aβa+xi,dβd+xi,zβz + ✏i

MLE(ˆ ) = MLE(ˆ µ, ˆ a, ˆ d, ˆ z)

  • |
  • |

LRT = 2lnΛ = 2l(ˆ θ1|y) 2l(ˆ θ0|y)

ve an exact dis n LRT ! χ2

d f,

df) that depen

Logistic covariates: summary

l() =

n

X

i=1

[yiln(1(µ + xi,aa + xi,dd + xi,zz))+(1yi)ln(1(µ + xi,aa + xi,dd + xi,zz))] (27)

l(ˆ θ1|y) =

n

X

i=1

h yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) i

l(ˆ ✓0|y) =

n

X

i=1

 yiln(−1(ˆ µ + xi,z ˆ z)) + (1 yi)ln(1 −1(ˆ µ + xi,z ˆ z))

  • {

} Yi = −1(X) + ✏i

slide-16
SLIDE 16

Review: Generalized Linear Models (GLMs)

  • To introduce GLMs, we will introduce the overall structure first, and second

describe how linear and logistic models fit into this framework

  • There is some variation in presenting the properties of a GLM, but we will present

them using three (models that have these properties are considered GLMs):

  • The probability distribution of the response variable

Y conditional on the independent variable X is in the exponential family of distributions

  • A link function relating the independent variables and parameters to the

expected value of the response variable (where we often use the inverse!!)

  • The error random variable has a variance which is a function of ONLY

. Pr(Y |X) ∼ expfamily.

: : E(Y|X) → X, (E(Y|X)) = X

E(Y|X) = −1(X)

le ✏

= X

slide-17
SLIDE 17

Review: inference with GLMs

  • We perform inference in a GLM framework using the

same approach, i.e. MLE of the beta parameters using an IRLS algorithm (just substitute the appropriate link function in the equations, etc.)

  • We can also perform a hypothesis test using a LRT

(where the sampling distribution as the sample size goes to infinite is chi-square)

  • In short, what you have learned can be applied for most

types of regression modeling you will likely need to apply (!!)

slide-18
SLIDE 18

Alternative tests in GWAS I

  • Since our basic null / alternative hypothesis construction in

GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number

  • f tests that we could apply in a GWAS
  • e.g. t-tests, ANOVA, Wald’s test, non-parametric permutation

based tests, Kruskal-Wallis tests, other rank based tests, chi- square, Fisher’s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS)

  • When can we use different tests? The only restriction is that
  • ur data conform to the assumptions of the test (examples?)
  • We could therefore apply a diversity of tests for any given

GWAS

slide-19
SLIDE 19
  • Should we use different tests in a GWAS (and why)?

Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others

  • In general, since we don’t know the true conditions (and therefore which

will be best suited) we should run a number of tests and compare results

  • How to compare results of different GWAS is a fuzzy case (=no non-

conditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules:

  • If all methods identify the same hits (=genomic locations) then this is

good evidence that there is a causal polymorphism

  • If methods do not agree on the position (e.g. some are significant, some

are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)

Alternative tests in GWAS II

slide-20
SLIDE 20
  • We do not have time in this course to do a comprehensive

review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!)

  • Let’s consider a few examples alternative tests that could be

applied

  • Remember that to apply these alternative tests, you will

perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!):

H0 : Cov(Y, X) = 0 HA : Cov(Y, X) 6= 0

Alternative tests in GWAS III

slide-21
SLIDE 21

Alternative test examples I

  • First, let’s consider a case-control phenotype and consider a chi-square

test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!)

  • To construct the test statistic, we consider the counts of genotype-

phenotype combinations (left) and calculate the expected numbers in each cell (right):

  • We then construct the following test statistic:
  • Where the (asymptotic) distribution when the null hypothesis is true is:

in this c is χ2

d.f.=2.

ize tends to infinite, i.e. when the sam is d.f. = (#columns-1)(#rows-1) = 2, can therefore calculate the statistic in

LRT = −2lnΛ = −2

3

X

i=1 2

X

j=1

nijln ni n.inj. !

Case Control A1A1 n11 n12 n1. A1A2 n21 n22 n2. A2A2 n31 n32 n3. n.1 n.2 n

Case Control A1A1 (n.1n1.)/n (n.2n1.)/n n1. A1A2 (n.1n2.)/n (n.2n2.)/n n2. A2A2 (n.1n3.)/n (n.2n3.)/n n3. n.1 n.2 n

slide-22
SLIDE 22

Alternative test examples II

  • Second, let’s consider a Fisher’s exact test
  • Note the the LRT for the null hypothesis under the chi-square test was
  • nly asymptotically exact, i.e. it is exact as sample size n approaches infinite

but it is not exact for smaller sample sizes (although we hope it is close!)

  • Could we construct a test that is exact for smaller sample sizes?

Yes, we can calculate a Fisher’s test statistic for our sample, where the distribution under the null hypothesis is exact for any sample size (I will let you look up how to calculate this statistic and the distribution under the null on your own):

  • Given this test is exact, why would we ever use Chi-square / what is a rule

for when we should use one versus the other?

Case Control A1A1 n11 n21 A1A2 n21 n22 A2A2 n31 n32 hi-square test) is also often

slide-23
SLIDE 23

Alternative test examples III

  • Third, let’s ways of grouping the cells, where we could apply either a chi-

square or a Fisher’s exact test

  • For MAF = A1, we can apply a “recessive” (left) and “dominance” test

(right):

  • We could also apply an “allele test” (note these test names are from

PLINK):

  • When should we expect one of these tests to perform better than the
  • thers?

Case Control A1A1 n11 n12 A1A2 ∪ A2A2 n21 n22 Case Control A1A1 ∪ A1A2 n11 n12 A2A2 n21 n22 Case Control A1 n11 n12 A2 n21 n22

slide-24
SLIDE 24

Comparing results of multiple analyses of the same GWAS data I

  • I’ve run my initial analyses using several tests and produced the

following (now what!?):

slide-25
SLIDE 25
  • The best case is that the same markers (SNPs) pass a multiple test

correction regardless of the testing approach used, i.e. the result is robust to testing approach.

  • In cases where this does not happen (most) it becomes helpful to

understand why test results could be different:

  • Are tests capturing additive vs. dominance effects?
  • Are tests less powerful than others or depend on certain assumptions being

true? Are they handling missing data in different ways?

  • Are particular covariates altering the results if included/excluded? Why might

this be?

  • Does it depend on how you partition the data (e.g. batch effects)?
  • This can help narrow down the set of tests you feel are the most
  • informative. In general, a good publishing strategy is limiting yourself to
  • ne or two tests that both give you significant results that you believe!

Comparing results of multiple analyses of the same GWAS data II

slide-26
SLIDE 26

Haplotype testing I

  • We have just extended our GWAS framework to

handle additional phenotypes

  • We can also extend our GWAS framework to handle

genotypes defined using a different approach

  • In this case, let’s consider using haplotype alleles in our

testing framework

  • Note that a haplotype collapses genetic marker

information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time

slide-27
SLIDE 27

Haplotype testing II

  • Haplotype - a series of ordered, linked alleles that

are inherited together

  • For the moment, let’s consider a haplotype to define a

“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:

  • For example, if these loci are each a SNP with the

following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:

h = f(Ai, Bj, ...)

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-28
SLIDE 28

Haplotype testing III

  • Note that how we define haplotype alleles is somewhat arbitrary but in

general, we define a haplotype for a set of genetic markers (loci) that are physically linked that are frequently occur in a population

  • How many markers is somewhat arbitrary, e.g. we often define sets that match
  • bserved patterns of LD
  • How many haplotype alleles we define is also somewhat arbitrary, where we

define haplotype alleles that have appreciable frequenecy in the population

  • For example, four the four loci with alleles (A,G), (A,T),(G,C),(G,C) how

many haplotype alleles could we define?

  • However, it could be that only the following two combinations have

relatively “high” allele frequencies (say >0.05 = arbitrary!)

  • In such a case, we can collapse the many alleles into just a few!

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-29
SLIDE 29

Haplotype testing IV

  • As an example of haplotype allele collapsing, say for our case of four

loci (A,G), (A,T),(G,C),(G,C), we have lots of LD (!!) such that there are only 4 alleles in the population (i.e. all other combinations have frequency of zero!):

  • Let’s also say that the frequencies of the third and fourth of these in

the population are < 0.01

  • In this case, we can define just two haplotype alleles that collapse

the other alleles as follows (where * means “any” genetic marker allele):

  • NOTE: we are therefore loosing information using this approach!!
  • ften the case that only a few haplotypes are at appreciable frequency, e.g.

s h∗

1 = (A, A, C, C),h∗ 2 = (G, T, G, G),h∗ 3 = (A, A, G, C),h∗ 4 = (G, T, C, G)

s h1 = (A, A, ∗, C)

∗ ∗

d h2 = (G, T, ∗, G) d h = h∗ ∪ h∗. ∗ at h1 = h∗

1 ∪h∗ 3

d h2 = (G, T, ∗ d h2 = h∗

2 ∪ h∗ 4.

slide-30
SLIDE 30

GWAS with haplotypes I

  • Once we have defined haplotype alleles, we can

proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)

  • For example, in a case where we only have two

haplotype alleles, we can code our independent variables for our regression model as follows:

  • All other aspects remain the same (although what is the

effect on our interpretation of where the causal polymorphism is located?)

Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1

slide-31
SLIDE 31

GWAS with haplotypes II

  • Given that we are losing information by using a

haplotype testing approach in a GWAS, why might we want to use this approach?

  • As one example consider the following case of

haplotypes in a population:

A1 B1 (C1)⇤ D2 E1 A1 B2 (C1)⇤ D1 E1 A2 B1 (C1)⇤ D1 E1 A1 B1 (C1)⇤ D1 E2

A2 B2 (C2)⇤ D1 E2 A2 B1 (C2)⇤ D2 E2 A1 B2 (C2)⇤ D2 E2 A2 B2 (C2)⇤ D2 E1

slide-32
SLIDE 32

Advantages of haplotype testing

  • In some cases (system and sample dependent!), the

haplotype is a better “tag” of the causal polymorphism than any of the surrounding markers

  • In such a case, the Corr(Xh, X) > Corr (X’, X) and

therefore has a higher probability of correctly rejecting the null hypothesis

  • Another “advantage” is by putting together markers, we

are performing less total tests in our GWAS (in what sense is this an advantage!?)

slide-33
SLIDE 33

Disadvantages of haplotype testing

  • Collapsing to haplotypes may produce a better tag but

it also may not (!!), i.e. sometimes (in fact often!) individual genetic markers are better tags of the causal polymorphism

  • Another disadvantage is resolution, since we absolutely

cannot resolve the position of the causal polymorphism to a position smaller than the range of the haplotype alleles, i.e. large haplotypes can have smaller resolution

  • If we had measured the causal polymorphism in our

data, should we use haplotype testing (i.e. in the future, the importance of haplotype testing may decrease)

slide-34
SLIDE 34

Should I apply haplotype testing in my GWAS?

  • Yes! but apply both an individual marker testing approach

(always!) as well as a haplotype test (optional)

  • The reason is that we never know the true answer in our

GWAS (as with any statistical analysis!) so it doesn’t hurt us to explore our dataset with as many techniques as we want to apply

  • In fact, this will be a continuing theme of the class, i.e. keep

analyzing GWAS with as many methods as you find useful

  • However, since we never know the right answer for certain,

if we get conflicting results, which one do we interpret as “correct”!?

slide-35
SLIDE 35

Where do haplotypes come from?

  • A deep discussion of the origin of haplotypes (remember: a

fuzzy definition!) is another subject that is in the realm of population genetics and therefore we cannot discuss this in detail in this class (again: I encourage you to take a class on population genetics!)

  • However, we can get an intuition about where haplotypes

come from by remembering that the origin of new haplotype alleles are mutations and that new haplotype alleles can be produced by recombination

  • In fact, these two processes also underlie the amount of LD

in the population and therefore what blocks of alleles are inherited as a haplotype (and we therefore use them to define haplotypes using system specific criteria)

slide-36
SLIDE 36

Defining haplotypes

  • We could spend multiple lectures on how people define

haplotypes for given systems and the algorithms used for this purpose (so we will just briefly mention the main concepts here)

  • To define haplotypes, we need to “phase” measured

genotype markers, decide on the number of genotype markers to put together into a haplotype block, and decide how many haplotype alleles to consider

  • Remember: there are no universal rules for doing this

(system dependent!)

slide-37
SLIDE 37

Phasing haplotypes

  • To get a sense of the phasing problem, consider a case

where we have two markers that are right next to each

  • ther on a chromosome and we know we want to put them

together in a haplotype block

  • Say one marker is (A,T) and the other marker is (G,C) and

we are considering a diploid individual who is a heterozygote for both of these markers, which of the marker alleles are physically linked in this individual?

  • Figuring this out for individuals in a sample is the phasing

problem and there are many algorithms for accomplishing this goal (note that in the future, technology may make this a non-issue...)

slide-38
SLIDE 38

Deciding on how many genotypes to include in a haplotype block

  • Again, while there is no set rule, how we decide on

genotypes to include in a haplotype block depends on LD

  • The general rule: if we have a set of markers in high LD with

each other but low LD with other markers, we use this as a guide for defining the haplotype block

  • 0.2

0.4 0.6 0.8 5 10 15 0.2 0.4 0.6 0.8 1

RCOR3 TRAF5 C1orf97 RD3 SLC30A1 NEK2 LPGAT1

genes 30 60 1 2

211.5 211.6 211.7 211.8 211.9 212.0

* *

Single marker test Conditional test VBAY Lasso Adaptive Lasso 2D-MCP LOG NEG Independent hit Non−independent hit

10 15

genes

slide-39
SLIDE 39

Deciding on how many haplotype alleles to consider

  • Again, there are no set rules for how many haplotype alleles

to define, but in general, we define a set where the frequency in a population is above some MAF threshold (which depends on the system)

  • With a MAF cutoff of say 0.05, this generally limits us to 2-5

haplotype alleles (e.g. in humans!)

  • There are however cases where we might want to consider

rarer haplotypes (what are some of these?)

slide-40
SLIDE 40

Haplotype GWAS wrap-up

  • Haplotypes are a physical and sampling consequence of how

genetic systems work (just like LD!)

  • Definitions of haplotype blocks and haplotype alleles depend
  • n the system and context (fuzzy definition)
  • Regardless of how we define them, once we have haplotype

alleles, we can use them as we would genetic markers in our GWAS analysis framework

  • While optional, it is never a bad idea to perform a haplotype

analysis of your GWAS in addition to your single marker analysis (ALWAYS do a single marker analysis)

slide-41
SLIDE 41

That’s it for today

  • See you on Thurs.!