Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

quantitative genomics and genetics btry 4830 6830 pbsb
SMART_READER_LITE
LIVE PREVIEW

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Alternative tests and haplotype testing Jason Mezey jgm45@cornell.edu April 18, 2017 (Th) 8:40-9:55 Announcements Project is posted (!!) Midterm grades will


slide-1
SLIDE 1

Jason Mezey jgm45@cornell.edu April 18, 2017 (Th) 8:40-9:55

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01

Lecture18: Alternative tests and haplotype testing

slide-2
SLIDE 2

Announcements

  • Project is posted (!!)
  • Midterm grades will be available Fri.
  • Information about the Final:
  • Same format as midterm (take-home, work on your own)
  • Cumulative as far as material covered
  • Scheduling (NOT FINALIZED

YET!) probably available Fri. May 19 and Due Mon., May 21

slide-3
SLIDE 3

Conceptual Overview

Genetic System

Does A1 -> A2 affect Y?

Sample or experimental pop

Measured individuals (genotype, phenotype)

Pr(Y|X)

Model params

Reject / DNR

Regression model

F-test

slide-4
SLIDE 4

Review: Logistic GWAS

  • Now we have all the critical components for performing a GWAS

with a case / control phenotype!

  • The procedure (and goals!) are the same as before, for a sample of n

individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests

  • To perform these hypothesis tests, we need to run our IRLS

algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker

  • We then use these N LRT statistics to calculate N p-values by using

a chi-square distribution (how do we do this is R?)

slide-5
SLIDE 5

Introduction to logistic covariates

  • Recall that in a GWAS, we are considering the following

regression model and hypotheses to assess a possible association for every marker with the phenotype

  • Also recall that with these hypotheses we are actually

testing:

Y = −1(µ + Xaa + Xdd)

H0 : a = 0 \ d = 0 HA : a 6= 0 [ d 6= 0

H0 : Cov(Y, Xa) = 0 \ Cov(Y, Xd) = 0 HA : Cov(Y, Xa) 6= 0 [ Cov(Y, Xd) 6= 0

slide-6
SLIDE 6

Modeling logistic covariates I

  • Therefore, if we have a factor that is correlated with our

phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests!

  • The good news is that, assuming we have measured the

factor (i.e. it is part of our GWAS dataset) then we can incorporate the factor in our model as a covariate:

  • The effect of this is that we will estimate the covariate

model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!)

Y = −1(µ + Xaa + Xdd + Xzz)

slide-7
SLIDE 7

Modeling logistic covariates II

  • For our a logistic regression, our LRT (logistic) we have the same

equations:

  • Using the following estimates for the null hypothesis and the alternative

making use of the IRLS algorithm (just add an additional parameter!):

  • Under the null hypothesis, the LRT is still distributed as a Chi-square with

2 degree of freedom (why?):

LRT = −2lnΛ = 2l(ˆ θ1|y) − 2l(ˆ θ0|y) l(ˆ θ1|y) =

n

X

i=1

h yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) i

LRT ! χ2

d f=2

ˆ θ0 = {ˆ βµ, ˆ βa = 0, ˆ βd = 0, ˆ βz}

ˆ θ1 = {ˆ βµ, ˆ βa, ˆ βd, ˆ βz}

l(ˆ ✓0|y) =

n

X

i=1

 yiln(−1(ˆ µ + xi,z ˆ z)) + (1 yi)ln(1 −1(ˆ µ + xi,z ˆ z))

slide-8
SLIDE 8

Inference with GLMs

  • We perform inference in a GLM framework using the

same approach, i.e. MLE of the beta parameters using an IRLS algorithm (just substitute the appropriate link function in the equations, etc.)

  • We can also perform a hypothesis test using a LRT

(where the sampling distribution as the sample size goes to infinite is chi-square)

  • In short, what you have learned can be applied for most

types of regression modeling you will likely need to apply (!!)

slide-9
SLIDE 9

Introduction to Generalized Linear Models (GLMs) I

  • We have introduced linear and logistic regression models for

GWAS analysis because these are the most versatile framework for performing a GWAS (there are many less versatile alternatives!)

  • These two models can handle our genetic coding (in fact any genetic

coding) where we have discrete categories (although they can also handle X that can take on a continuous set of values!)

  • They can also handle (the sampling distribution) of phenotypes that

have normal (linear) and Bernoulli error (logistic)

  • How about phenotypes with different error (sampling)

distributions? Linear and logistic regression models are members of a broader class called Generalized Linear Models (GLMs), where

  • ther models in this class can handle additional phenotypes (error

distributions)

slide-10
SLIDE 10

Introduction to Generalized Linear Models (GLMs) II

  • To introduce GLMs, we will introduce the overall structure first, and second

describe how linear and logistic models fit into this framework

  • There is some variation in presenting the properties of a GLM, but we will present

them using three (models that have these properties are considered GLMs):

  • The probability distribution of the response variable

Y conditional on the independent variable X is in the exponential family of distributions

  • A link function relating the independent variables and parameters to the

expected value of the response variable (where we often use the inverse!!)

  • The error random variable has a variance which is a function of ONLY

. Pr(Y |X) ∼ expfamily.

: : E(Y|X) → X, (E(Y|X)) = X

E(Y|X) = −1(X)

le ✏

= X

slide-11
SLIDE 11

Exponential family I

  • The exponential family is includes a broad set of probability distributions that can

be expressed in the following `natural’ form:

  • As an example, for the normal distribution, we have the following:
  • Note that many continuous and discrete distributions are in this family (normal,

binomial, poisson, lognormal, multinomial, several categorical distributions, exponential, gamma distribution, beta distribution, chi-square) but not all (examples that are not!?) and since we can model response variables with these distributions, we can model phenotypes with these distributions in a GWAS using a GLM (!!)

  • Note that the normal distribution is in this family (linear) as is Bernoulli or more

accurately Binomial (logistic)

Pr(Y ) ∼ e

Y θ−b(θ) φ

+c(Y,φ) ve: ✓ = µ, = 2, b(✓) = ✓2 2 , c(Y, ) = −1 2 Y 2 + log(2⇡) !

slide-12
SLIDE 12

Exponential family II

  • Instead of the `natural’ form, the exponential family is often expressed in the

following form:

  • To convert from one to the other, make the following substitutions:
  • Note that the dispersion parameter is now no longer a direct part of this

formulation

  • Which is used depends on the application (i.e., for glm’s the `natural’ form has an

easier to use form + the dispersion parameter is useful for model fitting, while the form on this slide provides advantages for other types of applications

Pr(Y ) ∼ h(Y )s(θ)e

Pk

i=1 wi(θ)ti(Y )

k = 1, h(Y ) = ec(Y,φ), s(θ) = e− b(θ)

φ , w(θ) = θ

φ, t(Y ) = Y

slide-13
SLIDE 13

GLM link function

  • A “link” function is just a function (!!) that acts on the expected

value of Y given X:

  • This function is defined in such a way such that it has a useful form

for a GLM although there are some general restrictions on the form

  • f this function, the most important is that they need to be

monotonic such that we can define an inverse:

  • For the logistic regression, we have selected the following link

function, which is a logit function (a “canonical link”) where the inverse is the logistic function (but note that others are also used for binomial response variables):

  • What is the link function for a normal distribution?

the inverse Y = f(X), as f−1(Y ) = X.

E(Y|X) = −1(X) = eXβ 1 + eXβ

γ(E(Y|X)) = ln

eXβ 1+eXβ

1 −

eXβ 1+eXβ

!

slide-14
SLIDE 14

GLM error function

  • The variance of the error term in a GLM must be function of

ONLY the independent variable and beta parameter vector:

  • This is the case for a linear regression (note the variance of the

error is constant!!):

  • As an example, this is the case for the logistic regression (note

the error changes depending on the value of X!!):

V ar(✏) = f(X)

V ar(✏) = f(X) = 2

V ar(✏) = −1(X)(1 − −1(X)) V ar(✏i) = −1(µ + Xi,aa + Xi,dd)(1 − −1(µ + Xi,aa + Xi,dd)

✏ ⇠ N(0, 2

✏ )

slide-15
SLIDE 15

Alternative tests in GWAS I

  • Since our basic null / alternative hypothesis construction in

GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number

  • f tests that we could apply in a GWAS
  • e.g. t-tests, ANOVA, Wald’s test, non-parametric permutation

based tests, Kruskal-Wallis tests, other rank based tests, chi- square, Fisher’s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS)

  • When can we use different tests? The only restriction is that
  • ur data conform to the assumptions of the test (examples?)
  • We could therefore apply a diversity of tests for any given

GWAS

slide-16
SLIDE 16
  • Should we use different tests in a GWAS (and why)?

Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others

  • In general, since we don’t know the true conditions (and therefore which

will be best suited) we should run a number of tests and compare results

  • How to compare results of different GWAS is a fuzzy case (=no non-

conditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules:

  • If all methods identify the same hits (=genomic locations) then this is

good evidence that there is a causal polymorphism

  • If methods do not agree on the position (e.g. some are significant, some

are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)

Alternative tests in GWAS II

slide-17
SLIDE 17
  • We do not have time in this course to do a comprehensive

review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!)

  • Let’s consider a few examples alternative tests that could be

applied

  • Remember that to apply these alternative tests, you will

perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!):

H0 : Cov(Y, X) = 0 HA : Cov(Y, X) 6= 0

Alternative tests in GWAS III

slide-18
SLIDE 18

Alternative test examples I

  • First, let’s consider a case-control phenotype and consider a chi-square

test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!)

  • To construct the test statistic, we consider the counts of genotype-

phenotype combinations (left) and calculate the expected numbers in each cell (right):

  • We then construct the following test statistic:
  • Where the (asymptotic) distribution when the null hypothesis is true is:

in this c is χ2

d.f.=2.

ize tends to infinite, i.e. when the sam is d.f. = (#columns-1)(#rows-1) = 2, can therefore calculate the statistic in

LRT = −2lnΛ = −2

3

X

i=1 2

X

j=1

nijln ni n.inj. !

Case Control A1A1 n11 n12 n1. A1A2 n21 n22 n2. A2A2 n31 n32 n3. n.1 n.2 n

Case Control A1A1 (n.1n1.)/n (n.2n1.)/n n1. A1A2 (n.1n2.)/n (n.2n2.)/n n2. A2A2 (n.1n3.)/n (n.2n3.)/n n3. n.1 n.2 n

slide-19
SLIDE 19

Alternative test examples II

  • Second, let’s consider a Fisher’s exact test
  • Note the the LRT for the null hypothesis under the chi-square test was
  • nly asymptotically exact, i.e. it is exact as sample size n approaches infinite

but it is not exact for smaller sample sizes (although we hope it is close!)

  • Could we construct a test that is exact for smaller sample sizes?

Yes, we can calculate a Fisher’s test statistic for our sample, where the distribution under the null hypothesis is exact for any sample size (I will let you look up how to calculate this statistic and the distribution under the null on your own):

  • Given this test is exact, why would we ever use Chi-square / what is a rule

for when we should use one versus the other?

Case Control A1A1 n11 n21 A1A2 n21 n22 A2A2 n31 n32 hi-square test) is also often

slide-20
SLIDE 20

Alternative test examples III

  • Third, let’s ways of grouping the cells, where we could apply either a chi-

square or a Fisher’s exact test

  • For MAF = A1, we can apply a “recessive” (left) and “dominance” test

(right):

  • We could also apply an “allele test” (note these test names are from

PLINK):

  • When should we expect one of these tests to perform better than the
  • thers?

Case Control A1A1 n11 n12 A1A2 ∪ A2A2 n21 n22 Case Control A1A1 ∪ A1A2 n11 n12 A2A2 n21 n22 Case Control A1 n11 n12 A2 n21 n22

slide-21
SLIDE 21

Haplotype testing I

  • We have just extended our GWAS framework to

handle additional phenotypes

  • We can also extend our GWAS framework to handle

genotypes defined using a different approach

  • In this case, let’s consider using haplotype alleles in our

testing framework

  • Note that a haplotype collapses genetic marker

information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time

slide-22
SLIDE 22

Haplotype testing II

  • Haplotype - a series of ordered, linked alleles that

are inherited together

  • For the moment, let’s consider a haplotype to define a

“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:

  • For example, if these loci are each a SNP with the

following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:

h = f(Ai, Bj, ...)

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-23
SLIDE 23

Haplotype testing III

  • Note that how we define haplotype alleles is somewhat arbitrary but in

general, we define a haplotype for a set of genetic markers (loci) that are physically linked that are frequently occur in a population

  • How many markers is somewhat arbitrary, e.g. we often define sets that match
  • bserved patterns of LD
  • How many haplotype alleles we define is also somewhat arbitrary, where we

define haplotype alleles that have appreciable frequenecy in the population

  • For example, four the four loci with alleles (A,G), (A,T),(G,C),(G,C) how

many haplotype alleles could we define?

  • However, it could be that only the following two combinations have

relatively “high” allele frequencies (say >0.05 = arbitrary!)

  • In such a case, we can collapse the many alleles into just a few!

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-24
SLIDE 24

Haplotype testing IV

  • As an example of haplotype allele collapsing, say for our case of four

loci (A,G), (A,T),(G,C),(G,C), we have lots of LD (!!) such that there are only 4 alleles in the population (i.e. all other combinations have frequency of zero!):

  • Let’s also say that the frequencies of the third and fourth of these in

the population are < 0.01

  • In this case, we can define just two haplotype alleles that collapse

the other alleles as follows (where * means “any” genetic marker allele):

  • NOTE: we are therefore loosing information using this approach!!
  • ften the case that only a few haplotypes are at appreciable frequency, e.g.

s h∗

1 = (A, A, C, C),h∗ 2 = (G, T, G, G),h∗ 3 = (A, A, G, C),h∗ 4 = (G, T, C, G)

s h1 = (A, A, ∗, C)

∗ ∗

d h2 = (G, T, ∗, G) d h = h∗ ∪ h∗. ∗ at h1 = h∗

1 ∪h∗ 3

d h2 = (G, T, ∗ d h2 = h∗

2 ∪ h∗ 4.

slide-25
SLIDE 25

GWAS with haplotypes I

  • Once we have defined haplotype alleles, we can

proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)

  • For example, in a case where we only have two

haplotype alleles, we can code our independent variables for our regression model as follows:

  • All other aspects remain the same (although what is the

effect on our interpretation of where the causal polymorphism is located?)

Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1

slide-26
SLIDE 26

GWAS with haplotypes II

  • Given that we are losing information by using a

haplotype testing approach in a GWAS, why might we want to use this approach?

  • As one example consider the following case of

haplotypes in a population:

A1 B1 (C1)⇤ D2 E1 A1 B2 (C1)⇤ D1 E1 A2 B1 (C1)⇤ D1 E1 A1 B1 (C1)⇤ D1 E2

A2 B2 (C2)⇤ D1 E2 A2 B1 (C2)⇤ D2 E2 A1 B2 (C2)⇤ D2 E2 A2 B2 (C2)⇤ D2 E1

slide-27
SLIDE 27

Advantages of haplotype testing

  • In some cases (system and sample dependent!), the

haplotype is a better “tag” of the causal polymorphism than any of the surrounding markers

  • In such a case, the Corr(Xh, X) > Corr (X’, X) and

therefore has a higher probability of correctly rejecting the null hypothesis

  • Another “advantage” is by putting together markers, we

are performing less total tests in our GWAS (in what sense is this an advantage!?)

slide-28
SLIDE 28

Disadvantages of haplotype testing

  • Collapsing to haplotypes may produce a better tag but

it also may not (!!), i.e. sometimes (in fact often!) individual genetic markers are better tags of the causal polymorphism

  • Another disadvantage is resolution, since we absolutely

cannot resolve the position of the causal polymorphism to a position smaller than the range of the haplotype alleles, i.e. large haplotypes can have smaller resolution

  • If we had measured the causal polymorphism in our

data, should we use haplotype testing (i.e. in the future, the importance of haplotype testing may decrease)

slide-29
SLIDE 29

Should I apply haplotype testing in my GWAS?

  • Yes! but apply both an individual marker testing approach

(always!) as well as a haplotype test (optional)

  • The reason is that we never know the true answer in our

GWAS (as with any statistical analysis!) so it doesn’t hurt us to explore our dataset with as many techniques as we want to apply

  • In fact, this will be a continuing theme of the class, i.e. keep

analyzing GWAS with as many methods as you find useful

  • However, since we never know the right answer for certain,

if we get conflicting results, which one do we interpret as “correct”!?

slide-30
SLIDE 30

Where do haplotypes come from?

  • A deep discussion of the origin of haplotypes (remember: a

fuzzy definition!) is another subject that is in the realm of population genetics and therefore we cannot discuss this in detail in this class (again: I encourage you to take a class on population genetics!)

  • However, we can get an intuition about where haplotypes

come from by remembering that the origin of new haplotype alleles are mutations and that new haplotype alleles can be produced by recombination

  • In fact, these two processes also underlie the amount of LD

in the population and therefore what blocks of alleles are inherited as a haplotype (and we therefore use them to define haplotypes using system specific criteria)

slide-31
SLIDE 31

Defining haplotypes

  • We could spend multiple lectures on how people define

haplotypes for given systems and the algorithms used for this purpose (so we will just briefly mention the main concepts here)

  • To define haplotypes, we need to “phase” measured

genotype markers, decide on the number of genotype markers to put together into a haplotype block, and decide how many haplotype alleles to consider

  • Remember: there are no universal rules for doing this

(system dependent!)

slide-32
SLIDE 32

Phasing haplotypes

  • To get a sense of the phasing problem, consider a case

where we have two markers that are right next to each

  • ther on a chromosome and we know we want to put them

together in a haplotype block

  • Say one marker is (A,T) and the other marker is (G,C) and

we are considering a diploid individual who is a heterozygote for both of these markers, which of the marker alleles are physically linked in this individual?

  • Figuring this out for individuals in a sample is the phasing

problem and there are many algorithms for accomplishing this goal (note that in the future, technology may make this a non-issue...)

slide-33
SLIDE 33

Deciding on how many genotypes to include in a haplotype block

  • Again, while there is no set rule, how we decide on

genotypes to include in a haplotype block depends on LD

  • The general rule: if we have a set of markers in high LD with

each other but low LD with other markers, we use this as a guide for defining the haplotype block

  • 0.2

0.4 0.6 0.8 5 10 15 0.2 0.4 0.6 0.8 1

RCOR3 TRAF5 C1orf97 RD3 SLC30A1 NEK2 LPGAT1

genes 30 60 1 2

211.5 211.6 211.7 211.8 211.9 212.0

* *

Single marker test Conditional test VBAY Lasso Adaptive Lasso 2D-MCP LOG NEG Independent hit Non−independent hit

10 15

genes

slide-34
SLIDE 34

Deciding on how many haplotype alleles to consider

  • Again, there are no set rules for how many haplotype alleles

to define, but in general, we define a set where the frequency in a population is above some MAF threshold (which depends on the system)

  • With a MAF cutoff of say 0.05, this generally limits us to 2-5

haplotype alleles (e.g. in humans!)

  • There are however cases where we might want to consider

rarer haplotypes (what are some of these?)

slide-35
SLIDE 35

Haplotype GWAS wrap-up

  • Haplotypes are a physical and sampling consequence of how

genetic systems work (just like LD!)

  • Definitions of haplotype blocks and haplotype alleles depend
  • n the system and context (fuzzy definition)
  • Regardless of how we define them, once we have haplotype

alleles, we can use them as we would genetic markers in our GWAS analysis framework

  • While optional, it is never a bad idea to perform a haplotype

analysis of your GWAS in addition to your single marker analysis (ALWAYS do a single marker analysis)

slide-36
SLIDE 36

That’s it for today

  • See you on Tues.!