Jason Mezey jgm45@cornell.edu April 18, 2017 (Th) 8:40-9:55
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Alternative tests and haplotype testing Jason Mezey jgm45@cornell.edu April 18, 2017 (Th) 8:40-9:55 Announcements Project is posted (!!) Midterm grades will
Announcements
- Project is posted (!!)
- Midterm grades will be available Fri.
- Information about the Final:
- Same format as midterm (take-home, work on your own)
- Cumulative as far as material covered
- Scheduling (NOT FINALIZED
YET!) probably available Fri. May 19 and Due Mon., May 21
Conceptual Overview
Genetic System
Does A1 -> A2 affect Y?
Sample or experimental pop
Measured individuals (genotype, phenotype)
Pr(Y|X)
Model params
Reject / DNR
Regression model
F-test
Review: Logistic GWAS
- Now we have all the critical components for performing a GWAS
with a case / control phenotype!
- The procedure (and goals!) are the same as before, for a sample of n
individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests
- To perform these hypothesis tests, we need to run our IRLS
algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker
- We then use these N LRT statistics to calculate N p-values by using
a chi-square distribution (how do we do this is R?)
Introduction to logistic covariates
- Recall that in a GWAS, we are considering the following
regression model and hypotheses to assess a possible association for every marker with the phenotype
- Also recall that with these hypotheses we are actually
testing:
Y = −1(µ + Xaa + Xdd)
H0 : a = 0 \ d = 0 HA : a 6= 0 [ d 6= 0
H0 : Cov(Y, Xa) = 0 \ Cov(Y, Xd) = 0 HA : Cov(Y, Xa) 6= 0 [ Cov(Y, Xd) 6= 0
Modeling logistic covariates I
- Therefore, if we have a factor that is correlated with our
phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests!
- The good news is that, assuming we have measured the
factor (i.e. it is part of our GWAS dataset) then we can incorporate the factor in our model as a covariate:
- The effect of this is that we will estimate the covariate
model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!)
Y = −1(µ + Xaa + Xdd + Xzz)
Modeling logistic covariates II
- For our a logistic regression, our LRT (logistic) we have the same
equations:
- Using the following estimates for the null hypothesis and the alternative
making use of the IRLS algorithm (just add an additional parameter!):
- Under the null hypothesis, the LRT is still distributed as a Chi-square with
2 degree of freedom (why?):
LRT = −2lnΛ = 2l(ˆ θ1|y) − 2l(ˆ θ0|y) l(ˆ θ1|y) =
n
X
i=1
h yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd + xi,z ˆ βz)) i
LRT ! χ2
d f=2
ˆ θ0 = {ˆ βµ, ˆ βa = 0, ˆ βd = 0, ˆ βz}
ˆ θ1 = {ˆ βµ, ˆ βa, ˆ βd, ˆ βz}
l(ˆ ✓0|y) =
n
X
i=1
yiln(−1(ˆ µ + xi,z ˆ z)) + (1 yi)ln(1 −1(ˆ µ + xi,z ˆ z))
Inference with GLMs
- We perform inference in a GLM framework using the
same approach, i.e. MLE of the beta parameters using an IRLS algorithm (just substitute the appropriate link function in the equations, etc.)
- We can also perform a hypothesis test using a LRT
(where the sampling distribution as the sample size goes to infinite is chi-square)
- In short, what you have learned can be applied for most
types of regression modeling you will likely need to apply (!!)
Introduction to Generalized Linear Models (GLMs) I
- We have introduced linear and logistic regression models for
GWAS analysis because these are the most versatile framework for performing a GWAS (there are many less versatile alternatives!)
- These two models can handle our genetic coding (in fact any genetic
coding) where we have discrete categories (although they can also handle X that can take on a continuous set of values!)
- They can also handle (the sampling distribution) of phenotypes that
have normal (linear) and Bernoulli error (logistic)
- How about phenotypes with different error (sampling)
distributions? Linear and logistic regression models are members of a broader class called Generalized Linear Models (GLMs), where
- ther models in this class can handle additional phenotypes (error
distributions)
Introduction to Generalized Linear Models (GLMs) II
- To introduce GLMs, we will introduce the overall structure first, and second
describe how linear and logistic models fit into this framework
- There is some variation in presenting the properties of a GLM, but we will present
them using three (models that have these properties are considered GLMs):
- The probability distribution of the response variable
Y conditional on the independent variable X is in the exponential family of distributions
- A link function relating the independent variables and parameters to the
expected value of the response variable (where we often use the inverse!!)
- The error random variable has a variance which is a function of ONLY
. Pr(Y |X) ∼ expfamily.
: : E(Y|X) → X, (E(Y|X)) = X
E(Y|X) = −1(X)
le ✏
= X
Exponential family I
- The exponential family is includes a broad set of probability distributions that can
be expressed in the following `natural’ form:
- As an example, for the normal distribution, we have the following:
- Note that many continuous and discrete distributions are in this family (normal,
binomial, poisson, lognormal, multinomial, several categorical distributions, exponential, gamma distribution, beta distribution, chi-square) but not all (examples that are not!?) and since we can model response variables with these distributions, we can model phenotypes with these distributions in a GWAS using a GLM (!!)
- Note that the normal distribution is in this family (linear) as is Bernoulli or more
accurately Binomial (logistic)
Pr(Y ) ∼ e
Y θ−b(θ) φ
+c(Y,φ) ve: ✓ = µ, = 2, b(✓) = ✓2 2 , c(Y, ) = −1 2 Y 2 + log(2⇡) !
Exponential family II
- Instead of the `natural’ form, the exponential family is often expressed in the
following form:
- To convert from one to the other, make the following substitutions:
- Note that the dispersion parameter is now no longer a direct part of this
formulation
- Which is used depends on the application (i.e., for glm’s the `natural’ form has an
easier to use form + the dispersion parameter is useful for model fitting, while the form on this slide provides advantages for other types of applications
Pr(Y ) ∼ h(Y )s(θ)e
Pk
i=1 wi(θ)ti(Y )
k = 1, h(Y ) = ec(Y,φ), s(θ) = e− b(θ)
φ , w(θ) = θ
φ, t(Y ) = Y
GLM link function
- A “link” function is just a function (!!) that acts on the expected
value of Y given X:
- This function is defined in such a way such that it has a useful form
for a GLM although there are some general restrictions on the form
- f this function, the most important is that they need to be
monotonic such that we can define an inverse:
- For the logistic regression, we have selected the following link
function, which is a logit function (a “canonical link”) where the inverse is the logistic function (but note that others are also used for binomial response variables):
- What is the link function for a normal distribution?
the inverse Y = f(X), as f−1(Y ) = X.
E(Y|X) = −1(X) = eXβ 1 + eXβ
γ(E(Y|X)) = ln
eXβ 1+eXβ
1 −
eXβ 1+eXβ
!
GLM error function
- The variance of the error term in a GLM must be function of
ONLY the independent variable and beta parameter vector:
- This is the case for a linear regression (note the variance of the
error is constant!!):
- As an example, this is the case for the logistic regression (note
the error changes depending on the value of X!!):
V ar(✏) = f(X)
V ar(✏) = f(X) = 2
✏
V ar(✏) = −1(X)(1 − −1(X)) V ar(✏i) = −1(µ + Xi,aa + Xi,dd)(1 − −1(µ + Xi,aa + Xi,dd)
✏ ⇠ N(0, 2
✏ )
Alternative tests in GWAS I
- Since our basic null / alternative hypothesis construction in
GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number
- f tests that we could apply in a GWAS
- e.g. t-tests, ANOVA, Wald’s test, non-parametric permutation
based tests, Kruskal-Wallis tests, other rank based tests, chi- square, Fisher’s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS)
- When can we use different tests? The only restriction is that
- ur data conform to the assumptions of the test (examples?)
- We could therefore apply a diversity of tests for any given
GWAS
- Should we use different tests in a GWAS (and why)?
Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others
- In general, since we don’t know the true conditions (and therefore which
will be best suited) we should run a number of tests and compare results
- How to compare results of different GWAS is a fuzzy case (=no non-
conditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules:
- If all methods identify the same hits (=genomic locations) then this is
good evidence that there is a causal polymorphism
- If methods do not agree on the position (e.g. some are significant, some
are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)
Alternative tests in GWAS II
- We do not have time in this course to do a comprehensive
review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!)
- Let’s consider a few examples alternative tests that could be
applied
- Remember that to apply these alternative tests, you will
perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!):
H0 : Cov(Y, X) = 0 HA : Cov(Y, X) 6= 0
Alternative tests in GWAS III
Alternative test examples I
- First, let’s consider a case-control phenotype and consider a chi-square
test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!)
- To construct the test statistic, we consider the counts of genotype-
phenotype combinations (left) and calculate the expected numbers in each cell (right):
- We then construct the following test statistic:
- Where the (asymptotic) distribution when the null hypothesis is true is:
in this c is χ2
d.f.=2.
ize tends to infinite, i.e. when the sam is d.f. = (#columns-1)(#rows-1) = 2, can therefore calculate the statistic in
LRT = −2lnΛ = −2
3
X
i=1 2
X
j=1
nijln ni n.inj. !
Case Control A1A1 n11 n12 n1. A1A2 n21 n22 n2. A2A2 n31 n32 n3. n.1 n.2 n
Case Control A1A1 (n.1n1.)/n (n.2n1.)/n n1. A1A2 (n.1n2.)/n (n.2n2.)/n n2. A2A2 (n.1n3.)/n (n.2n3.)/n n3. n.1 n.2 n
Alternative test examples II
- Second, let’s consider a Fisher’s exact test
- Note the the LRT for the null hypothesis under the chi-square test was
- nly asymptotically exact, i.e. it is exact as sample size n approaches infinite
but it is not exact for smaller sample sizes (although we hope it is close!)
- Could we construct a test that is exact for smaller sample sizes?
Yes, we can calculate a Fisher’s test statistic for our sample, where the distribution under the null hypothesis is exact for any sample size (I will let you look up how to calculate this statistic and the distribution under the null on your own):
- Given this test is exact, why would we ever use Chi-square / what is a rule
for when we should use one versus the other?
Case Control A1A1 n11 n21 A1A2 n21 n22 A2A2 n31 n32 hi-square test) is also often
Alternative test examples III
- Third, let’s ways of grouping the cells, where we could apply either a chi-
square or a Fisher’s exact test
- For MAF = A1, we can apply a “recessive” (left) and “dominance” test
(right):
- We could also apply an “allele test” (note these test names are from
PLINK):
- When should we expect one of these tests to perform better than the
- thers?
Case Control A1A1 n11 n12 A1A2 ∪ A2A2 n21 n22 Case Control A1A1 ∪ A1A2 n11 n12 A2A2 n21 n22 Case Control A1 n11 n12 A2 n21 n22
Haplotype testing I
- We have just extended our GWAS framework to
handle additional phenotypes
- We can also extend our GWAS framework to handle
genotypes defined using a different approach
- In this case, let’s consider using haplotype alleles in our
testing framework
- Note that a haplotype collapses genetic marker
information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time
Haplotype testing II
- Haplotype - a series of ordered, linked alleles that
are inherited together
- For the moment, let’s consider a haplotype to define a
“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:
- For example, if these loci are each a SNP with the
following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:
h = f(Ai, Bj, ...)
s h1 = (A, A, C, C)
s (A, G), (A, T), (G, d h2 = (G, T, G, G).
Haplotype testing III
- Note that how we define haplotype alleles is somewhat arbitrary but in
general, we define a haplotype for a set of genetic markers (loci) that are physically linked that are frequently occur in a population
- How many markers is somewhat arbitrary, e.g. we often define sets that match
- bserved patterns of LD
- How many haplotype alleles we define is also somewhat arbitrary, where we
define haplotype alleles that have appreciable frequenecy in the population
- For example, four the four loci with alleles (A,G), (A,T),(G,C),(G,C) how
many haplotype alleles could we define?
- However, it could be that only the following two combinations have
relatively “high” allele frequencies (say >0.05 = arbitrary!)
- In such a case, we can collapse the many alleles into just a few!
s h1 = (A, A, C, C)
s (A, G), (A, T), (G, d h2 = (G, T, G, G).
Haplotype testing IV
- As an example of haplotype allele collapsing, say for our case of four
loci (A,G), (A,T),(G,C),(G,C), we have lots of LD (!!) such that there are only 4 alleles in the population (i.e. all other combinations have frequency of zero!):
- Let’s also say that the frequencies of the third and fourth of these in
the population are < 0.01
- In this case, we can define just two haplotype alleles that collapse
the other alleles as follows (where * means “any” genetic marker allele):
- NOTE: we are therefore loosing information using this approach!!
- ften the case that only a few haplotypes are at appreciable frequency, e.g.
s h∗
1 = (A, A, C, C),h∗ 2 = (G, T, G, G),h∗ 3 = (A, A, G, C),h∗ 4 = (G, T, C, G)
s h1 = (A, A, ∗, C)
∗ ∗
d h2 = (G, T, ∗, G) d h = h∗ ∪ h∗. ∗ at h1 = h∗
1 ∪h∗ 3
d h2 = (G, T, ∗ d h2 = h∗
2 ∪ h∗ 4.
GWAS with haplotypes I
- Once we have defined haplotype alleles, we can
proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)
- For example, in a case where we only have two
haplotype alleles, we can code our independent variables for our regression model as follows:
- All other aspects remain the same (although what is the
effect on our interpretation of where the causal polymorphism is located?)
Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1
GWAS with haplotypes II
- Given that we are losing information by using a
haplotype testing approach in a GWAS, why might we want to use this approach?
- As one example consider the following case of
haplotypes in a population:
A1 B1 (C1)⇤ D2 E1 A1 B2 (C1)⇤ D1 E1 A2 B1 (C1)⇤ D1 E1 A1 B1 (C1)⇤ D1 E2
A2 B2 (C2)⇤ D1 E2 A2 B1 (C2)⇤ D2 E2 A1 B2 (C2)⇤ D2 E2 A2 B2 (C2)⇤ D2 E1
Advantages of haplotype testing
- In some cases (system and sample dependent!), the
haplotype is a better “tag” of the causal polymorphism than any of the surrounding markers
- In such a case, the Corr(Xh, X) > Corr (X’, X) and
therefore has a higher probability of correctly rejecting the null hypothesis
- Another “advantage” is by putting together markers, we
are performing less total tests in our GWAS (in what sense is this an advantage!?)
Disadvantages of haplotype testing
- Collapsing to haplotypes may produce a better tag but
it also may not (!!), i.e. sometimes (in fact often!) individual genetic markers are better tags of the causal polymorphism
- Another disadvantage is resolution, since we absolutely
cannot resolve the position of the causal polymorphism to a position smaller than the range of the haplotype alleles, i.e. large haplotypes can have smaller resolution
- If we had measured the causal polymorphism in our
data, should we use haplotype testing (i.e. in the future, the importance of haplotype testing may decrease)
Should I apply haplotype testing in my GWAS?
- Yes! but apply both an individual marker testing approach
(always!) as well as a haplotype test (optional)
- The reason is that we never know the true answer in our
GWAS (as with any statistical analysis!) so it doesn’t hurt us to explore our dataset with as many techniques as we want to apply
- In fact, this will be a continuing theme of the class, i.e. keep
analyzing GWAS with as many methods as you find useful
- However, since we never know the right answer for certain,
if we get conflicting results, which one do we interpret as “correct”!?
Where do haplotypes come from?
- A deep discussion of the origin of haplotypes (remember: a
fuzzy definition!) is another subject that is in the realm of population genetics and therefore we cannot discuss this in detail in this class (again: I encourage you to take a class on population genetics!)
- However, we can get an intuition about where haplotypes
come from by remembering that the origin of new haplotype alleles are mutations and that new haplotype alleles can be produced by recombination
- In fact, these two processes also underlie the amount of LD
in the population and therefore what blocks of alleles are inherited as a haplotype (and we therefore use them to define haplotypes using system specific criteria)
Defining haplotypes
- We could spend multiple lectures on how people define
haplotypes for given systems and the algorithms used for this purpose (so we will just briefly mention the main concepts here)
- To define haplotypes, we need to “phase” measured
genotype markers, decide on the number of genotype markers to put together into a haplotype block, and decide how many haplotype alleles to consider
- Remember: there are no universal rules for doing this
(system dependent!)
Phasing haplotypes
- To get a sense of the phasing problem, consider a case
where we have two markers that are right next to each
- ther on a chromosome and we know we want to put them
together in a haplotype block
- Say one marker is (A,T) and the other marker is (G,C) and
we are considering a diploid individual who is a heterozygote for both of these markers, which of the marker alleles are physically linked in this individual?
- Figuring this out for individuals in a sample is the phasing
problem and there are many algorithms for accomplishing this goal (note that in the future, technology may make this a non-issue...)
Deciding on how many genotypes to include in a haplotype block
- Again, while there is no set rule, how we decide on
genotypes to include in a haplotype block depends on LD
- The general rule: if we have a set of markers in high LD with
each other but low LD with other markers, we use this as a guide for defining the haplotype block
- 0.2
0.4 0.6 0.8 5 10 15 0.2 0.4 0.6 0.8 1
RCOR3 TRAF5 C1orf97 RD3 SLC30A1 NEK2 LPGAT1
genes 30 60 1 2
211.5 211.6 211.7 211.8 211.9 212.0
* *
Single marker test Conditional test VBAY Lasso Adaptive Lasso 2D-MCP LOG NEG Independent hit Non−independent hit
10 15
genes
Deciding on how many haplotype alleles to consider
- Again, there are no set rules for how many haplotype alleles
to define, but in general, we define a set where the frequency in a population is above some MAF threshold (which depends on the system)
- With a MAF cutoff of say 0.05, this generally limits us to 2-5
haplotype alleles (e.g. in humans!)
- There are however cases where we might want to consider
rarer haplotypes (what are some of these?)
Haplotype GWAS wrap-up
- Haplotypes are a physical and sampling consequence of how
genetic systems work (just like LD!)
- Definitions of haplotype blocks and haplotype alleles depend
- n the system and context (fuzzy definition)
- Regardless of how we define them, once we have haplotype
alleles, we can use them as we would genetic markers in our GWAS analysis framework
- While optional, it is never a bad idea to perform a haplotype
analysis of your GWAS in addition to your single marker analysis (ALWAYS do a single marker analysis)
That’s it for today
- See you on Tues.!