Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

quantitative genomics and genetics btry 4830 6830 pbsb
SMART_READER_LITE
LIVE PREVIEW

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Introduction to Bayesian MCMC and wrap-up (last class!) Jason Mezey jgm45@cornell.edu May 12, 2020 (T) 8:40-9:55 Announcements Reminder: Project due 11:59PM


slide-1
SLIDE 1

Lecture 26: Introduction to Bayesian MCMC and wrap-up (last class!)

Jason Mezey jgm45@cornell.edu May 12, 2020 (T) 8:40-9:55

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01

slide-2
SLIDE 2

Announcements

  • Reminder: Project due 11:59PM TODAY May 12 (!!)
  • The FINAL EXAM (!!)
  • Same format as midterm (i.e., take home, open book, no restrictions on

material you may access BUT ONCE THE EXAM STARTS YOU MAY NOT ASK ANYONE ABOUT ANYTHING THAT COULD RELATE TO THE EXAM (!!!!)

  • Timing: Available Evening May 16 (!!) (Sat.) and will be due 11:59PM May

20 (Weds.)

  • If you prepare, the exam should take 8-12 hours (i.e., allocate about 1 day

if you are well prepared)

  • You will have to do a logistic regression analysis of GWAS!
slide-3
SLIDE 3

Summary of lecture 26

  • Today we will complete our discussion of Bayesian statistics

by (briefly) introducing MCMC algorithms

  • We will then do a quick wrap-up by mentioning other topics
  • f interest and thoughts that you may want to consider for

charting your future learning

slide-4
SLIDE 4

Review: Intro to Bayesian analysis I

  • Remember that in a Bayesian (not frequentist!) framework, our parameter(s)

have a probability distribution associated with them that reflects our belief in the values that might be the true value of the parameter

  • Since we are treating the parameter as a random variable, we can consider the

joint distribution of the parameter AND a sample Y produced under a probability model:

  • Fo inference, we are interested in the probability the parameter takes a

certain value given a sample:

  • Using Bayes theorem, we can write:
  • Also note that since the sample is fixed (i.e. we are considering a single

sample) we can rewrite this as follows:

Pr(θ ∩ Y)

Pr(θ|y)

Pr(θ|y) = Pr(y|θ)Pr(θ) Pr(y)

Pr(θ|y) ∝ Pr(y|θ)Pr(θ)

  • Pr(y) = c,
slide-5
SLIDE 5
  • Let’s consider the structure of our main equation in Bayesian statistics:
  • Note that the left hand side is called the posterior probability:
  • The first term of the right hand side is something we have seen before, i.e. the

likelihood (!!):

  • The second term of the right hand side is new and is called the prior:
  • Note that the prior is how we incorporate our assumptions concerning the

values the true parameter value may take

  • In a Bayesian framework, we are making two assumptions (unlike a frequentist

where we make one assumption): 1. the probability distribution that generated the sample, 2. the probability distribution of the parameter

Pr(θ|y) ∝ Pr(y|θ)Pr(θ)

t Pr(θ|y) , i.e. the

t Pr(θ) i

| ∝ | Pr(y|θ) = L(θ|y)

Review: Intro to Bayesian analysis II

slide-6
SLIDE 6

Review: Bayesian estimation

  • Inference in a Bayesian framework differs from a frequentist

framework in both estimation and hypothesis testing

  • For example, for estimation in a Bayesian framework, we always

construct estimators using the posterior probability distribution, for example:

  • Estimates in a Bayesian framework can be different than in a

likelihood (Frequentist) framework since estimator construction is fundamentally different (!!)

ˆ θ = mean(θ|y) = Z θPr(θ|y)dθ

  • r

ˆ θ = median(θ|y)

slide-7
SLIDE 7

Review: Bayesian hypothesis testing

  • For hypothesis testing in a Bayesian analysis, we use the same null and alternative

hypothesis framework:

  • However, the approach to hypothesis testing is completely different than in a

frequentist framework, where we use a Bayes factor to indicate the relative support for one hypothesis versus the other:

  • Note that a downside to using a Bayes factor to assess hypotheses is that it can be

difficult to assign priors for hypotheses that have completely different ranges of support (e.g. the null is a point and alternative is a range of values)

  • As a consequence, people often use an alternative “psuedo-Bayesian” approach to

hypothesis testing that makes use of credible intervals (which is what we will use in this course)

H0 : θ ∈ Θ0 HA : θ ∈ ΘA

Bayes = R

θ∈Θ0 Pr(y|θ)Pr(θ)dθ

R

θ∈ΘA Pr(y|θ)Pr(θ)dθ

slide-8
SLIDE 8

Review: Bayesian credible intervals

  • Recall that in a Frequentist framework that we can estimate a confidence interval

at some level (say 0.95), which is an interval that will include the value of the parameter 0.95 of the times we performed the experiment an infinite number of times, calculating the confidence interval each time (note: a strange definition...)

  • In a Bayesian interval, the parallel concept is a credible interval that has a

completely different interpretation: this interval has a given probability of including the parameter value (!!)

  • The definition of a credible interval is as follows:
  • Note that we can assess a null hypothesis using a credible interval by determining

if this interval includes the value of the parameter under the null hypothesis (!!)

c.i.(θ) = Z cα

−cα

Pr(θ|y)dθ = 1 − α

slide-9
SLIDE 9

Review: Bayesian inference: genetic model 1

  • We are now ready to tackle Bayesian inference for our genetic model

(note that we will focus on the linear regression model but we can perform Bayesian inference for any GLM!):

  • Recall for a sample generated under this model, we can write:
  • In this case, we are interested in the following hypotheses:
  • We are therefore interested in the marginal posterior probability of these

two parameters

Y = µ + Xaa + Xdd + ✏ ✏ ⇠ N(0, 2

✏ )

y = x + ✏ ✏ ⇠ multiN(0, I2

✏ )

poses of mapping, we ar s H0 : a = 0\d = 0

HA : a 6= 0 [ d 6= 0

slide-10
SLIDE 10
  • To calculate these probabilities, we need to assign a joint probability

distribution for the prior

  • One possible choice is as follows (are these proper or improper!?):
  • Under this prior the complete posterior distribution is multivariate

normal (!!):

Pr(βµ, βa, βd, σ2

✏ ) =

Pr(βµ, βa, βd, σ2

✏ ) = Pr(βµ)Pr(βa)Pr(βd)Pr(σ2 ✏ )

Pr(βµ) = Pr(βa) = Pr(βd) = c Pr(σ2

✏ ) = c

Pr(βµ, βa, βd, σ2

✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )

Pr(θ|y) ∝ (σ2

✏ ) − n

2 e (y−x)T(y−x) 22 ✏

Review: Bayesian inference: genetic model II

slide-11
SLIDE 11
  • For the linear model with sample:
  • The complete posterior probability for the genetic model is:
  • With a uniform prior is:
  • The marginal posterior probability of the parameters we are

interested in is:

y = x + ✏ ✏ ⇠ multiN(0, I2

✏ )

Pr(µ, a, d, 2

✏ |y) / Pr(y|µ, a, d, 2 ✏ )Pr(µ, a, d, 2 ✏ )

Pr(βµ, βa, βd, σ2

✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )

Pr(βa, βd|y) = ⌦ ∞ ⌦ ∞

−∞

Pr(βµ, βa, βd, σ2

⇥ |y)dβµdσ2 ⇥

Review: Bayesian inference: genetic model III

slide-12
SLIDE 12
  • Assuming uniform (improper!) priors, the marginal distribution is:
  • With the following parameter values:
  • With these estimates (equations) we can now construct a credible

interval for our genetic null hypothesis and test a marker for a phenotype association and we can perform a GWAS by doing this for each marker (!!)

Pr(βa, βd|y) = Z ∞

−∞

Z ∞ Pr(βµ, βa, βd, σ2

✏ |y)dβµdσ2 ✏ ∼ multi-t-distribution

mean(Pr(βa, βd|y)) = h ˆ βa, ˆ βd iT = C−1 [Xa, Xd]T y cov = (y − [Xa, Xd] h ˆ βa, ˆ βd iT )T(y − [Xa, Xd] h ˆ βa, ˆ βd iT ) n − 6 C−1 C = XT

a Xa

XT

a Xd

XT

d Xa

XT

d Xd

  • d

f(multi−t) = n − 4

Review: Bayesian inference: genetic model IV

slide-13
SLIDE 13

Pr(βa, βd|y) β Pr(βa, βd|y) β

Pr(βa, βd|y)

Pr(βa, βd|y)

βa βa

βd

βa βa

βd βd βd

0.95 credible interval 0.95 credible interval

Cannot reject H0! Reject H0!

Review: Bayesian inference: genetic model V

slide-14
SLIDE 14

Review: Bayesian inference for more “complex” posterior distributions

  • For a linear regression, with a simple (uniform) prior, we have a

simple closed form of the overall posterior

  • This is not always (=often not the case), since we may often choose

to put together more complex priors with our likelihood or consider a more complicated likelihood equation (e.g. for a logistic regression!)

  • To perform hypothesis testing with these more complex cases, we

still need to determine the credible interval from the posterior (or marginal) probability distribution so we need to determine the form

  • f this distribution
  • To do this we will need an algorithm and we will introduce the

Markov chain Monte Carlo (MCMC) algorithm for this purpose

slide-15
SLIDE 15

Review: Stochastic processes

  • To introduce the MCMC algorithm for our purpose, we need

to consider models from another branch of probability (remember, probability is a field much larger than the components that we use for statistics / inference!): Stochastic processes

  • Stochastic process (intuitive def) - a collection of random

vectors (variables) with defined conditional relationships, often indexed by an ordered set t

  • We will be interested in one particular class of models within

this probability sub-field: Markov processes (or more specifically Markov chains)

  • Our MCMC will be a Markov chain (probability model)
slide-16
SLIDE 16
  • A Markov chain can be thought of as a random vector (or more

accurately, a set of random vectors), which we will index with t:

  • Markov chain - a stochastic process that satisfies the Markov

property:

  • While we often assume each of the random variables in a Markov

chain are in the same class of random variables (e.g. Bernoulli, normal, etc.) we allow the parameters of these random variables to be different, e.g. at time t and t+1

  • How does this differ from a random vector of an iid sample!?

Review: Markov processes

Xt, Xt+1, Xt+2, ...., Xt+k Xt, Xt−1, Xt−2, ...., Xt−k

− − −

Pr(Xt, |Xt−1, Xt−2, ...., Xt−k) = Pr(Xt, |Xt−1)

slide-17
SLIDE 17
  • As an example, let’s consider a Markov chain where each random

variable in the chain has a Bernoulli distribution:

  • Note that we could draw observations from this Markov chain

(since it is just a random vector with a probability distribution!):

  • How does this differ from an iid random vector?
  • Note that for t late in this process, the parameters of the Bernoulli

distributions are the same (=they do not change over time)

  • In our case, we will be interested in Markov chains that “evolve” to

such stationary distributions

Example of a Markov chain

1,0,...,1,1 0,1,...,1,1 0,0,...,0,0 0,1,...,0,0

X1, X2..., X1001, X1002

X1 ⇠ Bern(0.2), X2 ⇠ Bern(0.45), ..., X1001 ⇠ Bern(0.4), X1002 ⇠ Bern(0.4)

0.21

slide-18
SLIDE 18
  • If a Markov chain has certain properties (irreducible and ergodic), we can

prove that the chain will evolve (more accurately converge!) to a unique (!!) stationary distribution and will not leave this stationary distribution (where is it often possible to determine the parameters for the stationary distribution!)

  • For such Markov chains, if we consider enough iterations t+k (where k

may be very large, e.g. infinite), we will reach a point where each following random variable is in the unique stationary distribution:

  • For the purposes of Bayesian inference, we are going to set up a Markov

chain that evolves to a unique stationary distribution that is exactly the posterior probability distribution that we are interested in (!!!)

  • To use this chain, we will run the Markov chain for enough iterations to

reach this stationary distribution and then we will take a sample from this chain to determine (or more accurately approximate) our posterior

  • This is Bayesian Markov chain Monte Carlo (MCMC)!

Stationary distributions and MCMC

|

− − −

Pr(Xt+k) = Pr(Xt+k+1) = ...

slide-19
SLIDE 19

An example of Bayesian MCMC

| MCMC = Xt+k, Xt+k+1, Xt+k+2, ...., Xt+k+m Sample = 0.1, −0.08, −1.4, ...., 0.5

Pr(µ|y)

ˆ θ = median(Pr(θ|y) ' median(θ[tab], ..., θ[tab+k])

slide-20
SLIDE 20
  • Instructions for constructing an MCMC using Metropolis-Hastings approach:
  • Running the MCMC algorithm:

Constructing an MCMC

  • 1. Choose θ[0], where Pr(θ[0]|y) > 0.
  • 2. Sample a proposal parameter value θ∗ from a jumping distribution J(θ∗|θ[t]), where

t = 0 or any subsequent iteration.

  • 3. Calculate r = Pr(θ∗|y)J(θ[t]|(θ∗)

Pr(θ[t]|y)J(θ∗|θ[t]).

  • 4. Set θ[t+1] = θ∗ with Pr(θ[t+1] = θ∗) = min(r, 1) and θ[t+1] = θ[t] with Pr(θ[t+1] =

θ[t]) = 1 min(r, 1).

  • 1. Set up the Metropolis-Hastings algorithm.
  • 2. Initialize the values for θ[0].
  • 3. Iterate the algorithm for t >> 0, such that we are past tab, which is the iteration after

the ‘burn-in’ phase, where the realizations of θ[t] start to behave as though they are sampled from the stationary distribution of the Metropolis-Hastings Markov chain (we will discuss how many iterations are necessary for a burn-in below).

  • 4. Sample the chain for a set of iterations after the burn-in and use these to approximate

the posterior distribution and perform Bayesian inference.

slide-21
SLIDE 21
  • For a given marker part of our GWAS, we define our glm (which gives us our

likelihood) and our prior (which we provide!), and our goal is then to construct an MCMC with a stationary distribution (which we will sample to get the posterior “histogram”:

  • One approach is setting up a Metropolis-Hastings algorithm by defining a jumping

distribution

  • Another approach is to use a special case of the Metropolis-Hastings algorithm called

the Gibbs sampler (requires no rejections!), which samples each parameter from the conditional posterior distributions (which requires you derive these relationships = not always possible!)

Constructing an MCMC for genetic analysis

Pr(βµ|βa, βd, σ2

✏ , y)

Pr(βa|βµ, βd, σ2

✏ , y)

Pr(βd|βµ, βa, σ2

✏ , y)

Pr(σ2

✏ |βµ, βa, βd, y)

θ[t] =     βµ βa βd σ2

   

[t]

  θ[t+1] =     βµ βa βd σ2

   

[t+1]

, , ...

slide-22
SLIDE 22

Importance for MCMC

  • Constructing MCMC for Bayesian inference is extremely

practical

  • The constraint is they are computationally intensive
  • This is one reason for the surge in the practical use of

Bayesian data analysis is when computers increased in speed

  • This is definitely the case where the number of Bayesian

MCMC approaches in genetic analysis has steadily increased

  • ver the last decade or so
  • One issue is that, even with a fast computer, MCMC

algorithms can be inefficient (they take a long time to converge, they do not sample modes of a complex posterior efficiently, etc.)

  • There are therefore other algorithm approaches to Bayesian

genetic inference, e.g. variational Bayes

slide-23
SLIDE 23

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-24
SLIDE 24

Alternative tests in GWAS I

  • Since our basic null / alternative hypothesis construction in

GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number

  • f tests that we could apply in a GWAS
  • e.g. t-tests, ANOVA, Wald’s test, non-parametric permutation

based tests, Kruskal-Wallis tests, other rank based tests, chi- square, Fisher’s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS)

  • When can we use different tests? The only restriction is that
  • ur data conform to the assumptions of the test (examples?)
  • We could therefore apply a diversity of tests for any given

GWAS

slide-25
SLIDE 25
  • We do not have time in this course to do a comprehensive

review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!)

  • Let’s consider a few examples alternative tests that could be

applied

  • Remember that to apply these alternative tests, you will

perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!):

H0 : Cov(Y, X) = 0 HA : Cov(Y, X) 6= 0

Alternative tests in GWAS II

slide-26
SLIDE 26

Alternative test example

  • First, let’s consider a case-control phenotype and consider a chi-square

test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!)

  • To construct the test statistic, we consider the counts of genotype-

phenotype combinations (left) and calculate the expected numbers in each cell (right):

  • We then construct the following test statistic:
  • Where the (asymptotic) distribution when the null hypothesis is true is:

in this c is χ2

d.f.=2.

ize tends to infinite, i.e. when the sam is d.f. = (#columns-1)(#rows-1) = 2, can therefore calculate the statistic in

LRT = −2lnΛ = −2

3

X

i=1 2

X

j=1

nijln ni n.inj. !

Case Control A1A1 n11 n12 n1. A1A2 n21 n22 n2. A2A2 n31 n32 n3. n.1 n.2 n

Case Control A1A1 (n.1n1.)/n (n.2n1.)/n n1. A1A2 (n.1n2.)/n (n.2n2.)/n n2. A2A2 (n.1n3.)/n (n.2n3.)/n n3. n.1 n.2 n

slide-27
SLIDE 27
  • Should we use different tests in a GWAS (and why)?

Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others

  • In general, since we don’t know the true conditions (and therefore which

will be best suited) we should run a number of tests and compare results

  • How to compare results of different GWAS is a fuzzy case (=no non-

conditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules:

  • If all methods identify the same hits (=genomic locations) then this is

good evidence that there is a causal polymorphism

  • If methods do not agree on the position (e.g. some are significant, some

are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)

Alternative tests in GWAS III

slide-28
SLIDE 28

Comparing results of multiple analyses of the same GWAS data I

  • I’ve run my initial analyses using several tests and produced the

following (now what!?):

slide-29
SLIDE 29
  • The best case is that the same markers (SNPs) pass a multiple test

correction regardless of the testing approach used, i.e. the result is robust to testing approach.

  • In cases where this does not happen (most) it becomes helpful to

understand why test results could be different:

  • Are tests capturing additive vs. dominance effects?
  • Are tests less powerful than others or depend on certain assumptions being

true? Are they handling missing data in different ways?

  • Are particular covariates altering the results if included/excluded? Why might

this be?

  • Does it depend on how you partition the data (e.g. batch effects)?
  • This can help narrow down the set of tests you feel are the most
  • informative. In general, a good publishing strategy is limiting yourself to
  • ne or two tests that both give you significant results that you believe!

Comparing results of multiple analyses of the same GWAS data II

slide-30
SLIDE 30

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-31
SLIDE 31

Haplotype testing I

  • We have just extended our GWAS framework to

handle additional phenotypes

  • We can also extend our GWAS framework to handle

genotypes defined using a different approach

  • In this case, let’s consider using haplotype alleles in our

testing framework

  • Note that a haplotype collapses genetic marker

information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time

slide-32
SLIDE 32

Why does concept of a haplotype make sense?

  • This is because of LD!
  • The general rule: if we have a set of markers in high LD with

each other but low LD with other markers, we use this as a guide for defining the haplotype block

  • 0.2

0.4 0.6 0.8 5 10 15 0.2 0.4 0.6 0.8 1

RCOR3 TRAF5 C1orf97 RD3 SLC30A1 NEK2 LPGAT1

genes 30 60 1 2

211.5 211.6 211.7 211.8 211.9 212.0

* *

Single marker test Conditional test VBAY Lasso Adaptive Lasso 2D-MCP LOG NEG Independent hit Non−independent hit

10 15

genes

slide-33
SLIDE 33

Haplotype testing II

  • Haplotype - a series of ordered, linked alleles that

are inherited together

  • For the moment, let’s consider a haplotype to define a

“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:

  • For example, if these loci are each a SNP with the

following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:

h = f(Ai, Bj, ...)

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-34
SLIDE 34

Defining haplotypes

  • We could spend multiple lectures on how people define

haplotypes for given systems and the algorithms used for this purpose (so we will just briefly mention the main concepts here)

  • To define haplotypes, we need to “phase” measured

genotype markers, decide on the number of genotype markers to put together into a haplotype block, and decide how many haplotype alleles to consider

  • Remember: there are no universal rules for doing this

(system dependent!)

slide-35
SLIDE 35

GWAS with haplotypes I

  • Once we have defined haplotype alleles, we can

proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)

  • For example, in a case where we only have two

haplotype alleles, we can code our independent variables for our regression model as follows:

  • All other aspects remain the same (although what is the

effect on our interpretation of where the causal polymorphism is located?)

Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1

slide-36
SLIDE 36

Advantages of haplotype testing

  • In some cases (system and sample dependent!), the

haplotype is a better “tag” of the causal polymorphism than any of the surrounding markers

  • In such a case, the Corr(Xh, X) > Corr (X’, X) and

therefore has a higher probability of correctly rejecting the null hypothesis

  • Another “advantage” is by putting together markers, we

are performing less total tests in our GWAS (in what sense is this an advantage!?)

slide-37
SLIDE 37

Disadvantages of haplotype testing

  • Collapsing to haplotypes may produce a better tag but

it also may not (!!), i.e. sometimes (in fact often!) individual genetic markers are better tags of the causal polymorphism

  • Another disadvantage is resolution, since we absolutely

cannot resolve the position of the causal polymorphism to a position smaller than the range of the haplotype alleles, i.e. large haplotypes can have smaller resolution

  • If we had measured the causal polymorphism in our

data, should we use haplotype testing (i.e. in the future, the importance of haplotype testing may decrease)

slide-38
SLIDE 38

Should I apply haplotype testing in my GWAS?

  • Yes! but apply both an individual marker testing approach

(always!) as well as a haplotype test (optional)

  • The reason is that we never know the true answer in our

GWAS (as with any statistical analysis!) so it doesn’t hurt us to explore our dataset with as many techniques as we want to apply

  • In fact, this will be a continuing theme of the class, i.e. keep

analyzing GWAS with as many methods as you find useful

  • However, since we never know the right answer for certain,

if we get conflicting results, which one do we interpret as “correct”!?

slide-39
SLIDE 39

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-40
SLIDE 40
  • So far, we have applied a GWAS analysis by considering

statistical models between one genetic marker and the phenotype

  • This is the standard approach applied in all GWAS analyses

and the one that you should apply as a first step when analyzing GWAS data (always!)

  • However, we could start considering more than one marker in

each of the statistical models we consider

  • One reason we might want to do this is to test for statistical

interactions among genetic markers (or more specifically, between the causal polymorphisms that they are tagging)

Introduction to epistasis I

slide-41
SLIDE 41
  • As an example, for a sample that we can appropriately model with a linear

regression model, we can plot the phenotypes associated with each of the nine classes:

  • In this case, both marginal loci are additive

Introduction to epistasis III

slide-42
SLIDE 42
  • With nine classes, we also get the possibility of conditional relationships

we have not seen before:

  • This is an example of epistasis

Introduction to epistasis IV

slide-43
SLIDE 43
  • epistasis - a case where the effect of an allele substitution at one locus A1 -> A2

alters the effect of a substituting an allele at another locus B1->B2

  • This may be equivalently phrased as a change in the expected phenotype (genotypic

value) for a genotype at one locus conditional on the state of a locus at another marker

  • Note that there is a symmetry in epistasis such that if the effect of at least one allelic

substitution (from one genotype to another) for one locus depends on the genotype at the other locus, then at least one allelic substitution of the other locus will be dependent as well

  • A consequence of this symmetry is if there is an epistatic relationship between two

loci BOTH will be causal polymorphisms for the phenotype (!!!)

  • If there is an epistatic effect (=relationship) between loci, we would therefore like to

know this information

  • Note that we need not consider such relationships for a pair of loci, but such

relationships can exist among three (three-way), four (four-way), etc.

  • The amount of epistasis among loci for any given phenotype is unknown (but without

question it is ubiquitous!!)

Notes about epistasis 1

slide-44
SLIDE 44
  • Note that the definition of epistasis is entirely statistical (!!)

and says nothing about mechanism (although people have mis- appropriated the term in this way)

  • The term epistasis was coined by Fisher in the 1920’s
  • Epistasis is sometimes called genotype by genotype, G by G,
  • r G x G
  • Geneticists often use the term “modifiers” to describe the

dependence of genetic effects at a locus on the state of another locus - this is just epistasis (!!)

  • We can also consider the effects of a locus when considering

the entire “genetic background” (i.e. all the state in the rest of the genome!) - this is also epistasis (!!)

Notes about epistasis II

slide-45
SLIDE 45
  • To model epistasis, we are going to use our same GLM

framework (!!)

  • The parameterization (using Xa and Xd) that we have

considered so far perfectly models any case where there is no epistasis

  • We will account for the possibility of epistasis by

constructing additional dummy variables and adding additional parameters (so that we have 9 total in our GLM)

Modeling epistasis I

slide-46
SLIDE 46
  • Recall the dummy variables we have constructed so far:
  • We will use these dummy variables to construct additional

dummy variables in our GLM (and add additional parameters) to account for epistasis

Modeling epistasis II

Y = γ−1(βµ + Xa,1βa,1 + Xd,1βd,1 + Xa,2βa,2 + Xd,2βd,2 + Xa,1Xa,2βa,a + Xa,1Xd,2βa,d + Xd,1Xa,2βd,a + Xd,1Xd,2βd,d)

Xa,1 =    −1 for A1A1 for A1A2 1 for A2A2 Xd,1 =    −1 for A1A1 1 for A1A2 −1 for A2A2 Xa,2 =    −1 for B1B1 for B1B2 1 for B2B2 Xd,2 =    −1 for B1B1 1 for B1B2 −1 for B2B2

slide-47
SLIDE 47
  • To infer epistatic relationships we will use the exact same genetic

framework and statistical framework that we have been considering

  • For the genetic framework, we are still testing markers that we

are assuming are in LD with causal polymorphisms that could have an epistatic relationship (so we are indirectly inferring that there is epistasis from the marker genotypes)

  • For inference, we going to estimate epistatic parameters using

the same approach as before (!!), i.e. for a linear model:

Inference for epistasis 1

X = [1, Xa,1, Xd,1, Xa,2, Xd,2, Xa,a, Xa,d, Xd,a, Xd,d]

β = [βµ, βa,1, βd,1, βa,2, βd,2, βa,a, βa,d, βd,a, βd,d]T

ˆ β = (XTX)−1XTy

slide-48
SLIDE 48
  • For hypothesis testing, we will just use an LRT calculated the

same way as before (!!)

  • For an F-statistic for a linear regression and for logistic estimate

the parameters under the null and alternative model and substitute these into the likelihood equations that have the same form as before (with some additional dummy variables and parameters)

  • The only difference is the degrees of freedom for a given test we

consider = number of parameters in the alternative model - the number of parameters in the null model

Inference for epistasis II

slide-49
SLIDE 49
  • For example, we could use the entire model to test the same

hypothesis that we have been considering for a single marker:

  • We could also test whether either marker has evidence of being a

causal polymorphism:

  • We can also test just for epistasis (note this is equivalent to testing

an interaction effect in an ANOVA!):

  • We can also test the entire model (what is the interpretation in this

case!?):

H0 : βa,1 = 0\βd,1 = 0\βa,2 = 0\βd,2 = 0\βa,a = 0\βa,d = 0\βd,a = 0\βd,d = 0 ( HA : βa,1 6= 0[βd,1 6= 0[βa,2 6= 0[βd,2 6= 0[βa,a 6= 0[βa,d 6= 0[βd,a 6= 0[βd,d 6= 0 (

H0 : βa,a = 0 \ βa,d = 0 \ βd,a = 0 \ βd,d = 0 HA : βa,a 6= 0 [ βa,d 6= 0 [ βd,a 6= 0 [ βd,d 6= 0

H0 : βa,1 = 0 \ βd,1 = 0 \ βa,2 = 0 \ βd,2 = 0 HA : βa,1 6= 0 [ βd,1 6= 0 [ βa,2 6= 0 [ βd,2 6= 0

H0 : βa,1 = 0 \ βd,1 = 0

HA : βa,1 6= 0 [ βd,1 6= 0

Inference for epistasis III

slide-50
SLIDE 50

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-51
SLIDE 51
  • So far, we have considered a GWAS analysis where we have a

single phenotype and many genotypes, the latter collected by genomics technologies

  • Genomics technologies can also be used to measure many

phenotypes (e.g., genome-wide gene expression, proteomics, etc.)

  • We also often have a situation where we have both many

genotypes and many phenotypes

  • The framework you have learned in this class still applies (!!),

i.e., the first step in these analyses is still testing pairs of variables at a time

Analysis with more phenotypes

slide-52
SLIDE 52
  • Consider a case where you have collected genome-wide gene

expression or proteomic data for a tissue of a mouse experiment where there are only two conditions: “wild type" and “mutant”:

  • To analyze these data, regress each phenotype (e.g., a gene

expression measurement) on the condition (e.g., coded 0 / 1)

  • ne phenotype variable at a time (just like a GWAS!!)

Many phenotypes and one experimental condition I

Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅

slide-53
SLIDE 53
  • There is one important diagnostic difference in the many phenotype

analysis: your QQ plots need not conform to the rules of GWAS QQ plots (please take note of this!!)

  • That is, when you have a single treatment (or genotype) where you

are considering the impact on many phenotypes, it is possible the treatment / genotype impacts many phenotypes (and therefore produces many significant tests!)

Many phenotypes and one experimental condition II

slide-54
SLIDE 54
  • Why is this?
  • That is, why is it that when analyzing GWAS data (=regressing one

phenotype on many genotypes) the correct statistical model fitting cannot produce many highly significant tests while an analysis of many phenotypes on one genotype can produce many significant test results (and be the appropriate test result)

  • The reason is in a GWAS, we are assuming the underlying true case

is many causal genotypes each contributing to variation in the one phenotype, such that if there are many, each of their effects is relatively small (!!)

  • In a many phenotypes with one treatment situation, the treatment

(or genotype) many separately impact many of the phenotypes (!!)

Many phenotypes and one experimental condition III

slide-55
SLIDE 55
  • From the statistical modeling point of view, we can view a GWAS as a

multiple regression model (i.e., a single Y with many X’s):

  • While for a case with many phenotypes and a single treatment (e.g., a

single genotype) the correct model is a multivariate regression (i.e., many Y’s with a single X)

  • We could also have many phenotypes and many genotypes (e.g., eQTL)

Many phenotypes and one experimental condition IV

Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅

Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅

Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅

slide-56
SLIDE 56
  • While the right first analysis step when dealing with many

variables is testing pairs of variables at a time (e.g., one phenotype - one genotype) could we construct statistical models that consider more genotypes or more phenotypes at the same time?

  • Yes!
  • We could fit multiple regressions with many genotypes (you’ve

done multiple regressions already!)

  • We could fit multivariate regressions with many

Y’s and one treatment

  • We could even fit a multivariate-multiple regression model (!!)

Multiple and multivariate models I

slide-57
SLIDE 57
  • The problem with the multivariate regression approach is many

aspects get more complicated and in practice, you often you get the same information as fitting one Y and X pair at a time

  • The problem with multiple regressions with many X’s is the over-

fitting problem, requiring other techniques (e.g., penalized or regularized regressions) and in practice you often get the same information as fitting one Y and X pair

  • Same for multivariate-multiple regression situations like eQTL

designs (let’s take a quick look at this concept first)

  • For multiple regressions, we sometimes like to consider a few

more X’s to capture “interactions” (=epistasis)

Multiple and multivariate models II

slide-58
SLIDE 58
  • expression Quantitative Trait Locus (eQTL) - a polymorphic locus where an

experimental exchange of one allele for another produces a change in expression on average under specified conditions:

  • The allelic states defined by the original mutation event define the causal

polymorphism of the eQTL

  • Intuitive example: if rs27290 was a causal allele, changing A -> G would change the

measured expression of ERAP2

Example: eQTL!

eQTL

3.5 4.0 4.5 5.0 5.5 6.0 rs27290 genotype ERAP2 expression A/A A/G G/G

X A1 → A2 ⇒ ∆Y |Z

slide-59
SLIDE 59

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-60
SLIDE 60
  • Use of pedigrees has a long history in genetics, where the use of

family pedigrees stretch back ~100 years, i.e. before genetic markers (!!)

  • The observation that lead people to analyze pedigrees was that

Mendelian diseases (= phenotype determined by a single locus where genotype is highly predictive of phenotype) tend to run in families

  • The genetics of such diseases could therefore be studies by

analyzing a family pedigree

  • Given the disease focus, it is perhaps not surprisingly that family

pedigree analysis was the main tool of medical genetics

Pedigrees in genetics I

slide-61
SLIDE 61
  • When the first genetic markers appeared, it was natural to use

these to identify positions in the genome that may have the causal polymorphisms responsible for the Mendelian disease

  • In fact, analysis of pedigrees in combination with just a few markers

was the first step in identifying the causal polymorphisms for many Mendelian diseases, i.e. they could identify the general position in a chromosome, which could be investigator further with additional markers, tec.

  • In the late 70’s - 90’s a large number of Mendelian causal disease

polymorphisms were found using such techniques

  • Pedigree analysis therefore dominates the medical genetics

literature (where now this field is wrapped into the more diffusely field of quantitative genomics!)

Pedigrees in genetics II

slide-62
SLIDE 62
  • segregation analysis - inference concerning whether a phenotype

(disease) is consistent with a Mendelian disease given a pedigree (no genetic data!)

  • identity by descent (ibd) - inference concerning whether two

individuals (or more) individuals share alleles because they inherited them from a common ancestor (note: such analyses can be performed without markers but more recently, markers have allowed finer ibd inference and ibd inference without a pedigree!)

  • linkage analysis - use of a genetic markers on a pedigree to map the

position of causal polymorphisms affecting a phenotype (which may be Mendelian or complex)

  • family based testing - the use of genetic markers and many small

pedigrees to map the position of causal polymorphisms (again Mendelian

  • r complex)
  • Note that there are others (!!) and that we will provide simple examples

the illustrate the last two

Types of pedigree analysis

slide-63
SLIDE 63
  • The reason that we do not focus on pedigree analysis in this class is the

having high-coverage marker data makes many of the pedigree analyses unnecessary

  • As an example, pedigree (linkage) analysis was useful when we only had a

few markers because we could use the pedigree to infer the states of unseen markers

  • Once we can measure all the markers there is no need to use a pedigree
  • In fact, we can easily map the positions of Mendelian disease causal

polymorphisms without a pedigree (and we now do this all the time)

  • What’s worse, using pedigree (linkage) analysis to map causal

polymorphisms to complex phenotypes are turning out to have produced more (=not useful) inferences (!!)

  • However, understanding the basic intuition of these methods is critical for

understanding the literature in quantitative genetics and for derived pedigree methods that are still used

Importance of pedigree analysis now

slide-64
SLIDE 64
  • Both linkage analysis and association analysis have the same goal: identify

positions in the genome where there are causal polymorphisms using genetic markers

  • Recall that we are modeling the following in association analysis:
  • We are not concerned that the marker we are testing is not the causal

marker, but we would prefer to test the causal marker (if we could!)

  • Note that if we could model the relationship of the unmeasured causal

polymorphism Xcp and observed genetic marker X, we could use this information:

  • This is what we do in linkage analysis (!!)

Connection between linkage / association analysis I

Pr(Y |X)

| Pr(Y |Xcp)Pr(Xcp|X)

slide-65
SLIDE 65
  • Note that the first of these two terms is called the penetrance model (and there

are many ways to model penetrance!) and the second term is modeled based on the structure of an observed pedigree, which allows us to infer the conditional relationship of the causal polymorphism and observed genetic marker by inferring a recombination probability parameter r (confusingly, this is often symbolized as in the literature!):

  • We can therefore use the same statistical (inference) tools we have used before

but our models will be a little more complex and we will be inferring not only parameters that relate the genotype and phenotype (e.g. regression ‘s) but also the parameter r (!!)

  • If we are dealing with a Mendelian trait (which is the case for many linkage

analyses), the causal polymorphism perfectly describes the phenotype so we do not need to be concerned with the penetrance model:

Connection between linkage / association analysis II

| | Pr(Y |Xcp)Pr(Xcp|X, r(Xcp,X))

θ

Pr(Xcp|X, r(Xcp,X))

β

slide-66
SLIDE 66
  • In the literature, we often symbolize the combination of Xcp and X as a single g (for the

genotype involving both of these polymorphisms) so we may re-write this equation as the probability of a vector of a sample of n of these genotypes:

  • To convert this probability model into a more standard pedigree notation, note that

we can write out the genotypes of the n individuals in the sample

  • Using the pedigree information, we can write the following conditional relationships

relating parents (father = gf, mother = gm) to their offspring (where individuals without parents in the pedigree are called founders):

  • Finally, for inference, we need to consider all possible genotype configurations that

could occur for these n individuals (=classic pedigree equation):

Connection between linkage / association analysis III

Pr(g1, ..., gn|r)

|

f

Y

i

Pr(gi)

n

Y

j=f+1

Pr(gj|, gj,f, gj,m, r)

X

Θg f

Y

i

Pr(gi)

n

Y

j=f+1

Pr(gj|, gj,f, gj,m, r)

Pr(Xcp|X, r) = Pr(g|r)

slide-67
SLIDE 67
  • Again, note that in general, linkage analysis provides useful information

when you have a Mendelian phenotype and low marker coverage

  • If you have a more complex phenotype or higher marker coverage, it is

better just to test each marker one at a time, since the additional model complexities in linkage analysis tend to reduce the efficacy of the inference

  • A downside of using pedigrees designs for mapping with high marker

coverage is they have high LD (why?) so resolution is low

  • An upside is the individuals in the sample can be enriched for a disease

(particularly important if the disease is rare) and by considering individuals in a pedigree, this provides some control of genetic background (e.g. epistasis) and other issues!

  • This latter control is why family-based tests are also still used

Linkage analysis wrap-up

slide-68
SLIDE 68

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-69
SLIDE 69
  • inbred line design - a sampling experiment where the

individuals in the sample have a known relationship that is a consequence of controlled breeding

  • Note that the relationships may be know exactly (e.g. all

individuals have the same grandparents) or are known within a set of rules (e.g. the individuals were produced by brother-sister breeding for k generations)

  • Note that inbred line designs are a form of pedigrees

(= a sample of individuals for which we have information

  • n relationships among individuals)

Analysis of inbred lines

slide-70
SLIDE 70
  • Inbred lines have played a critical role in agricultural

genetics (actually, both inbred lines and pedigrees have been important)

  • This is particularly true for crop species, where people

have been producing inbred lines throughout history and (more recently) for the explicit purposes of genetic analysis

  • In genetic analysis, these have played an important

historical role, leading to the identification of some of the first causal polymorphisms for complex (non-Mendelian!) phenotypes

Historical importance of inbred lines

slide-71
SLIDE 71
  • Inbred lines continue to play a critical role in both agriculture (most

plants we eat are inbred!) and in genetics

  • The reason they continue to be important in genetic analysis is we can

control the genetic background (e.g. epistasis!) and, once we know causal polymorphisms, we can integrate the section of genome containing the causal polymorphism through inbreeding designs or now through “exact” approaches like CRISPR (or TALEN) (!!)

  • Where they used to be critically important in Quantitative Genetics

was when we had access to many fewer genetic markers, inbreeding designs allowed “strong” inference for the markers in between

  • This usage is less important now, but for understanding the Quant Gen

literature (e.g. the specialized mapping methods applied to these line) we will consider several specialized designs and how we analyze them

  • How should I analyze (high density) marker data for inbred lines?

= do a GWAS analysis one market a time (!!) (maybe use a mixed model to account for inbred line structure…)

Importance of inbred lines

slide-72
SLIDE 72
  • The reason that inbred line designs are useful is we can infer

the unobserved markers (with low error!) even with very few markers

  • The reason is inbred lines designs result in homozygosity of

the resulting lines (although they may be homozygous for different genotype!)

  • Therefore, inbreeding, in combination with uncontrolled

random sampling (=genetic drift) results in lines that are homozygous for one of the genotypes of the parents

Consequences of inbreeding

slide-73
SLIDE 73
  • A few main examples (non-exhaustive!):
  • B1 (Backcross) - cross between two inbred lines where offspring are crossed

back to one or both parents

  • F2 - cross between two inbred lines where offspring are crossed to each other

to produce the mapping population

  • NILs (Near Isogenic Lines) - cross between two inbred lines, followed by

repeated backcrossing to one of the parent populations, followed by inbreeding

  • RILs (Recombinant Inbred Lines) an F2 cross followed by inbreeding of the
  • ffspring
  • Isofemale lines - offspring of a single female from an outbred (=non-inbred!)

population are inbred

  • We will discuss NILs and briefly mention the F2 design to provide a foundation for

the major concepts in the literature

Types of inbred line designs (important in genetic analysis)

slide-74
SLIDE 74

Inbred line A (homozygous) Inbred line B (homozygous)

X

F1 (cross these to each

  • ther)

F2

Example: interval mapping (F2)

slide-75
SLIDE 75

Example: interval mapping (F2)

  • A limitation of NILs is the resolution is the size of the smallest

“introgressed” region

  • The goal of “interval mapping” is to take advantage of different

designs but with many possible recombination events, so we could map to a smaller region with a pedigree analysis approach

  • Recall the general structure of the pedigree likelihood equation (note

we could also use a Bayesian approach!):

  • For interval mapping, we will use a version of this equation (what

assumptions!?) to infer the state of unmeasured polymorphism “Q” that is in the proximity of markers we have measured:

  • The first of these equations is just our glm (!!) or similar penetrance

model, where we will consider an example of one type of inbreeding design (F2) to show the structure of the second

Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) = X

Θg f

Y

i

Pr(y|gi)Pr(gi)Pr(gi)

n

Y

j=f+1

Pr(yj|gj)Pr(gj|, gj,f, gj,m, r)

Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) =

n

Y

i

X

Θg

Pr(yi|gi,Q)Pr(gi,Q|gi,A, gi,B, r)

slide-76
SLIDE 76
  • We can therefore substitute these conditional probabilities into our

main equation and calculate the likelihood over possible values of r

  • In practice we perform a LRT comparing the null of no causal

polymorphism for an alternative where there is a causal polymorphism in the marker defined region, where if we reject, we consider there to be a causal polymorphism in the region

  • Note that the LRT is sometimes expressed as a “LOD” score (just LRT

base 10!), which is just LRT times a constant (!!)

  • Note that once we have rejected the null for a region, we can identify

the position within the interval by finding the position where a given value of r maximizes the likelihood, i.e. hence “interval mapping”

  • We can translate this to a relative position if we have a physical map and

recombination map (another complex subject!)

Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) =

n

Y

i

X

Θg

Pr(yi|gi,Q)Pr(gi,Q|gi,A, gi,B, r)

Example: interval mapping (F2)

slide-77
SLIDE 77

Value of interval mapping

  • Similar to the case of using a linkage (pedigree) analysis to map causal

polymorphisms for complex (non-Mendelian) phenotypes, in practice, interval mapping turns out to be not very useful

  • The reason is the same as in interval mapping (for complex

phenotypes) that fitting a complex model does not provide very exact inferences

  • This is not to say inbred line designs are not useful (remember: the

control of genetic background, etc.) but the best approach for analyzing these data is to test one marker at a time, i.e. just like in a GWAS!

  • Given that we can now easily produce many markers across a region,

we would get the same result as the ideal interval mapping result (!!)

  • Interval mapping (and the many variants) is therefore (should) no

longer used but understanding this technique is important for interpreting the literature (!!)

slide-78
SLIDE 78

Topics that we don’t have time to cover (but 2019 lectures available!) - will briefly mention last lecture…

  • Alternative tests in GWAS (2019 Lecture 19)
  • Haplotype testing (2019 Lecture 19)
  • Multiple regression analysis / epistasis (2019 Lecture 21)
  • Multivariate regression analysis / eQTL (2019 Lecture 21)
  • Basics of linkage analysis (2019 Lecture 24)
  • Basics of inbred line analysis (2019 Lecture 25)
  • Basics of evolutionary quantitative genetics (2019 Lecture 25)
slide-79
SLIDE 79
  • Intro. to classic quantitative genetics I
  • The last concepts we will discuss are from the field of genetics

before we knew about DNA (!!) and therefore before genetic markers

  • A way of thinking about the field of genetics before genetic markers

was geneticists used the observation of the similarity between relatives to determine how much they could explain about underlying genetics (they could infer quite a bit!)

  • These inferences were used to model the patterns of phenotypes

they observed in populations, how phenotypes evolved (=how the mean of a phenotype in a population changed over time), to guide plant and animal breeding to produce desired changes in phenotypes, etc.

  • The history goes back > 100 years where many of the concepts are

important and continue to re-appear in quantitative genomics

slide-80
SLIDE 80
  • Intro. to classic quantitative genetics II
  • We can understand the major concepts in classic quantitative genomics

using our glm framework (!!)

  • We will focus on phenotypes with normal error (= linear regression)

but the concepts generalize

  • The most important concept for understanding classic quantitative

genetics is understanding narrow sense heritability (often just referred to as heritability), which is a property of a phenotype we measure:

  • Note that this is a fraction with additive genetic variance (VA) in the

numerator and phenotypic variance (VP) in the denominator

  • The strange notation comes from a derivation by Sewall Wright (there

are several derivations of heritability!) using path analysis, a type of probabilistic graphical model called a structural equation model

h2 = VA VP

slide-81
SLIDE 81
  • RA Fisher used it to resolve the Mendelian versus Biometry argument that

had gone on for ~30 years (with one paper!!) showing that a single genetic model could explain both patterns of inheritance

  • RA Fisher also used heritability to demonstrate why Darwin’s evolution by

natural selection was not only possible but occurred under extremely plausible conditions (“Fisher’s fundamental theorem”):

  • More generally for evolution, heritability determines whether a phenotype

changes under selection or genetic drift:

  • We can use parts of heritability (additive genetic variance) to predict the

relative offspring phenotype values from breeding two individuals (= breeding values)

  • One of the most robust observations in biology: all reasonable phenotypes

have non-zero heritability (!!), implying at least one causal polymorphism affects every phenotype (what else does it imply!?)

Why heritability is important

∆ ¯ w = h2

wVP

∆ ¯ Y = h2s

V ¯

P,t+1 = h2 t VP,t

Ne

slide-82
SLIDE 82

The components of heritability

  • Recall that heritability is a fraction of two terms:
  • The denominator is the total variance for the phenotype (VP), which we

can calculate for the entire population as follows (or estimate using a sample):

  • The numerator is the additive genetic variance (VA) in the phenotype,

which can be calculated for any phenotype (regardless of the complexity

  • f the genetics!)
  • However, this is easiest to understand when assuming there is a single

causal polymorphism for the phenotype

  • In this case, the

VA is the following where the parameter is from our linear regression term only fitting the “additive” term (not dominance term!!):

h2 = VA VP

VP = 1 n

n

X

i

(Yi − ¯ Y )2

VA = 2MAF(1 MAF)2

↵ = 2p(1 p)2 ↵

slide-83
SLIDE 83

Additive genetic variance I

  • Recall that in our original regression (for a single causal polymorphism and

assume we are fitting this model for the actual causal polymorphism, not a marker in LD!), we had two dummy variables and two parameters:

  • For additive genetic variance, we will only define one dummy variable

(even if there is dominance in the system!):

  • Given this model, it should be clear that the effects of dominance end up

in the error term (!!) just as for the case with un-modeled covariates

  • We can then derive the additive genetic variance as follows:

Xa(A1A1) = 1, Xa(A1A2) = 0, Xa(A2A2) = 1

Xd(A1A1) = −1, Xd(A1A2) = 1, Xd(A2A2) = −1

− Y = µ + Xaa + Xdd + ✏

Xα(A1A1) = −1, Xα(A1A2) = 0, Xα(A2A2) = 1 Y = µ + Xαα + ✏

− VA = 2p(1 − p)2

α

slide-84
SLIDE 84

Parameter relationships (!!): genetic regression model and VA

  • Another classic parameterization of genetic effects is the following
  • We can convert these to our regression parameters by solving the

following equations and making appropriate substitutions:

  • Note one last important relationship:

1

  • f GA1A1 = 0, GA1A2 = a + d, GA2A2 = 2a

VA = 2p(1 p) a ⇣ 1 + d(p1 p2) ⌘!2

0 = βµ βa βd, a + d = βµ + βd, 2a = βµ + βa βd

{ \ } ↵ = a 1 + d 2 (p1 p2) !

slide-85
SLIDE 85

Additive genetic variance II

  • There is a consequence of whether we fit two or one “slope”

parameters in our regression model

  • If we consider two slope parameters (as we have done all

semester!) the true values of the parameters are the same regardless of the allele frequency (MAF) of the causal polymorphism

  • If we consider one regression parameter the true value of

this parameter depends on the allele frequency (MAF) of the causal polymorphism

  • The latter means that the true parameter value will change with

changes in allele frequencies (!!)

  • Stated another way, if we were to estimate this additive genetic

regression parameter, there would be a different correct answer depending on the allele frequency in the population (!!) a, d

α

slide-86
SLIDE 86
  • Yes! It’s an important concept for thinking about evolution, the

structure of variation in populations, etc.

  • It is often important for determining our chances of using a GWAS

to map the locations of causal polymorphims (why is this?)

  • We often use marginal heritabilities, i.e. the heritability due to a

single marker to provide a quantification of effects (note that we use different concepts such as relative risks and related concepts when dealing with case / control data):

  • In short, heritability is an important concept, but now you have the

tools to understand heritability in terms of regressions (!!) and this will provide a framework for understanding related concepts

Do we still use heritability in quantitative genomics?

h2

m =

2pi(1 − pi)2

α,i

VP

slide-87
SLIDE 87

Conceptual Overview

Genetic System

Does A1 -> A2 affect Y?

Sample or experimental pop

Measured individuals (genotype, phenotype)

Pr(Y|X)

Model params

Reject / DNR

Regression model

F-test

slide-88
SLIDE 88

Conceptual Overview

System Question

Experiment

Sample

Probability Model

Estimator

Inference D i s t r i b u t i

  • n

Hypothesis Test

slide-89
SLIDE 89

Pep Talk: keep learning (!!)

  • How to learn stats, math, coding, etc. (my suggestion):
  • Figure out what you are passionate about (it will involve quantitative aspects!) and

build your understanding by hooking it into your passion

  • If you spend more than a few hours / day trying to understand something and can’t it

means you are missing a critical component that you have not learned = put it down and come back to it at a later date (you’ll be surprised how you’ll learn something later that suddenly makes it clear…)

  • Don’t memorize theorems, constantly study, etc. = know what you do understand,

keep adding to this, and learn it over time by hooking it into your passion

  • Don’t be intimidated by others or yourself
  • Anyone trying to make you feel bad because they “know math” and you don’t is

confused (knowing math someone else has developed does not mean you’re smart…)

  • I’m too old to learn this, I don’t understand what people are saying / the material in

the class so I’m not smart enough to learn it, etc. = NOT TRUE

  • You can stop learning this for extended periods and loose faith in yourself for years -

you can always come back and keep learning and you WILL LEARN IT (trust me)

slide-90
SLIDE 90

That’s it (!!)

  • Good luck on the final!