Lecture 26: Introduction to Bayesian MCMC and wrap-up (last class!)
Jason Mezey jgm45@cornell.edu May 12, 2020 (T) 8:40-9:55
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Introduction to Bayesian MCMC and wrap-up (last class!) Jason Mezey jgm45@cornell.edu May 12, 2020 (T) 8:40-9:55 Announcements Reminder: Project due 11:59PM
Jason Mezey jgm45@cornell.edu May 12, 2020 (T) 8:40-9:55
material you may access BUT ONCE THE EXAM STARTS YOU MAY NOT ASK ANYONE ABOUT ANYTHING THAT COULD RELATE TO THE EXAM (!!!!)
20 (Weds.)
if you are well prepared)
by (briefly) introducing MCMC algorithms
charting your future learning
have a probability distribution associated with them that reflects our belief in the values that might be the true value of the parameter
joint distribution of the parameter AND a sample Y produced under a probability model:
certain value given a sample:
sample) we can rewrite this as follows:
Pr(θ ∩ Y)
Pr(θ|y)
Pr(θ|y) = Pr(y|θ)Pr(θ) Pr(y)
Pr(θ|y) ∝ Pr(y|θ)Pr(θ)
likelihood (!!):
values the true parameter value may take
where we make one assumption): 1. the probability distribution that generated the sample, 2. the probability distribution of the parameter
Pr(θ|y) ∝ Pr(y|θ)Pr(θ)
t Pr(θ|y) , i.e. the
t Pr(θ) i
| ∝ | Pr(y|θ) = L(θ|y)
framework in both estimation and hypothesis testing
construct estimators using the posterior probability distribution, for example:
likelihood (Frequentist) framework since estimator construction is fundamentally different (!!)
ˆ θ = mean(θ|y) = Z θPr(θ|y)dθ
ˆ θ = median(θ|y)
hypothesis framework:
frequentist framework, where we use a Bayes factor to indicate the relative support for one hypothesis versus the other:
difficult to assign priors for hypotheses that have completely different ranges of support (e.g. the null is a point and alternative is a range of values)
hypothesis testing that makes use of credible intervals (which is what we will use in this course)
H0 : θ ∈ Θ0 HA : θ ∈ ΘA
Bayes = R
θ∈Θ0 Pr(y|θ)Pr(θ)dθ
R
θ∈ΘA Pr(y|θ)Pr(θ)dθ
at some level (say 0.95), which is an interval that will include the value of the parameter 0.95 of the times we performed the experiment an infinite number of times, calculating the confidence interval each time (note: a strange definition...)
completely different interpretation: this interval has a given probability of including the parameter value (!!)
if this interval includes the value of the parameter under the null hypothesis (!!)
c.i.(θ) = Z cα
−cα
Pr(θ|y)dθ = 1 − α
(note that we will focus on the linear regression model but we can perform Bayesian inference for any GLM!):
two parameters
Y = µ + Xaa + Xdd + ✏ ✏ ⇠ N(0, 2
✏ )
y = x + ✏ ✏ ⇠ multiN(0, I2
✏ )
poses of mapping, we ar s H0 : a = 0\d = 0
HA : a 6= 0 [ d 6= 0
distribution for the prior
normal (!!):
Pr(βµ, βa, βd, σ2
✏ ) =
Pr(βµ, βa, βd, σ2
✏ ) = Pr(βµ)Pr(βa)Pr(βd)Pr(σ2 ✏ )
Pr(βµ) = Pr(βa) = Pr(βd) = c Pr(σ2
✏ ) = c
Pr(βµ, βa, βd, σ2
✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )
Pr(θ|y) ∝ (σ2
✏ ) − n
2 e (y−x)T(y−x) 22 ✏
interested in is:
y = x + ✏ ✏ ⇠ multiN(0, I2
✏ )
Pr(µ, a, d, 2
✏ |y) / Pr(y|µ, a, d, 2 ✏ )Pr(µ, a, d, 2 ✏ )
Pr(βµ, βa, βd, σ2
✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )
Pr(βa, βd|y) = ⌦ ∞ ⌦ ∞
−∞
Pr(βµ, βa, βd, σ2
⇥ |y)dβµdσ2 ⇥
interval for our genetic null hypothesis and test a marker for a phenotype association and we can perform a GWAS by doing this for each marker (!!)
Pr(βa, βd|y) = Z ∞
−∞
Z ∞ Pr(βµ, βa, βd, σ2
✏ |y)dβµdσ2 ✏ ∼ multi-t-distribution
mean(Pr(βa, βd|y)) = h ˆ βa, ˆ βd iT = C−1 [Xa, Xd]T y cov = (y − [Xa, Xd] h ˆ βa, ˆ βd iT )T(y − [Xa, Xd] h ˆ βa, ˆ βd iT ) n − 6 C−1 C = XT
a Xa
XT
a Xd
XT
d Xa
XT
d Xd
f(multi−t) = n − 4
Pr(βa, βd|y) β Pr(βa, βd|y) β
Pr(βa, βd|y)
Pr(βa, βd|y)
βa βa
βd
βa βa
βd βd βd
0.95 credible interval 0.95 credible interval
Cannot reject H0! Reject H0!
simple closed form of the overall posterior
to put together more complex priors with our likelihood or consider a more complicated likelihood equation (e.g. for a logistic regression!)
still need to determine the credible interval from the posterior (or marginal) probability distribution so we need to determine the form
Markov chain Monte Carlo (MCMC) algorithm for this purpose
to consider models from another branch of probability (remember, probability is a field much larger than the components that we use for statistics / inference!): Stochastic processes
vectors (variables) with defined conditional relationships, often indexed by an ordered set t
this probability sub-field: Markov processes (or more specifically Markov chains)
accurately, a set of random vectors), which we will index with t:
property:
chain are in the same class of random variables (e.g. Bernoulli, normal, etc.) we allow the parameters of these random variables to be different, e.g. at time t and t+1
Xt, Xt+1, Xt+2, ...., Xt+k Xt, Xt−1, Xt−2, ...., Xt−k
− − −
Pr(Xt, |Xt−1, Xt−2, ...., Xt−k) = Pr(Xt, |Xt−1)
variable in the chain has a Bernoulli distribution:
(since it is just a random vector with a probability distribution!):
distributions are the same (=they do not change over time)
such stationary distributions
1,0,...,1,1 0,1,...,1,1 0,0,...,0,0 0,1,...,0,0
X1, X2..., X1001, X1002
X1 ⇠ Bern(0.2), X2 ⇠ Bern(0.45), ..., X1001 ⇠ Bern(0.4), X1002 ⇠ Bern(0.4)
0.21
prove that the chain will evolve (more accurately converge!) to a unique (!!) stationary distribution and will not leave this stationary distribution (where is it often possible to determine the parameters for the stationary distribution!)
may be very large, e.g. infinite), we will reach a point where each following random variable is in the unique stationary distribution:
chain that evolves to a unique stationary distribution that is exactly the posterior probability distribution that we are interested in (!!!)
reach this stationary distribution and then we will take a sample from this chain to determine (or more accurately approximate) our posterior
|
− − −
Pr(Xt+k) = Pr(Xt+k+1) = ...
| MCMC = Xt+k, Xt+k+1, Xt+k+2, ...., Xt+k+m Sample = 0.1, −0.08, −1.4, ...., 0.5
Pr(µ|y)
ˆ θ = median(Pr(θ|y) ' median(θ[tab], ..., θ[tab+k])
t = 0 or any subsequent iteration.
Pr(θ[t]|y)J(θ∗|θ[t]).
θ[t]) = 1 min(r, 1).
the ‘burn-in’ phase, where the realizations of θ[t] start to behave as though they are sampled from the stationary distribution of the Metropolis-Hastings Markov chain (we will discuss how many iterations are necessary for a burn-in below).
the posterior distribution and perform Bayesian inference.
likelihood) and our prior (which we provide!), and our goal is then to construct an MCMC with a stationary distribution (which we will sample to get the posterior “histogram”:
distribution
the Gibbs sampler (requires no rejections!), which samples each parameter from the conditional posterior distributions (which requires you derive these relationships = not always possible!)
Pr(βµ|βa, βd, σ2
✏ , y)
Pr(βa|βµ, βd, σ2
✏ , y)
Pr(βd|βµ, βa, σ2
✏ , y)
Pr(σ2
✏ |βµ, βa, βd, y)
θ[t] = βµ βa βd σ2
✏
[t]
θ[t+1] = βµ βa βd σ2
✏
[t+1]
practical
Bayesian data analysis is when computers increased in speed
MCMC approaches in genetic analysis has steadily increased
algorithms can be inefficient (they take a long time to converge, they do not sample modes of a complex posterior efficiently, etc.)
genetic inference, e.g. variational Bayes
GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number
based tests, Kruskal-Wallis tests, other rank based tests, chi- square, Fisher’s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS)
GWAS
review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!)
applied
perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!):
H0 : Cov(Y, X) = 0 HA : Cov(Y, X) 6= 0
test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!)
phenotype combinations (left) and calculate the expected numbers in each cell (right):
in this c is χ2
d.f.=2.
ize tends to infinite, i.e. when the sam is d.f. = (#columns-1)(#rows-1) = 2, can therefore calculate the statistic in
LRT = −2lnΛ = −2
3
X
i=1 2
X
j=1
nijln ni n.inj. !
Case Control A1A1 n11 n12 n1. A1A2 n21 n22 n2. A2A2 n31 n32 n3. n.1 n.2 n
Case Control A1A1 (n.1n1.)/n (n.2n1.)/n n1. A1A2 (n.1n2.)/n (n.2n2.)/n n2. A2A2 (n.1n3.)/n (n.2n3.)/n n3. n.1 n.2 n
Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others
will be best suited) we should run a number of tests and compare results
conditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules:
good evidence that there is a causal polymorphism
are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)
following (now what!?):
correction regardless of the testing approach used, i.e. the result is robust to testing approach.
understand why test results could be different:
true? Are they handling missing data in different ways?
this be?
handle additional phenotypes
genotypes defined using a different approach
testing framework
information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time
each other but low LD with other markers, we use this as a guide for defining the haplotype block
0.4 0.6 0.8 5 10 15 0.2 0.4 0.6 0.8 1
RCOR3 TRAF5 C1orf97 RD3 SLC30A1 NEK2 LPGAT1
genes 30 60 1 2
211.5 211.6 211.7 211.8 211.9 212.0
* *
Single marker test Conditional test VBAY Lasso Adaptive Lasso 2D-MCP LOG NEG Independent hit Non−independent hit
10 15
genes
are inherited together
“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:
following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:
h = f(Ai, Bj, ...)
s h1 = (A, A, C, C)
s (A, G), (A, T), (G, d h2 = (G, T, G, G).
haplotypes for given systems and the algorithms used for this purpose (so we will just briefly mention the main concepts here)
genotype markers, decide on the number of genotype markers to put together into a haplotype block, and decide how many haplotype alleles to consider
(system dependent!)
proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)
haplotype alleles, we can code our independent variables for our regression model as follows:
effect on our interpretation of where the causal polymorphism is located?)
Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1
haplotype is a better “tag” of the causal polymorphism than any of the surrounding markers
therefore has a higher probability of correctly rejecting the null hypothesis
are performing less total tests in our GWAS (in what sense is this an advantage!?)
it also may not (!!), i.e. sometimes (in fact often!) individual genetic markers are better tags of the causal polymorphism
cannot resolve the position of the causal polymorphism to a position smaller than the range of the haplotype alleles, i.e. large haplotypes can have smaller resolution
data, should we use haplotype testing (i.e. in the future, the importance of haplotype testing may decrease)
(always!) as well as a haplotype test (optional)
GWAS (as with any statistical analysis!) so it doesn’t hurt us to explore our dataset with as many techniques as we want to apply
analyzing GWAS with as many methods as you find useful
if we get conflicting results, which one do we interpret as “correct”!?
statistical models between one genetic marker and the phenotype
and the one that you should apply as a first step when analyzing GWAS data (always!)
each of the statistical models we consider
interactions among genetic markers (or more specifically, between the causal polymorphisms that they are tagging)
regression model, we can plot the phenotypes associated with each of the nine classes:
we have not seen before:
alters the effect of a substituting an allele at another locus B1->B2
value) for a genotype at one locus conditional on the state of a locus at another marker
substitution (from one genotype to another) for one locus depends on the genotype at the other locus, then at least one allelic substitution of the other locus will be dependent as well
loci BOTH will be causal polymorphisms for the phenotype (!!!)
know this information
relationships can exist among three (three-way), four (four-way), etc.
question it is ubiquitous!!)
and says nothing about mechanism (although people have mis- appropriated the term in this way)
dependence of genetic effects at a locus on the state of another locus - this is just epistasis (!!)
the entire “genetic background” (i.e. all the state in the rest of the genome!) - this is also epistasis (!!)
framework (!!)
considered so far perfectly models any case where there is no epistasis
constructing additional dummy variables and adding additional parameters (so that we have 9 total in our GLM)
dummy variables in our GLM (and add additional parameters) to account for epistasis
Y = γ−1(βµ + Xa,1βa,1 + Xd,1βd,1 + Xa,2βa,2 + Xd,2βd,2 + Xa,1Xa,2βa,a + Xa,1Xd,2βa,d + Xd,1Xa,2βd,a + Xd,1Xd,2βd,d)
Xa,1 = −1 for A1A1 for A1A2 1 for A2A2 Xd,1 = −1 for A1A1 1 for A1A2 −1 for A2A2 Xa,2 = −1 for B1B1 for B1B2 1 for B2B2 Xd,2 = −1 for B1B1 1 for B1B2 −1 for B2B2
framework and statistical framework that we have been considering
are assuming are in LD with causal polymorphisms that could have an epistatic relationship (so we are indirectly inferring that there is epistasis from the marker genotypes)
the same approach as before (!!), i.e. for a linear model:
X = [1, Xa,1, Xd,1, Xa,2, Xd,2, Xa,a, Xa,d, Xd,a, Xd,d]
ˆ β = (XTX)−1XTy
same way as before (!!)
the parameters under the null and alternative model and substitute these into the likelihood equations that have the same form as before (with some additional dummy variables and parameters)
consider = number of parameters in the alternative model - the number of parameters in the null model
hypothesis that we have been considering for a single marker:
causal polymorphism:
an interaction effect in an ANOVA!):
case!?):
H0 : βa,1 = 0\βd,1 = 0\βa,2 = 0\βd,2 = 0\βa,a = 0\βa,d = 0\βd,a = 0\βd,d = 0 ( HA : βa,1 6= 0[βd,1 6= 0[βa,2 6= 0[βd,2 6= 0[βa,a 6= 0[βa,d 6= 0[βd,a 6= 0[βd,d 6= 0 (
H0 : βa,a = 0 \ βa,d = 0 \ βd,a = 0 \ βd,d = 0 HA : βa,a 6= 0 [ βa,d 6= 0 [ βd,a 6= 0 [ βd,d 6= 0
H0 : βa,1 = 0 \ βd,1 = 0 \ βa,2 = 0 \ βd,2 = 0 HA : βa,1 6= 0 [ βd,1 6= 0 [ βa,2 6= 0 [ βd,2 6= 0
H0 : βa,1 = 0 \ βd,1 = 0
HA : βa,1 6= 0 [ βd,1 6= 0
single phenotype and many genotypes, the latter collected by genomics technologies
phenotypes (e.g., genome-wide gene expression, proteomics, etc.)
genotypes and many phenotypes
i.e., the first step in these analyses is still testing pairs of variables at a time
expression or proteomic data for a tissue of a mouse experiment where there are only two conditions: “wild type" and “mutant”:
expression measurement) on the condition (e.g., coded 0 / 1)
Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅
analysis: your QQ plots need not conform to the rules of GWAS QQ plots (please take note of this!!)
are considering the impact on many phenotypes, it is possible the treatment / genotype impacts many phenotypes (and therefore produces many significant tests!)
phenotype on many genotypes) the correct statistical model fitting cannot produce many highly significant tests while an analysis of many phenotypes on one genotype can produce many significant test results (and be the appropriate test result)
is many causal genotypes each contributing to variation in the one phenotype, such that if there are many, each of their effects is relatively small (!!)
(or genotype) many separately impact many of the phenotypes (!!)
multiple regression model (i.e., a single Y with many X’s):
single genotype) the correct model is a multivariate regression (i.e., many Y’s with a single X)
Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅
Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅
Data = ⇤ ⌥ ⇧ z11 ... z1k y11 ... y1m x11 ... x1N . . . . . . . . . . . . . . . . . . . . . . . . . . . zn1 ... znk yn1 ... ynm x11 ... xnN ⌅
variables is testing pairs of variables at a time (e.g., one phenotype - one genotype) could we construct statistical models that consider more genotypes or more phenotypes at the same time?
done multiple regressions already!)
Y’s and one treatment
aspects get more complicated and in practice, you often you get the same information as fitting one Y and X pair at a time
fitting problem, requiring other techniques (e.g., penalized or regularized regressions) and in practice you often get the same information as fitting one Y and X pair
designs (let’s take a quick look at this concept first)
more X’s to capture “interactions” (=epistasis)
experimental exchange of one allele for another produces a change in expression on average under specified conditions:
polymorphism of the eQTL
measured expression of ERAP2
3.5 4.0 4.5 5.0 5.5 6.0 rs27290 genotype ERAP2 expression A/A A/G G/G
X A1 → A2 ⇒ ∆Y |Z
family pedigrees stretch back ~100 years, i.e. before genetic markers (!!)
Mendelian diseases (= phenotype determined by a single locus where genotype is highly predictive of phenotype) tend to run in families
analyzing a family pedigree
pedigree analysis was the main tool of medical genetics
these to identify positions in the genome that may have the causal polymorphisms responsible for the Mendelian disease
was the first step in identifying the causal polymorphisms for many Mendelian diseases, i.e. they could identify the general position in a chromosome, which could be investigator further with additional markers, tec.
polymorphisms were found using such techniques
literature (where now this field is wrapped into the more diffusely field of quantitative genomics!)
(disease) is consistent with a Mendelian disease given a pedigree (no genetic data!)
individuals (or more) individuals share alleles because they inherited them from a common ancestor (note: such analyses can be performed without markers but more recently, markers have allowed finer ibd inference and ibd inference without a pedigree!)
position of causal polymorphisms affecting a phenotype (which may be Mendelian or complex)
pedigrees to map the position of causal polymorphisms (again Mendelian
the illustrate the last two
having high-coverage marker data makes many of the pedigree analyses unnecessary
few markers because we could use the pedigree to infer the states of unseen markers
polymorphisms without a pedigree (and we now do this all the time)
polymorphisms to complex phenotypes are turning out to have produced more (=not useful) inferences (!!)
understanding the literature in quantitative genetics and for derived pedigree methods that are still used
positions in the genome where there are causal polymorphisms using genetic markers
marker, but we would prefer to test the causal marker (if we could!)
polymorphism Xcp and observed genetic marker X, we could use this information:
Pr(Y |X)
| Pr(Y |Xcp)Pr(Xcp|X)
are many ways to model penetrance!) and the second term is modeled based on the structure of an observed pedigree, which allows us to infer the conditional relationship of the causal polymorphism and observed genetic marker by inferring a recombination probability parameter r (confusingly, this is often symbolized as in the literature!):
but our models will be a little more complex and we will be inferring not only parameters that relate the genotype and phenotype (e.g. regression ‘s) but also the parameter r (!!)
analyses), the causal polymorphism perfectly describes the phenotype so we do not need to be concerned with the penetrance model:
| | Pr(Y |Xcp)Pr(Xcp|X, r(Xcp,X))
θ
Pr(Xcp|X, r(Xcp,X))
β
genotype involving both of these polymorphisms) so we may re-write this equation as the probability of a vector of a sample of n of these genotypes:
we can write out the genotypes of the n individuals in the sample
relating parents (father = gf, mother = gm) to their offspring (where individuals without parents in the pedigree are called founders):
could occur for these n individuals (=classic pedigree equation):
Pr(g1, ..., gn|r)
|
f
Y
i
Pr(gi)
n
Y
j=f+1
Pr(gj|, gj,f, gj,m, r)
X
Θg f
Y
i
Pr(gi)
n
Y
j=f+1
Pr(gj|, gj,f, gj,m, r)
Pr(Xcp|X, r) = Pr(g|r)
when you have a Mendelian phenotype and low marker coverage
better just to test each marker one at a time, since the additional model complexities in linkage analysis tend to reduce the efficacy of the inference
coverage is they have high LD (why?) so resolution is low
(particularly important if the disease is rare) and by considering individuals in a pedigree, this provides some control of genetic background (e.g. epistasis) and other issues!
individuals in the sample have a known relationship that is a consequence of controlled breeding
individuals have the same grandparents) or are known within a set of rules (e.g. the individuals were produced by brother-sister breeding for k generations)
(= a sample of individuals for which we have information
genetics (actually, both inbred lines and pedigrees have been important)
have been producing inbred lines throughout history and (more recently) for the explicit purposes of genetic analysis
historical role, leading to the identification of some of the first causal polymorphisms for complex (non-Mendelian!) phenotypes
plants we eat are inbred!) and in genetics
control the genetic background (e.g. epistasis!) and, once we know causal polymorphisms, we can integrate the section of genome containing the causal polymorphism through inbreeding designs or now through “exact” approaches like CRISPR (or TALEN) (!!)
was when we had access to many fewer genetic markers, inbreeding designs allowed “strong” inference for the markers in between
literature (e.g. the specialized mapping methods applied to these line) we will consider several specialized designs and how we analyze them
= do a GWAS analysis one market a time (!!) (maybe use a mixed model to account for inbred line structure…)
the unobserved markers (with low error!) even with very few markers
the resulting lines (although they may be homozygous for different genotype!)
random sampling (=genetic drift) results in lines that are homozygous for one of the genotypes of the parents
back to one or both parents
to produce the mapping population
repeated backcrossing to one of the parent populations, followed by inbreeding
population are inbred
the major concepts in the literature
Inbred line A (homozygous) Inbred line B (homozygous)
F1 (cross these to each
F2
“introgressed” region
designs but with many possible recombination events, so we could map to a smaller region with a pedigree analysis approach
we could also use a Bayesian approach!):
assumptions!?) to infer the state of unmeasured polymorphism “Q” that is in the proximity of markers we have measured:
model, where we will consider an example of one type of inbreeding design (F2) to show the structure of the second
Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) = X
Θg f
Y
i
Pr(y|gi)Pr(gi)Pr(gi)
n
Y
j=f+1
Pr(yj|gj)Pr(gj|, gj,f, gj,m, r)
Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) =
n
Y
i
X
Θg
Pr(yi|gi,Q)Pr(gi,Q|gi,A, gi,B, r)
main equation and calculate the likelihood over possible values of r
polymorphism for an alternative where there is a causal polymorphism in the marker defined region, where if we reject, we consider there to be a causal polymorphism in the region
base 10!), which is just LRT times a constant (!!)
the position within the interval by finding the position where a given value of r maximizes the likelihood, i.e. hence “interval mapping”
recombination map (another complex subject!)
Pr(Y |Xcp=Q)Pr(Xcp|X, r(Xcp=Q,X)) =
n
Y
i
X
Θg
Pr(yi|gi,Q)Pr(gi,Q|gi,A, gi,B, r)
polymorphisms for complex (non-Mendelian) phenotypes, in practice, interval mapping turns out to be not very useful
phenotypes) that fitting a complex model does not provide very exact inferences
control of genetic background, etc.) but the best approach for analyzing these data is to test one marker at a time, i.e. just like in a GWAS!
we would get the same result as the ideal interval mapping result (!!)
longer used but understanding this technique is important for interpreting the literature (!!)
before we knew about DNA (!!) and therefore before genetic markers
was geneticists used the observation of the similarity between relatives to determine how much they could explain about underlying genetics (they could infer quite a bit!)
they observed in populations, how phenotypes evolved (=how the mean of a phenotype in a population changed over time), to guide plant and animal breeding to produce desired changes in phenotypes, etc.
important and continue to re-appear in quantitative genomics
using our glm framework (!!)
but the concepts generalize
genetics is understanding narrow sense heritability (often just referred to as heritability), which is a property of a phenotype we measure:
numerator and phenotypic variance (VP) in the denominator
are several derivations of heritability!) using path analysis, a type of probabilistic graphical model called a structural equation model
h2 = VA VP
had gone on for ~30 years (with one paper!!) showing that a single genetic model could explain both patterns of inheritance
natural selection was not only possible but occurred under extremely plausible conditions (“Fisher’s fundamental theorem”):
changes under selection or genetic drift:
relative offspring phenotype values from breeding two individuals (= breeding values)
have non-zero heritability (!!), implying at least one causal polymorphism affects every phenotype (what else does it imply!?)
∆ ¯ w = h2
wVP
∆ ¯ Y = h2s
V ¯
P,t+1 = h2 t VP,t
Ne
can calculate for the entire population as follows (or estimate using a sample):
which can be calculated for any phenotype (regardless of the complexity
causal polymorphism for the phenotype
VA is the following where the parameter is from our linear regression term only fitting the “additive” term (not dominance term!!):
h2 = VA VP
VP = 1 n
n
X
i
(Yi − ¯ Y )2
VA = 2MAF(1 MAF)2
↵ = 2p(1 p)2 ↵
assume we are fitting this model for the actual causal polymorphism, not a marker in LD!), we had two dummy variables and two parameters:
(even if there is dominance in the system!):
in the error term (!!) just as for the case with un-modeled covariates
Xa(A1A1) = 1, Xa(A1A2) = 0, Xa(A2A2) = 1
Xd(A1A1) = −1, Xd(A1A2) = 1, Xd(A2A2) = −1
− Y = µ + Xaa + Xdd + ✏
Xα(A1A1) = −1, Xα(A1A2) = 0, Xα(A2A2) = 1 Y = µ + Xαα + ✏
− VA = 2p(1 − p)2
α
following equations and making appropriate substitutions:
⇣
1
VA = 2p(1 p) a ⇣ 1 + d(p1 p2) ⌘!2
0 = βµ βa βd, a + d = βµ + βd, 2a = βµ + βa βd
{ \ } ↵ = a 1 + d 2 (p1 p2) !
parameters in our regression model
semester!) the true values of the parameters are the same regardless of the allele frequency (MAF) of the causal polymorphism
this parameter depends on the allele frequency (MAF) of the causal polymorphism
changes in allele frequencies (!!)
regression parameter, there would be a different correct answer depending on the allele frequency in the population (!!) a, d
α
structure of variation in populations, etc.
to map the locations of causal polymorphims (why is this?)
single marker to provide a quantification of effects (note that we use different concepts such as relative risks and related concepts when dealing with case / control data):
tools to understand heritability in terms of regressions (!!) and this will provide a framework for understanding related concepts
h2
m =
2pi(1 − pi)2
α,i
VP
Genetic System
Does A1 -> A2 affect Y?
Sample or experimental pop
Measured individuals (genotype, phenotype)
Pr(Y|X)
Model params
Reject / DNR
Regression model
F-test
Experiment
Probability Model
Estimator
Hypothesis Test
build your understanding by hooking it into your passion
means you are missing a critical component that you have not learned = put it down and come back to it at a later date (you’ll be surprised how you’ll learn something later that suddenly makes it clear…)
keep adding to this, and learn it over time by hooking it into your passion
confused (knowing math someone else has developed does not mean you’re smart…)
the class so I’m not smart enough to learn it, etc. = NOT TRUE
you can always come back and keep learning and you WILL LEARN IT (trust me)