BTRY 4830/6830: Quantitative Genomics and Genetics
Lecture17: Logistic Regression III and Haplotypes
Jason Mezey jgm45@cornell.edu
- Oct. 23, 2014 (Th) 8:40-9:55
BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: - - PowerPoint PPT Presentation
BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: Logistic Regression III and Haplotypes Jason Mezey jgm45@cornell.edu Oct. 23, 2014 (Th) 8:40-9:55 Announcements MY OFFICE HOURS TODAY ARE CANCELLED (This week only!!)
Jason Mezey jgm45@cornell.edu
week only!!)
be due this coming Thurs. (11:59PM)
abilities by introducing the logistic regression model that allows us to analyze case / control data
by discussing (in brief) Generalized Linear Models (GLM) the broader class of models including linear and logistic regression
concept of haplotypes and haplotype testing in a GWAS framework
Genetic System
Does A1 -> A2 affect Y?
Sample or experimental pop
Measured individuals (genotype, phenotype)
Pr(Y|X)
Model params
Reject / DNR
Regression model
F-test
error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal
makes up the value to either 0 or 1:
Yi is described by the following equation:
Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term
− | Yi = E(Yi|Xi) + ⇤i Yi = ⇥−1(Yi|Xi) + ⇤i Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i
distribution of a case / control phenotype when there is a causal polymorphism
not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!):
regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)
µ = c a = 0 d = 0 H0 : a = 0 ∩ d = 0
with the phenotype individually
p-value for every maker
the MLE of our regression parameters (which we will do with an algorithm!)
the LRT
sample as an input and outputs the estimate of the parameters)!
logistic regression, which has the following form (sample size n):
equation that allows us to plug in the Y’s and X’s and returns the beta values that maximize the log-likelihood, there is no such simple equation for a logistic regression
MLE(ˆ β) = MLE( ˆ βµ, ˆ βa, ˆ βd)
l(β) =
n
⇤
i=1
⇥
producing an output
Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure:
we assign these numbers and call the resulting vector β[0].
an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.
algorithms in general there are usually some restrictions and / or certain starting values that are “better” than others in the sense that the algorithm will converge faster, find a more “optimal” solution etc.
[0] = 2 4 3 5
the following equation (then do this again and again until we stop!):
y1 y2 . . . yn
y =
matrix W is an n by (Wij = 0 for i 6= j)
⌥ ⇧ β[t] = ⇤ ⌥ ⇧ β[t]
µ
β[t]
a
β[t]
d
⌅
| | x = ⇤ ⌥ ⌥ ⌥ ⇧ 1 x1,a x1,d 1 x2,a x2,d . . . . . . ... 1 xn,a xn,d ⌅
⇤ ⌅
∼ [t+1] = [t] + [xTWx]−1xT(y − ⇥−1(x[t])
| − | ⇥−1([t]
µ + xi,a[t] a + xi,d[t] d ) =
eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
1 + eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
⇥−1(x[t]) = exβ[t] 1 + exβ[t]
Wii = γ−1(β[t]
µ + xi,aβ[t] a + xi,dβ[t] d )(1 − γ−1(β[t] µ + xi,aβ[t] a + xi,dβ[t] d ))
Wii = eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
1 + eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
1 + eβ[t]
µ +xi,aβ[t] a +xi,dβ[t] d
⇥
decide not to stop, we go back to step 2
the MLE (it may not be exactly the true MLE, but we will assume that it is close if we do not stop the algorithm to early!), e.g.
to construct a rule (note the issue with ln(0)!!:
⇥D = |D [t + 1] D [t] |
is small. Wh 4D < 106 n [t+1] ⇡ [t]
− ⇧ D = 2
n
⇤
i=1
⌅ yiln
eβ[t]or[t+1]
µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d
1+eβ[t]or[t+1]
µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d
⇥ +(1−yi)ln
1 −
eβ[t]or[t+1]
µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d
1+eβ[t]or[t+1]
µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d
⇥⇧
D = 2
n
⇤
i=1
⌅ yiln
⇥−1([t]or[t+1]
µ
+ xi,a[t]or[t+1]
a
+ xi,d[t]or[t+1]
d
) ⇥
⇤ ⌅ +(1 − yi)ln
1 − ⇥−1([t]or[t+1]
µ
+ xi,a[t]or[t+1]
a
+ xi,d[t]or[t+1]
d
) ⇥⇧
H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0
LRT = 2lnΛ = 2lnL(ˆ θ0|y) L(ˆ θ1|y)
θ1|y) 2l(ˆ θ0|y)
⌅1|y) = l( ˆ µ, ˆ a, ˆ d|y) l( ˆ ⌅0|y) = l( ˆ µ, 0, 0|y)
logistic regression parameters we get from our IRLS algorithm and plug these into the log-like equation
into this same equation
estimates of by running the algorithm EXACTLY the same with EXCEPT we set and we do not update these!
l(ˆ θ1|y) =
n
⇥ yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd)) ⇤ l(ˆ θ0|y) =
n
X
i=1
h yiln(γ−1(ˆ βµ,0 + xi,a ∗ 0 + xi,d ∗ 0)) + (1 − yi)ln(1 − γ−1(ˆ βµ,0 + xi,a ∗ 0 + xi,d ∗ 0)) i
∼ ⇥−1(µ + xi,aa + xi,dd) = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd
βµ,0
ˆ βa = 0, ˆ βd = 0
distribution of our LRT statistic under the null hypothesis
n (contrast with F-statistics!!) but we know that as n goes to infinite, we know the distribution is i.e. ( ):
under our (not all!!) null, this LRT is (approximately!) a chi-square distribution with 2 degrees of freedom (d.f.) assuming n is not too small!
θ1|y) 2l(ˆ θ0|y)
with a case / control phenotype!
n individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests
algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker
a chi-square distribution (how do we do this is R?)
GWAS analysis because these are the most versatile framework for performing a GWAS (there are many less versatile alternatives!)
genetic coding) where we have discrete categories (although they can also handle X that can take on a continuous set of values!)
have normal (linear) and Bernoulli error (logistic)
distributions? Linear and logistic regression models are members of a broader class called Generalized Linear Models (GLMs), where
distributions)
describe how linear and logistic models fit into this framework
present them using three (models that have these properties are considered GLMs):
Y conditional on the independent variable X is in the exponential family of distributions
expected value of the response variable (where we often use the inverse!!)
. Pr(Y |X) ∼ expfamily.
: : E(Y|X) → X,
(E(Y|X)) = X E(Y|X) = −1(X)
= X
can be expressed in the following form:
binomial, poisson, lognormal, multinomial, several categorical distributions, exponential, gamma distribution, beta distribution, chi-square) but not all (examples that are not!?) and since we can model response variables with these distributions, we can model phenotypes with these distributions in a GWAS using a GLM (!!)
accurately Binomial (logistic)
Pr(Y ) ∼ e
Y θ−b(θ) φ
+c(Y,φ) ve: ✓ = µ, = 2, b(✓) = ✓2 2 , c(Y, ) = −1 2 Y 2 + log(2⇡) !
value of Y given X:
for a GLM although there are some general restrictions on the form of this function, the most important is that they need to be monotonic such that we can define an inverse:
function, which is a logit function (a “canonical link”) where the inverse is the logistic function (but note that others are also used for binomial response variables):
the inverse Y = f(X), as f−1(Y ) = X.
E(Y|X) = −1(X) = eXβ 1 + eXβ
γ(E(Y|X)) = ln
eXβ 1+eXβ
1 −
eXβ 1+eXβ
!
ONLY the independent variable and beta parameter vector:
error is constant!!):
the error changes depending on the value of X!!):
V ar(✏) = f(X) ✏ ∼ N(E(Y|X), 2
✏ )
V ar(✏) = f(X) = 2
✏
V ar(✏) = −1(X)(1 − −1(X)) V ar(✏i) = −1(µ + Xi,aa + Xi,dd)(1 − −1(µ + Xi,aa + Xi,dd)
same approach, i.e. MLE of the beta parameters using an IRLS algorithm (just substitute the appropriate link function in the equations, etc.)
(where the sampling distribution as the sample size goes to infinite is chi-square)
types of regression modeling you will likely need to apply (!!)
handle additional phenotypes
genotypes defined using a different approach
testing framework
information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time
are inherited together
“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:
following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:
h = f(Ai, Bj, ...)
s h1 = (A, A, C, C)
s (A, G), (A, T), (G, d h2 = (G, T, G, G).
general, we define a haplotype for a set of genetic markers (loci) that are physically linked that are frequently occur in a population
match observed patterns of LD
define haplotype alleles that have appreciable frequenecy in the population
many haplotype alleles could we define?
relatively “high” allele frequencies (say >0.05 = arbitrary!)
s h1 = (A, A, C, C)
s (A, G), (A, T), (G, d h2 = (G, T, G, G).
four loci (A,G), (A,T),(G,C),(G,C), we have lots of LD (!!) such that there are only 4 alleles in the population (i.e. all other combinations have frequency of zero!):
in the population are < 0.01
the other alleles as follows (where * means “any” genetic marker allele):
s h∗
1 = (A, A, C, C),h∗ 2 = (G, T, G, G),h∗ 3 = (A, A, G, C),h∗ 4 = (G, T, C, G)
s h1 = (A, A, ∗, C)
∗ ∗
d h2 = (G, T, ∗, G) d h = h∗ ∪ h∗. ∗ at h1 = h∗
1 ∪h∗ 3
d h2 = (G, T, ∗ d h2 = h∗
2 ∪ h∗ 4.
proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)
haplotype alleles, we can code our independent variables for our regression model as follows:
the effect on our interpretation of where the causal polymorphism is located?)
Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1
important (population structure)!
perform in your GWAS analysis (!!)