Lecture14: Logistic regression I: GWAS for case / control phenotypes
Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture14: Logistic regression I: GWAS for case / control phenotypes Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55 Announcements Your midterm will be returned
Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55
tomorrow
(=during first study period)
to analyze data for the “ideal” GWAS for phenotypes that can be modeled with a linear regression model
we will discuss how to analyze case / control phenotypes using a logistic regression model
Genetic System
Does A1 -> A2 affect Y?
Sample or experimental pop
Measured individuals (genotype, phenotype)
Pr(Y|X)
Model params
Reject / DNR
Regression model
F-test
N genotypes THROUGHOUT the genome for n independent individuals
linkage disequilibrium, that we have located a position in the genome that contains a causal polymorphism (not the causal polymorphism!)
identifying the causal polymorphism requires additional data and
H0 : Cov(X, Y ) = 0
model for the relationship between genotype and phenotype (where this implicitly assumes a normal error provides a reasonable approximation of the phenotype distribution given the genotype):
Y = µ + Xaa + Xdd + ✏
✏ ∼ N(0, 2
✏ )
many phenotypes, we are commonly interested in analyzing phenotypes where this is NOT a good model
interested in identifying causal polymorphisms (loci) that contribute to the risk for developing a disease, e.g. heart disease, diabetes, etc.
disease” or “does not have disease” or more precisely “case” or “control”
individuals and therefore elements of a sample space, such that we can define a random variable such as Y(case) = 1 and Y(control) = 0
with a linear regression model versus case / control data:
with a linear regression model versus case / control data:
“line” function of this “shape”
considered (along with linear regression) within a broader class
will discuss next lecture
regression function) we will see that the structure and out approach to inference is the same with this model
the “logistic” function as yet undefined
Y has the same structure as we have seen before in a regression:
(where the X matrix has the same form as we have been considering!):
the logistic function
Y = logistic(µ + Xaa + Xdd) + ✏l
E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)
E(Y|X) = logistic(X)
difference between each point and the expected value (the linear regression line), which we assume follow a normal, but for a logistic regression, we have the same case but the value has to make up the value to either 0 or 1 (what distribution is this?):
an error that takes either the value of “1” or “0” depending on the value of the expected value of the genotype
Y = 0
Y = 1
distribution is therefore the Bernoulli distribution with the following parameter
p = logistic(µ + Xaa + Xdd)
✏i = E(Yi|Xi) = E(Y |AiAj) = logistic(µ + Xi,aa + Xi,dd)
|
Pr(Z) ⇠ bern(p)
relatively simple
distribution of the error term is set up to make the probability
Y being zero greater than being one (and vice versa for the regression line near one!):
p = logistic(µ + Xaa + Xdd)
|
Pr(Z) ⇠ bern(p)
a logistic regression (remember below we are plotting just versus Xa but this really is a plot versus Xa AND Xd!!):
E(Yi|Xi) = eβµ+Xi,aβa+Xi,dβd 1 + eβµ+Xi,aβa+Xi,dβd
E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)
(but remember that it is just a function!!):
E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)
E(Yi|Xi) = γ1(Yi|Xi) = eβµ+Xi,aβa+Xi,dβd 1 + eβµ+Xi,aβa+Xi,dβd
E(y|x) = γ1(xβ) = 2 6 6 6 4
eβµ+x1,aβa+x1,dβd 1+eβµ+x1,aβa+x1,dβd
. . .
eβµ+xn,aβa+xn,dβd 1+eβµ+xn,aβa+xn,dβd
3 7 7 7 5
Yi is described by the following equation:
Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term (we will do this separately
− | Yi = E(Yi|Xi) + ⇤i Yi = ⇥−1(Yi|Xi) + ⇤i Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i
and phenotype Yi = 0
(which we will not know in practice! We need to infer them!) are:
0 = 0.1 − 0.1 µ = 0.2 a = 2.2 d = 0.2
| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 0 = e0.2+(−1)2.2+(−1)0.2 1 + e0.2+(−1)2.2+(−1)0.2 + ⇤i
and phenotype Yi = 1
(which we will not know in practice! We need to infer them!) are:
µ = 0.2 a = 2.2 d = 0.2
1 = 0.1 + 0.9
| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 1 = e0.2+(−1)2.2+(−1)0.2 1 + e0.2+(−1)2.2+(−1)0.2 + ⇤i
and phenotype Yi = 0
(which we will not know in practice! We need to infer them!) are:
µ = 0.2 a = 2.2 d = 0.2
0 = 0.6 − 0.6 | Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i
0 = e0.2+(0)2.2+(1)0.2 1 + e0.2+(0)2.2+(1)0.2 + ⇤i
and phenotype Yi = 0
(which we will not know in practice! We need to infer them!) are:
µ = 0.2 a = 2.2 d = 0.2
0 = 0.9 − 0.9
| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 0 = e0.2+(1)2.2+(−1)0.2 1 + e0.2+(1)2.2+(−1)0.2 + ⇤i
Yi is zero and 1- E(Yi | Xi) when Yi is one:
p = E(Y |X)
For example:
| p = 0.1
∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)
| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)
− Pr(⇤i) ∼ bern(p|X) − E(Y |X)
⇤i = −0.1
✏i = 0.9
Yi is zero and 1- E(Yi | Xi) when Yi is one:
p = E(Y |X)
For example:
p = 0.6 ⇤i = −0.6
− ⇤i = 0.4
∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)
| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)
− Pr(⇤i) ∼ bern(p|X) − E(Y |X)
Yi is zero and 1- E(Yi | Xi) when Yi is one:
p = E(Y |X)
For example:
p = 0.9 ⇤i = −0.9 ⇤i = 0.1
∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)
| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)
− Pr(⇤i) ∼ bern(p|X) − E(Y |X)
plot is versus BOTH Xa and Xd (harder to see what is going on)
E(y|x) = γ1(xβ) = 2 6 6 6 4
eβµ+x1,aβa+x1,dβd 1+eβµ+x1,aβa+x1,dβd
. . .
eβµ+xn,aβa+xn,dβd 1+eβµ+xn,aβa+xn,dβd
3 7 7 7 5
E(Y|X) = −1(X) = eXβ 1 + eXβ = 1 1 + e−Xβ
distribution of a case / control phenotype when there is a causal polymorphism
not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!):
regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)
parameters of the logistic regression
µ = c a = 0 d = 0 H0 : a = 0 ∩ d = 0
producing an output
Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure:
we assign these numbers and call the resulting vector β[0].
an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.
“convex” (although for other algorithms, we need to be careful in how we assign
equation:
[0] = 2 4 3 5 matrix W is an n by (Wij = 0 for i 6= j)
z = x[t] + W−1(y −1(x))
[t+1] = ⇥ xTWx ⇤−1 xTWz
is small. Wh 4D < 106
n [t+1] ⇡ [t] when
⇥D = |D [t + 1] D [t] |
∼ [t+1] = [t] + [xTWx]−1xT(y − ⇥−1(x[t])
Wii = γ−1(β[t]
µ + xi,aβ[t] a + xi,dβ[t] d )(1 − γ−1(β[t] µ + xi,aβ[t] a + xi,dβ[t] d ))
D = 2
n i=1
⌦ yiln
γ−1(β[t]or[t+1]
µ
+ xi,aβ[t]or[t+1]
a
+ xi,dβ[t]or[t+1]
d
) ⇥
+(1 − yi)ln
1 − γ−1(β[t]or[t+1]
µ
+ xi,aβ[t]or[t+1]
a
+ xi,dβ[t]or[t+1]
d
) ⇥↵
distribution with 2 degrees of freedom (d.f.) or more accurately:
H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0
LRT = 2lnΛ = 2lnL(ˆ θ0|y) L(ˆ θ1|y)
θ1|y) 2l(ˆ θ0|y)
ve an exact dis n LRT ! χ2
d f,
df) that depen
with a case / control phenotype!
n individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests
algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker
using a chi-square distribution (how do we do this is R?)
logistic regression (!!) - we will review this next lecture (computer lab this week!)
“line” we are using a regression function that has a different shape and instead of a normal error we are using an error that takes one of two state (bernoulli error!)
will estimate parameters using the same approach (MLE!) and we will perform hypothesis testing using the same approach (Likelihood Ratio Test)
regression!