Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

quantitative genomics and genetics btry 4830 6830 pbsb
SMART_READER_LITE
LIVE PREVIEW

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture14: Logistic regression I: GWAS for case / control phenotypes Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55 Announcements Your midterm will be returned


slide-1
SLIDE 1

Lecture14: Logistic regression I: GWAS for case / control phenotypes

Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01

slide-2
SLIDE 2

Announcements

  • Your midterm will be returned next Tues.
  • Homework #6 (last homework!) will be available

tomorrow

  • Project available April 14 (more details to come!)
  • Scheduling the final (take home - same format as midterm)
  • Option 1: Available Tues. May 10, due Fri. May 13

(=during first study period)

  • Option 2: During first week of exams May 16-19
  • I will send an email about these options - please email
  • r talk to me about concerns / constraints ASAP(!!)
slide-3
SLIDE 3

Summary of lecture 14

  • In previous lectures, we completed our introduction to how

to analyze data for the “ideal” GWAS for phenotypes that can be modeled with a linear regression model

  • Going forward, we will continue to add layers, where today

we will discuss how to analyze case / control phenotypes using a logistic regression model

slide-4
SLIDE 4

Conceptual Overview

Genetic System

Does A1 -> A2 affect Y?

Sample or experimental pop

Measured individuals (genotype, phenotype)

Pr(Y|X)

Model params

Reject / DNR

Regression model

F-test

slide-5
SLIDE 5

Review: GWAS basics

  • In an “ideal” GWAS experiment, we measure the phenotype and

N genotypes THROUGHOUT the genome for n independent individuals

  • To analyze a GWAS, we perform N independent hypothesis tests
  • f the following form:
  • When we reject the null hypothesis, we assume that because of

linkage disequilibrium, that we have located a position in the genome that contains a causal polymorphism (not the causal polymorphism!)

  • This is as far as we can go with a GWAS (!!) such that (often)

identifying the causal polymorphism requires additional data and

  • r follow-up experiments, i.e. GWAS is a starting point

H0 : Cov(X, Y ) = 0

slide-6
SLIDE 6

Review: linear regression

  • So far, we have considered a linear regression is a reasonable

model for the relationship between genotype and phenotype (where this implicitly assumes a normal error provides a reasonable approximation of the phenotype distribution given the genotype):

Y = µ + Xaa + Xdd + ✏

✏ ∼ N(0, 2

✏ )

slide-7
SLIDE 7

Case / Control Phenotypes I

  • While a linear regression may provide a reasonable model for

many phenotypes, we are commonly interested in analyzing phenotypes where this is NOT a good model

  • As an example, we are often in situations where we are

interested in identifying causal polymorphisms (loci) that contribute to the risk for developing a disease, e.g. heart disease, diabetes, etc.

  • In this case, the phenotype we are measuring is often “has

disease” or “does not have disease” or more precisely “case” or “control”

  • Recall that such phenotypes are properties of measured

individuals and therefore elements of a sample space, such that we can define a random variable such as Y(case) = 1 and Y(control) = 0

slide-8
SLIDE 8

Case / Control Phenotypes II

  • Let’s contrast the situation, let’s contrast data we might model

with a linear regression model versus case / control data:

slide-9
SLIDE 9

Case / Control Phenotypes II

  • Let’s contrast the situation, let’s contrast data we might model

with a linear regression model versus case / control data:

slide-10
SLIDE 10

Logistic regression I

  • Instead, we’re going to consider a logistic regression model
slide-11
SLIDE 11

Logistic regression II

  • It may not be immediately obvious why we choose regression

“line” function of this “shape”

  • The reason is mathematical convenience, i.e. this function can be

considered (along with linear regression) within a broader class

  • f models called Generalized Linear Models (GLM) which we

will discuss next lecture

  • However, beyond a few differences (the error term and the

regression function) we will see that the structure and out approach to inference is the same with this model

slide-12
SLIDE 12

Logistic regression III

  • To begin, let’s consider the structure of a regression model:
  • We code the “X’s” the same (!!) although a major difference here is

the “logistic” function as yet undefined

  • However, the expected value of

Y has the same structure as we have seen before in a regression:

  • We can similarly write for a population using matrix notation

(where the X matrix has the same form as we have been considering!):

  • In fact the two major differences are in the form of the error and

the logistic function

Y = logistic(µ + Xaa + Xdd) + ✏l

E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)

E(Y|X) = logistic(X)

slide-13
SLIDE 13

Logistic regression: error term I

  • Recall that for a linear regression, the error term accounted for the

difference between each point and the expected value (the linear regression line), which we assume follow a normal, but for a logistic regression, we have the same case but the value has to make up the value to either 0 or 1 (what distribution is this?):

Y Y Xa Xa

slide-14
SLIDE 14

Logistic regression: error term II

  • For the error on an individual i, we therefore have to construct

an error that takes either the value of “1” or “0” depending on the value of the expected value of the genotype

  • For

Y = 0

  • For

Y = 1

  • For a distribution that takes two such values, a reasonable

distribution is therefore the Bernoulli distribution with the following parameter

p = logistic(µ + Xaa + Xdd)

✏i = E(Yi|Xi) = E(Y |AiAj) = logistic(µ + Xi,aa + Xi,dd)

  • |
  • |
  • ✏i = 1 E(Yi|Xi) = 1 E(Y |AiAj) = 1 logistic(µ + Xi,aa + Xi,dd)

|

  • ✏i = Z E(Yi|Xi)
  • |

Pr(Z) ⇠ bern(p)

slide-15
SLIDE 15

Logistic regression: error term III

  • This may look complicated at first glance but the intuition is

relatively simple

  • If the logistic regression line is near zero, the probability

distribution of the error term is set up to make the probability

  • f

Y being zero greater than being one (and vice versa for the regression line near one!):

Y Xa

p = logistic(µ + Xaa + Xdd)

|

  • ✏i = Z E(Yi|Xi)
  • |

Pr(Z) ⇠ bern(p)

slide-16
SLIDE 16

Logistic regression: link function I

  • Next, we have to consider the function for the regression line of

a logistic regression (remember below we are plotting just versus Xa but this really is a plot versus Xa AND Xd!!):

Y Xa

E(Yi|Xi) = eβµ+Xi,aβa+Xi,dβd 1 + eβµ+Xi,aβa+Xi,dβd

E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)

slide-17
SLIDE 17

Logistic regression: link function II

  • We will write this function using a somewhat strange notation

(but remember that it is just a function!!):

  • In matrix notation, this is the following:

E(Yi|Xi) = logistic(µ + Xi,aa + Xi,dd)

E(Yi|Xi) = γ1(Yi|Xi) = eβµ+Xi,aβa+Xi,dβd 1 + eβµ+Xi,aβa+Xi,dβd

E(y|x) = γ1(xβ) = 2 6 6 6 4

eβµ+x1,aβa+x1,dβd 1+eβµ+x1,aβa+x1,dβd

. . .

eβµ+xn,aβa+xn,dβd 1+eβµ+xn,aβa+xn,dβd

3 7 7 7 5

slide-18
SLIDE 18

Calculating the components of an individual I

  • Recall that an individual with phenotype

Yi is described by the following equation:

  • To understand how an individual with a phenotype

Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term (we will do this separately

− | Yi = E(Yi|Xi) + ⇤i Yi = ⇥−1(Yi|Xi) + ⇤i Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i

slide-19
SLIDE 19

Calculating the components of an individual II

  • For example, say we have an individual i that has genotype A1A1

and phenotype Yi = 0

  • We know Xa = -1 and Xd = -1
  • Say we also know that for the population, the true parameters

(which we will not know in practice! We need to infer them!) are:

  • We can then calculate the E(Yi|Xi) and the error term for i:

0 = 0.1 − 0.1 µ = 0.2 a = 2.2 d = 0.2

| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 0 = e0.2+(−1)2.2+(−1)0.2 1 + e0.2+(−1)2.2+(−1)0.2 + ⇤i

slide-20
SLIDE 20

Calculating the components of an individual III

  • For example, say we have an individual i that has genotype A1A1

and phenotype Yi = 1

  • We know Xa = -1 and Xd = -1
  • Say we also know that for the population, the true parameters

(which we will not know in practice! We need to infer them!) are:

  • We can then calculate the E(Yi|Xi) and the error term for i:

µ = 0.2 a = 2.2 d = 0.2

1 = 0.1 + 0.9

| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 1 = e0.2+(−1)2.2+(−1)0.2 1 + e0.2+(−1)2.2+(−1)0.2 + ⇤i

slide-21
SLIDE 21

Calculating the components of an individual IV

  • For example, say we have an individual i that has genotype A1A2

and phenotype Yi = 0

  • We know Xa = 0 and Xd = 1
  • Say we also know that for the population, the true parameters

(which we will not know in practice! We need to infer them!) are:

  • We can then calculate the E(Yi|Xi) and the error term for i:

µ = 0.2 a = 2.2 d = 0.2

0 = 0.6 − 0.6 | Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i

0 = e0.2+(0)2.2+(1)0.2 1 + e0.2+(0)2.2+(1)0.2 + ⇤i

slide-22
SLIDE 22

Calculating the components of an individual V

  • For example, say we have an individual i that has genotype A2A2

and phenotype Yi = 0

  • We know Xa = 1 and Xd = -1
  • Say we also know that for the population, the true parameters

(which we will not know in practice! We need to infer them!) are:

  • We can then calculate the E(Yi|Xi) and the error term for i:

µ = 0.2 a = 2.2 d = 0.2

0 = 0.9 − 0.9

| Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i 0 = e0.2+(1)2.2+(−1)0.2 1 + e0.2+(1)2.2+(−1)0.2 + ⇤i

slide-23
SLIDE 23

The error term 1

  • Recall that the error term is either the negative of E(Yi | Xi) when

Yi is zero and 1- E(Yi | Xi) when Yi is one:

  • For the entire distribution of the population, recall that

p = E(Y |X)

For example:

| p = 0.1

∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)

| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)

− Pr(⇤i) ∼ bern(p|X) − E(Y |X)

⇤i = −0.1

✏i = 0.9

slide-24
SLIDE 24

The error term 1I

  • Recall that the error term is either the negative of E(Yi | Xi) when

Yi is zero and 1- E(Yi | Xi) when Yi is one:

  • For the entire distribution of the population, recall that

p = E(Y |X)

For example:

p = 0.6 ⇤i = −0.6

− ⇤i = 0.4

∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)

| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)

− Pr(⇤i) ∼ bern(p|X) − E(Y |X)

slide-25
SLIDE 25

The error term III

  • Recall that the error term is either the negative of E(Yi | Xi) when

Yi is zero and 1- E(Yi | Xi) when Yi is one:

  • For the entire distribution of the population, recall that

p = E(Y |X)

For example:

p = 0.9 ⇤i = −0.9 ⇤i = 0.1

∼ | − | ⇤i|(Yi = 0) = −E(Yi|Xi)

| − | ⇤i|(Yi = 1) = 1 − E(Yi|Xi)

− Pr(⇤i) ∼ bern(p|X) − E(Y |X)

slide-26
SLIDE 26

Comments

  • Remember that while we are plotting this versus just Xa, the true

plot is versus BOTH Xa and Xd (harder to see what is going on)

  • For an entire sample, we can use matrix notation as follows:

E(y|x) = γ1(xβ) = 2 6 6 6 4

eβµ+x1,aβa+x1,dβd 1+eβµ+x1,aβa+x1,dβd

. . .

eβµ+xn,aβa+xn,dβd 1+eβµ+xn,aβa+xn,dβd

3 7 7 7 5

E(Y|X) = −1(X) = eXβ 1 + eXβ = 1 1 + e−Xβ

slide-27
SLIDE 27

Inference in logistic regression

  • Recall that our goal with using logistic regression was to model the probability

distribution of a case / control phenotype when there is a causal polymorphism

  • To use this for a GWAS, we need to test the null hypothesis that a genotype is

not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!):

  • To assess this null hypothesis, we will use the same approach as in linear

regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)

  • Just as with linear regression to construct a LRT, we need the MLE of the (beta)

parameters of the logistic regression

µ = c a = 0 d = 0 H0 : a = 0 ∩ d = 0

slide-28
SLIDE 28

IRLS algorithm I

  • algorithm - a sequence of instructions for taking an input and

producing an output

  • For logistic regression (and GLM’s in general!) we will construct an

Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure:

  • 1. Choose starting values for the β’s. Since we have a vector of three β’s in our case,

we assign these numbers and call the resulting vector β[0].

  • 2. Using the re-weighting equation (described next slide), update the β[t] vector.
  • 3. At each step t > 0 check if β[t+1] ⇡ β[t] (i.e. if these are approximately equal) using

an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.

slide-29
SLIDE 29

IRLS algorithm II

  • For the step (1), we can assign any starting values, since the algorithm is

“convex” (although for other algorithms, we need to be careful in how we assign

  • ur starting values!)
  • For step (2), we will update this vector at each “iteration” using the following

equation:

  • Note an alternative way of writing these equations:
  • For step (3), we decide when to stop the algorithm using the “deviance criterion”:

[0] = 2 4 3 5 matrix W is an n by (Wij = 0 for i 6= j)

z = x[t] + W−1(y −1(x))

[t+1] = ⇥ xTWx ⇤−1 xTWz

is small. Wh 4D < 106

n [t+1] ⇡ [t] when

⇥D = |D [t + 1] D [t] |

∼ [t+1] = [t] + [xTWx]−1xT(y − ⇥−1(x[t])

Wii = γ−1(β[t]

µ + xi,aβ[t] a + xi,dβ[t] d )(1 − γ−1(β[t] µ + xi,aβ[t] a + xi,dβ[t] d ))

D = 2

n i=1

⌦ yiln

  • yi

γ−1(β[t]or[t+1]

µ

+ xi,aβ[t]or[t+1]

a

+ xi,dβ[t]or[t+1]

d

) ⇥

  • ⇥↵

+(1 − yi)ln

  • 1 − yi

1 − γ−1(β[t]or[t+1]

µ

+ xi,aβ[t]or[t+1]

a

+ xi,dβ[t]or[t+1]

d

) ⇥↵

slide-30
SLIDE 30

Hypothesis testing: LRT I

  • Recall that our null and alternative hypotheses are:
  • We will use the LRT for the null (0) and alternative (1):
  • Under the null, this LRT is (approximately!) a chi-square

distribution with 2 degrees of freedom (d.f.) or more accurately:

H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0

LRT = 2lnΛ = 2lnL(ˆ θ0|y) L(ˆ θ1|y)

  • |
  • LRT = 2lnΛ = 2l(ˆ

θ1|y) 2l(ˆ θ0|y)

as n ! 1

ve an exact dis n LRT ! χ2

d f,

df) that depen

slide-31
SLIDE 31

Performing a GWAS

  • Now we have all the critical components for performing a GWAS

with a case / control phenotype!

  • The procedure (and goals!) are the same as before, for a sample of

n individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests

  • To perform these hypothesis tests, we need to run our IRLS

algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker

  • We then use these N LRT statistics to calculate N p-values by

using a chi-square distribution (how do we do this is R?)

slide-32
SLIDE 32

Logistic regression: looking forward

  • Don’t get overly concerned with the (apparent) complexity of a

logistic regression (!!) - we will review this next lecture (computer lab this week!)

  • Remember that it is just like a linear regression but instead of a

“line” we are using a regression function that has a different shape and instead of a normal error we are using an error that takes one of two state (bernoulli error!)

  • Otherwise, the rest is the same: the X’s are coded the same, we

will estimate parameters using the same approach (MLE!) and we will perform hypothesis testing using the same approach (Likelihood Ratio Test)

slide-33
SLIDE 33

That’s it for today

  • Next lecture: we will continue our discussion of logistic

regression!