BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: - - PowerPoint PPT Presentation

btry 4830 6830 quantitative genomics and genetics
SMART_READER_LITE
LIVE PREVIEW

BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: - - PowerPoint PPT Presentation

BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: Logistic Regression III and Haplotypes Jason Mezey jgm45@cornell.edu Oct. 23, 2014 (Th) 8:40-9:55 Announcements MY OFFICE HOURS TODAY ARE CANCELLED (This week only!!)


slide-1
SLIDE 1

BTRY 4830/6830: Quantitative Genomics and Genetics

Lecture17: Logistic Regression III and Haplotypes

Jason Mezey jgm45@cornell.edu

  • Oct. 23, 2014 (Th) 8:40-9:55
slide-2
SLIDE 2

Announcements

  • MY OFFICE HOURS TODAY ARE CANCELLED (This

week only!!)

  • Homework #5 will be available tomorrow (Fri.) and will

be due this coming Thurs. (11:59PM)

slide-3
SLIDE 3

Summary of lecture 17

  • In previous two lectures, we expanded our GWAS analysis

abilities by introducing the logistic regression model that allows us to analyze case / control data

  • Today, we will complete our discussion of logistic regression

by discussing (in brief) Generalized Linear Models (GLM) the broader class of models including linear and logistic regression

  • We will also provide an (extremely) brief introduction to the

concept of haplotypes and haplotype testing in a GWAS framework

slide-4
SLIDE 4

Conceptual Overview

Genetic System

Does A1 -> A2 affect Y?

Sample or experimental pop

Measured individuals (genotype, phenotype)

Pr(Y|X)

Model params

Reject / DNR

Regression model

F-test

slide-5
SLIDE 5

Review: linear vs. logistic

  • Recall that for a linear regression, the regression function was a line and the

error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal

  • For a logistic regression, we use the logistic function and the error term

makes up the value to either 0 or 1:

Y Y Xa Xa

slide-6
SLIDE 6

Review: the expected value for logistic regression

  • Recall that an individual with phenotype

Yi is described by the following equation:

  • To understand how an individual with a phenotype

Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term

− | Yi = E(Yi|Xi) + ⇤i Yi = ⇥−1(Yi|Xi) + ⇤i Yi = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd + ⇤i

slide-7
SLIDE 7

Review: Inference in logistic regression

  • Recall that our goal with using logistic regression was to model the probability

distribution of a case / control phenotype when there is a causal polymorphism

  • To use this for a GWAS, we need to test the null hypothesis that a genotype is

not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!):

  • To assess this null hypothesis, we will use the same approach as in linear

regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)

µ = c a = 0 d = 0 H0 : a = 0 ∩ d = 0

slide-8
SLIDE 8

Review: logistic inference for GWAS

  • Our goal is to do a hypothesis test for each marker

with the phenotype individually

  • We therefore have to construct a LRT and calculate a

p-value for every maker

  • To calculate the LRT (and p-value) we need to calculate

the MLE of our regression parameters (which we will do with an algorithm!)

  • We therefore need equations for: 1. the algorithm, 2.

the LRT

slide-9
SLIDE 9

Review: logistic MLE

  • Recall that an MLE is simply a statistic (a function that takes the

sample as an input and outputs the estimate of the parameters)!

  • In this case, we want to construct the following MLE:
  • To do this, we need to maximize the log-likelihood function for the

logistic regression, which has the following form (sample size n):

  • Unlike the case of linear regression, where we had a “closed-form”

equation that allows us to plug in the Y’s and X’s and returns the beta values that maximize the log-likelihood, there is no such simple equation for a logistic regression

  • We will therefore need an algorithm to calculate the MLE

MLE(ˆ β) = MLE( ˆ βµ, ˆ βa, ˆ βd)

l(β) =

n

i=1

  • yiln(γ1(βµ + xi,aβa + xi,dβd)) + (1 yi)ln(1 γ1(βµ + xi,aβa + xi,dβd))

slide-10
SLIDE 10

Review: IRLS algorithm I

  • algorithm - a sequence of instructions for taking an input and

producing an output

  • For logistic regression (and GLM’s in general!) we will construct an

Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure:

  • 1. Choose starting values for the β’s. Since we have a vector of three β’s in our case,

we assign these numbers and call the resulting vector β[0].

  • 2. Using the re-weighting equation (described next slide), update the β[t] vector.
  • 3. At each step t > 0 check if β[t+1] ⇡ β[t] (i.e. if these are approximately equal) using

an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.

slide-11
SLIDE 11

Step 1: IRLS algorithm

  • These are simply values of the vector that we assign (!!)
  • In one sense, these can be anything we want (!!) although for

algorithms in general there are usually some restrictions and / or certain starting values that are “better” than others in the sense that the algorithm will converge faster, find a more “optimal” solution etc.

  • In our case, we can assign our starting values as follows:

[0] = 2 4 3 5

slide-12
SLIDE 12

Step 2: IRLS algorithm

  • At step 2, we will update (= produce a new value of the vector) using

the following equation (then do this again and again until we stop!):

     y1 y2 . . . yn     

y =

matrix W is an n by (Wij = 0 for i 6= j)

⌥ ⇧ β[t] = ⇤ ⌥ ⇧ β[t]

µ

β[t]

a

β[t]

d

| | x = ⇤ ⌥ ⌥ ⌥ ⇧ 1 x1,a x1,d 1 x2,a x2,d . . . . . . ... 1 xn,a xn,d ⌅

⇤ ⌅

∼ [t+1] = [t] + [xTWx]−1xT(y − ⇥−1(x[t])

| − | ⇥−1([t]

µ + xi,a[t] a + xi,d[t] d ) =

eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

1 + eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

⇥−1(x[t]) = exβ[t] 1 + exβ[t]

Wii = γ−1(β[t]

µ + xi,aβ[t] a + xi,dβ[t] d )(1 − γ−1(β[t] µ + xi,aβ[t] a + xi,dβ[t] d ))

Wii = eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

1 + eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

  • 1 −

eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

1 + eβ[t]

µ +xi,aβ[t] a +xi,dβ[t] d

slide-13
SLIDE 13

Step 3: IRLS algorithm

  • At step 3, we “check” to see if we should stop the algorithm and, if we

decide not to stop, we go back to step 2

  • If we decide to stop, we will assume the final values of the vector are

the MLE (it may not be exactly the true MLE, but we will assume that it is close if we do not stop the algorithm to early!), e.g.

  • There are many stopping rules, using change in Deviance is one way

to construct a rule (note the issue with ln(0)!!:

⇥D = |D [t + 1] D [t] |

is small. Wh 4D < 106 n [t+1] ⇡ [t]

− ⇧ D = 2

n

i=1

⌅ yiln

  • yi

eβ[t]or[t+1]

µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d

1+eβ[t]or[t+1]

µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d

⇥ +(1−yi)ln

  • 1 − yi

1 −

eβ[t]or[t+1]

µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d

1+eβ[t]or[t+1]

µ +xi,aβ[t]or[t+1] a +xi,dβ[t]or[t+1] d

⇥⇧

D = 2

n

i=1

⌅ yiln

  • yi

⇥−1([t]or[t+1]

µ

+ xi,a[t]or[t+1]

a

+ xi,d[t]or[t+1]

d

) ⇥

  • ⇥⇧

⇤ ⌅ +(1 − yi)ln

  • 1 − yi

1 − ⇥−1([t]or[t+1]

µ

+ xi,a[t]or[t+1]

a

+ xi,d[t]or[t+1]

d

) ⇥⇧

slide-14
SLIDE 14

Review: Logistic hypothesis testing I

  • Recall that our null and alternative hypotheses are:
  • We will use the LRT for the null (0) and alternative (1):
  • For our case, we need the following:

H0 : a = 0 ∩ d = 0 HA : βa 6= 0 [ βd 6= 0

LRT = 2lnΛ = 2lnL(ˆ θ0|y) L(ˆ θ1|y)

  • |
  • LRT = 2lnΛ = 2l(ˆ

θ1|y) 2l(ˆ θ0|y)

  • l( ˆ

⌅1|y) = l( ˆ µ, ˆ a, ˆ d|y) l( ˆ ⌅0|y) = l( ˆ µ, 0, 0|y)

slide-15
SLIDE 15
  • For the alternative, we use our MLE estimates of our

logistic regression parameters we get from our IRLS algorithm and plug these into the log-like equation

  • For the null, we plug in the following parameter estimates

into this same equation

  • where we use the same IRLS algorithm to provide

estimates of by running the algorithm EXACTLY the same with EXCEPT we set and we do not update these!

l(ˆ θ1|y) =

n

  • i=1

⇥ yiln(γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd)) + (1 − yi)ln(1 − γ−1(ˆ βµ + xi,a ˆ βa + xi,d ˆ βd)) ⇤ l(ˆ θ0|y) =

n

X

i=1

h yiln(γ−1(ˆ βµ,0 + xi,a ∗ 0 + xi,d ∗ 0)) + (1 − yi)ln(1 − γ−1(ˆ βµ,0 + xi,a ∗ 0 + xi,d ∗ 0)) i

∼ ⇥−1(µ + xi,aa + xi,dd) = eβµ+xi,aβa+xi,dβd 1 + eβµ+xi,aβa+xi,dβd

Review Logistic hypothesis testing II

  • ˆ

βµ,0

ˆ βa = 0, ˆ βd = 0

slide-16
SLIDE 16

Logistic hypothesis testing III

  • To calculate our p-value, we need to know the

distribution of our LRT statistic under the null hypothesis

  • There is no simple form for this distribution for any given

n (contrast with F-statistics!!) but we know that as n goes to infinite, we know the distribution is i.e. ( ):

  • What’s more, it is a reasonably good assumption that

under our (not all!!) null, this LRT is (approximately!) a chi-square distribution with 2 degrees of freedom (d.f.) assuming n is not too small!

  • |
  • LRT = 2lnΛ = 2l(ˆ

θ1|y) 2l(ˆ θ0|y)

slide-17
SLIDE 17

Review: logistic GWAS

  • Now we have all the critical components for performing a GWAS

with a case / control phenotype!

  • The procedure (and goals!) are the same as before, for a sample of

n individuals where for each we have measured a case / control phenotype and N genotypes, we perform N hypothesis tests

  • To perform these hypothesis tests, we need to run our IRLS

algorithm for EACH marker to get the MLE of the parameters under the alternative (= no restrictions on the beta’s!) and use these to calculate our LRT test statistic for each marker

  • We then use these N LRT statistics to calculate N p-values by using

a chi-square distribution (how do we do this is R?)

slide-18
SLIDE 18

Introduction to Generalized Linear Models (GLMs) I

  • We have introduced linear and logistic regression models for

GWAS analysis because these are the most versatile framework for performing a GWAS (there are many less versatile alternatives!)

  • These two models can handle our genetic coding (in fact any

genetic coding) where we have discrete categories (although they can also handle X that can take on a continuous set of values!)

  • They can also handle (the sampling distribution) of phenotypes that

have normal (linear) and Bernoulli error (logistic)

  • How about phenotypes with different error (sampling)

distributions? Linear and logistic regression models are members of a broader class called Generalized Linear Models (GLMs), where

  • ther models in this class can handle additional phenotypes (error

distributions)

slide-19
SLIDE 19

Introduction to Generalized Linear Models (GLMs) II

  • To introduce GLMs, we will introduce the overall structure first, and second

describe how linear and logistic models fit into this framework

  • There is some variation in presenting the properties of a GLM, but we will

present them using three (models that have these properties are considered GLMs):

  • The probability distribution of the response variable

Y conditional on the independent variable X is in the exponential family of distributions

  • A link function relating the independent variables and parameters to the

expected value of the response variable (where we often use the inverse!!)

  • The error random variable has a variance which is a function of ONLY

. Pr(Y |X) ∼ expfamily.

: : E(Y|X) → X,

(E(Y|X)) = X E(Y|X) = −1(X)

le ✏

= X

slide-20
SLIDE 20

Exponential family

  • The exponential family is includes a broad set of probability distributions that

can be expressed in the following form:

  • As an example, for the normal distribution, we have the following:
  • Note that many continuous and discrete distributions are in this family (normal,

binomial, poisson, lognormal, multinomial, several categorical distributions, exponential, gamma distribution, beta distribution, chi-square) but not all (examples that are not!?) and since we can model response variables with these distributions, we can model phenotypes with these distributions in a GWAS using a GLM (!!)

  • Note that the normal distribution is in this family (linear) as is Bernoulli or more

accurately Binomial (logistic)

Pr(Y ) ∼ e

Y θ−b(θ) φ

+c(Y,φ) ve: ✓ = µ, = 2, b(✓) = ✓2 2 , c(Y, ) = −1 2 Y 2 + log(2⇡) !

slide-21
SLIDE 21

GLM link function

  • A “link” function is just a function (!!) that acts on the expected

value of Y given X:

  • This function is defined in such a way such that it has a useful form

for a GLM although there are some general restrictions on the form of this function, the most important is that they need to be monotonic such that we can define an inverse:

  • For the logistic regression, we have selected the following link

function, which is a logit function (a “canonical link”) where the inverse is the logistic function (but note that others are also used for binomial response variables):

  • What is the link function for a normal distribution?

the inverse Y = f(X), as f−1(Y ) = X.

E(Y|X) = −1(X) = eXβ 1 + eXβ

γ(E(Y|X)) = ln

eXβ 1+eXβ

1 −

eXβ 1+eXβ

!

slide-22
SLIDE 22

GLM error function

  • The variance of the error term in a GLM must be function of

ONLY the independent variable and beta parameter vector:

  • This is the case for a linear regression (note the variance of the

error is constant!!):

  • As an example, this is the case for the logistic regression (note

the error changes depending on the value of X!!):

V ar(✏) = f(X) ✏ ∼ N(E(Y|X), 2

✏ )

V ar(✏) = f(X) = 2

V ar(✏) = −1(X)(1 − −1(X)) V ar(✏i) = −1(µ + Xi,aa + Xi,dd)(1 − −1(µ + Xi,aa + Xi,dd)

slide-23
SLIDE 23

Inference with GLMs

  • We perform inference in a GLM framework using the

same approach, i.e. MLE of the beta parameters using an IRLS algorithm (just substitute the appropriate link function in the equations, etc.)

  • We can also perform a hypothesis test using a LRT

(where the sampling distribution as the sample size goes to infinite is chi-square)

  • In short, what you have learned can be applied for most

types of regression modeling you will likely need to apply (!!)

slide-24
SLIDE 24

Introduction to haplotype testing I

  • We have just extended our GWAS framework to

handle additional phenotypes

  • We can also extend our GWAS framework to handle

genotypes defined using a different approach

  • In this case, let’s consider using haplotype alleles in our

testing framework

  • Note that a haplotype collapses genetic marker

information but in some cases, testing using haplotypes is more effective than testing one genetic marker at a time

slide-25
SLIDE 25

Introduction to haplotype testing II

  • Haplotype - a series of ordered, linked alleles that

are inherited together

  • For the moment, let’s consider a haplotype to define a

“function” that takes a set of alleles at several loci A, B, C, D, etc. and outputs a haplotype allele:

  • For example, if these loci are each a SNP with the

following alleles (A,G), (A,T),(G,C),(G,C) we could define the following haplotype alleles:

h = f(Ai, Bj, ...)

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-26
SLIDE 26

Introduction to haplotype testing III

  • Note that how we define haplotype alleles is somewhat arbitrary but in

general, we define a haplotype for a set of genetic markers (loci) that are physically linked that are frequently occur in a population

  • How many markers is somewhat arbitrary, e.g. we often define sets that

match observed patterns of LD

  • How many haplotype alleles we define is also somewhat arbitrary, where we

define haplotype alleles that have appreciable frequenecy in the population

  • For example, four the four loci with alleles (A,G), (A,T),(G,C),(G,C) how

many haplotype alleles could we define?

  • However, it could be that only the following two combinations have

relatively “high” allele frequencies (say >0.05 = arbitrary!)

  • In such a case, we can collapse the many alleles into just a few!

s h1 = (A, A, C, C)

s (A, G), (A, T), (G, d h2 = (G, T, G, G).

slide-27
SLIDE 27

Introduction to haplotype testing IV

  • As an example of haplotype allele collapsing, say for our case of

four loci (A,G), (A,T),(G,C),(G,C), we have lots of LD (!!) such that there are only 4 alleles in the population (i.e. all other combinations have frequency of zero!):

  • Let’s also say that the frequencies of the third and fourth of these

in the population are < 0.01

  • In this case, we can define just two haplotype alleles that collapse

the other alleles as follows (where * means “any” genetic marker allele):

  • NOTE: we are therefore loosing information using this approach!!
  • ften the case that only a few haplotypes are at appreciable frequency, e.g.

s h∗

1 = (A, A, C, C),h∗ 2 = (G, T, G, G),h∗ 3 = (A, A, G, C),h∗ 4 = (G, T, C, G)

s h1 = (A, A, ∗, C)

∗ ∗

d h2 = (G, T, ∗, G) d h = h∗ ∪ h∗. ∗ at h1 = h∗

1 ∪h∗ 3

d h2 = (G, T, ∗ d h2 = h∗

2 ∪ h∗ 4.

slide-28
SLIDE 28

GWAS with haplotypes I

  • Once we have defined haplotype alleles, we can

proceed with a GWAS using our framework (just substitute haplotype alleles and genotypes for genetic marker alleles and genotypes!)

  • For example, in a case where we only have two

haplotype alleles, we can code our independent variables for our regression model as follows:

  • All other aspects remain the same (although what is

the effect on our interpretation of where the causal polymorphism is located?)

Xa(h1h1) = −1, Xa(h1h2) = 0, Xa(h2h2) = 1 Xd(h1h1) = −1, Xd(h1h2) = 1, Xd(h2h2) = −1

slide-29
SLIDE 29

That’s it for today

  • Next lecture: complete our discussion of haplotypes
  • we will also discuss covariates, including one of the most

important (population structure)!

  • We will also discuss a “minimum checklist” that you should

perform in your GWAS analysis (!!)