btry 4830 6830 quantitative genomics and genetics
play

BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: - PowerPoint PPT Presentation

BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: Logistic Regression III and Haplotypes Jason Mezey jgm45@cornell.edu Oct. 23, 2014 (Th) 8:40-9:55 Announcements MY OFFICE HOURS TODAY ARE CANCELLED (This week only!!)


  1. BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: Logistic Regression III and Haplotypes Jason Mezey jgm45@cornell.edu Oct. 23, 2014 (Th) 8:40-9:55

  2. Announcements • MY OFFICE HOURS TODAY ARE CANCELLED (This week only!!) • Homework #5 will be available tomorrow (Fri.) and will be due this coming Thurs. (11:59PM)

  3. Summary of lecture 17 • In previous two lectures, we expanded our GWAS analysis abilities by introducing the logistic regression model that allows us to analyze case / control data • Today, we will complete our discussion of logistic regression by discussing (in brief) Generalized Linear Models (GLM) the broader class of models including linear and logistic regression • We will also provide an (extremely) brief introduction to the concept of haplotypes and haplotype testing in a GWAS framework

  4. Conceptual Overview Genetic Sample or experimental System pop Measured individuals Does A1 -> A2 (genotype, Y? phenotype) affect Regression Reject / DNR model Model params Pr(Y|X) F-test

  5. Review: linear vs. logistic • Recall that for a linear regression, the regression function was a line and the error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal • For a logistic regression, we use the logistic function and the error term makes up the value to either 0 or 1: Y Y Xa Xa

  6. Review: the expected value for logistic regression • Recall that an individual with phenotype Yi is described by the following equation: | − Y i = E( Y i | X i ) + ⇤ i Y i = ⇥ − 1 ( Y i | X i ) + ⇤ i e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i • To understand how an individual with a phenotype Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term

  7. Review: Inference in logistic regression • Recall that our goal with using logistic regression was to model the probability distribution of a case / control phenotype when there is a causal polymorphism • To use this for a GWAS, we need to test the null hypothesis that a genotype is not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!): � µ = c � a = 0 � d = 0 H 0 : � a = 0 ∩ � d = 0 • To assess this null hypothesis, we will use the same approach as in linear regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)

  8. Review: logistic inference for GWAS • Our goal is to do a hypothesis test for each marker with the phenotype individually • We therefore have to construct a LRT and calculate a p-value for every maker • To calculate the LRT (and p-value) we need to calculate the MLE of our regression parameters (which we will do with an algorithm!) • We therefore need equations for: 1. the algorithm, 2. the LRT

  9. Review: logistic MLE • Recall that an MLE is simply a statistic (a function that takes the sample as an input and outputs the estimate of the parameters)! • In this case, we want to construct the following MLE: MLE (ˆ β ) = MLE ( ˆ β µ , ˆ β a , ˆ β d ) • To do this, we need to maximize the log-likelihood function for the logistic regression, which has the following form (sample size n): n ⇤ y i ln ( γ � 1 ( β µ + x i,a β a + x i,d β d )) + (1 � y i ) ln (1 � γ � 1 ( β µ + x i,a β a + x i,d β d )) � ⇥ l ( β ) = i =1 • Unlike the case of linear regression, where we had a “closed-form” equation that allows us to plug in the Y’s and X’s and returns the beta values that maximize the log-likelihood, there is no such simple equation for a logistic regression • We will therefore need an algorithm to calculate the MLE

  10. Review: IRLS algorithm I • algorithm - a sequence of instructions for taking an input and producing an output • For logistic regression (and GLM’s in general!) we will construct an Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure: 1. Choose starting values for the β ’s. Since we have a vector of three β ’s in our case, we assign these numbers and call the resulting vector β [0] . 2. Using the re-weighting equation (described next slide), update the β [ t ] vector. 3. At each step t > 0 check if β [ t +1] ⇡ β [ t ] (i.e. if these are approximately equal) using an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.

  11. Step 1: IRLS algorithm • These are simply values of the vector that we assign (!!) • In one sense, these can be anything we want (!!) although for algorithms in general there are usually some restrictions and / or certain starting values that are “better” than others in the sense that the algorithm will converge faster, find a more “optimal” solution etc. • In our case, we can assign our starting values as follows: 2 3 0 � [0] = 0 4 5 0

  12. Step 2: IRLS algorithm • At step 2, we will update (= produce a new value of the vector) using the following equation (then do this again and again until we stop!): ∼ � [ t +1] = � [ t ] + [ x T Wx ] − 1 x T ( y − ⇥ − 1 ( x � [ t ] ) | | | | − ⇤ ⌅ 1 x 1 ,a x 1 ,d e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] d a + x i,d � [ t ] ⇥ − 1 ( � [ t ] µ + x i,a � [ t ] d ) = 1 x 2 ,a x 2 ,d ⌥ � 1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] x = ⌥ . . � ... d . . ⌥ � . . ⌥ e x β [ t ] ⇧ ⌃ ⇧ ⇥ − 1 ( x � [ t ] ) = 1 x n,a x n,d 1 + e x β [ t ] ⇤ ⌅   y 1 β [ t ] ⇤ ⌅ a + x i,d β [ t ] a + x i,d β [ t ] W ii = γ − 1 ( β [ t ] µ + x i,a β [ t ] d )(1 − γ − 1 ( β [ t ] µ + x i,a β [ t ] d )) µ y 2 β [ t ] = y = β [ t ]   ⌥ � e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ]  .  a � ⇥ . d d ⇧ ⌃   . W ii = 1 − β [ t ]   1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] 1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] d d d y n matrix W is an n by ( W ij = 0 for i 6 = j )

  13. Step 3: IRLS algorithm • At step 3, we “check” to see if we should stop the algorithm and, if we decide not to stop, we go back to step 2 • If we decide to stop, we will assume the final values of the vector are the MLE (it may not be exactly the true MLE, but we will assume that n � [ t +1] ⇡ � [ t ] it is close if we do not stop the algorithm to early!), e.g. • There are many stopping rules, using change in Deviance is one way to construct a rule (note the issue with ln(0)!!: is small. Wh ⇥ D = | D [ t + 1] � D [ t ] | ⇤ ⌅ 4 D < 10 � 6 ⇧ n � 1 − y i ⇥⇧ � ⇥ y i ⌅ ⇤ +(1 − y i ) ln D = 2 y i ln 1 − ⇥ − 1 ( � [ t ]or[ t +1] + x i,a � [ t ]or[ t +1] + x i,d � [ t ]or[ t +1] ⇥ − 1 ( � [ t ]or[ t +1] + x i,a � [ t ]or[ t +1] + x i,d � [ t ]or[ t +1] ) ) − µ a µ a i =1 d d � ⇥⇧ � ⇥ � n � y i ⇥ � 1 − y i ⇥⇧ ⌅ ⇤ D = 2 +(1 − y i ) ln y i ln e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] µ a µ a d d i =1 1 − 1+ e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] 1+ e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] µ a µ a d d

  14. Review: Logistic hypothesis testing I • Recall that our null and alternative hypotheses are: H 0 : � a = 0 ∩ � d = 0 H A : β a 6 = 0 [ β d 6 = 0 • We will use the LRT for the null (0) and alternative (1): | � � LRT = � 2 ln Λ = � 2 lnL (ˆ θ 0 | y ) LRT = � 2 ln Λ = 2 l (ˆ θ 1 | y ) � 2 l (ˆ θ 0 | y ) L (ˆ θ 1 | y ) � • For our case, we need the following: l ( ˆ ⌅ 1 | y ) = l ( ˆ � µ , ˆ � a , ˆ � d | y ) l ( ˆ ⌅ 0 | y ) = l ( ˆ � µ , 0 , 0 | y )

  15. Review Logistic hypothesis testing II • For the alternative, we use our MLE estimates of our logistic regression parameters we get from our IRLS algorithm and plug these into the log-like equation n ⇥ ⇤ l (ˆ � y i ln ( γ − 1 (ˆ β µ + x i,a ˆ β a + x i,d ˆ β d )) + (1 − y i ) ln (1 − γ − 1 (ˆ β µ + x i,a ˆ β a + x i,d ˆ θ 1 | y ) = β d )) ∼ i =1 e β µ + x i,a β a + x i,d β d ⇥ − 1 ( � µ + x i,a � a + x i,d � d ) = 1 + e β µ + x i,a β a + x i,d β d • For the null, we plug in the following parameter estimates into this same equation � n h i l (ˆ X y i ln ( γ − 1 (ˆ β µ, 0 + x i,a ∗ 0 + x i,d ∗ 0)) + (1 − y i ) ln (1 − γ − 1 (ˆ θ 0 | y ) = β µ, 0 + x i,a ∗ 0 + x i,d ∗ 0)) i =1 • where we use the same IRLS algorithm to provide ˆ β µ, 0 estimates of by running the algorithm EXACTLY the same with EXCEPT we set and we do not update β a = 0 , ˆ ˆ β d = 0 these!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend