Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture14: Logistic regression I: GWAS for case / control phenotypes Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55

Announcements • Your midterm will be returned next Tues. • Homework #6 (last homework!) will be available tomorrow • Project available April 14 (more details to come!) • Scheduling the final (take home - same format as midterm) • Option 1: Available Tues. May 10, due Fri. May 13 (=during first study period) • Option 2: During first week of exams May 16-19 • I will send an email about these options - please email or talk to me about concerns / constraints ASAP(!!)

Summary of lecture 14 • In previous lectures, we completed our introduction to how to analyze data for the “ideal” GWAS for phenotypes that can be modeled with a linear regression model • Going forward, we will continue to add layers, where today we will discuss how to analyze case / control phenotypes using a logistic regression model

Conceptual Overview Genetic Sample or experimental System pop Measured individuals Does A1 -> A2 (genotype, Y? phenotype) affect Regression Reject / DNR model Model params Pr(Y|X) F-test

Review: GWAS basics • In an “ideal” GWAS experiment, we measure the phenotype and N genotypes THROUGHOUT the genome for n independent individuals • To analyze a GWAS, we perform N independent hypothesis tests of the following form: H 0 : Cov ( X, Y ) = 0 • When we reject the null hypothesis, we assume that because of linkage disequilibrium, that we have located a position in the genome that contains a causal polymorphism (not the causal polymorphism!) • This is as far as we can go with a GWAS (!!) such that (often) identifying the causal polymorphism requires additional data and or follow-up experiments, i.e. GWAS is a starting point

Review: linear regression • So far, we have considered a linear regression is a reasonable model for the relationship between genotype and phenotype (where this implicitly assumes a normal error provides a reasonable approximation of the phenotype distribution given the genotype): ✏ ∼ N (0 , � 2 ✏ ) Y = � µ + X a � a + X d � d + ✏

Case / Control Phenotypes I • While a linear regression may provide a reasonable model for many phenotypes, we are commonly interested in analyzing phenotypes where this is NOT a good model • As an example, we are often in situations where we are interested in identifying causal polymorphisms (loci) that contribute to the risk for developing a disease, e.g. heart disease, diabetes, etc. • In this case, the phenotype we are measuring is often “has disease” or “does not have disease” or more precisely “case” or “control” • Recall that such phenotypes are properties of measured individuals and therefore elements of a sample space, such that we can define a random variable such as Y (case) = 1 and Y (control) = 0

Case / Control Phenotypes II • Let’s contrast the situation, let’s contrast data we might model with a linear regression model versus case / control data:

Logistic regression I • Instead, we’re going to consider a logistic regression model

Logistic regression II • It may not be immediately obvious why we choose regression “line” function of this “shape” • The reason is mathematical convenience, i.e. this function can be considered (along with linear regression) within a broader class of models called Generalized Linear Models (GLM) which we will discuss next lecture • However, beyond a few differences (the error term and the regression function) we will see that the structure and out approach to inference is the same with this model

Logistic regression III • To begin, let’s consider the structure of a regression model: Y = logistic ( � µ + X a � a + X d � d ) + ✏ l • We code the “X’s” the same (!!) although a major difference here is the “logistic” function as yet undefined • However, the expected value of Y has the same structure as we have seen before in a regression: E( Y i | X i ) = logistic ( � µ + X i,a � a + X i,d � d ) • We can similarly write for a population using matrix notation (where the X matrix has the same form as we have been considering!): E( Y | X ) = logistic ( X � ) • In fact the two major differences are in the form of the error and the logistic function

Logistic regression: error term I • Recall that for a linear regression, the error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal, but for a logistic regression, we have the same case but the value has to make up the value to either 0 or 1 (what distribution is this?): Y Y Xa Xa

Logistic regression: error term II • For the error on an individual i, we therefore have to construct an error that takes either the value of “1” or “0” depending on the value of the expected value of the genotype • For Y = 0 ✏ i = � E ( Y i | X i ) = � E ( Y | A i A j ) = � logistic ( � µ + X i,a � a + X i,d � d ) • For Y = 1 | | � � � ✏ i = 1 � E ( Y i | X i ) = 1 � E ( Y | A i A j ) = 1 � logistic ( � µ + X i,a � a + X i,d � d ) • For a distribution that takes two such values, a reasonable distribution is therefore the Bernoulli distribution with the | � following parameter ✏ i = Z � E ( Y i | X i ) � | Pr ( Z ) ⇠ bern ( p ) p = logistic ( � µ + X a � a + X d � d )

Logistic regression: error term III • This may look complicated at first glance but the intuition is relatively simple • If the logistic regression line is near zero, the probability distribution of the error term is set up to make the probability of Y being zero greater than being one (and vice versa for the regression line near one!): | � ✏ i = Z � E ( Y i | X i ) � | Y Pr ( Z ) ⇠ bern ( p ) p = logistic ( � µ + X a � a + X d � d ) Xa

Logistic regression: link function I • Next, we have to consider the function for the regression line of a logistic regression (remember below we are plotting just versus Xa but this really is a plot versus Xa AND Xd!!): E( Y i | X i ) = logistic ( � µ + X i,a � a + X i,d � d ) Y e β µ + X i,a β a + X i,d β d E( Y i | X i ) = 1 + e β µ + X i,a β a + X i,d β d Xa

Logistic regression: link function II • We will write this function using a somewhat strange notation (but remember that it is just a function!!): E( Y i | X i ) = logistic ( � µ + X i,a � a + X i,d � d ) e β µ + X i,a β a + X i,d β d E( Y i | X i ) = γ � 1 ( Y i | X i ) = 1 + e β µ + X i,a β a + X i,d β d • In matrix notation, this is the following: 2 3 e β µ + x 1 ,a β a + x 1 ,d β d 1+ e β µ + x 1 ,a β a + x 1 ,d β d . 6 7 E( y | x ) = γ � 1 ( x β ) = . 6 7 . 6 7 e β µ + xn,a β a + xn,d β d 4 5 1+ e β µ + xn,a β a + xn,d β d

Calculating the components of an individual I • Recall that an individual with phenotype Yi is described by the following equation: | − Y i = E( Y i | X i ) + ⇤ i Y i = ⇥ − 1 ( Y i | X i ) + ⇤ i e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i • To understand how an individual with a phenotype Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term (we will do this separately

Calculating the components of an individual II • For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 0 • We know Xa = -1 and Xd = -1 • Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: � µ = 0 . 2 � a = 2 . 2 � d = 0 . 2 • We can then calculate the E(Yi|Xi) and the error term for i: | e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i e 0 . 2+( − 1)2 . 2+( − 1)0 . 2 0 = 1 + e 0 . 2+( − 1)2 . 2+( − 1)0 . 2 + ⇤ i 0 = 0 . 1 − 0 . 1

Calculating the components of an individual III • For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 1 • We know Xa = -1 and Xd = -1 • Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: � µ = 0 . 2 � a = 2 . 2 � d = 0 . 2 • We can then calculate the E(Yi|Xi) and the error term for i: | e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i e 0 . 2+( − 1)2 . 2+( − 1)0 . 2 1 = 1 + e 0 . 2+( − 1)2 . 2+( − 1)0 . 2 + ⇤ i 1 = 0 . 1 + 0 . 9

Calculating the components of an individual IV • For example, say we have an individual i that has genotype A1A2 and phenotype Yi = 0 • We know Xa = 0 and Xd = 1 • Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: � µ = 0 . 2 � a = 2 . 2 � d = 0 . 2 • We can then calculate the E(Yi|Xi) and the error term for i: | e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i e 0 . 2+(0)2 . 2+(1)0 . 2 0 = 1 + e 0 . 2+(0)2 . 2+(1)0 . 2 + ⇤ i 0 = 0 . 6 − 0 . 6

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture14: Logistic regression I: GWAS for case / control phenotypes Jason Mezey jgm45@cornell.edu April 5, 2016 (T) 8:40-9:55 Announcements Your midterm will be returned

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Pedigree and inbred

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Inbred line analysis

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Haplotype testing and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Alternative tests and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 9: Hypothesis testing II

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: Analysis of

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture19: Alternative Tests,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture21: Multiple genotypes

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Jason Mezey Biological

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Multiple phenotypes

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 22: Continued

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: (Brief) Introduction

CS 147: Computer Systems Performance Analysis Two-Factor Designs 1 / 34 Overview CS147

Applied Political Research Session 13: Multiple Regression Analysis Lecturer: Prof. A.

Statistical View of Linear Least Squares Minjung Kyung mkyung@stat.ncsu.edu May 22, 2005 0-0

Handling of Position Errors in Variational and Hybrid Ensemble/Variational Data Assimilation Using

Machine Learning Lecture 5 Support Vector Machines Justin Pearson 1 2020 1

Logit with multiple alternatives Michel Bierlaire Transport and Mobility Laboratory School of

Learning Models from Data with Measurement Error: Tackling Underreporting Roy Adams, Yuelong Ji,

Regression Diagnostics and Troubleshooting Jeffrey Arnold May 3, 2016 Question How do