Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture21: Multiple genotypes and phenotypes Jason Mezey jgm45@cornell.edu April 23, 2019 (T) 10:10-11:25

Announcements • THE PROJECT IS NOW DUE AT: 11:59PM, Sat., May 4 (!!!) • The Final: • Available Sun. (May 5) evening (Time TBD) and due 11:59PM, May 7 (last day of class) • Take-home, open book, no discussion with anyone (same as the midterm!) • Cumulative (including the last lecture!)

Summary of lecture 21 • Review and last mixed model subjects • Multiple regression GWAS (epistasis) • Multivariate GWAS analysis (and eQTLs)

Review: introduction to mixed models I • A mixed model describes a class of models that have played an important role in early quantitative genetic (and other types) of statistical analysis before genomics (if you are interested, look up variance component estimation) • These models are now used extensively in GWAS analysis as a tool for model covariates (often population structure!) • These models considered effects as either “fixed” (they types of regression coefficients we have discussed in the class) and “random” (which just indicates a different model assumption) where the appropriateness of modeling covariates as fixed or random depends on the context (fuzzy rules!) • These models have logistic forms but we will introduce mixed models using linear mixed models (“simpler”)

Review: Intro to mixed models II • Recall our linear regression model has the following structure: ✏ i ∼ N (0 , � 2 ✏ ) y i = � µ + X i,a � a + X i,d � d + ✏ i • For example, for n =2: ∼ y 1 = � µ + X 1 ,a � a + X 1 ,d � d + ✏ 1 ✏ 2 y 2 = � µ + X 2 ,a � a + X 2 ,d � d + ✏ 2 ✏ 1 • What if we introduced a correlation? y 1 = � µ + X 1 ,a � a + X 1 ,d � d + a 1 a 2 y 2 = � µ + X 2 ,a � a + X 2 ,d � d + a 2 a 1

Review: Intro to mixed models III • The formal structure of a mixed model is as follows: y = X � + Za + ✏ where ✏ ∼ multiN ( 0 , I � 2 al a ∼ multiN ( 0 , A � 2 ✏ ) a ), rix (see class for a discu 2 3 2 3 2 3 2 3 2 3 y 1 1 X i,a X i,d 1 0 0 0 0 a 1 ✏ 1 1 0 1 0 0 0 y 2 X i,a X i,d a 2 ✏ 2 2 3 � µ 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 1 0 0 1 0 0 y 3 X i,a X i,d 5 + a 3 ✏ 3 = + � a 6 7 6 7 6 7 6 7 6 7 4 6 . 7 6 . . . 7 6 . . . . 7 6 . 7 6 . 7 . . . . . . . . . . � d 6 7 6 7 6 7 6 7 6 7 . . . . . . . . . . 4 5 4 5 4 5 4 5 4 5 1 0 1 y n X i,a X i,d ... ... ... a n ✏ n • Note that X is called the “design” matrix (as with a GLM), Z is called the “incidence” matrix, the a is the vector of random effects and note that the A matrix determines the correlation among the ai values where the structure of A is provided from external information (!!)

Review: Intro to mixed models IV • We perform inference (estimation and hypothesis testing) for the mixed model just as we would for a GLM (!!) • Note that in some applications, people might be ∼ interested in estimating the variance components � 2 ✏ , � 2 a but for GWAS, we are generally interested in regression parameters for our genotype (as before!): � a , � d • For a GWAS, we will therefore determine the MLE of the genotype association parameters and use a LRT for the hypothesis test, where we will compare a null and alternative model (what is the difference between these models?)

Review: inference 1 • To estimate parameters, we will use the MLE, so we are concerned with the form of the likelihood equation Z ∞ L ( � , � 2 a , � 2 Pr ( y | � , a , � 2 ✏ ) Pr ( a | A � 2 ✏ | y ) = a ) d a −∞ 1 ✏ [ y − X � − Za ] T [ y − X � − Za ] | A � 2 1 a a T A − 1 a ✏ | − 1 a | − 1 − − L ( � , � 2 a , � 2 ✏ | y ) = | I � 2 2 e 2 � 2 2 e 2 � 2 a − 1 [ y − X β − Za ] T [ y − X β − Za ] − 1 ✏ | y ) ∝ − n ✏ − − n l ( β , σ 2 a , σ 2 2 ln σ 2 2 ln σ 2 a T A − 1 a 2 σ 2 2 σ 2 ✏ a (18) • Unfortunately, there is no closed form for the MLE since they have the following form: − 1 X T ) − 1 X T ˆ − 1 Y MLE (ˆ β ) = ( X ˆ V V MLE ( ˆ V ) = f ( X , ˆ V , Y , A ) e V = σ 2 a A + σ 2 ✏ I .

Review: inference II h i 1. At step [ t ] for t = 0, assign values to the parameters: β [0] = β [0] µ , β [0] a , β [0] , σ 2 , [0] , σ 2 , [0] . a d ✏ These need to be selected such that they are possible values of the parameters (e.g. no negative values for the variance parameters). 2. Calculate the expectation step for [ t ]: Z T Z + A − 1 σ 2 , [ t − 1] ◆ − 1 ✓ a [ t ] = Z T ( y − x β [ t − 1] ) ✏ (21) σ 2 , [ t − 1] a Z T Z + A − 1 σ 2 , [ t − 1] ◆ − 1 ✓ V [ t ] σ 2 , [ t − 1] ✏ = (22) a σ 2 , [ t − 1] ✏ a 3. Calculate the maximization step for [ t ]: β [ t ] = ( x T x ) − 1 x T ( y − Za [ t ] ) (23) = 1 h a [ t ] A − 1 a [ t ] + tr ( A − 1 V [ t ] i σ 2 , [ t ] a ) (24) a n = − 1 y − x β [ t ] − Za [ t ] i T h h y − x β [ t ] − Za [ t ] i + tr ( Z T Z V [ t ] σ 2 , [ t ] a ) (25) ✏ n where tr is a trace function, which is equal to the sum of the diagonal elements of a matrix. 4. Iterate steps 2, 3 until ( β [ t ] , σ 2 , [ t ] , σ 2 , [ t ] ) ≈ ( β [ t +1] , σ 2 , [ t +1] , σ 2 , [ t +1] ) (or alternatively a a ✏ ✏ lnL [ t ] ≈ lnL [ t +1] ).

Review: inference III • For hypothesis testing, we will calculate a LRT: | � � LRT = � 2 ln Λ = 2 l (ˆ θ 1 | y ) � 2 l (ˆ θ 0 | y ) • To do this, run the EM algorithm twice, once for the null hypothesis (again what is this?) and once for the alternative (i.e. all parameters unrestricted) and then substitute the parameter values into the log-likelihood equations and calculate the LRT • The LRT is then distributed (asymptotically) as a Chi- Square distribution with two degrees of freedom (as before!)

Construction of A matrix I • The matrix A is an n x n covariance matrix (what is the form of a covariance matrix?) • Where does A come from? This depends on the modeling application... • In GWAS, the random effect is usually used to account for population structure OR relatedness among individuals • For relatedness, we use estimates of identity by descent, which can be estimated from a pedigree or genotype data • For population structure, a matrix is constructed from the covariance (or similarity) among individuals based on their genotypes

Construction of A matrix II ⇤ ⌅ z 11 ... z 1 k y 11 ... y 1 m x 11 ... x 1 N . . . . . . . . . . . . . . . . . . Data = ⌥ � . . . . . . . . . ⇧ ⌃ z n 1 ... z nk y n 1 ... y nm x 11 ... x nN • Calculate the n xn ( n =sample size) covariance matrix for the individuals in your sample across all genotypes - this is a reasonable A matrix! • There is software for calculating A and for performing a mixed model analysis (e.g. R-package: lrgpr, EMMAX, FAST - LMM, TASSEL, etc.) • Mastering mixed models will take more time than we have to devote to the subject in this class, but what we have covered provides a foundation for understanding the topic

What is the A modeling? • Recall our linear regression model has the following structure: ✏ i ∼ N (0 , � 2 ✏ ) y i = � µ + X i,a � a + X i,d � d + ✏ i • For example, for n =2: ∼ y 1 = � µ + X 1 ,a � a + X 1 ,d � d + ✏ 1 ✏ 2 y 2 = � µ + X 2 ,a � a + X 2 ,d � d + ✏ 2 ✏ 1 • What if we introduced a correlation? y 1 = � µ + X 1 ,a � a + X 1 ,d � d + a 1 a 2 y 2 = � µ + X 2 ,a � a + X 2 ,d � d + a 2 a 1

Introduction to epistasis I • So far, we have applied a GWAS analysis by considering statistical models between one genetic marker and the phenotype • This is the standard approach applied in all GWAS analyses and the one that you should apply as a first step when analyzing GWAS data (always!) • However, we could start considering more than one marker in each of the statistical models we consider • One reason we might want to do this is to test for statistical interactions among genetic markers (or more specifically, between the causal polymorphisms that they are tagging)

Introduction to epistasis II • If we wanted to consider two markers at a time, our current statistical framework extends easily (note that a index AFTER a comma indicates a different marker): Y = � − 1 ( � µ + X a, 1 � a, 1 + X d, 1 � d, 1 + X a, 2 � a, 2 + X d, 2 � d, 2 ) + ✏ • However, this equation only has four regression parameters and with two markers, we have more than four classes of genotypes • To make this explicit, recall that we define the genotypic value of the phenotype as the expected value of the phenotype Y given a genotype: s G A k A l B k B l = E ( Y | g = A k A l B k B l ) • For the case of two markers, we therefore have nine classes of genotypes and therefore nine possible genotypic values, i.e. we need nine parameters to model this system (why are there nine?): B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 G A 1 A 1 B 1 B 1 G A 1 A 1 B 1 B 2 G A 1 A 1 B 2 B 2 A 1 A 2 G A 1 A 2 B 1 B 1 G A 1 A 2 B 1 B 2 G A 1 A 2 B 2 B 2 A 2 A 2 G A 2 A 2 B 1 B 1 G A 2 A 2 B 1 B 2 G A 2 A 2 B 2 B 2

Introduction to epistasis III • As an example, for a sample that we can appropriately model with a linear regression model, we can plot the phenotypes associated with each of the nine classes: • In this case, both marginal loci are additive

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture21: Multiple genotypes and phenotypes Jason Mezey jgm45@cornell.edu April 23, 2019 (T) 10:10-11:25 Announcements THE PROJECT IS NOW DUE AT: 11:59PM, Sat., May 4 (!!!)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Pedigree and inbred

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Inbred line analysis

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Haplotype testing and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Alternative tests and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 9: Hypothesis testing II

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: Analysis of

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture19: Alternative Tests,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Jason Mezey Biological

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Multiple phenotypes

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 22: Continued

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: (Brief) Introduction

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Minimum GWAS steps;

Dummy Endogenous Variables in a Simultaneous Equation System Econometrica, Vol. 46, No. 4 (Jul.,

Lecture 7: OLS with qualitative information Dummy variables Dummy variable: an indicator

High High Per erfor ormance mance Dummy ummy Fill Ins Fill nser ertion ion wit ith h

ECON2228 Notes 6 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

Model extensions part 1: Dummy variables Kathrin Gruber Assistant Professor of Econometrics

Additional Topics - Dummy Variables, Adjusted R-Squared & Information A Single

DISCRETIZE: Command to Convert a Continuous Instrument into a Dummy Variable for Instrumental

Simultaneous OPC- and CMP-Aware Routing Based on Accurate Closed-Form Modeling Shao-Yun Fang,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - PowerPoint PPT Presentation

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture21: Multiple genotypes and phenotypes Jason Mezey jgm45@cornell.edu April 23, 2019 (T) 10:10-11:25 Announcements THE PROJECT IS NOW DUE AT: 11:59PM, Sat., May 4 (!!!)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Pedigree and inbred

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Inbred line analysis

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Haplotype testing and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Alternative tests and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 9: Hypothesis testing II

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: Analysis of

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture19: Alternative Tests,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 26: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Jason Mezey Biological

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Multiple phenotypes

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 23: Introduction to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 22: Continued

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 24: (Brief) Introduction

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Minimum GWAS steps;

Dummy Endogenous Variables in a Simultaneous Equation System Econometrica, Vol. 46, No. 4 (Jul.,

Lecture 7: OLS with qualitative information Dummy variables Dummy variable: an indicator

High High Per erfor ormance mance Dummy ummy Fill Ins Fill nser ertion ion wit ith h

ECON2228 Notes 6 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

Model extensions part 1: Dummy variables Kathrin Gruber Assistant Professor of Econometrics

Additional Topics - Dummy Variables, Adjusted R-Squared &amp; Information A Single

DISCRETIZE: Command to Convert a Continuous Instrument into a Dummy Variable for Instrumental

Simultaneous OPC- and CMP-Aware Routing Based on Accurate Closed-Form Modeling Shao-Yun Fang,

Additional Topics - Dummy Variables, Adjusted R-Squared & Information A Single