[PPT] - Statistical Analysis of Genetic and Phenotypic Data for Breeders: PowerPoint Presentation

SLIDE 1

Statistical Analysis of Genetic and Phenotypic Data for Breeders: Hands on Practical Sessions (GBLUP-RR)

Paulino Pérez 1 José Crossa 2

1ColPos-México 2CIMMyT-México

June, 2015.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 1/34

SLIDE 2

General comments

Remember,

1

A simple model used frequently in plant breeding stands that the phenotypic value of an individual (P) is expressed as the summation of the genetic value (G) and the residual environmental effect (E): P = G + E, (1) where G includes additive, dominance and epistatic effects.

2

A model that includes solely additive effects (A) can be easily derived from (1), and can be expressed as follows, P = A + E′ (2) where E′ includes effects that are non additive.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 3/34

SLIDE 4

General comments

Continue...

The breeding value (BV) for an individual can be computed based on narrow sense heritability (h2), BVi = µ + h2(yi − µ), where µ is mean phenotypic value of a population and yi is the phenotypic value for individual i. Obviously it is necessary to have information of parents and offsprings to compute this.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 4/34

SLIDE 5

General comments

Continue...

In Genomic Selection (GS), genetic values are approximated using linear regression (Meuwissen et al., 2001), that is: yi = gi + ei = µ +

p

j=1

xijβj + ei (3) Relationships between marker genotypes (x1i: 0 and 1) and phenotypes (yi) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line (taken from Nakaya and Isobe, 2012).

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 5/34

SLIDE 6

General comments

Continue...

In GS it is possible to obtain Genomic Estimated Breeding Values (GEBVs for short). This can be done simply by adding up marker effects (according to its marker genotypes) obtained from a training population, that is: GEBVi =

p

j=1

xij ˆ βj (4) Next we show how to obtain the predictions ˆ yi (and in some cases ˆ βj) using several models.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 6/34

SLIDE 7

General comments

Continue...

Figure 1: Graphical representation of parametric and non-parametric methods used commonly in whole-genomic prediction.

In this presentation we will focus in Ridge Regression.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 7/34

SLIDE 8

General comments

Continue...

−6 −4 −2 2 4 6 0.0 0.2 0.4 0.6 0.8 βj p(βj) Gaussian Double Exponential Scaled−t (5df) BayesC (π=0.25)

Figure 2: Prior densities of regression coefficients with Markers.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 8/34

SLIDE 9

GBLUP-Ridge Regression

GBLUP-RR

This is the most basic model used in GS. Let yi = gi + ei = µ +

p

j=1

xijβj + ei marker effects are obtained by solving the following optimization problem, min β, λ

(y −
Xjβj)′(y −
Xjβj) + λ
β2

j

,

(5) where λ > 0 is a regularization parameter. Notes:

1

λ is unknown and can be selected by using cross-validation

2

we need to minimize a “penalized sum of squares”.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 9/34

SLIDE 10

GBLUP-Ridge Regression

Continue...

The optimization problem has a closed solution, ˆ β = (X ′X + λI)−1X ′˜ y, where ˜ y = y − µ1. Unfortunately, we need to know the value of λ to use this solution. The problem can be solved easily using the Bayesian framework. Let β ∼ N(0, σ2

βI) and e ∼ N(0, σ2 eI), and u = Xβ, then model (3) can be

written as: y = µ1 + u + e (6) Model (6) is know as GBLUP . Note that u ∼ N(0, σ2

βXX ′)

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 10/34

SLIDE 11

GBLUP-Ridge Regression

Training and testing sets

Note also that the covariance matrix for u involves the product XX ′, which is proportional to the Genomic Relationship Matrix proposed by VanRaden (2008). We will assume that u ∼ N(0, σ2

uG) with G = XX ′/k. The mix-model

equations for (6) are as follows:

1′1σ−2

e

1′σ−2

e

1′σ−2

e

Iσ−2

e σ−2 u

+ Gσ−2

u

ˆ µ ˆ u

=
1′y

y

(7)

u and µ are obtained solving the mix-model equations, assuming that the variance components σ2

e and σ2 u are known.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 11/34

SLIDE 12

GBLUP-Ridge Regression

Continue...

If we have individuals for training and testing, we can partition G and u as follows, G = G11 G12 G21 G22

, u =

u1 u2

, y =

y1 y2

, 1 =

11 12

1=individuals in the training set, 2=individuals in the testing set. ˆ

µ and ˆ u1 are

btained as the solution of the mix-model equations,
1′

111σ−2 e

1′

1σ−2 e

1′

1σ−2 e

I11σ−2

e σ−2 u

+ G11σ−2

u

ˆ µ ˆ u1

=
1′

1y1

y1

The predictions for individuals in the testing set are given by

ˆ y2 = ˆ µ12 + G21G−1

11 ˆ

u1

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 12/34

SLIDE 13

Application examples

Wheat dataset

Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (xij, i = 1, ..., n, j = 1, ..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R,

1

Load R

2

Install BGLR package (if not yet installed)

3

Load the package

4

Load the data

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 13/34

SLIDE 14

Application examples

Continue...

Figure 3: Loading the BGLR package and the wheat dataset.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 14/34

SLIDE 15

Application examples

Continue...

You can explore the MM matrix, pedigree matrix within R, fix(wheat.X) fix(wheat.A)

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 15/34

SLIDE 16

Application examples

Continue...

Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the GBLUP . We do not know the value for σ2

e

and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the RR model using Bayesian approach with non informative priors for σ2

e, σ2 β, rm(list=ls()) library(BGLR) data(wheat) X=wheat.X Y=wheat.Y #Linear predictor ETA=list(list(X=X,model="BRR")) #Or #ETA=list(Markers=list(X=X,model="BRR")) fmR<-BGLR(y=Y[,1],ETA=ETA,nIter=10000,burnIn=5000,thin=10) plot(fmR$yHat,Y[,1])

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 16/34

SLIDE 17

Application examples

Continue...

●
●
−2.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 −2 −1 1 2 3 y ^ y

Figure shows observed vs predicted grain yield. Predictions ˆ y = ˆ µ + X ˆ β, and estimates for σ2

e, σ2 β can be obtained easily in R > fmR$yHat > fmR$varE [1] 0.5481523 > fmR$ETA[[1]]$varB [1,] 0.002721897 >

From the output above, ˆ λ = ˆ σ2

e/ ˆ

σ2

β = 0.5482/0.0027 = 201.38

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 17/34

SLIDE 18

Application examples

A complete list of the objects attached to fmR can be obtained by typing

> names(fmR) [1] "y" "whichNa" "saveAt" "nIter" "burnIn" [8] "verbose" "response_type" "df0" "S0" "yHat" [15] "SD.mu" "varE" "SD.varE" "fit" "ETA"

A complete description of the output can be found in the BGLR reference manual. The GEBVs can be obtained easily in R,

#GEVBs #option 1 X%*%fmR$ETA[[1]]$b #option 2 fmR$yHat-fmR$mu #Or fmR$ETA$Markers$b

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 18/34

SLIDE 19

Application examples

Training and testing set

Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines,

#Training and testing set sets<-wheat.sets y<-Y[,1] yNa=y whichNa=(sets==2) yNa[whichNa]=NA fmR<-BGLR(y=yNa,ETA=ETA,nIter=10000, burnIn=5000,thin=10) plot(fmR$yHat,y,xlab="Phenotype", ylab="Pred. Gen. Value" ,cex=.8,bty="L") points(x=y[whichNa],y=fmR$yHat[whichNa],col=2,cex=.8,pch=19) legend("topleft", legend=c("training","testing"),bty="n", pch=c(1,19),col=c("black","red"))

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 19/34

SLIDE 20

Application examples

Continue...

●
−2.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2 −1 1 2 3 Phenotype

Pred. Gen. Value
training

testing

> MSE.tst<-mean((fmR$yHat[whichNa]

y[whichNa])^2)

> MSE.tst [1] 0.8110028 > MSE.trn<-mean((fmR$yHat[-whichNa]

y[-whichNa])^2)

> MSE.trn [1] 0.4364856 > COR.tst<-cor(fmR$yHat[whichNa], y[whichNa]) > COR.tst [1] 0.4338218 > COR.trn<-cor(fmR$yHat[-whichNa], y[-whichNa]) > COR.trn [1] 0.7839615

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 20/34

SLIDE 21

Application examples

Questions?

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 21/34

SLIDE 22

Application examples

Exercise

Suppose that we want to predict the grain Yields for individuals in set 2 and environment 4. Write an R program to solve the problem described above Obtain the correlations in the training set Obtain the correlations in the testing set Write the predictions to a csv (comma separated values) so that you can read the file in Excel.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 22/34

SLIDE 23

Application examples

Continue...

Suppose that we are interested in studying the predictive power of GBLUP . We can perform a simulation study to that end. In this exercise you will perform 10 fold cross validation. Write an R program to perform a 10 fold cross-validation, use the use

bject sets to allocate observations to folds.

Report your results.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 23/34

SLIDE 24

Biplot from marker effects

Biplots (The theory...)

A biplot is a two-dimensional representation of a data matrix C showing a point for each of the n observation vectors (rows of C) along with a point for each of the p variables (columns of C), see Gabriel, 1971. Perform the SVD of C, that is C = UDV ′ Let Up×q = [α1, ..., αq] and V q×q = [γ1, ..., γq]. plot α1 vs α2 Draw arrows, the coordinates of the end of the arrow are given in γ1 and γ2.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 24/34

SLIDE 25

Biplot from marker effects

Continue...

●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.1

0.0 0.1 0.2 −0.1 0.0 0.1 0.2 Comp.1 (50.17%) Comp.2 (24.27%) wPt.4600 wPt.3533 wPt.2644 wPt.3462 wPt.3697 wPt.4988 wPt.6047 wPt.3922 wPt.9256 wPt.3393 wPt.3904 wPt.4706 c.345897 c.373879 c.380591 c.381717 c.408424

E1

E2 E3 E4 −20 −10 10 20 30 40 −20 −10 10 20 30 40

Figure 4: An example of a Biplot derived from markers.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 25/34

SLIDE 26

Biplot from marker effects

Continue...

Why is the biplot useful?

1

Points in the biplot are the marker effects projected in the first two components.

2

The “environmental effects” are displayed as vectors whose coordinates are given by γ1 and γ2.

3

The length of the vectors approximates the variance accounted for by the specific molecular marker and “environmental effect”.

4

The cosine of the angle between two environments, approximates the correlation of the two environments with an angle of zero indicating a correlation of +1, an angle of 90◦ (or −90◦) a correlation of 0, and an angle of 180◦ a correlation of −1.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 26/34

SLIDE 27

Biplot from marker effects

R Code

rm(list=ls()) #Set the working directory setwd("C:/Users/P.P.RODRIGUEZ/Desktop/Slides Paulino/3. GBLUP-RR/examples") #Function for biplots source("biplot.R") #Import the data data=read.csv("mean_betas_PMBL.csv",header=TRUE) data=data[,2:5] #Principal component analysis pca.betas= princomp(data,cor=TRUE) #Default biplot in R biplot(pca.betas) #Modified function my.biplot.princomp(pca.betas)

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 27/34

SLIDE 28

Biplot from marker effects

Exercise

1

Use the wheat dataset described in the previous slides and fit a GBLUP-BRR model and save the marker effects for the 4 environments to an Excel file (csv).

2

Use the marker effects obtained in Step 1, and create a biplot.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 28/34

SLIDE 29

Extension of BRR to include infinitesimal effect

Extension of basic model to include infinitesimal effect

de los Campos et al. (2009) extended the basic BRR model to include an infinitesimal effect, that is: yi = µ +

p

j=1

xijβj + ui + ei, (8) where u ∼ N(0, σ2

uA) and A is the pedigree matrix.

The model can be fitted using the standard linear mixed model theory or using Bayesian methods.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 29/34

SLIDE 30

Extension of BRR to include infinitesimal effect

Example

rm(list=ls()) library(BGLR) data(wheat) X=wheat.X Y=wheat.Y A=wheat.A #Linear predictor ETA=list(list(X=X,model="BRR"), list(K=A,model="RKHS")) fmR<-BGLR(y=Y[,1],ETA=ETA,nIter=10000,burnIn=5000,thin=10) plot(fmR$yHat,Y[,1])

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 30/34

SLIDE 31

Extension of BRR to include infinitesimal effect

Exercise

Suppose that we want to predict the grain Yields for individuals in set 2 and environment 1 using the marker and pedigree information jointly. Write an R program to solve the problem described above Obtain the correlations in the training set Obtain the correlations in the testing set Write the predictions to a csv (comma separated values) so that you can read the file in Excel.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 31/34

SLIDE 32

Extension of BRR to include infinitesimal effect

Marker based vs Maker + Pedigree based model

How can we design a simulation to study prediction ability of a model with markers only and a model that includes Markers + Pedigree?

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 32/34

SLIDE 33

Extension of BRR to include infinitesimal effect

Concluding remarks

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 33/34

SLIDE 34

Extension of BRR to include infinitesimal effect

References

de los Campos G., J. Hickey, R. Pong-Wong, H. D. Daetwyler and M.P .L. Calus,. 2012. Whole Genome Regression and Prediction Methods Applied to Plant and Animal Breeding, Genetics. Gabriel, R. K. 1971. The Biplot Graphical Display of Matrices with Application to Principal Component Analysis. Biometrika, 58, 453-467. Endelman, J. B., 2011. Ridge Regression and Other Kernels for Genomic Selection with R package rrBLUP , The Plant Genome 4(3): 250-255.

CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 34/34

Statistical Analysis of Genetic and Phenotypic Data for Breeders: Hands on Practical Sessions (GBLUP-RR)

Paulino Pérez 1 José Crossa 2

June, 2015.

Contents

General comments

GBLUP-Ridge Regression

Application examples

Biplot from marker effects

Extension of BRR to include infinitesimal effect

General comments

Remember,

A simple model used frequently in plant breeding stands that the phenotypic value of an individual (P) is expressed as the summation of the genetic value (G) and the residual environmental effect (E): P = G + E, (1) where G includes additive, dominance and epistatic effects.

A model that includes solely additive effects (A) can be easily derived from (1), and can be expressed as follows, P = A + E′ (2) where E′ includes effects that are non additive.

Continue...

Continue...

In Genomic Selection (GS), genetic values are approximated using linear regression (Meuwissen et al., 2001), that is: yi = gi + ei = µ +

xijβj + ei (3) Relationships between marker genotypes (x1i: 0 and 1) and phenotypes (yi) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line (taken from Nakaya and Isobe, 2012).

Continue...

In GS it is possible to obtain Genomic Estimated Breeding Values (GEBVs for short). This can be done simply by adding up marker effects (according to its marker genotypes) obtained from a training population, that is: GEBVi =

xij ˆ βj (4) Next we show how to obtain the predictions ˆ yi (and in some cases ˆ βj) using several models.

Continue...

Figure 1: Graphical representation of parametric and non-parametric methods used commonly in whole-genomic prediction.

In this presentation we will focus in Ridge Regression.

Continue...

Figure 2: Prior densities of regression coefficients with Markers.

GBLUP-RR

This is the most basic model used in GS. Let yi = gi + ei = µ +

xijβj + ei marker effects are obtained by solving the following optimization problem, min β, λ

(5) where λ > 0 is a regularization parameter. Notes:

λ is unknown and can be selected by using cross-validation

we need to minimize a “penalized sum of squares”.

Continue...

The optimization problem has a closed solution, ˆ β = (X ′X + λI)−1X ′˜ y, where ˜ y = y − µ1. Unfortunately, we need to know the value of λ to use this solution. The problem can be solved easily using the Bayesian framework. Let β ∼ N(0, σ2

written as: y = µ1 + u + e (6) Model (6) is know as GBLUP . Note that u ∼ N(0, σ2

Training and testing sets

Note also that the covariance matrix for u involves the product XX ′, which is proportional to the Genomic Relationship Matrix proposed by VanRaden (2008). We will assume that u ∼ N(0, σ2

equations for (6) are as follows:

1′σ−2

1′σ−2

Iσ−2

+ Gσ−2

ˆ µ ˆ u

y

u and µ are obtained solving the mix-model equations, assuming that the variance components σ2

Continue...

If we have individuals for training and testing, we can partition G and u as follows, G = G11 G12 G21 G22

u1 u2

y1 y2

11 12

µ and ˆ u1 are

1′

1′

I11σ−2

+ G11σ−2

ˆ µ ˆ u1

y1

ˆ y2 = ˆ µ12 + G21G−1

u1

Wheat dataset

Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (xij, i = 1, ..., n, j = 1, ..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R,

Load R

Install BGLR package (if not yet installed)

Load the package

Load the data

Continue...

Figure 3: Loading the BGLR package and the wheat dataset.

Continue...

You can explore the MM matrix, pedigree matrix within R, fix(wheat.X) fix(wheat.A)

Continue...

Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the GBLUP . We do not know the value for σ2

and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the RR model using Bayesian approach with non informative priors for σ2

Continue...

Figure shows observed vs predicted grain yield. Predictions ˆ y = ˆ µ + X ˆ β, and estimates for σ2

From the output above, ˆ λ = ˆ σ2

σ2

A complete list of the objects attached to fmR can be obtained by typing

A complete description of the output can be found in the BGLR reference manual. The GEBVs can be obtained easily in R,

Training and testing set

Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines,

Continue...