[PPT] - GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (GBLUP-RR) PowerPoint Presentation

SLIDE 1

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (GBLUP-RR)

Paulino Pérez 1 José Crossa 2

1ColPos-México 2CIMMyT-México

September, 2014.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 1/35

SLIDE 2

General comments

Remember,

1

A simple model used frequently in plant breeding stands that the phenotypic value of an individual (P) is expressed as the summation of the genetic value (G) and the residual environmental effect (E): P = G + E, (1) where G includes additive, dominance and epistatic effects.

2

A model that includes solely additive effects (A) can be easily derived from (1), and can be expressed as follows, P = A + E′ (2) where E′ includes effects that are non additive.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 3/35

SLIDE 4

General comments

Continue...

The breeding value (BV) for an individual can be computed based on narrow sense heritability (h2), BVi = µ + h2(yi − µ), where µ is mean phenotypic value of a population and yi is the phenotypic value for individual i. Obviously it is necessary to have information of parents and offsprings to compute this.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 4/35

SLIDE 5

General comments

Continue...

In Genomic Selection (GS), genetic values are approximated using linear regression (Meuwissen et al., 2001), that is: yi = gi + ei = µ +

p

j=1

xijβj + ei (3) Relationships between marker genotypes (x1i: 0 and 1) and phenotypes (yi) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line (taken from Nakaya and Isobe, 2012).

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 5/35

SLIDE 6

General comments

Continue...

In GS it is possible to obtain Genomic Estimated Breeding Values (GEBVs for short). This can be done simply by adding up marker effects (according to its marker genotypes) obtained from a training population, that is: GEBVi =

p

j=1

xij ˆ βj (4) Next we show how to obtain the predictions ˆ yi (and in some cases ˆ βj) using several models.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 6/35

SLIDE 7

General comments

Continue...

Figure 1: Graphical representation of parametric and non-parametric methods used commonly in whole-genomic prediction.

In this presentation we will focus in Ridge Regression.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 7/35

SLIDE 8

General comments

Continue...

−6 −4 −2 2 4 6 0.0 0.2 0.4 0.6 0.8 βj p(βj) Gaussian Double Exponential Scaled−t (5df) BayesC (π=0.25)

Figure 2: Prior densities of regression coefficients with Markers.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 8/35

SLIDE 9

GBLUP-Ridge Regression

GBLUP-RR

This is the most basic model used in GS. Let yi = gi + ei = µ +

p

j=1

xijβj + ei marker effects are obtained by solving the following optimization problem, min β, λ

(y −
Xjβj)′(y −
Xjβj) + λ
β2

j

,

(5) where λ > 0 is a regularization parameter. Notes:

1

λ is unknown and can be selected by using cross-validation

2

we need to minimize a “penalized sum of squares”.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 9/35

SLIDE 10

GBLUP-Ridge Regression

Continue...

The optimization problem has a closed solution, ˆ β = (X ′X + λI)−1X ′˜ y, where ˜ y = y − µ1. Unfortunately, we need to know the value of λ to use this solution. The problem can be solved easily using the Bayesian framework. Let β ∼ N(0, σ2

βI) and e ∼ N(0, σ2 eI), and u = Xβ, then model (3) can be

written as: y = µ1 + u + e (6) Model (6) is know as GBLUP . Note that u ∼ N(0, σ2

βXX ′)

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 10/35

SLIDE 11

GBLUP-Ridge Regression

Training and testing sets

Note also that the covariance matrix for u involves the product XX ′, which is proportional to the Genomic Relationship Matrix proposed by VanRaden (2008). We will assume that u ∼ N(0, σ2

uG) with G = XX ′/k. The mix-model

equations for (6) are as follows:

1′1σ−2

e

1′σ−2

e

1′σ−2

e

Iσ−2

e σ−2 u

+ Gσ−2

u

ˆ µ ˆ u

=
1′y

y

(7)

u and µ are obtained solving the mix-model equations, assuming that the variance components σ2

e and σ2 u are known.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 11/35

SLIDE 12

GBLUP-Ridge Regression

Continue...

If we have individuals for training and testing, we can partition G and u as follows, G = G11 G12 G21 G22

, u =

u1 u2

, y =

y1 y2

, 1 =

11 12

1=individuals in the training set, 2=individuals in the testing set. ˆ

µ and ˆ u1 are

btained as the solution of the mix-model equations,
1′

111σ−2 e

1′

1σ−2 e

1′

1σ−2 e

I11σ−2

e σ−2 u

+ G11σ−2

u

ˆ µ ˆ u1

=
1′

1y1

y1

The predictions for individuals in the testing set are given by

ˆ y2 = ˆ µ12 + G21G−1

11 ˆ

u1

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 12/35

SLIDE 13

Application examples

Wheat dataset

Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (xij, i = 1, ..., n, j = 1, ..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R,

1

Load R

2

Install BGLR package (if not yet installed)

3

Load the package

4

Load the data

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 13/35

SLIDE 14

Application examples

Continue...

Figure 3: Loading the BGLR package and the wheat dataset.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 14/35

SLIDE 15

Application examples

Continue...

You can explore the MM matrix, pedigree matrix within R, fix(wheat.X) fix(wheat.A)

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 15/35

SLIDE 16

Application examples

Continue...

Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the GBLUP . We do not know the value for σ2

e

and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the RR model using Bayesian approach with non informative priors for σ2

e, σ2 β, rm(list=ls()) library(BGLR) data(wheat) X=wheat.X Y=wheat.Y setwd(’/tmp/’) #Linear predictor ETA=list(list(X=X,model="BRR")) fmR<-BGLR(y=Y[,1],ETA=ETA,nIter=10000,burnIn=5000,thin=10) plot(fmR$yHat,Y[,1])

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 16/35

SLIDE 17

Application examples

Continue...

●
●
−2.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 −2 −1 1 2 3 y ^ y

Figure shows observed vs predicted grain yield. Predictions ˆ y = ˆ µ + X ˆ β, and estimates for σ2

e, σ2 β can be obtained easily in R > fmR$yHat > fmR$varE [1] 0.5481523 > fmR$ETA[[1]]$varB [1,] 0.002721897 >

From the output above, ˆ λ = ˆ σ2

e/ ˆ

σ2

β = 0.5482/0.0027 = 201.38

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 17/35

SLIDE 18

Application examples

A complete list of the objects attached to fmR can be obtained by typing

> names(fmR) [1] "y" "whichNa" "saveAt" "nIter" "burnIn" [8] "verbose" "response_type" "df0" "S0" "yHat" [15] "SD.mu" "varE" "SD.varE" "fit" "ETA"

A complete description of the output can be found in the BGLR reference manual. The GEBVs can be obtained easily in R,

#GEVBs #option 1 X%*%fmR$ETA[[1]]$b #option 2 fmR$yHat-fmR$mu

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 18/35

SLIDE 19

Application examples

Training and testing set

Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines,

#Training and testing set sets<-wheat.sets y<-Y[,1] yNa=y whichNa=(sets==2) yNa[whichNa]=NA fmR<-BGLR(y=yNa,ETA=ETA,nIter=10000, burnIn=5000,thin=10) plot(fmR$yHat,y,xlab="Phenotype", ylab="Pred. Gen. Value" ,cex=.8,bty="L") points(x=y[whichNa],y=fmR$yHat[whichNa],col=2,cex=.8,pch=19) legend("topleft", legend=c("training","testing"),bty="n", pch=c(1,19),col=c("black","red"))

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 19/35

SLIDE 20

Application examples

Continue...

●
−2.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2 −1 1 2 3 Phenotype

Pred. Gen. Value
training

testing

> MSE.tst<-mean((fmR$yHat[whichNa]

y[whichNa])^2)

> MSE.tst [1] 0.8110028 > MSE.trn<-mean((fmR$yHat[-whichNa]

y[-whichNa])^2)

> MSE.trn [1] 0.4364856 > COR.tst<-cor(fmR$yHat[whichNa], y[whichNa]) > COR.tst [1] 0.4338218 > COR.trn<-cor(fmR$yHat[-whichNa], y[-whichNa]) > COR.trn [1] 0.7839615

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 20/35

SLIDE 21

Application examples

Questions?

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 21/35

SLIDE 22

Application examples

Excercise

Suppose that we want to predict the grain Yields for individuals in set 2 and environment 4. Write an R program to solve the problem described above Obtain the correlations in the training set Obtain the correlations in the testing set Write the predictions to a csv (comma separated values) so that you can read the file in Excel.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 22/35

SLIDE 23

Application examples

Continue...

Suppose that we are interested in studying the predictive power of GBLUP . We can perform a simulation study to that end. In this exercise you will perform 10 fold cross validation. Write an R program to perform a 10 fold cross-validation, use the use

bject sets to allocate observations to folds.

Report your results.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 23/35

SLIDE 24

Biplot from marker effects

Biplots (The theory...)

A biplot is a two-dimensional representation of a data matrix C showing a point for each of the n observation vectors (rows of C) along with a point for each of the p variables (columns of C), see Gabriel, 1971. Perform the SVD of C, that is C = UDV ′ Let Up×q = [α1, ..., αq] and V q×q = [γ1, ..., γq]. plot α1 vs α2 Draw arrows, the coordinates of the end of the arrow are given in γ1 and γ2.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 24/35

SLIDE 25

Biplot from marker effects

Continue...

Figure 4: An example of a Biplot derived from markers.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 25/35

SLIDE 26

Biplot from marker effects

Continue...

Why is the biplot useful?

1

Points in the biplot are the marker effects projected in the first two components.

2

The “environmental effects” are displayed as vectors whose coordinates are given by γ1 and γ2.

3

The length of the vectors approximates the variance accounted for by the specific molecular marker and “environmental effect”.

4

The cosine of the angle between two environments, approximates the correlation of the two environments with an angle of zero indicating a correlation of +1, an angle of 90◦ (or −90◦) a correlation of 0, and an angle of 180◦ a correlation of −1.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 26/35

SLIDE 27

Biplot from marker effects

R Code

rm(list=ls()) #Importar the data setwd("~/Examples/Biplot/") datos=read.csv("mean_betas_PMBL.csv",header=TRUE) load("wheat.RData") pca.betas= princomp(datos[2:5],cor=TRUE) #Do the Biplot by yourself lambda=pca.betas$sdev[1:2]*sqrt(pca.betas$n.obs) scores=pca.betas$scores[ ,c(1,2)]/lambda variables=pca.betas$loadings[,c(1,2)]*lambda plot(scores, type="p", xlim=c(-0.15, 0.25), ylim=c(-0.15, 0.25), font.lab=2,xlab="Comp.1 (50.17%)",ylab="Comp.2 (24.27%)") abline(v=0, h=0, lty=3) index=c(22,52,91,128,158,249,304,455,522,532,549, 550,933,1115,1247,1259,1277) text(scores[index,],labels=colnames(marker)[index],pos=4) points(scores[index,],pch=19)

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 27/35

SLIDE 28

Biplot from marker effects

Continue...

#identify(scores,labels=colnames(marker)) par(new=TRUE) plot(variables, type="n", xaxt="n", yaxt="n", xlim=c(-24, 40), ylim=c(-24, 40),xlab="",ylab="") arrows(0, 0, 0.6*variables[,1],0.6*variables[,2], len=0.1, lwd=2,col="black") text(0.65*variables, rownames(variables), col="black", xpd=TRUE) axis(3); axis(4);

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 28/35

SLIDE 29

Biplot from marker effects

Exercise

1

Use the wheat dataset described in the previous slides and fit a GBLUP-BRR model and save the marker effects for the 4 environments to an Excel file (csv).

2

Use the marker effects obtained in Step 1, and create a biplot.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 29/35

SLIDE 30

Extension of BRR to include infinitesimal effect

Extension of basic model to include infinitesimal effect

de los Campos et al. (2009) extended the basic BRR model to include an infinitesimal effect, that is: yi = µ +

p

j=1

xijβj + ui + ei, (8) where u ∼ N(0, σ2

uA) and A is the pedigree matrix.

The model can be fitted using the standard linear mixed model theory or using Bayesian methods.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 30/35

SLIDE 31

Extension of BRR to include infinitesimal effect

Example

rm(list=ls()) library(BGLR) data(wheat) X=wheat.X Y=wheat.Y A=wheat.A setwd(’/tmp/’) #Linear predictor ETA=list(list(X=X,model="BRR"), list(K=A,model="RKHS")) fmR<-BGLR(y=Y[,1],ETA=ETA,nIter=10000,burnIn=5000,thin=10) plot(fmR$yHat,Y[,1])

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 31/35

SLIDE 32

Extension of BRR to include infinitesimal effect

Excercise

Suppose that we want to predict the grain Yields for individuals in set 2 and environment 1 using the marker and pedigree information jointly. Write an R program to solve the problem described above Obtain the correlations in the training set Obtain the correlations in the testing set Write the predictions to a csv (comma separated values) so that you can read the file in Excel.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 32/35

SLIDE 33

Extension of BRR to include infinitesimal effect

Marker based vs Maker + Pedigree based model

How can we design a simulation to study prediction ability of a model with markers only and a model that includes Markers + Pedigree?

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 33/35

SLIDE 34

Extension of BRR to include infinitesimal effect

Concluding remarks

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 34/35

SLIDE 35

Extension of BRR to include infinitesimal effect

References

de los Campos G., J. Hickey, R. Pong-Wong, H. D. Daetwyler and M.P .L. Calus,. 2012. Whole Genome Regression and Prediction Methods Applied to Plant and Animal Breeding, Genetics. Gabriel, R. K. 1971. The Biplot Graphical Display of Matrices with Application to Principal Component Analysis. Biometrika, 58, 453-467. Endelman, J. B., 2011. Ridge Regression and Other Kernels for Genomic Selection with R package rrBLUP , The Plant Genome 4(3): 250-255.

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (GBLUP-RR) 35/35

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (GBLUP-RR)

Paulino Pérez 1 José Crossa 2

September, 2014.

Contents

General comments

GBLUP-Ridge Regression

Application examples

Biplot from marker effects

Extension of BRR to include infinitesimal effect

General comments

Remember,

A simple model used frequently in plant breeding stands that the phenotypic value of an individual (P) is expressed as the summation of the genetic value (G) and the residual environmental effect (E): P = G + E, (1) where G includes additive, dominance and epistatic effects.

A model that includes solely additive effects (A) can be easily derived from (1), and can be expressed as follows, P = A + E′ (2) where E′ includes effects that are non additive.

Continue...

Continue...

In Genomic Selection (GS), genetic values are approximated using linear regression (Meuwissen et al., 2001), that is: yi = gi + ei = µ +

xijβj + ei (3) Relationships between marker genotypes (x1i: 0 and 1) and phenotypes (yi) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line (taken from Nakaya and Isobe, 2012).

Continue...

In GS it is possible to obtain Genomic Estimated Breeding Values (GEBVs for short). This can be done simply by adding up marker effects (according to its marker genotypes) obtained from a training population, that is: GEBVi =

xij ˆ βj (4) Next we show how to obtain the predictions ˆ yi (and in some cases ˆ βj) using several models.

Continue...

Figure 1: Graphical representation of parametric and non-parametric methods used commonly in whole-genomic prediction.

In this presentation we will focus in Ridge Regression.

Continue...

Figure 2: Prior densities of regression coefficients with Markers.

GBLUP-RR

This is the most basic model used in GS. Let yi = gi + ei = µ +

xijβj + ei marker effects are obtained by solving the following optimization problem, min β, λ

(5) where λ > 0 is a regularization parameter. Notes:

λ is unknown and can be selected by using cross-validation

we need to minimize a “penalized sum of squares”.

Continue...

The optimization problem has a closed solution, ˆ β = (X ′X + λI)−1X ′˜ y, where ˜ y = y − µ1. Unfortunately, we need to know the value of λ to use this solution. The problem can be solved easily using the Bayesian framework. Let β ∼ N(0, σ2

written as: y = µ1 + u + e (6) Model (6) is know as GBLUP . Note that u ∼ N(0, σ2

Training and testing sets

Note also that the covariance matrix for u involves the product XX ′, which is proportional to the Genomic Relationship Matrix proposed by VanRaden (2008). We will assume that u ∼ N(0, σ2

equations for (6) are as follows:

1′σ−2

1′σ−2

Iσ−2

+ Gσ−2

ˆ µ ˆ u

y

u and µ are obtained solving the mix-model equations, assuming that the variance components σ2

Continue...

If we have individuals for training and testing, we can partition G and u as follows, G = G11 G12 G21 G22

u1 u2

y1 y2

11 12

µ and ˆ u1 are

1′

1′

I11σ−2

+ G11σ−2

ˆ µ ˆ u1

y1

ˆ y2 = ˆ µ12 + G21G−1

u1

Wheat dataset

Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (xij, i = 1, ..., n, j = 1, ..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R,

Load R

Install BGLR package (if not yet installed)

Load the package

Load the data

Continue...

Figure 3: Loading the BGLR package and the wheat dataset.

Continue...

You can explore the MM matrix, pedigree matrix within R, fix(wheat.X) fix(wheat.A)

Continue...

Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the GBLUP . We do not know the value for σ2

and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the RR model using Bayesian approach with non informative priors for σ2

Continue...

Figure shows observed vs predicted grain yield. Predictions ˆ y = ˆ µ + X ˆ β, and estimates for σ2

From the output above, ˆ λ = ˆ σ2

σ2

A complete list of the objects attached to fmR can be obtained by typing

A complete description of the output can be found in the BGLR reference manual. The GEBVs can be obtained easily in R,

Training and testing set

Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines,

Continue...