Using the genomic relationship matrix to predict the accuracy of - - PDF document

using the genomic relationship matrix to predict the
SMART_READER_LITE
LIVE PREVIEW

Using the genomic relationship matrix to predict the accuracy of - - PDF document

J. Anim. Breed. Genet. ISSN 0931-2668 ORIGINAL ARTICLE Using the genomic relationship matrix to predict the accuracy of genomic selection M.E. Goddard 1,2 , B.J. Hayes 2 & T.H.E. Meuwissen 3 1 Department of Agriculture and Food Systems,


slide-1
SLIDE 1

ORIGINAL ARTICLE

Using the genomic relationship matrix to predict the accuracy

  • f genomic selection

M.E. Goddard1,2, B.J. Hayes2 & T.H.E. Meuwissen3

1 Department of Agriculture and Food Systems, University of Melbourne, Melbourne, Vic., Australia 2 Biosciences Research Division, Victorian Department of Primary Industries, Bundoora, Vic., Australia 3 Norwegian University of Life Sciences, A ˚ s, Norway

Introduction The matrix of relationships among a group of indi- viduals can be used to predict their breeding values, to manage inbreeding and in genetic conservation. This relationship matrix can be calculated from the pedigree, but it is also possible to calculate the rela- tionship matrix from genotypes at genetic markers such as single-nucleotide polymorphisms (SNPs). Elements of the genomic relationship matrix are esti- mates of the realized proportion of the genome that two individuals share, whereas the pedigree-derived relationship matrix is the expectation of this propor-

  • tion. This genomic relationship matrix can be used

in genomic selection to estimate breeding values. Genomic selection refers to the use of a large num- ber of genetic markers, such as SNPs, covering the whole genome to predict the genetic value of indi- viduals (Meuwissen et al. 2001). The individuals might be people whose genetic risk of developing a complex disease is being predicted, or they might be domestic animals or plants in which estimates of

Keywords Genomic selection; relationship matrix. Correspondence

  • M. Goddard, Biosciences Research Division,

Victorian Department of Agriculture, 1 Park Drive, Bundoora, Vic. 3083, Australia. Tel: +61 39032 7091; Fax: +61 39032 7158; E-mail: mike.goddard@dpi.vic.gov.au Received: 6 February 2011; accepted: 18 August 2011

Summary Estimated breeding values (EBVs) using data from genetic markers can be predicted using a genomic relationship matrix, derived from animal’s genotypes, and best linear unbiased prediction. However, if the accuracy

  • f the EBVs is calculated in the usual manner (from the inverse element
  • f the coefficient matrix), it is likely to be overestimated owing to sam-

pling errors in elements of the genomic relationship matrix. We show here that the correct accuracy can be obtained by regressing the rela- tionship matrix towards the pedigree relationship matrix so that it is an unbiased estimate of the relationships at the QTL controlling the trait. This method shows how the accuracy increases as the number of mark- ers used increases because the regression coefficient (of genomic rela- tionship towards pedigree relationship) increases. We also present a deterministic method for predicting the accuracy of such genomic EBVs before data on individual animals are collected. This method estimates the proportion of genetic variance explained by the markers, which is equal to the regression coefficient described above, and the accuracy with which marker effects are estimated. The latter depends on the vari- ance in relationship between pairs of animals, which equals the mean linkage disequilibrium over all pairs of loci. The theory was validated using simulated data and data on fat concentration in the milk of Hol- stein cattle.

  • J. Anim. Breed. Genet. ISSN 0931-2668

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421 doi:10.1111/j.1439-0388.2011.00964.x

slide-2
SLIDE 2

their breeding value will be used to select parents to breed the next generation. In cattle, the availability

  • f high-throughput, high-density genotyping with

SNP chips has led to the widespread adoption of genomic selection in dairy cattle breeding pro- grammes, where it is predicted to double the rate of genetic improvement (Schaeffer 2006; Dalton 2009). Traditionally, livestock have been selected on the basis of estimated breeding values (EBVs) calculated from data on phenotype and pedigree using a statis- tical technique called best linear unbiased prediction (BLUP) (Henderson 1984). A desirable feature of this method is that the accuracy of the EBVs could be calculated as part of the statistical analysis. This is not the case with many methods used for genomic

  • selection. Currently, the most trusted method for

assessing the accuracy of genomic EBVs is an empiri- cal test in which a sample of animals have genomic EBVs calculated, and then additional phenotypic data are collected to assess how accurately the EBVs predict these new data. This is time-consuming, fails to predict the accuracy of individual EBVs and is wasteful in that the new data are used only to esti- mate accuracy and not to improve the prediction of breeding value. Other cross-validation techniques can also be used but they are also time-consuming and do not yield the accuracy of the final prediction,

  • r individual accuracies. In practice, some authors

have used the inverse of the mixed model or BLUP equations (e.g. VanRaden 2008; Hayes et al. 2009a,b) but, as we show in this paper, this can overestimate the accuracy. It would be very useful to be able to predict the accuracy of EBVs calculated using genomic selection in two situations. Firstly, after the data have been collected and are being analysed, it would be useful to calculate accuracies of EBVs as part of the statisti- cal analysis, as is done for traditional EBVs, includ- ing for individuals without their own phenotypes. In this situation, we are interested in calculating the accuracy of the EBVs of individual animals. Sec-

  • ndly, when planning a selection programme using

genomic selection, it would be useful to be able to predict the accuracy of alternative designs so that the best one could be implemented. In this situa- tion, we wish to predict the accuracy of EBVs of classes of animals, which we might then use in deterministic simulations

  • f

alternative breeding

  • programmes. This paper presents methods for both
  • f these situations.

After the data have been collected and are being analysed, it should be possible to predict the accu- racy from the properties of the statistical method (as in VanRaden 2008 and Harris & Johnson 2010). However, this requires that the statistical model matches the true situation. For instance, if the model makes assumptions about the distribution of the effects of genes affecting the trait (QTL), this should match the real distribution. If we assume that there are a very large number of QTL whose effects follow a normal distribution with constant variance, the analysis (called BLUP by Meuwissen et al. 2001) is robust to departures from this assumption and the accuracy is little affected even if the distribution of QTL effects does not follow a normal distribution. However, empirical tests of the accuracy derived from the inverse of the BLUP equations often find that it is overestimated by this method, dramatically so if it is used to predict breeding values in one breed based on data from another breed (Hayes et al. 2009a). An anomaly of the method is that it does not predict increasing accuracy as the number of markers is increased and this explains why it overes- timates accuracy as shown below. In this paper, we describe how to calculate the accuracy of genomic EBVs after the data have been collected, taking into account the number of markers used. The accuracy of genomic EBVs expected before data are collected has been considered by Goddard (2009) for unrelated animals and by Hayes et al. (2009b) for simple family structures such as groups

  • f full-sibs and half-sibs. Their deterministic method

treats the genome as if it were a series of small chro- mosomal segments, each of which is inherited inde- pendently. Here, we treat chromosomes as continuous and show that a similar prediction results. The objectives of this paper are twofold: (i) to derive a method for calculating the accuracy of EBVs calculated using a known genomic relationship matrix; and (ii) to derive a method to predict this accuracy before the data on individual animals are

  • collected. In the Materials and Methods section, we

first develop the theory to predict accuracy and then describe the simulation and real data in which it is

  • tested. We
  • nly consider

the genomic selection method called BLUP by Meuwissen et al. (2001). Materials and methods Theory Calculation of accuracy after data are collected Consider a group of T animals with breeding values are controlled by Q QTL. At the jth QTL, the geno- types (00, 01, 11) have frequency (1 ) pj)2, 2 pj

Predict the accuracy of genomic selection

  • M. E. Goddard et al.

410

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-3
SLIDE 3

(1 ) pj) and pj

2, respectively, and are described for

the ith animal by wij = )2 pj, 1-2 pj and 2-2 pj, respectively, for the three genotypes. This coding causes the mean of wij over animals to be zero. Let; y ¼ fixed effects þ g þ e g ¼ Wu where, y is a T · 1 vector of phenotypic values, g is a T · 1 vector of breeding values, u is a Q · 1 vector

  • f effects of QTL assumed N(0, I r2

u),W is a T · Q

matrix describing the genotype of each animal i at each QTL j by wij, e is a T · 1 vector of environmen- tal effects, and VðgÞ ¼ WW0r2

u

VðeÞ ¼ Ir2

e

V y ð Þ ¼ WW0r2

u þ Ir2 e

h2 ¼ r2

g= r2 g þ r2 e

  • Thus, ignoring fixed effects, the model

y ¼ Wu þ e is equivalent to the conventional model y ¼ g þ e if V g ð Þ ¼ Gr2

g ¼ WW0r2 u

That is, the genomic model is equivalent to a con- ventional animal model with the relationship matrix calculated from the QTL genotypes. This equivalence has been pointed out before but usually in terms of marker genotypes and effects rather than QTL geno- types and effects (e.g. Nejati-Javaremi et al. 1997; Vi- llanueva et al. 2005; Fernando 1998, Habier et al. 2007; VanRaden et al. 2009, Goddard 2009 Strande ´n & Garrick 2009). In practice, we will use the markers to estimate G, but this formulation makes it clear that it is the relationship matrix based on the QTL that we need to estimate. Let G = A + D where A is the relationship matrix based on pedigree and D represents deviations from A owing to the observed segregation of alleles at the

  • QTL. In a conventional BLUP, we use A instead of G.

This still leads to unbiased estimates of g and to appro- priate estimates of accuracy because E(G|A) = A. That is A is an unbiased estimate of G in the same way that BLUP EBVs are an unbiased estimate of true breeding

  • value. We need an estimate of G based on the marker

genotypes that also has this property. Let G

m ¼ WmW0 m=M,where

Wm is a matrix defined in the same way as W above but recording the genotypes at markers instead

  • f

QTL and M* = R2pj(1 ) pj). Then; G

m ¼ A þ D þ E

where the errors (E) occur because the markers used are a sample of genomic positions and so G

mincludes

some sampling errors. G

mcan be improved by taking

a weighted average of the relationship estimated from each marker where the weights are the inverse

  • f the prediction error variance (PEV) (Powell et al.

2010; Yang et al. 2010). This can be achieved by defining xij as wij⁄ (sqrt(2pj(1 ) pj)) and calculating Gm as Gm ¼ XX0=M where M is the number of markers. However, Gm is still not unbiased in the sense we require, i.e. E(G|Gm) is not Gm. Instead, we use ^ G ¼ A þ b Gm A ð Þ ð1Þ where b is the regression of elements of G ) A on elements of Gm ) A ¼ V D ð Þ= V D ð Þ þ V E ð Þ ½

  • ð2Þ

V(D) is the variance of Dij across all possible pairs

  • f relationship, i.e. it is a scalar and V(E) is similarly
  • defined. To indicate that V(D) is a variance across

the elements of D, D is not written in bold type. The V(D) could be predicted from theory but it may be better to estimate it from the data. Assuming QTL have the same properties as the markers, we can assess how well the markers predict the relationship based on QTL by how well they predict the relation- ship based on other markers. If we randomly split the markers into two non-overlapping sets and calculate Gm1 and Gm2 for the two sets, then c = cov(Gm1 ) A, Gm2 ) A) estimates V(D) and V(Gm1 ) Gm2) estimates 2*V(E), where V(E) is the variance of the errors in Gm1 (or Gm2). [We exclude the diagonal elements of G and A from these calculations because they have

  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

411

slide-4
SLIDE 4

slightly different properties to the non-diagonal ele- ments and it is the latter that are important]Yang et al. (2010) show that this is V(E) = 1 ⁄M where M is the number of markers. Then, b can be estimated as b ¼ 1 1=M= c þ 1=M ð Þ ¼ 1 1= cM þ 1 ð Þ: ð2aÞ The regression coefficient b may vary among sub- sets of the data. For instance, relationships between animals from different breeds may have a small V(D) and hence a smaller value of b than relationships within a breed. It would be desirable to calculate b separately for categories of relationship that differ greatly in V(D), but we will not pursue that possibil- ity in this paper. If the QTL differ in a systematic way from the markers, then V(E) will be greater than expected from the finite number of markers. An alternative derivation of ^ G shows how to estimate b in this case. Let y ¼ q þ a þ e where a is a vector of polygenic effects not captured by the markers N(0, A r2

a),q = Xm is a vector of

breeding values explained by markers, m is a vector

  • f standardized marker effects N(0, I r2

m)

V q ð Þ ¼ Gmr2

q and Gm ¼ XX0=M

g ¼ q þ a r2

g ¼ r2 q þ r2 a

V g ð Þ ¼ Gr2

g ¼ Gmr2 q þ Ar2 a

The variance components in this linear model can be estimated by REML, for example, and then b ¼ r2

q=r2 g.

Then ^ G ¼ Gmr2

q þ Ar2 a ¼ A þ b Gm A

ð Þ ½ r2

g

ð1aÞ is the same as defined above in (1) but this time with the regression coefficient estimated from the pheno- type data rather than predicted from the marker data. We can now use ^ G in place of A in the normal mixed model equations (MME) and calculate EBVs for individual animals and the accuracy of these EBVs in the usual manner. Prediction of accuracy before data are collected In practice, the methodology, for calculating EBVs and their accuracy from genetic markers described above, would be implemented using Henderson’s

  • MME. If we ignore fixed effects, we can use the

equivalent selection index approach. We assume that the training data consist of T unrelated animals (A = I) with phenotypes and marker genotypes. We wish to predict the accuracy of the EBVs for addi- tional, unrelated test animals with marker genotypes but without phenotypes. We will use the model y = g + e and g = q + a introduced above. As VðaÞ ¼ Ar2

a ¼ Ir2 a,the only infor-

mation about the EBV of the test animals comes from marker information, which allows us to estimate q. Therefore, reliability (accuracy squared) of EBV R2 ¼ Vð^ qÞ=VðgÞ ¼ Vð^ qÞ=VðqÞ VðqÞ=VðgÞ ¼ Vð^ qÞ=VðqÞ b ð3Þ Therefore,we have to predict two quantities: the pro- portion of the genetic variance explained by the markers (b) and the accuracy with which the com- bined marker effects (q) are estimated (Dekkers 2007; Goddard 2009). b = V(q)⁄ V(g). We will only consider the case where the QTL are not systematically different from the markers and therefore this can be predicted from the properties of the markers, which in turn can be predicted from the theory of linkage disequilibrium. Equation (2) shows b depends on the true variation in relationship between pairs of animals, V(D), com- pared to the sampling error caused by a finite num- ber

  • f

markers, V(E). Appendix 1 shows that V(D) = mean value of r2 over all pairs of loci where r2 is a standard measure of LD (Hill & Robertson 1968). The appendix derives the expected value of this mean, V(D) = 1⁄ Me, where Me ¼ 2NeLk=log NeL ð Þ ð4Þ where Ne is the effective population size, L is the average length of a chromosome in Morgan, k is the number of chromosomes and Me is called the effec- tive number of chromosome segments segregating in the population (Goddard 2009; Hayes et al. 2009b). As V(E) = 1⁄ M b ¼ 1=Me ½ = 1=Me þ 1=M ½ ¼ M= Me þ M ð Þ ð2bÞ Vð^ qÞ=VðqÞ:The selection index equation for the EBV of an animal without phenotype is ^ q ¼ g0

2V1y

where g2 is the vector of covariances between the target animal and the training animals, and V ¼ Gmr2

q þ I r2 e þ r2 a

  • ,y is the vector of phenotypic

Predict the accuracy of genomic selection

  • M. E. Goddard et al.

412

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-5
SLIDE 5

values

  • f

animals in the training set and the Vð^ qÞ ¼ g0

2V1g2:

A convenient way to calculate g¢2V)1 g2 for many different animals is to expand the matrix V to include the target animal V ¼ r2

q þ r2 e þ r2 ag0 2

g2 V

  • Then; g0

2V1g2 ¼ V 11 1=V11

ð5Þ where V*11 is the first diagonal element of the inverse of V*. By treating each animal in turn as the target ani- mal, it is possible to calculate the accuracy of the EBVs for T + 1 possible target animals (i.e. each ani- mal in turn is regarded as the target animal that does not have a phenotype recorded). In Appendix 2, we describe a heuristic approxima- tion for ½V111 r2

e þ r2 a þ r2 q=ð1 þ hÞ where

h = Tbh2⁄ Me, so that Vð^ qÞ=VðqÞ ¼ h=ð1 þ hÞ ð6Þ This formula (6) is very similar to those of God- dard (2009) and Daetwyler et al. (2008) who derived it by considering the accuracy of estimating a single marker effect. It does not account for the reduction in the error variance when all markers are fitted

  • simultaneously. Daetwyler et al. (2008) proposed the

following correction: If the reliability (accuracy squared) calculated by (3) is called R2

w ⁄ o, then the

proposed estimate of the reliability (R2

D)is

R2

D ¼ R2 w=o

1 þ R4

w=oh2

2h ! ð7Þ Simulation Populations Twenty replicate populations were simulated as described in Meuwissen and Goddard (2010). Bri- efly, Fisher-Wright’s idealized populations were sim- ulated with Ne = 1000 for 10 000 generations in

  • rder to achieve a mutation–drift balance and link-

age disequilibrium between the created SNPs. Ne was assumed high here, which makes the probability that the training set contains close relatives of the predicted individual quite small, i.e. the predicted accuracies can be interpreted as accuracies of unre- lated individuals. The genome consisted of one chro- mosome of 1 Morgan. Mutations were simulated according to the infinite sites mutation model (Kim- ura 1969) at a rate of 10)8 per base-pair per meiosis. The recombination rate was also 10)8 per base-pair per meiosis. This resulted on average in 33 066 seg- regating mutations (SNPs). At random 1000 SNPs were sampled (without replacement) to enter SNP- SET1, 1000 other SNPs were sampled for SNP-SET2, and 30 SNPs were sampled to act as QTL (QTL-SET). QTL effects were sampled from the Normal distribu-

  • tion. Thus, SNP-SET1, SNP-SET2 and QTL-SET are

disjunct sets, and there are no systematic differences between the SNPs that enter each of the sets. After the 10 000 generations, the 20 populations were simulated for 10 more generations following the same population model, but now the sampled pedigree was recorded and used to set up a relation- ship matrix A of the last generation of animals. Both SNP-SET1 and SNP-SET2 were used to set up genomic relationship matrices Gm1 and Gm2, respec- tively, calculated as XX¢⁄ M with X as defined above. This is very similar to Yang et al. (2010), except that Yang et al. calculated the diagonals of the Gmx matrices in a different manner in order to reduce their sampling error. The latter however resulted in negative eigenvalues of the Gmx matrix, whereas XX¢ is semi-positive definite. Although Yang et al. (2010) found that such negative eigenvalues did not cause any problems, we have been cautious and used XX¢. Prediction of accuracy using Gm When using the GBLUP approach, the genomic rela- tionship matrix, Gm, can be used to predict the accu- racy of individual genomic estimated breeding values (GEBV) following Henderson’s (1984) mixed model

  • theory. Equivalently, this prediction can be based on

selection index theory. We will use selection index theory here and assume (arbitrarily) that genetic variance is 1, in which case the variance of the selection index equals its reliability. We assume a phenotyped and genotyped training population of size T and use selection index theory to calculate the accuracy of the EBV for an additional animal that has marker genotypes but no phenotype. As above, we do this by expanding the matrix V = V(y) to V* by including the animal whose EBV reliability is required as the first animal. Then, the reliability cal- culated from the selection index,

  • r

equivalent MME, is (V*11 ) 1⁄ V*11). By treating each animal in turn as the test animal, whose EBV reliability is required, we can calculate the accuracy for all T + 1 animals with only one matrix inversion. By defining r2

a as zero, we can calculate the accuracy predicted

by the MME when Gm is used instead of ^ G.

  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

413

slide-6
SLIDE 6

Estimation of true accuracy True genetic values of all animals were calculated as:TBVi ¼ P30

j¼1 WijUj, where summation is over the

QTL in QTL-SET, wij is the genotype of QTL j with 0, 1 and 2 denoting ‘0 0’, ‘1 0’ and ‘1 1’, respectively, and uj is the normally distributed effect of the QTL. Within each data set, the variance of the TBVi was standard- ized to 1. To obtain phenotypic records, an environ- mental effect was added to the TBV, which was sampled from N(0, r2

e), where r2 e was 4, 1 or 0.111 to

  • btain heritabilities of 0.2, 0.5 and 0.9, respectively.

To obtain estimates of breeding values, the pheno- typic data were analysed by the GBLUP model: y ¼ l þ Zg þ e where g is the genetic value of 500 training and 500 evaluation animals; Z is an incidence matrix indicat- ing which animals have records; and V(g) = Gm1 or ^ G: Estimation of variance components pertaining to g and e and estimation of ^ g were by ASREML (Gil- mour et al., 2002). True accuracy of the GEBV was calculated as the correlation between the ^ g of the evaluation animals and their TBV. Real data The data set consisted of 1200 Australian Holstein

  • bulls. The phenotype used for each bull was the

mean of his daughter’s fat percentage (fat%) in their

  • milk. We chose this phenotype because it is known

to be affected by a gene of large effect (diglyceride acultransferase or DGAT) and we wished to test the theory under these conditions. To obtain this pheno- type, we de-regressed the Australian Breeding Val- ues (ABVs) to remove the contribution from relatives other than daughters (e.g. Pryce et al. 2010) while retaining the correction for non-genetic effects such as herd. All bulls with de-regressed EBVs had at least 80 daughters. The bulls were genotyped using the Illumina Bovi- ne50K array, which includes 54 001 single-nucleo- tide polymorphism (SNP) markers (Matukumalli et al. 2009). The following criteria and checks were applied to the bull’s genotypes. Mendelian consis- tency checks revealed a small number of sons who were discordant with their sires at many (>1000) SNPs or sires with many discordant sons. These ani- mals (17) were removed from the data set. We omit- ted bulls if they had more than 20% of missing

  • genotypes. And 1181 bulls passed these criteria.

Criteria for selecting SNPs were <5% pedigree dis- cordants (e.g. cases where a sire was homozygous for

  • ne allele and progeny were homozygous for the other

allele), 90% call rate, minor allele frequencies (MAF) >2%, Hardy–Weinberg p > 0.00001. All of these crite- ria were met by 40077 SNPs. A small number of these were not assigned to any chromosome on Bovine Gen-

  • me Build 4.0 and were omitted from the final data

set, as were SNPs on the X chromosome. Parentage checking was then performed again, and any geno- types incompatible with pedigree were set to missing. To impute missing genotypes, the SNPs were

  • rdered by chromosome position. All SNPs that

could not be mapped or were on the X chromosome were excluded from the final data set, leaving 39 048 SNPs. To impute missing genotypes, the genotype calls and missing genotype information were submitted to fastPHASE chromosome by chro- mosome (Scheet & Stephens 2006). The genotypes were taken as those filled in by fastPHASE. The matrix Gm among the 1181 bulls was con- structed as Gm = XX¢⁄ M. The matrix A was con- structed from the pedigree of the bulls, which had ancestors back to 1940. Then, ^ G was calculated as described above in equation (1a), as ^ G ¼ A þ b

^

ðGm AÞ

  • ^

r2

g

where ^ r2

qand ^

r2

gwere estimated using ASREML (Gil-

mour et al. 2002) and ^ b ¼ ^ r2

q=^

r2

g:

The discovery data set consisted of bulls progeny tested before 2004 (n = 756). The bulls in the valida- tion data set were progeny tested during or after 2004 (n = 400). GEBV for the validation set bulls were predicted using the normal BLUP equations with the A matrix replaced by ^

  • G. Only phenotypes of the reference set

bulls were used in this prediction. Note that ^ b was also derived using phenotypes from only the refer- ence set bulls. The accuracy of the GEBV was calculated in two

  • ways. The ^

G realized accuracy was the correlation of the GEBV for the validation set bulls with their phe- notypes divided by the accuracy of the phenotypes (0.9, from ADHIS). The ^ G theoretical accuracy was calculated from the diagonal elements of the inverse

  • f the coefficient matrix in the usual way. To assess

the value of correcting the G matrix for the propor- tion of variance explained by the markers, we also calculated the realized and theoretical accuracies with Gm in place of ^

  • G. Calculations were performed

with 10–10 000 markers used to construct the geno- mic relationship matrices.

Predict the accuracy of genomic selection

  • M. E. Goddard et al.

414

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-7
SLIDE 7

Results Simulated data – after the data are collected Table 1 summarizes the properties of the relationship matrix (Gm) calculated from the simulated marker

  • data. Two separate sets of markers (1 and 2) are

used so that the PEV in calculating elements of Gm and the true variance among elements of Gm can be

  • assessed. In the simulated data, the PEV of elements
  • f Gm is 0.001 (Table 1). This is as predicted from
  • ur

theory that PEV = 1⁄ M. As expected, the

  • bserved variance of Gm (0.0045) is the sum of the

true variance (0.0035 estimated as the cov(Gm1, Gm2)) and the PEV. Therefore, the regression coeffi- cient b in the calculation of ^ G in equation (2) is 0.78=0.0035 ⁄ 0.0045. Appendix 2 derives the result that the true V(Gm) is equal to the mean LD r2 over all pairs of markers. To confirm this, we calculated the mean r2 over the 499500 pairs of SNPs and found it to be 0.00355. Table 2 compares the accuracy predicted using genomic relationship matrices in conventional MME with the true accuracy of the EBVs, which we can calculate from the correlation between EBV and true BV, which is known because this is simulated data. When Gm is used in traditional MME to calculate EBVs and the accuracies of those EBVs are calculated from the MME in the normal manner, the accuracies

  • f the EBVs are overestimated (Table 2). However,

when ^ G is used in the MME, the accuracies of the EBVs agree with the true accuracies calculated as the correlation of true BV with EBV, at least to within the standard error. Simulated data – before the data are collected The aim in this section is to test the method of pre- dicting the accuracy of genomic EBVs using informa- tion about the population (effective population size Ne), the genome length (L), the trait (heritability h2) and the size of the discovery population (T). The prediction method presented in the Methods section is in two parts – the proportion

  • f

variance explained by the markers [b = V(q)⁄ V(g)] and the accuracy in estimating the marker effects ½vð^ qÞ=vðqÞ. We consider these two separately and then the com- bined prediction. We compare the predicted accura- cies against the accuracy calculated using the MME because we have shown above that that predicts the true accuracy. V q ð Þ=V g ð Þ For Ne = 1000, L = 1 equation (4) gives the effec- tive number of chromosome segments as Me = 2 Ne L⁄ log(Ne L) = 290. Therefore, we predict that the V(D) = 1⁄ Me = 0.00345 close to the observed figure

  • f 0.00351 (Table 1). Therefore, b = 0.78 from equa-

tion 2b. Vð^ qÞ=VðqÞ Using this value of Me in formulae 6, we predicted the accuracy of EBVs and compared that to the value expected from the MME using Gm (Table 3). [In calculating h, we set b = 1 because use of Gm in the MME implies that the markers explain all the variance]. Table 3 also contains the predicted accu- racy before and after the correction of Daetwyler et al. (2008) using equation (7). Predicted accuracy increases with T and h2 and is approximately in agreement with the accuracy from the MME. As expected, the correction of Daetwyler et al. (2008) has little effect when reliabilities are low but a more marked effect as accuracies approach 1.0 where this correction improves the agreement with the accuracies calculated from the MME equations. Vð^ qÞ=VðgÞ The results in Table 3 ignore the errors in the Gm matrix both in predicting reliabilities and in calculat- ing them in the MME. Table 2 shows that if the Gm

Table 1 Properties of the estimated genomic relationships1 Cov (Gm1 ) A, Gm2 ) A)2 PEV3 V(Gm1 ) A)2 0.0035 0.001 0.0045

1Excluding self-relationships. 2The A, Gm1 and Gm2 matrices are calculated as in the main text. 3Prediction error variance calculated as V(Gm1 ) Gm2) ⁄ 2.

Table 2 Comparison of predicted accuracy using genomic relation- ship matrices and true accuracy from correlation between true BV and EBV1 h2 Predicted accuracy True accuracy Gm ^ G 0.2 0.52 0.42 0.43 0.5 0.68 0.60 0.59 0.9 0.87 0.70 0.73

1Number of training records is T = 500.

  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

415

slide-8
SLIDE 8

matrix is regressed to ^ G, MME correctly predicts the accuracy obtained by simulation. In Table 4, we combine the two aspects of predicting the accuracy by using equation (3) where b = 0.78. Note that b is also needed in the formula for h. Table 4 shows that the predicted accuracies using equations (3) and (7) agree reasonably well with the accuracies calculated from the MME. The accuracies in Table 4 are lower than the comparable figure in Table 3 because in Table 3, we assumed that Gm cor- rectly described the genetic relationship matrix, while in Table 4, we used ^ G that incorporates a residual polygenic variance not explained by the markers. In summary, the simulation study shows that the proportion

  • f

genetic variance explained by the markers (b) can be predicted by equation 2b, the accuracy of estimating marker effects can be pre- dicted by equation (6), these two estimated can be combined using equations (3) and (7) to predict the accuracy of EBVs, and these accuracies agree with those obtained from the MME and with the true accuracies. Real data Figure 1 shows the accuracy calculated from the MME and realized for milk fat% in Holstein bulls. When Gm is used in the MME, the calculated accu- racy is greater than that actually realized but when ^ G is used, the calculated accuracy is close to that

  • realized. As the number of markers increases, the

accuracy predicted by ^ G and realized increases, whereas the accuracy calculated using Gm decreases. This occurs because V(Gm) decreases as the number

  • f markers is increased. In other words, when the

number of markers is small, Gm is subject to large sampling errors, which cause the calculated accuracy to appear high but the real accuracy is reduced. As the number

  • f

markers increases, the sampling errors become small and Gm and ^ G converge. Although BLUP methods of genomic selection assume normally distributed QTL effects, they are not very sensitive to departures from this assump-

  • tion. This is illustrated here by the use of fat concen-

tration in milk, a trait that is known to be influenced by a QTL with a large effect (DGAT). Despite the existence of this QTL, the theory cor- rectly predicts the accuracy of EBVs calculated using BLUP methods. Discussion Equivalent models The equivalence of a model based on marker effects and a conventional animal model with the relation- ship matrix estimated from the markers has been pointed out by several authors (Nejati-Javaremi et al. 1997; Fernando 1998; Villanueva et al. 2005; Habier et al. 2007; Goddard 2009; Strande ´n & Garrick 2009; VanRaden et al. 2009). The equality of the mean LD and the variance of relationships (shown in Appen- dix 1) is another aspect of this same equivalence. Both LD and relationships are caused by the inheri- tance, without recombination, of segments of chro- mosomes from a common ancestor. If the genome comprised an infinite number of loci all inherited independently (i.e. no linkage), there would be no LD or variation in relationship except that caused by variation in pedigree relationship. Linkage causes

Table 3 Predicted accuracy of EBVs before data are collected and from the mixed model equations (MME) after data collection using relationship matrix Gm1 h2 T2 Predicted1 MME Rw ⁄ o RD 0.2 62 0.202 0.202 0.226 125 0.281 0.284 0.305 250 0.383 0.389 0.406 500 0.506 0.517 0.522 1000 0.640 0.654 0.641 0.5 62 0.316 0.318 0.336 125 0.422 0.437 0.451 250 0.549 0.577 0.580 500 0.680 0.721 0.705 1000 0.800 0.840 0.809 0.9 62 0.402 0.425 0.451 125 0.529 0.574 0.596 250 0.661 0.731 0.748 500 0.780 0.860 0.869 1000 0.870 0.939 0.940

1Prediction with and without the Daetwyler et al. (2008) correction. 2T = the number of animals in the training population.

Table 4 Reliabilities (accuracy squared) predicted and from mixed model equations (MME) using ^ G h2 T Predicted (Daetwyler) MME 0.9 62 0.329 0.351 125 0.446 0.469 250 0.576 0.592 500 0.696 0.699 1000 0.784 0.785 Predict the accuracy of genomic selection

  • M. E. Goddard et al.

416

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-9
SLIDE 9

points close together on a chromosome to have the same coalescence tree. As a consequence of this, there is a correlation between the relationship at

  • ne locus and that at neighbouring loci. This, in

turn, causes variation in relationship in excess of that caused by variation in pedigree relationship. In the absence of LD, markers would not predict the genotypes at QTL, and the relationship at markers would not predict the relationship at QTL. If the relationship between all pairs of individuals were the same, then all individuals with no phenotype would receive the same EBV. This emphasizes the impor- tance of variation in genomic relationship in driving the accuracy of genomic selection. Thus, it is mean- ingless to ask whether genomic selection works because it utilizes relationships over and above those due to pedigree or because it utilizes LD: the two explanations are equivalent. The equivalence between a model based on QTL effects and a conventional animal model would be invalidated if LD between QTL systematically increased or decreased total genetic variance. For instance, the Bulmer effect occurs when selection results in negative covariance between QTL because chromosomes tend to carry a mix of positive and neg- ative QTL alleles. In this case, the total genetic vari- ance will be less than that expected from the sum of the QTL variances (V(g) = WW¢ r2

u). However, we

usually define the genetic variance in a base popula- tion where there is assumed to be no Bulmer effect, and this genetic variance will agree with that calcu- lated from the sum of the QTL variances. If the effective population size (Ne) is large, com- mon ancestors tend to be in the distant past and so recombination will have broken up chromosomes into many small pieces that coalesce independently. Consequently, as Ne increases, the variation in rela- tionship decreases because the relationship between two individuals is an average over many indepen- dent chromosome segments. The derivation in Appendix 1 shows that the relationship is effectively an average over Me segments where Me = 2NeLk⁄ - log(Ne L) [formula (4)]. In this paper, we point out another equivalence: that between a model using an unbiased estimate of the relationship matrix at the QTL (^ G) and that using a residual polygenic effect as well as a random effect described by a relationship matrix at the markers (Gm). This equivalence explains why the number of markers used is important. If the number of markers (M) is too small, Gm estimates G too imprecisely. The extent to which the markers track relationships at the QTL depends on M⁄ (M + Me). This formula also describes the extent of LD between markers because M⁄ (M + Me) = 1 ⁄(2 + 4Nec⁄ log(NeL)) where c is the average distance between markers is Lk ⁄M. This is almost the same as the expectation of LD between neighbouring markers r2 = 1⁄ (2 + 4Nec). The differ- ence between the two formulae (log(NeL)) can be thought as due to the LD between all other markers and a target marker, not just the nearest marker. Knowledge of the variation in relationship led us to an approximation for the inverse of the matrix V, where V = V(y), and this in turn led to an approxi- mation for the accuracy of EBVs calculated from marker genotypes. This approximation, derived from a consideration of variation in relationships, is the same as that derived by Goddard (2009) and Daetwyler et al. (2008) from a consideration of the accuracy of estimating the effect of a single marker

Fat (%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1000 2000 3000 4000 5000 6000 7000 8000 9000 10 000

Number of SNPs Accuracy Gm_theoretical Gm_realised Ghat_theoretical Ghat_realised

Figure 1 The accuracy of estimated breeding values for milk fat concentration in Holstein bulls.

  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

417

slide-10
SLIDE 10
  • r, more correctly, an effective chromosome seg-
  • ment. Because the formula are the same, it was easy

for us to include the correction of Daetwyler et al. (2008), which accounts for the increased accuracy in estimating the effect of one marker when all other markers have been fitted and hence reduced the residual variance. The accuracy of genomic selection depends on the proportion of the genetic variance explained by the SNPs and the accuracy with which the SNP effects are estimated (Dekkers 2007; Goddard 2009). These two components of the accuracy are also used in this paper. The proportion

  • f

the genetic variance explained by the markers is b = V(q) ⁄ V(g), and the accuracy of estimating marker effects is Vð^ qÞ=VðqÞ. The proportion of the genetic variance explained by the markers To estimate ^ G from Gm, we regress Gm back towards A and the regression coefficient (b) is the proportion of genetic variance explained by the markers. VanRaden (2008) proposed the same regression equation, but without a thorough derivation of its coefficient b. If QTL are not systematically different to markers, b = M⁄ (M + Me), as shown in Table 4. However, if the QTL are systematically different to the markers, b must be estimated from data on phenotypes. Research on human height (Yang et al. 2010) found that only half the genetic variance was explained by the SNPs owing to imperfect LD between the SNPs and the QTL. Of the remaining half, 10% was because

  • f the finite number of SNPs used (300 000) and 40%

was because of systematic differences between QTL and SNPs. For instance, the QTL could have lower MAF than the SNPs. Within a breed of cattle such as Holsteins, recent Ne is very small (100) compared with 10 000 in humans. Consequently, the variation in relationships in cattle is large or, equivalently, LD is extensive, and so far fewer SNPs are necessary to explain most of the variance in relationship or, equiv- alently, most of the variation in QTL. Figure 1 shows that the accuracy of EBVs has reached an asymptote by 10 000 markers but in humans this has not

  • ccurred even after 300 000 markers.

While within a breed (of cattle) the accuracy may reach an asymptote after 10 000 markers, for between-breed prediction, the accuracy is likely to reach an asymptote at a much higher number of

  • markers. For example, Hayes et al. (2009a) found

that the theoretical accuracy in a combined Hol- stein–Jersey population substantially overpredicted the actual accuracy when using Gm, even with 40 000 markers. This is because between breeds, the variation in relationship will be very small so that large numbers of markers are required to predict these relationships accurately (and capture the lim- ited LD that exists between breeds). So for multi- breed prediction of GEBV, it is important to use ^ G rather than Gm when calculating reliabilities and to calculate the regression coefficient b separately for within-breed and between-breed relationships. This parallels the finding of De Roos et al. (2008) that the phase of LD was not conserved between breeds when 50 000 markers were used. In other words, 50 000 SNPs are not enough to accurately detect relationships between cattle from different breeds such as Holstein and Jersey. ‘Unified approaches’ to utilize phenotypic, full pedigree, and genomic information for genetic eval- uation, for both genotyped and un-genotyped indi- viduals, have been proposed, which use a single relationship matrix in the BLUP equations (Aguilar et al. 2010). This matrix has both G matrix and A matrix components (e.g. sub-matrices based on rela- tionships derived from genotypes and sub-matrices derived from pedigree relationships. The ^ G matrix proposed here would be the most suitable for describing (genomic) relationships among genotyped individuals in such an approach because, if Gm was used, the accuracies might be overestimated. How- ever, the relationships in G and A must be expressed to the same base before the matrices are combined (Meuwissen et al. 2011). Further developments Further developments of the methods presented here are desirable. For instance, how would the accuracy

  • f genomic selection be predicted when there are

pedigree relationships among the animals as well as relationships estimated from the markers? By anal-

  • gy with the method used to combine the reliabili-

ties from other sources of data, we suggest that h = R2⁄ (1 ) R2) should be additive when indepen- dent sources of data are combined, but we have not investigated this suggestion in the present paper. Conclusions When the BLUP method of genomic selection is used, the results of this paper can be summarized as a series of recommendations for calculating the accu- racy of genomic EBVs: After the data have been collected: Fit an animal model but with the relationship matrix calculated as

Predict the accuracy of genomic selection

  • M. E. Goddard et al.

418

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-11
SLIDE 11

^ G: The regression coefficients (b) would ideally be calculated by estimating the variance components associated with SNPs and with the residual polygenic

  • variance. However, if QTL are assumed to have

properties similar to SNPs, we can estimate b as M⁄ (M + Me). Me in turn can be estimated from

  • bserved variation in relationships [V(G ) A) minus

the PEV] or from LD (mean r2) or from NeL using equation (4). If ^ G is used in the MME, the predicted accuracy of GEBV will also capture linkage and fam- ily information that is present among the reference population and selection candidates. Before the data have been collected: Calculate the pro- portion of genetic variance explained by the SNPs (b) as above. Calculate the accuracy of estimating SNP effects as h ⁄ (h + 1) where h = Tbh2⁄ Me. Calculate the reliability (accuracy squared) as R2

w ⁄ o = b h⁄ (h + 1).

Apply the Daetwyler correction to R2

w=o to obtain R2 D.

References

Aguilar I., Misztal I., Johnson D.L., Legarra A., Tsuruta S., Lawlor T.J. (2010) A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci., 93, 743–752. Daetwyler H.D., Villanueva B., Woolliams J.A., Weedon M.N. (2008) Accuracy of predicting the genetic risk

  • f disease using a genome-wide approach. PLoS ONE,

3, e3395. doi:10.1371/journal.pone.0003395. PMID:18852893. Dalton R. (2009) No bull: genes for better milk. Nature, 457, 369. De Roos A.P.W., Hayes B.J., Spelman R., Goddard M.E. (2008) Linkage disequilibrium and persistence of phase in Holstein Friesian, Jersey and Angus cattle. Genetics, 179, 1503–1512. Dekkers J.C. (2007) Prediction of response to marker- assisted and genomic selection using selection index

  • theory. J. Anim. Breed. Genet., 124, 331–341.

Fernando R.L. (1998) Some true aspects of finite locus

  • models. In: Proceedings of the 6th World Congress of

Genetics Applied to Livestock Production, 11–16 Janu- ary 1998. University of New England, Armidale, Aus- tralia, 26, pp. 329–336. Gilmour AR, Gogel BJ, Cullis BR, Welham SJ, Thompson

  • R. ASReml User Guide Release 1.0. VSN International

Ltd., Hemel Hempstead, UK; 2002. Goddard M.E. (2009) Genomic selection: prediction of accuracy and maximisation of long term response. Genetica, 136, 245–257. Habier D., Fernando R.L., Dekkers J.C. (2007) The impact

  • f genetic relationship information on genome-assisted

breeding values. Genetics, 177, 2389–2397. Harris B.L., Johnson D.L. (2010) Genomic predictions for New Zealand dairy bulls and integration with national genetic evaluation. J. Dairy Sci., 93, 1243–1252. Hayes B.J., Bowman P.J., Chamberlain A.C., Verbyla K., Goddard M.E. (2009a) Accuracy of genomic breeding values in multi-breed populations. Genet. Sel. Evol., 41, 51. Hayes B.J., Visscher P.M., Goddard M.E. (2009b) Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res., 91, 47–60. Henderson C.R. (1984) Applications of Linear Models in Animal Breeding. University

  • f

Guelph, Guelph, Ontario. Hill W.G., Robertson A. (1968) Linkage disequilibrium in finite populations. Theor. Appl. Genet., 38, 226–231. Kimura M. (1969) The number of heterozygous nucleo- tide sites maintained in a finite population due to steady flux of mutations. Genetics, 61, 893–903. Matukumalli L.K., Lawley C.T., Schnabel R.D., Taylor J.F., Allan M.F., Heaton M.P., O‘Connell J., Moore S.S., Smith T.P., Sonstegard T.S., Van Tassell C.P. (2009) Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE, 4, e5350. Meuwissen T.H..E., Hayes B.J., Goddard M.E. (2001) Pre- diction of total genetic value using genome wide dense marker maps. Genetics, 157, 1819–1829. Meuwissen T. H. E., Luan T., Woolliams J. A (2011) The unified approach to the use of genomic and pedigree information in genomic evaluations revisited. J. Anim.

  • Breed. Genet., 128, 429–439.

Nejati-Javaremi A., Smith C., Gibson J. (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. J. Anim. Sci., 75, 1738–1745. Powell J.E., Visscher P.M., Goddard M.E. (2010) Recon- ciling the analysis of IBD and IBS in complex trait

  • studies. Nat. Rev. Genet., 11, 800–805.

Pryce J.E., Bolormaa S., Chamberlain A.J., Bowman P.J., Savin K., Goddard M.E., Hayes B.J. (2010) A validated genome-wide association study in two dairy cattle breeds for milk production and fertility traits using var- iable length haplotypes. J. Dairy Sci., 93, 3331–3345. Schaeffer L.R. (2006) Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet., 123, 218–223. Scheet P., Stephens M.A. (2006) A fast and flexible statis- tical model for large-scale population genotype data: applications to inferring missing genotypes and haplo- typic phase. Am. J. Hum. Genet., 78, 629–644. Strande ´n I., Garrick D.J. (2009) Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit J. Dairy Sci., 92, 2971–2975. Sved J.A. (1971) Linkage disequilibrium and homozygos- ity of chromosome segments in finite populations.

  • Theor. Popul. Biol., 2, 125–141.
  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

419

slide-12
SLIDE 12

Tenesa A., Navarro P., Hayes B.J., Duffy D.L., Clarke G.M., Goddard M.E., Visscher P.M. (2007) Recent human effective population size estimated from linkage

  • disequilibrium. Genome Res., 17, 520–526.

VanRaden P.M. (2008) Efficient methods to compute genomic predictions. J. Dairy Sci., 91, 4414–4423. VanRaden P.M., Van Tassell C.P., Wiggans G.R., Sonstegard T.S., Schnabel R.D. et al. (2009) Invited Review: Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci., 92, 16–24. Villanueva B., Pong-Wong R., Fernandez J., Toro M.A. (2005) Benefits from marker-assisted selection under an additive polygenic genetic model. J. Anim. Sci., 83, 1747–1752. Yang J., Beben B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.F., Heath A.C., Martin N.G., Montgomery G.W., Goddard M.E., Visscher P.M. (2010) Missing heritability of human height explained by genomic relationships. Nat. Genet., 42, 565–569.

Appendix 1 The properties of the genetic covariance matrix WW¢ Using the same model as in the main text, that is g = Wu and assuming u N(0, I), so that V(g) = WW¢ and the ij element of this matrix is the covariance of breeding values between animals i and

  • j. The elements of W (wik) describe the genotype of

animal i at marker k. Then the ith diagonal element of WW 0 ¼ w0

iwi

E w0

iwi

ð Þ ¼ R2pj 1 pj

  • ¼ r2

g

  • ff diagonal element ij of WW0 ¼ w0

iwj ¼ Rwikwjk

E w0

iwj

  • ¼ 0 because the animals are unrelated

Vðw0

iwjÞ ¼ EðRwikwjkÞðRwikwjkÞ

¼ EðRRðwikwjkÞðwilwjlÞ ¼ RREðwikwjkÞðwilwjlÞ ¼ RREðwikwilÞEðwjkwjlÞ ¼ RRCovðwk; wlÞ2 ¼ RRr2

kl2pkð1 pkÞ2plð1 plÞ

¼ RR2pkð1 pkÞ2plð1 plÞRRr2

kl=½Qð1 QÞ

where Q is the number of QTL ¼ r4

g mean ðr2Þ where r2 is the usual measure of

LD Thus, G = WW¢⁄ r2

g is a relationship matrix with

diagonal elements averaging 1. The

  • ff-diagonal

elements of G have mean = 0 and their variance is the mean of the r2 measure of linkage disequilibrium

  • ver all pairs of loci. If we consider the QTL to be M

unlinked loci, then rkk

2 = 1 and rkl 2 = 0, so the

mean of r2 = 1 ⁄M. However, if we assume that QTL are spread all along the chromosome, we can evalu- ate the mean by integrating: mean (r2) = [ rkl

2 dk

dl]⁄ (L2), where the limits of integration are 0, and L, the length

  • f

the chromosome. Assuming E(r2) = 1⁄ (2 + 4Nc) (Tenesa et al. 2007), where N is the effective population size and c is the distance between loci in Morgan, the mean ðr2Þ ¼ ½ Z Z r2

kldkdl=ðL2Þ

¼ Z Z 1=ð2 þ 4nðl kÞÞdkdl=L2 ¼ ½ð2 þ 4NLÞ logð2 þ 4NLÞ 4NL 4NL log 2 2 log 2=ð8N2L2Þ logðNLÞ=ð2NLÞ for large NL If the genome is made up of k chromosomes, each

  • f length L

mean r2 log NL ð Þ= 2NLk ð Þ for large NL: Thus, the number

  • f

effective QTL (M) is 2NLk⁄ log(NL) even if the number of actual QTL is infinite because linkage generates LD, which is a correlation between loci to one another. If one prefers to use E(r2) = 1⁄ (1 + 4Nc) (Sved 1971), because the LD is driven by inbreeding with-

  • ut new mutation, then mean (r2) log(2NL)⁄

(2NLk) for large NL. Appendix 2 Heuristic approximation for V*)1 From the main text V ¼ Gmr2

q þ I r2 e þ r2 a

  • where Gm = XX¢ ⁄M.

Gm can be written as I + D where E D ð Þ ¼ 0 V D ð Þ ¼ 1=Me Then I þ D ð Þ1 ¼ I D þ D2 . . . I þ D2

Predict the accuracy of genomic selection

  • M. E. Goddard et al.

420

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

slide-13
SLIDE 13

E I þ D ð Þ1

  • I 1 þ T=Me

ð Þ where T is the number of animals in the matrix D. The inverse of the diagonal elements of (I + D))1 are the PEV of predicting the breeding value of one animal from the breeding values of all the other ani- mals assuming the relationship matrix Gm, that is, PEV = r2

q ⁄ (1 + T⁄ Me). This formula can be compared

to the formula for the PEV based on known relatives such as offspring r2

g ⁄ (1 + T⁄ 3).

However, breeding values are not predicted from the breeding value of relatives but from their pheno-

  • types. The formula for PEV based on offspring’s phe-

notype is r2

g ⁄ (1 + Tk)

where k = h2⁄ (4-h2) so, assuming bh2 is small, we modify the PEV using the Gm matrix to PEV = r2

q ⁄ (1 + bh2T ⁄Me).

The inverse of the diagonal elements of V*)1 are the PEV for predicting the phenotype of one animal from the phenotype of the others. This PEV must include all of the polygenic and environmental vari- ance, because these cannot be predicted from other, unrelated animals, as well as the PEV for the breed- ing value determined by the genetic markers, which we have just calculated. Therefore V11

  • 1 r2

e þ r2 a þ r2 q= 1 þ h

ð Þ where h ¼ Tbh2=Me used in the main text. The restriction that bh2 is small is relaxed else- where in the methodology using the correction of Daetwyler et al. (2008).

  • M. E. Goddard et al.

Predict the accuracy of genomic selection

ª 2011 Blackwell Verlag GmbH • J. Anim. Breed. Genet. 128 (2011) 409–421

421