Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya - - PowerPoint PPT Presentation
Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya - - PowerPoint PPT Presentation
Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya University of Georgia Animal Breeding Selection of more productive animals as parents Effect=> Next generation is more productive Tools: artificial
Computations in Animal Breeding
Ignacy Misztal and Romdhane Rekaya University of Georgia
Animal Breeding
- Selection of more productive animals as parents
- Effect=> Next generation is more productive
- Tools: artificial insemination + embryo transfer
– A dairy sire can have >100,000 daughters! – Selection of “good” sires important!
Information for selection
- DNA (active research)
- Data collected on large populations of
farm animals
– Records contain combination of genetics, environment, and systematic effects – Need statistical methodology to obtain best prediction of genetic effects
Dairy
- Mostly family farms (50-1000 cows)
- Mostly Holsteins
- Recording on:
– Production (milk, fat, protein yields) – Conformation (size, legs, udder..) – Secondary traits (Somatic cell count in milk,Calving ease, reproduction)
- Information on > 20 million Holsteins (kept by
USDA and breed associations)
- Semen market global
Molecular genetics
Poultry
- Integrated system
- Large companies
- Hierarchical breeding structure
- Final product - crossbreds
- Useful records on up to 200k birds
- Recording on:
– Number and size of eggs – Growth (weights at certain ages) – Fertility – …
Swine
- Becoming more integrated
- Hierarchical breeding structure
- Final product - crossbreds
- Recording on:
– Growth – Litter size – Meat quality – …
- Populations > 200k animals
Beef cattle
- Mostly family farms (5 - 50,000)
- Many breeds
- Final product – purebreds and crossbreds
- Recording on:
– growth – fertility – meat quality
- Data size up to 2 million animals
Results of genetic selection
- Milk yield 2 times higher
- Chicken
– Time to maturity over 2 times shorter – Food efficiency 2 times higher
- Swine
- Beef
- Fish
Genetic value of individual
On a population level: g = a + “rest” Var (a)=A sa, A - matrix of relationships among animals, sa- additive variance A dense; A-1 sparse and easy to set up
Example of a model
- Litter weight =
– contemporary group + – age class + – genetic group + – animal + – e – Var(e) = Ise, var(animal)=Asa – se, sa- variance components
Mixed model
y – vector of records ß – vector of fixed effects u – vector of random effects e – vector of residuals X, Z – design matrices Fixed effects – usually few levels, lots of information Random effects – usually many levels, little information
Mixed model equations
R block diagonal with small blocks G block diagonal with small
- r large blocks
Matrices in Mixed Model
- Symmetric
- Semi-positive definite
- Sparse
– 3-200 nonzeros per row (on average)
- Can be constructed as a sum of outer
products
Σ Wi Qi Wi
Wi – 1-20 x 20k-300 million, < 100 nonzeroes Qi - small square matrix (less than 20 x 20)
Models
- Sire model
Y = cg +… sire + e
1000k animals ≈ 50k equations
- Animal model
Y= cg +..+ animal + e
1000k animals ≈ 1200k equations
- Mutiple trait model
y y1
1 = cg1 +…+ animal1 + …+ e1
1000k animals >3000k equations
….
y yn
n = cgn +…+ animaln + …+ en
- Random regression model
1000k animals >6000k equations
(Longitudinal)
y y = cg +…+ Σ fi(x)animali + …+ e
Tasks
- Estimate variance components
– Usually sample of 5-50k animals
- Solve mixed model equations
– Populations up to 20 million animals – Up to 60 unknowns per animal
Data structures for sparse matrices
- Linked list
- Triples as Hash
- IJA
– (row pointer to columns and values)
- Matrix not stored
Data structures
Solving strategies
- Sparse factorization
- Iteration
– Gauss Seidel – Gauss Seidel + 2nd order Jacobi – PCG
- Conditioners from diagonal to incomplete
factorization
Gauss-Seidel Iteration & SOR
Ax = b
- Simple
- Stable and self correcting
- Converges for semi-positive definite A
- For balanced cross-classified models converges in one round
- Small memory requirements
- Hard to implement matrix-free
- Slow convergence for complicated models
Preconditioned Conjugate Gradient
Large memory requirements Tricky implementation Easy to implement matrix-free Usually converges a few times faster than SOR even with diagonal preconditioner! Matrix-free iteration Let A = Σ (Wi’ Wi) nonzeros(W) << nonzeros(A) W easily generated from data Ax = Σ (Wi’ Wi x)
Methodologies to estimate variance components
- Restricted Maximum Likelihood (REML)
- Monte Carlo Markov Chain
REML
Φ- variance components (in R and G) C* - LHS converted to full rank Maximizations Derivative free (use sparse factorization) First derivative (Expectation maximization; use sparse inverse) Second derivative D2 and E(D2) hard to compute but [D2 and E(D2) ]/2 simpler – Average Information REML
Sparse matrix inversion
- Takahashi method:
- Can obtain inverse elements only for elements where L
≠0
- Inverses obtained for sparse matrices as large as
1000kx1000k
- Cost ≈ 2 x sparse factorization
REML, properties
- Derivative free methods reliable only for simple
problems
- Derivative methods
– Difficult formulas
- Nearly impossible for nonstandard models
– High computing cost ≈ quadratic – Easy determination of termination
Bayesian Methods and MCMC
Samples
Approximate P(se|y)
Approximate (p(sp|y)
Approximate (p(sa|y)
MCMC, properties
– Much simpler formulas – Can accommodate
- large models
- Complicated models
– Can take months if not optimized – Details important and hard decision when to stop
Optimization in sampling methods
- 10k-1 million samples
- Slow if equations regenerated each round
- If equations can be represented as:
[R⊗X1 + G⊗X2] = R⊗y
R,G – estimated small matrices, X1, X2, y – constant
X1, X2 and y can be created and stored once. Requires tricks if models different/trait and missing traits
Software
- SAS (fixed models or small mixed models)
- Custom
- Packages
– PEST, VCE (Groeneveld et al) – ASREML (Gilmour) – DMU (Jensen et al) – MATVEC (Wang et al) – Blupf90 etc. (Misztal et al)
Larger data, more complicated models, simpler computing?
Computing platforms
- Past
– Mainframe – Supercomputers
- Sparse computations vectorize!
- Current
– PCs + workstations – Windows/Linux,Unix
- Parallel and vector processing not important
Random regression model on a parallel processor
Madsen et al, 1999; Lidauer et all, 1999)
Goal: compute Σ (Wi‘Wix); Wi -large sparse vectors
- Approaches:
a) Distribute collection of Σ (Wi‘Wi)x t o separate processors b) Optimize scalar algorithm first to Σ (Wi‘(Wix)) If Wi has 30 nonzeros: a) 900 multiplications b) 60 multiplications Scalar optimization more important than brute force parallelization
Other Models
- Censored
- Survival
- Threshold
- …..
Issues
- 0.99 problem
- Sophistication of statistics vs.
understanding of problem vs. data editing
- Undesired responses of selection
– Less fitness,…. – Aggressiveness (swine, poultry)
- Challenges of molecular genetics
Molecular Genetics
- Attempts to identify effects of genes on
individual traits
- Simple statistical methodologies
- Methodology for joint analyses with
phenotypic and DNA data difficult
- Active research area
Conclusions
- Animal breeding compute intensive
– Large systems of equations – Matrices sparse
- Research has high large economic value