Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya - - PowerPoint PPT Presentation

computations in animal breeding
SMART_READER_LITE
LIVE PREVIEW

Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya - - PowerPoint PPT Presentation

Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya University of Georgia Animal Breeding Selection of more productive animals as parents Effect=> Next generation is more productive Tools: artificial


slide-1
SLIDE 1
slide-2
SLIDE 2

Computations in Animal Breeding

Ignacy Misztal and Romdhane Rekaya University of Georgia

slide-3
SLIDE 3

Animal Breeding

  • Selection of more productive animals as parents
  • Effect=> Next generation is more productive
  • Tools: artificial insemination + embryo transfer

– A dairy sire can have >100,000 daughters! – Selection of “good” sires important!

slide-4
SLIDE 4

Information for selection

  • DNA (active research)
  • Data collected on large populations of

farm animals

– Records contain combination of genetics, environment, and systematic effects – Need statistical methodology to obtain best prediction of genetic effects

slide-5
SLIDE 5
slide-6
SLIDE 6

Dairy

  • Mostly family farms (50-1000 cows)
  • Mostly Holsteins
  • Recording on:

– Production (milk, fat, protein yields) – Conformation (size, legs, udder..) – Secondary traits (Somatic cell count in milk,Calving ease, reproduction)

  • Information on > 20 million Holsteins (kept by

USDA and breed associations)

  • Semen market global
slide-7
SLIDE 7

Molecular genetics

slide-8
SLIDE 8
slide-9
SLIDE 9

Poultry

  • Integrated system
  • Large companies
  • Hierarchical breeding structure
  • Final product - crossbreds
  • Useful records on up to 200k birds
  • Recording on:

– Number and size of eggs – Growth (weights at certain ages) – Fertility – …

slide-10
SLIDE 10
slide-11
SLIDE 11

Swine

  • Becoming more integrated
  • Hierarchical breeding structure
  • Final product - crossbreds
  • Recording on:

– Growth – Litter size – Meat quality – …

  • Populations > 200k animals
slide-12
SLIDE 12
slide-13
SLIDE 13

Beef cattle

  • Mostly family farms (5 - 50,000)
  • Many breeds
  • Final product – purebreds and crossbreds
  • Recording on:

– growth – fertility – meat quality

  • Data size up to 2 million animals
slide-14
SLIDE 14

Results of genetic selection

  • Milk yield 2 times higher
  • Chicken

– Time to maturity over 2 times shorter – Food efficiency 2 times higher

  • Swine
  • Beef
  • Fish
slide-15
SLIDE 15

Genetic value of individual

On a population level: g = a + “rest” Var (a)=A sa, A - matrix of relationships among animals, sa- additive variance A dense; A-1 sparse and easy to set up

slide-16
SLIDE 16

Example of a model

  • Litter weight =

– contemporary group + – age class + – genetic group + – animal + – e – Var(e) = Ise, var(animal)=Asa – se, sa- variance components

slide-17
SLIDE 17

Mixed model

y – vector of records ß – vector of fixed effects u – vector of random effects e – vector of residuals X, Z – design matrices Fixed effects – usually few levels, lots of information Random effects – usually many levels, little information

slide-18
SLIDE 18

Mixed model equations

R block diagonal with small blocks G block diagonal with small

  • r large blocks
slide-19
SLIDE 19

Matrices in Mixed Model

  • Symmetric
  • Semi-positive definite
  • Sparse

– 3-200 nonzeros per row (on average)

  • Can be constructed as a sum of outer

products

Σ Wi Qi Wi

Wi – 1-20 x 20k-300 million, < 100 nonzeroes Qi - small square matrix (less than 20 x 20)

slide-20
SLIDE 20

Models

  • Sire model

Y = cg +… sire + e

1000k animals ≈ 50k equations

  • Animal model

Y= cg +..+ animal + e

1000k animals ≈ 1200k equations

  • Mutiple trait model

y y1

1 = cg1 +…+ animal1 + …+ e1

1000k animals >3000k equations

….

y yn

n = cgn +…+ animaln + …+ en

  • Random regression model

1000k animals >6000k equations

(Longitudinal)

y y = cg +…+ Σ fi(x)animali + …+ e

slide-21
SLIDE 21

Tasks

  • Estimate variance components

– Usually sample of 5-50k animals

  • Solve mixed model equations

– Populations up to 20 million animals – Up to 60 unknowns per animal

slide-22
SLIDE 22

Data structures for sparse matrices

  • Linked list
  • Triples as Hash
  • IJA

– (row pointer to columns and values)

  • Matrix not stored
slide-23
SLIDE 23

Data structures

slide-24
SLIDE 24

Solving strategies

  • Sparse factorization
  • Iteration

– Gauss Seidel – Gauss Seidel + 2nd order Jacobi – PCG

  • Conditioners from diagonal to incomplete

factorization

slide-25
SLIDE 25

Gauss-Seidel Iteration & SOR

Ax = b

  • Simple
  • Stable and self correcting
  • Converges for semi-positive definite A
  • For balanced cross-classified models converges in one round
  • Small memory requirements
  • Hard to implement matrix-free
  • Slow convergence for complicated models
slide-26
SLIDE 26

Preconditioned Conjugate Gradient

Large memory requirements Tricky implementation Easy to implement matrix-free Usually converges a few times faster than SOR even with diagonal preconditioner! Matrix-free iteration Let A = Σ (Wi’ Wi) nonzeros(W) << nonzeros(A) W easily generated from data Ax = Σ (Wi’ Wi x)

slide-27
SLIDE 27

Methodologies to estimate variance components

  • Restricted Maximum Likelihood (REML)
  • Monte Carlo Markov Chain
slide-28
SLIDE 28

REML

Φ- variance components (in R and G) C* - LHS converted to full rank Maximizations Derivative free (use sparse factorization) First derivative (Expectation maximization; use sparse inverse) Second derivative D2 and E(D2) hard to compute but [D2 and E(D2) ]/2 simpler – Average Information REML

slide-29
SLIDE 29

Sparse matrix inversion

  • Takahashi method:
  • Can obtain inverse elements only for elements where L

≠0

  • Inverses obtained for sparse matrices as large as

1000kx1000k

  • Cost ≈ 2 x sparse factorization
slide-30
SLIDE 30

REML, properties

  • Derivative free methods reliable only for simple

problems

  • Derivative methods

– Difficult formulas

  • Nearly impossible for nonstandard models

– High computing cost ≈ quadratic – Easy determination of termination

slide-31
SLIDE 31

Bayesian Methods and MCMC

slide-32
SLIDE 32

Samples

slide-33
SLIDE 33

Approximate P(se|y)

slide-34
SLIDE 34

Approximate (p(sp|y)

slide-35
SLIDE 35

Approximate (p(sa|y)

slide-36
SLIDE 36

MCMC, properties

– Much simpler formulas – Can accommodate

  • large models
  • Complicated models

– Can take months if not optimized – Details important and hard decision when to stop

slide-37
SLIDE 37

Optimization in sampling methods

  • 10k-1 million samples
  • Slow if equations regenerated each round
  • If equations can be represented as:

[R⊗X1 + G⊗X2] = R⊗y

R,G – estimated small matrices, X1, X2, y – constant

X1, X2 and y can be created and stored once. Requires tricks if models different/trait and missing traits

slide-38
SLIDE 38

Software

  • SAS (fixed models or small mixed models)
  • Custom
  • Packages

– PEST, VCE (Groeneveld et al) – ASREML (Gilmour) – DMU (Jensen et al) – MATVEC (Wang et al) – Blupf90 etc. (Misztal et al)

slide-39
SLIDE 39

Larger data, more complicated models, simpler computing?

slide-40
SLIDE 40

Computing platforms

  • Past

– Mainframe – Supercomputers

  • Sparse computations vectorize!
  • Current

– PCs + workstations – Windows/Linux,Unix

  • Parallel and vector processing not important
slide-41
SLIDE 41

Random regression model on a parallel processor

Madsen et al, 1999; Lidauer et all, 1999)

Goal: compute Σ (Wi‘Wix); Wi -large sparse vectors

  • Approaches:

a) Distribute collection of Σ (Wi‘Wi)x t o separate processors b) Optimize scalar algorithm first to Σ (Wi‘(Wix)) If Wi has 30 nonzeros: a) 900 multiplications b) 60 multiplications Scalar optimization more important than brute force parallelization

slide-42
SLIDE 42

Other Models

  • Censored
  • Survival
  • Threshold
  • …..
slide-43
SLIDE 43

Issues

  • 0.99 problem
  • Sophistication of statistics vs.

understanding of problem vs. data editing

  • Undesired responses of selection

– Less fitness,…. – Aggressiveness (swine, poultry)

  • Challenges of molecular genetics
slide-44
SLIDE 44

Molecular Genetics

  • Attempts to identify effects of genes on

individual traits

  • Simple statistical methodologies
  • Methodology for joint analyses with

phenotypic and DNA data difficult

  • Active research area
slide-45
SLIDE 45

Conclusions

  • Animal breeding compute intensive

– Large systems of equations – Matrices sparse

  • Research has high large economic value