Algebraic multigrid methods for mechanical engineering applications - - PowerPoint PPT Presentation

algebraic multigrid methods for mechanical engineering
SMART_READER_LITE
LIVE PREVIEW

Algebraic multigrid methods for mechanical engineering applications - - PowerPoint PPT Presentation

Algebraic multigrid methods for mechanical engineering applications Mark F. Adams St. Wolfgang/Strobl Austria - 3 July 2006 17th International Conference on Domain Decomposition Methods 0 Outline Algebraic multigrid (AMG) Coarse grid


slide-1
SLIDE 1

17th International Conference on Domain Decomposition Methods

Mark F. Adams

  • St. Wolfgang/Strobl Austria- 3 July 2006

Algebraic multigrid methods for mechanical engineering applications

slide-2
SLIDE 2

17th International Conference on Domain Decomposition Methods 1

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-3
SLIDE 3

17th International Conference on Domain Decomposition Methods 2

Multigrid smoothing and coarse grid correction (projection)

smoothing

Finest Grid

Prolongation (P=RT)

The Multigrid V-cycle

First Coarse Grid

Restriction (R) Note: smaller grid

slide-4
SLIDE 4

17th International Conference on Domain Decomposition Methods 3

Multigrid components

  • Smoother Sν(f,u0), ν iterations of simple PC

(Schwarz)

  • Multiplicative: great theoretical properties, parallel

problematic

  • Additive: requires damping (eg, Chebyshev polynomials)
  • Prolongation (interpolation) operator P
  • Restriction operator R (R = PT)
  • Map residuals from fine to coarse grid
  • Columns of P: discrete coarse grid functions on fine grid
  • Algebraic

Algebraic coarse grid (Galerkin) AH = RAhP

  • AMG method defined by S and P operator
slide-5
SLIDE 5

17th International Conference on Domain Decomposition Methods 4

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-6
SLIDE 6

17th International Conference on Domain Decomposition Methods 5

Smoothed Aggregation

Piecewise constant function: “Plain” agg. (P0)

Start with kernel vectors B of operator

eg, 6 RBMs in elasticity

Nodal aggregation B P0

“Smoothed” aggregation: lower energy of functions

One Jacobi iteration: P  ( I - ω D-1 A ) P0

slide-7
SLIDE 7

17th International Conference on Domain Decomposition Methods 6

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-8
SLIDE 8

17th International Conference on Domain Decomposition Methods 7

Smoothers

  • CG/Jacobi: Additive
  • Essentially damped by CG - Adams SC1999
  • Dot products, non-stationary
  • Gauss-Seidel: multiplicative (Optimal MG smoother)
  • Complex communication and comput. - Adams SC2001
  • Polynomial Smoothers: Additive
  • Chebyshev ideal for MG - Adams et.al. JCP 2003
  • Chebychev chooses p(y) such that
  • |1 - p(y) y | = min over interval [λ* , λmax]
  • Estimate of λmax easy
  • Use λ* = λmax / C (No need for lowest eigenvalue)
  • C related to rate of grid coarsening
slide-9
SLIDE 9

17th International Conference on Domain Decomposition Methods 8

Parallel Gauss-Seidel

  • Multiplicative smoothers
  • (+) Powerful
  • (+) Great for MG
  • (-) Difficult to parallelize
  • Ideas:
  • Use processor partitions
  • Use ‘internal’ work to

hide communication

  • Symmetric!

Example: 2D, 4 proc

slide-10
SLIDE 10

17th International Conference on Domain Decomposition Methods 9

Cray T3E - 24 Processors – About 30,000 dof Per Processor

slide-11
SLIDE 11

17th International Conference on Domain Decomposition Methods 10

Iteration counts (80K to 76M equations)

slide-12
SLIDE 12

17th International Conference on Domain Decomposition Methods 11

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-13
SLIDE 13

17th International Conference on Domain Decomposition Methods 12

Aircraft carrier

  • 315,444 vertices
  • Shell and beam elements (6 DOF per node)
  • Linear dynamics – transient (time domain)
  • About 1 min. per solve (rtol=10-6)
  • 2.4 GHz Pentium 4/Xenon processors
  • Matrix vector product runs at 254 Mflops
slide-14
SLIDE 14

17th International Conference on Domain Decomposition Methods 13

slide-15
SLIDE 15

17th International Conference on Domain Decomposition Methods 14

slide-16
SLIDE 16

17th International Conference on Domain Decomposition Methods 15

“BR” tire

slide-17
SLIDE 17

17th International Conference on Domain Decomposition Methods 16

Math does matter!

slide-18
SLIDE 18

17th International Conference on Domain Decomposition Methods 17

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-19
SLIDE 19

17th International Conference on Domain Decomposition Methods 18

Trabecular Bone

Cortical bone Trabecular bone 5-mm Cube

FE mesh generation

slide-20
SLIDE 20

17th International Conference on Domain Decomposition Methods 19

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-21
SLIDE 21

17th International Conference on Domain Decomposition Methods 20

Athena: Parallel FE ParMetis

Parallel Mesh Partitioner

(Univerisity of Minnesota)

Prometheus

Multigrid Solver

FEAP

Serial general purpose FE application (University of

California)

PETSc

Parallel numerical libraries

(Argonne National Labs)

µFE Mesh Input File

Athena

ParMetis

FE input file (in memory) FE input file (in memory) Partition to SMPs

Athena Athena

ParMetis

File File File File

FEAP FEAP FEAP FEAP

Material Card

Silo DB Silo DB Silo DB Silo DB

Visit

Prometheus

PETSc ParMetis

METIS METIS METIS METIS

pFEAP

Computational Architecture

Olympus

slide-22
SLIDE 22

17th International Conference on Domain Decomposition Methods 21

Viz:

  • Geometric

& Material non- linear

  • 2.25% strain
  • 8 procs.

DataStar (SP4 at UCSD)

slide-23
SLIDE 23

17th International Conference on Domain Decomposition Methods 22

80 µm w/ shell

Scalability: Vertebral Body

  • Large deformation elast.
  • 6 load steps (3% strain)
  • Scaled speedup
  • ~131K dof/processor
  • 7 to 537 million dof
  • 4 to 292 nodes
  • IBM SP Power3
  • 14 of 16 procs/node used
  • Double/Single Colony switch
slide-24
SLIDE 24

17th International Conference on Domain Decomposition Methods 23

80 µm w/o shell

  • Inexact Newton
  • CG linear solver
  • Variable tolerance
  • Smoothed aggregation

AMG preconditioner

  • (vertex block) Diagonal

smoothers:

  • 2nd order Chebeshev (add.)
  • Gauss-Seidel (multiplicative)

Scalability

slide-25
SLIDE 25

17th International Conference on Domain Decomposition Methods 24

Computational phases

  • Mesh setup (per mesh):
  • Coarse grid construction (aggregation)
  • Graph processing
  • Matrix setup (per matrix):
  • Coarse grid operator construction
  • Sparse matrix triple product RAP (expensive for S.A.)
  • Subdomain factorizations
  • Solve (per RHS):
  • Matrix vector products (residuals, grid transfer)
  • Smoothers (Matrix vector products)
slide-26
SLIDE 26

17th International Conference on Domain Decomposition Methods 25

131K dof/proc - Flops/sec/proc .47 Teraflop/s - 4088 processors

slide-27
SLIDE 27

17th International Conference on Domain Decomposition Methods 26

Sources of inefficiencies: Linear solver iterations

2 70 26 36 11 5 19 22 20 14 5 5 2 70 26 36 11 5 19 22 20 14 5 6 2 70 26 36 11 5 19 22 20 14 5 4 2 70 26 36 11 5 19 22 20 14 5 3 2 70 26 36 11 5 20 20 20 14 5 2 2 70 25 35 11 5 18 21 20 14 5 1 6 5 4 3 2 1 5 4 3 2 1 Large (537M dof) Small (7.5M dof) Newton Load

slide-28
SLIDE 28

17th International Conference on Domain Decomposition Methods 27

Sources of scale inefficiencies in solve phase

2.78 1.00 model 74 76 Flop rate 2.61 1.00 Measured run time 33.0K 19.3K #elems/pr 68 50 #nnz/row 897 450 #iteration 537M dof 7.5M dof

slide-29
SLIDE 29

17th International Conference on Domain Decomposition Methods 28

Strong speedup: 7.5M dof (1 to 128 nodes)

slide-30
SLIDE 30

17th International Conference on Domain Decomposition Methods 29

Nodal Performance of IBM SP Power3 and Power4

  • IBM power3, 16 processors per node
  • 375 Mhz, 4 flops per cycle
  • 16 GB/sec bus (~7.9 GB/sec w/ STREAM bm)
  • Implies ~1.5 Gflops/s MB peak for Mat-Vec
  • We get ~1.2 Gflops/s (15 x .08Gflops)
  • IBM power4, 32 processors per node
  • 1.3 GHz, 4 flops per cycle
  • Complex memory architecture
slide-31
SLIDE 31

17th International Conference on Domain Decomposition Methods 30

Speedup

slide-32
SLIDE 32

17th International Conference on Domain Decomposition Methods 31

Outline

  • Algebraic multigrid (AMG)
  • Coarse grid spaces
  • Smoothers: Add. (Cheb.) and Mult. (G-S)
  • Industrial applications
  • Micro-FE bone modeling
  • Scalability/performance studies
  • Weak and strong (scaled/unscaled) speedup
  • Multigrid algorithms for KKT system
  • New AMG framework for KKT systems
slide-33
SLIDE 33

17th International Conference on Domain Decomposition Methods 32

Constrained Linear Systems With Lagrange Multipliers

  • Solid mechanics
  • Contact, tied meshes
  • RBE3s …
  • Mixed methods
  • Incompressible flow
  • Optimization problems ….

K CT C

  • u
  • = f

g

slide-34
SLIDE 34

17th International Conference on Domain Decomposition Methods 33

AMG for Constrained Systems (KKT-AMG)

1) Use Identity: I

  • =
  • +

+ +

I P C C K I R P C RC P RK C C K

i i i i i i i

T i T i

1 T 1 1

  • =
  • +

+ + + + + + + + + + + + i i i i i i i i i i i i i i i i i i i i i i i i i i i 1 1 1 1 1 1 1 1 1 1 1 T 1 1

T i T i

P P C C K R R P C R P C R P K R C C K

2) Constraint coarsening:

i i 1 +

R

Ai R P A i+1

  • +

+ +

C C K

1 T 1 1 i i i

  • i

i

R R

  • C

C K

i i

T i

  • T

T i i

R R

slide-35
SLIDE 35

17th International Conference on Domain Decomposition Methods 34

Coarse Grid Space

  • Start simple: plain aggregation
  • Standard AMG graph problem
  • Aggregate strongly connected domains
  • But what graph T? (ie, symmetric matrix)
  • T=CCT
  • T=CYCT
  • Y=PPT: T = CPPTCT = (CP)(CP)T

2 1 1 3 2

slide-36
SLIDE 36

17th International Conference on Domain Decomposition Methods 35

Motivation for T = CPP = CPPTCT (with piecewise const. L.M. spaces)

P C P C

l T l

=

+1

Recall Consider Cl +1Cl +1

T

Thus Cl +1Cl +1

T [I,J] = PiIT ijP jJ =

T

ij iI , j J

  • cos(IJ) max

iI, j J T ij

= P

TClP

( ) PTCl

T P

( ) = P

TT P

slide-37
SLIDE 37

17th International Conference on Domain Decomposition Methods 36

Smoothers for Constrained Systems

  • Constraint centric Schwarz (Vanka)
  • Multiplicative
  • Additive
  • Segregated, PC Uzawa (Braess,…)
  • Additive
  • ILU (Shultz, Taskflow)
  • Level fill (1)
slide-38
SLIDE 38

17th International Conference on Domain Decomposition Methods 37

Segregated Smoothers for Constrained Systems

  • Primal smoother M-1 for K-1
  • Dual smoother Q-1 for S-1
  • u ← u0 + M-1 ( f – CT p0 – A u0 )
  • p ← p0 + Q-1 ( C u – g )
  • Preconditioned Uzawa
  • Symmetrize: u ← u + M-1 CT(p0 – p)
  • Q = C D-1CT, Processor block Jacobi
  • C

C K

T =

  • S

C K

  • I

C K I

T 1

T 1C

CK S

  • =
  • C

C K

T

=

1

S C K

  • I

C K I

T 1

slide-39
SLIDE 39

17th International Conference on Domain Decomposition Methods 38

Constraint Centric Schwarz Smoothers

  • Use Domain Decomposition

(Schwarz) ideas (overlapping)

  • Aggregate constraint equations into

non-overlapping subdomains

  • Generalized Vanka (additive)
  • For each constraint domain

(processor)

  • Add primal eqs in support of

constraints

  • Well posed KKT matrix (exact solve)
slide-40
SLIDE 40

17th International Conference on Domain Decomposition Methods 39

Constraint Centric Domain Decomposition Smoothers

  • Multiplicative Schwarz
  • Additive constraint part: Bd = B1 + B2 + … Bk
  • Bi = Ri

T (Ri A Ri T)-1 Ri, exact subdomain solves

  • Bp: Multiplicative smoother for primal eqs.
  • RT M-1 R, R = [ I 0 ], smoother M-1 for primal eqs
  • M-1: Parallel Gauss-Seidel
  • x ← x0 + (Bp + Bd - BdA Bp)(b – Ax0)
  • Additive Schwarz (Bp: add. smoother)
  • x ← x0 + (Bp + Bd)(b – Ax0)
slide-41
SLIDE 41

17th International Conference on Domain Decomposition Methods 40

Numerical results

  • Krylov solvers preconditioned
  • One V-cycle multigrid primal preconditioner
  • Primal Smoother: Nodal diagonal PC
  • 1st order Chebychev (additive)
  • 1 iteration symmetric Gauss-Seidel (multiplicative)

1) GMRES / KKT-AMG

  • Constraint centric DD
  • Additive
  • Multiplicative
  • Symmetric preconditioned Uzawa (segregated) iteration
  • Processor block Jacobi solver for CD-1CT
  • ILU level fill (1)

2) Uzawa outer iterations / CG inner iterations

  • Highly optimized production code
slide-42
SLIDE 42

17th International Conference on Domain Decomposition Methods 41

Adagio “Circles” Problem

  • ~7200 dof
  • 2 contact surfaces
  • 2 layers of 1st order

hexehedra displacement FE

  • 102 ratio in elastic

modulus

  • Contact w/ friction
  • Results in tied mesh
  • 480 constraint eqs.
slide-43
SLIDE 43

17th International Conference on Domain Decomposition Methods 42

Adagio “Circles” Problem

slide-44
SLIDE 44

17th International Conference on Domain Decomposition Methods 43

Independence of smoothers (processor) domain size

# domains (processors) Iteration counts Iteration counts 106 106 127 127 122 122 100 100 Uzawa 32 32 35 35 33 33 30 30 ILU 31 31 37 37 34 34 35 35 CCS (multiplicative) 42 42 47 47 43 43 47 47 CCS (additive) 47 47 52 52 57 57 51 51 Segregated 8 4 2 1 Smoothers

slide-45
SLIDE 45

17th International Conference on Domain Decomposition Methods 44

Adagio forging problem

  • ~128K dof
  • 64 Lagrange multipliers
  • 5 “time” steps
  • 1st contact in 3rd step
  • 16 processors
  • SGI 2000
slide-46
SLIDE 46

17th International Conference on Domain Decomposition Methods 45

Solve Times (sec)

355 (304) 238 1057 Uzawa 1025 (304) 607 2096 ILU 182 (306) 257 910 CCS(mult) 654 (313) 255 1400 CCS (add) 293 (309) 259 1060 Segregated solve (# of solves) setup end-to-end Smoothers

slide-47
SLIDE 47

17th International Conference on Domain Decomposition Methods 46

Mesh independence

2 54 8 4 10 28 4 4 284 9 5 8 29 3 17 8016 8 4 7 29 2 L.M primal Segre- gated CCS (mult.) CCS (add.) Uzawa Levels Dof (approx.) Iterations (1st solve)

slide-48
SLIDE 48

17th International Conference on Domain Decomposition Methods 47

Thank You

Ultrascalable implicit finite element analyses in solid mechanics with over a half a billion degrees of freedom M.F. Adams, H.H. Bayraktar,T.M. Keaveny,

  • P. Papadopoulos

ACM/IEEE Proceedings of SC2004: High Performance Networking and Computing

slide-49
SLIDE 49

17th International Conference on Domain Decomposition Methods 48

164K dof/proc

slide-50
SLIDE 50

17th International Conference on Domain Decomposition Methods 49

End to end times and (in)efficiency components

slide-51
SLIDE 51

17th International Conference on Domain Decomposition Methods 50

First try: Flop rates (265K dof/processor)

  • 265K dof per proc.
  • IBM switch bug
  • Bisection bandwidth

plateau 64-128 nodes

  • Solution:
  • use more processors
  • Less dof per proc.
  • Less pressure on

switch Bisection bandwidth

slide-52
SLIDE 52

17th International Conference on Domain Decomposition Methods 51

Multigrid V(ν1,ν2) - cy cle

  • Given smoother S and coarse grid space (P)
  • Columns of “prolongation” operator P, discrete rep. of coarse grid space
  • Function u = MG-V

MG-V(A,f)

  • if A is small
  • u ← A-1f
  • else
  • u ← Sν1(f, u)
  • - ν1 steps of smoother (pre)
  • rH ← PT( f – Au )
  • uH ← MG-V

MG-V(PTAP, rH )

  • - recursion

recursion (Galerkin)

  • u ← u + PuH
  • u ← Sν2(f, u)
  • - ν2 steps of smoother (post)
  • Iteration matrix: T = S ( I - P(RAP)-1RA ) S
  • multiplicative
slide-53
SLIDE 53

17th International Conference on Domain Decomposition Methods 52

Polynomial Smoothers - Additive preconditioners

  • Additive preconditioners (M) parallelize well
  • Polynomial methods are additive
  • x (m+1) = x (m) + p(MA) ( Mb – MA x (m) )
  • Chebychev is ideal for multigrid smoothers
  • Chebychev chooses p(y) such that
  • |1 - p(y) y | = min over interval [λ* , λmax ]
  • No need for lowest eigenvalue
  • Estimate of λmax is straight forward
  • Use λ* = λmax / C
  • C related to rate of grid coarsening
slide-54
SLIDE 54

17th International Conference on Domain Decomposition Methods 53

28

  • lex. Gauss-Seidel

1 13

Chebyshev

3 19

Chebyshev

2 27

damped Jacobi

2 53

damped Jacobi/Cheb.

1 10

red-black Gauss-Seidel

3 11

red-black Gauss-Seidel

2 20

red-black Gauss-Seidel

1 13

  • lex. Gauss-Seidel

3 16

  • lex. Gauss-Seidel

2 Iterations Smoother Order

2D Laplacian

slide-55
SLIDE 55

17th International Conference on Domain Decomposition Methods 54

Micro-Computed Tomography µCT @ 22 µm resolution 3D image Mechanical Testing E, εyield, σult, etc. 2.5 mm cube 44 µm elements

µFE mesh

Methods: µFE modeling

slide-56
SLIDE 56

17th International Conference on Domain Decomposition Methods 55

Motivation

  • Calibrate material models for continuum elements
  • eg, explicit computation of a yield surface)
  • Highly accurate modeling capabilities for low order

model validation

  • Investigation of effects that are not accessible with

lower order models

  • role of cortical shell in load carrying of vertebra
  • effects of drug treatment on continuum properties
slide-57
SLIDE 57

17th International Conference on Domain Decomposition Methods 56

ParMetis partitions

slide-58
SLIDE 58

17th International Conference on Domain Decomposition Methods 57

Common Solution Methods for KKT systems

  • Uzawa (augmented Lagrange)
  • Richardson iteration on Schur complement
  • (C K-1CT) λ = CK-1f - g
  • Constraint (Schur) reduction
  • Static condensation
  • Eliminate a ‘slave’ variable in each constraint
  • Projection methods:
  • Krylov method with PTKP u’ = fp
  • P = I-CT(CQCT)-1C, fp = P(f – Ku0), u0 = CT(CQCT)-1g, u = u0 + u’
  • Precondition with anything

anything (ie, constraint oblivious)

slide-59
SLIDE 59

17th International Conference on Domain Decomposition Methods 58

Motivation for Y=PPT

Non-aggressive primal coarsening Aggressive primal coarsening

  • =

1 1 1 1 1 1 C

  • 1

2 3 4 5 6

2D Model Problem

  • T = CPP

TC T =

1 1 1 1 1 1 1 1 1

  • =

1 1 1 1 1 1 P

T = CPP

TC T =

1 0 1 0 1

  • 6

6X

I P =

  • CCT = I
slide-60
SLIDE 60

17th International Conference on Domain Decomposition Methods 59