[PPT] - Parallel Numerical Algorithms Chapter 3 Dense Linear Systems PowerPoint Presentation

SLIDE 1

LU Factorization Parallel Algorithms for LU Partial Pivoting

Parallel Numerical Algorithms

Chapter 3 – Dense Linear Systems Section 3.2 – LU Factorization Michael T. Heath and Edgar Solomonik

Department of Computer Science University of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 44

SLIDE 2

LU Factorization Parallel Algorithms for LU Partial Pivoting

Outline

1

LU Factorization Motivation Gaussian Elimination

2

Parallel Algorithms for LU Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

3

Partial Pivoting

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 44

SLIDE 3

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

LU Factorization

System of linear algebraic equations has form Ax = b where A is given n × n matrix, b is given n-vector, and x is unknown solution n-vector to be computed Direct method for solving general linear system is by computing LU factorization A = LU where L is unit lower triangular and U is upper triangular

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 44

SLIDE 4

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

LU Factorization

System Ax = b then becomes LUx = b Solve lower triangular system Ly = b by forward-substitution to obtain vector y Finally, solve upper triangular system Ux = y by back-substitution to obtain solution x to original system

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 44

SLIDE 5

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

Gaussian Elimination Algorithm

LU factorization can be computed by Gaussian elimination as follows, where U overwrites A for k = 1 to n − 1 for i = k + 1 to n ℓik = aik/akk end for j = k + 1 to n for i = k + 1 to n aij = aij − ℓikakj end end end { loop over columns } { compute multipliers for current column } { apply transformation to remaining submatrix }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 44

SLIDE 6

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

Gaussian Elimination Algorithm

In general, row interchanges (pivoting) may be required to ensure existence of LU factorization and numerical stability

f Gaussian elimination algorithm, but for simplicity we

temporarily ignore this issue Gaussian elimination requires about n3/3 paired additions and multiplications, so model serial time as T1 = γ n3/3 where γ is time required for multiply-add operation About n2/2 divisions also required, but we ignore this lower-order term

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 44

SLIDE 7

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

Loop Orderings for Gaussian Elimination

Gaussian elimination has general form of triple-nested loop in which entries of L and U overwrite those of A for for for aij = aij − (aik/akk) akj end end end Indices i, j, and k of for loops can be taken in any order, for total of 3! = 6 different ways of arranging loops

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 44

SLIDE 8

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

Loop Orderings for Gaussian Elimination

Different loop orders have different memory access patterns, which may cause their performance to vary widely Right-looking orderings (loop over k is outermost) perform updates to the trailing matrix (update all aij for i, j ≥ k) eagerly Left-looking orderings (loop over k is innermost) update the trailing matrix lazily (updates to aij done only when all entries ai′j′ with min(i′, j′) < min(i, j) have been updated) Right-looking ordering achieve better read-locality (the same divisor and outer-product vectors are reused) Left-looking ordering achieve better write-locality (entries

f A may be changed in memory only once)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 44

SLIDE 9

LU Factorization Parallel Algorithms for LU Partial Pivoting Motivation Gaussian Elimination

Gaussian Elimination Algorithm

Right-looking form of Gaussian elimination for k = 1 to n − 1 for i = k + 1 to n ℓik = aik/akk end for j = k + 1 to n for i = k + 1 to n aij = aij − ℓik akj end end end Multipliers ℓik computed outside inner loop for greater efficiency

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 44

SLIDE 10

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Parallel Algorithm

Partition For i, j = 1, . . . , n, fine-grain task (i, j) stores aij and computes and stores uij, if i ≤ j ℓij, if i > j yielding 2-D array of n2 fine-grain tasks Communicate Broadcast entries of A vertically to tasks below Broadcast entries of L horizontally to tasks to right

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 44

SLIDE 11

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Fine-Grain Tasks and Communication

a11 u11 a12 u12 a21 ℓ21 a22 u22 a13 u13 a14 u14 a23 u23 a24 u24 a31 ℓ31 a32 ℓ32 a41 ℓ41 a42 ℓ42 a33 u33 a34 u34 a43 ℓ43 a44 u44 a51 ℓ51 a52 ℓ52 a61 ℓ61 a62 ℓ62 a53 ℓ53 a54 ℓ54 a63 ℓ63 a64 ℓ64 a15 u15 a16 u16 a25 u25 a26 u26 a35 u35 a36 u36 a45 u45 a46 u46 a55 u55 a56 u56 a65 ℓ65 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 44

SLIDE 12

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Fine-Grain Parallel Algorithm

for k = 1 to min(i, j) − 1 recv broadcast of akj from task (k, j) recv broadcast of ℓik from task (i, k) aij = aij − ℓik akj end if i ≤ j then broadcast aij to tasks (k, j), k = i + 1, . . . , n else recv broadcast of ajj from task (j, j) ℓij = aij/ajj broadcast ℓij to tasks (i, k), k = j + 1, . . . , n end { vert bcast } { horiz bcast } { update entry } { vert bcast } { vert bcast } { multiplier } { horiz bcast }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 44

SLIDE 13

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Agglomeration

Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k)2 coarse-grain tasks 1-D column: combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 44

SLIDE 14

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

2-D Agglomeration

a11 u11 a12 u12 a21 ℓ21 a22 u22 a13 u13 a14 u14 a23 u23 a24 u24 a31 ℓ31 a32 ℓ32 a41 ℓ41 a42 ℓ42 a33 u33 a34 u34 a43 ℓ43 a44 u44 a51 ℓ51 a52 ℓ52 a61 ℓ61 a62 ℓ62 a53 ℓ53 a54 ℓ54 a63 ℓ63 a64 ℓ64 a15 u15 a16 u16 a25 u25 a26 u26 a35 u35 a36 u36 a45 u45 a46 u46 a55 u55 a56 u56 a65 ℓ65 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 44

SLIDE 15

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Blocked LU factorization

A

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 44

SLIDE 16

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Blocked LU factorization

L₀₀ U₀₀

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 44

SLIDE 17

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Blocked LU factorization

L U

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 44

SLIDE 18

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Blocked LU factorization

L U S=A-LU

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 44

SLIDE 19

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Coarse-Grain 2-D Parallel Algorithm

for k = 1 to n − 1 broadcast {akj : j ∈ mycols, j ≥ k} in processor column if k ∈ mycols then for i ∈ myrows, i > k ℓik = aik/akk { multipliers } end end broadcast {ℓik : i ∈ myrows, i > k} in processor row for j ∈ mycols, j > k for i ∈ myrows, i > k, aij = aij − ℓik akj { update } end end end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 44

SLIDE 20

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Column Agglomeration

a11 u11 a12 u12 a21 ℓ21 a22 u22 a13 u13 a14 u14 a23 u23 a24 u24 a31 ℓ31 a32 ℓ32 a41 ℓ41 a42 ℓ42 a33 u33 a34 u34 a43 ℓ43 a44 u44 a51 ℓ51 a52 ℓ52 a61 ℓ61 a62 ℓ62 a53 ℓ53 a54 ℓ54 a63 ℓ63 a64 ℓ64 a15 u15 a16 u16 a25 u25 a26 u26 a35 u35 a36 u36 a45 u45 a46 u46 a55 u55 a56 u56 a65 ℓ65 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 44

SLIDE 21

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Row Agglomeration

a11 u11 a12 u12 a21 ℓ21 a22 u22 a13 u13 a14 u14 a23 u23 a24 u24 a31 ℓ31 a32 ℓ32 a41 ℓ41 a42 ℓ42 a33 u33 a34 u34 a43 ℓ43 a44 u44 a51 ℓ51 a52 ℓ52 a61 ℓ61 a62 ℓ62 a53 ℓ53 a54 ℓ54 a63 ℓ63 a64 ℓ64 a15 u15 a16 u16 a25 u25 a26 u26 a35 u35 a36 u36 a45 u45 a46 u46 a55 u55 a56 u56 a65 ℓ65 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 44

SLIDE 22

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Mapping

Map 2-D: assign (n/k)2/p coarse-grain tasks to each of p processors treating target network as 2-D mesh, using

blocked mapping (aggregating into larger blocks) cyclic mapping of blocks, yielding block-cyclic layout

1-D: assign n/p coarse-grain tasks to each of p processors treating target network as 1-D mesh, using

blocked mapping (aggregating into panels) cyclic mapping of rows/cols, yielding row-cyclic or column-cyclic layout

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 44

SLIDE 23

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Column Agglomeration with Cyclic Mapping

a11 u11 a21 ℓ21 a31 ℓ31 a41 ℓ41 a51 ℓ51 a61 ℓ61 a12 u12 a22 u22 a32 ℓ32 a42 ℓ42 a52 ℓ52 a62 ℓ62 a13 u13 a23 u23 a33 u33 a43 ℓ43 a53 ℓ53 a63 ℓ63 a14 u14 a24 u24 a34 u34 a44 u44 a54 ℓ54 a64 ℓ64 a15 u15 a25 u25 a35 u35 a45 u45 a55 u55 a65 ℓ65 a16 u16 a26 u26 a36 u36 a46 u46 a56 u56 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 44

SLIDE 24

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Column Agglomeration

Matrix rows need not be broadcast vertically, since any given column is contained entirely in only one process But there is no parallelism in computing multipliers or updating any given column Horizontal broadcasts still required to communicate multipliers for updating

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 44

SLIDE 25

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Coarse-Grain 1-D Column Parallel Algorithm

for k = 1 to n − 1 if k ∈ mycols then for i = k + 1 to n ℓik = aik/akk end end broadcast {ℓik : k < i ≤ n} for j ∈ mycols, j > k for i = k + 1 to n aij = aij − ℓik akj end end end { multipliers } { broadcast } { update }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 44

SLIDE 26

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Row Agglomeration with Cyclic Mapping

a11 u11 a12 u12 a13 u13 a14 u14 a15 u15 a16 u16 a21 ℓ21 a22 u22 a23 u23 a24 u24 a25 u25 a26 u26 a31 ℓ31 a32 ℓ32 a33 u33 a34 u34 a35 u35 a36 u36 a41 ℓ41 a42 ℓ42 a43 ℓ43 a44 u44 a45 u45 a46 u46 a51 ℓ51 a52 ℓ52 a53 ℓ53 a54 ℓ54 a55 u55 a56 u56 a61 ℓ61 a62 ℓ62 a63 ℓ63 a64 ℓ64 a65 ℓ65 a66 u66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 44

SLIDE 27

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

1-D Row Agglomeration

Multipliers need not be broadcast horizontally, since any given matrix row is contained entirely in only one process But there is no parallelism in updating any given row Vertical broadcasts still required to communicate each row

f matrix to processors below it for updating

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 44

SLIDE 28

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Coarse-Grain 1-D Row Parallel Algorithm

for k = 1 to n − 1 broadcast {akj : k ≤ j ≤ n} for i ∈ myrows, i > k, ℓik = aik/akk end for j = k + 1 to n for i ∈ myrows, i > k, aij = aij − ℓik akj end end end { broadcast } { multipliers } { update }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 44

SLIDE 29

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Block-Cyclic LU Factorization

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 44

SLIDE 30

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Block-Cyclic LU Factorization

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 44

SLIDE 31

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Block-Cyclic LU Factorization

L U

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 44

SLIDE 32

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Block-Cyclic LU Factorization

L U S=A-LU

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 44

SLIDE 33

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Performance Enhancements

Each processor becomes idle as soon as its last row and column are completed With block mapping, in which each processor holds contiguous block of rows and columns, some processors become idle long before overall computation is complete Block mapping also yields unbalanced load, as computing multipliers and updates requires successively less work with increasing row and column numbers Cyclic or reflection mapping improves both concurrency and load balance

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 44

SLIDE 34

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Performance Enhancements

Performance can also be enhanced by overlapping communication and computation At step k, each processor completes updating its portion of remaining unreduced submatrix before moving on to step k + 1 Broadcast of each segment of row k + 1, and computation and broadcast of each segment of multipliers for step k + 1, could be initiated as soon as relevant segments of row k + 1 and column k + 1 have been updated by their owners, before completing remainder of their updating for step k This look-ahead strategy enables other processors to start working on next step earlier than they otherwise could

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 44

SLIDE 35

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Execution Time for 1-D Agglomeration

With 1-D column agglomeration, each processor factorizes panels of b columns, then broadcasts them to perform the trailing matrix update While work-efficient Wp = Θ(n3), the concurrency in computational cost is constrained by panel factorization Fp(n, b) = Θ((n/b)nb2 + n3/p) so we need b < n/p to maintain Fp(n, b) = Θ(n3/p) The overall execution time is given by Tp(n, b) = Θ

(n/b)T bcast

p

(nb) + γFp(n, b)

It is generally minimized by picking b = Θ(n/p)

Tp(n, b) = Θ(αp log p + βn2 + γn3/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 44

SLIDE 36

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Execution Time for 2-D Agglomeration

With 2-D agglomeration and block-cyclic mapping, a processor factorizes a b × b diagonal block, broadcasts it to a column and row of processors, which update the panels and broadcast them to perform the trailing matrix updates The computational cost is constrained by lack of concurrency in the diagonal Fp(n, b) = O(n3/p + nb2 + n2b/√p) The overall execution time is given by Tp(n, b) = Θ

(n/b)(T bcast

√p

(b2)+T bcast

√p

(nb/√p))+γFp(n, b)

It is generally minimized by picking b = n/√p

Tp(n) = Tp(n, n/√p) = Θ(α√p log p + βn2/√p + γn3/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 44

SLIDE 37

LU Factorization Parallel Algorithms for LU Partial Pivoting Fine-Grain Algorithm Agglomeration Schemes Mapping Schemes Scalability

Scalability for 2-D Agglomeration

Cannon’s algorithm for matrix multiplication (2-D agglomeration), could achieve strong scaling speed-up ps = O((γ/α)n2) and unconditional weak scaling The SUMMA algorithm, which was based on broadcasts, achieved slightly inferior scaling due to a Θ(log(p)) term on the latency cost The execution time of 2-D agglomeration for LU is the same as of SUMMA, so the efficiency and scaling characteristics are the same On the other hand, it is not possible to achieve strong scaling to O((γ/α)n3/ log(n)) processors as the depth of the usual LU algorithm is D = n, meaning the maximum speed-up is ps = Θ(maxp Sp) = O(Q1/D) = O(n2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 44

SLIDE 38

LU Factorization Parallel Algorithms for LU Partial Pivoting

Partial Pivoting

Row ordering of A is irrelevant in system of linear equations Partial pivoting takes rows in order of largest entry in magnitude of leading column of remaining unreduced matrix This choice ensures that multipliers do not exceed 1 in magnitude, which reduces amplification of rounding errors In general, partial pivoting is required to ensure existence and numerical stability of LU factorization

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 44

SLIDE 39

LU Factorization Parallel Algorithms for LU Partial Pivoting

Partial Pivoting

Partial pivoting yields factorization of form P A = LU where P is permutation matrix If P A = LU, then system Ax = b becomes P Ax = LUx = P b which can be solved by forward-substitution in lower triangular system Ly = P b, followed by back-substitution in upper triangular system Ux = y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 44

SLIDE 40

LU Factorization Parallel Algorithms for LU Partial Pivoting

Parallel Partial Pivoting

Partial pivoting complicates parallel implementation of Gaussian elimination and significantly affects potential performance With 2-D algorithm, pivot search is parallel but requires communication within processor column (S = Ω(n log(p))) and inhibits overlap With 1-D column algorithm, pivot search requires no communication but is purely serial Once pivot is found, index of pivot row must be communicated to other processors, and rows must be explicitly or implicitly interchanged in each process

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 44

SLIDE 41

LU Factorization Parallel Algorithms for LU Partial Pivoting

Alternatives to Partial Pivoting

Because of negative effects of partial pivoting on parallel performance, various alternatives have been proposed that limit pivot search

tournament pivoting (perform tree of partial pivoting on different subsets of matrix rows, selecting b at a time) threshold pivoting (use local rows as pivots if the diagonal entries are within threshold of column norm) pairwise pivoting (eliminate n(n − 1)/2 entries by as many 2-by-2 transformations LiPi, where Li is unit-lower triangular and Pi is a permutation matrix, applied to appropriate row pairs)

Stability generally slightly worse in theory and for particularly hard test-cases Better stability without worrying about pivoting may be achieved via QR factorization

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 44

SLIDE 42

LU Factorization Parallel Algorithms for LU Partial Pivoting

Communication vs. Memory Tradeoff

If explicit replication of storage is allowed, then lower communication volume is possible As with matrix multiplication, algorithms that leverage all available memory to reduce communication cost to the maximum extent possible If sufficient memory is avaiable, then these algorithms can achieve provably optimal communication

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 44

SLIDE 43

LU Factorization Parallel Algorithms for LU Partial Pivoting

References

J. W. Demmel, M. T. Heath, and H. A. van der Vorst,

Parallel numerical linear algebra, Acta Numerica 2:111-197, 1993

G. A. Geist and C. H. Romine, LU factorization algorithms
n distributed-memory multiprocessor architectures, SIAM
J. Sci. Stat. Comput. 9:639-649, 1988
L. Grigori, J. Demmel, and H. Xiang, CALU: A

communication optimal LU factorization algorithm, SIAM J. Matrix Anal. Appl. 32:1317-1350, 2011

B. A. Hendrickson and D. E. Womble, The torus-wrap

mapping for dense matrix calculations on massively parallel computers, SIAM J. Sci. Stat. Comput. 15:1201-1226, 1994

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 44

SLIDE 44

LU Factorization Parallel Algorithms for LU Partial Pivoting

References

J. M. Ortega, Introduction to Parallel and Vector Solution of

Linear Systems, Plenum Press, 1988

J. M. Ortega and C. H. Romine, The ijk forms of

factorization methods II: parallel systems, Parallel Comput. 7:149-162, 1988

Y. Robert, The Impact of Vector and Parallel Architectures
n the Gaussian Elimination Algorithm, Wiley, 1990
S. A. Vavasis, Gaussian elimination with pivoting is

P-complete, SIAM J. Disc. Math. 2:413-423, 1989

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 44