Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - - PowerPoint PPT Presentation

parallel numerical algorithms
SMART_READER_LITE
LIVE PREVIEW

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - - PowerPoint PPT Presentation

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of


slide-1
SLIDE 1

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product

Parallel Numerical Algorithms

Chapter 3 – Dense Linear Systems Section 3.1 – Vector and Matrix Products Michael T. Heath and Edgar Solomonik

Department of Computer Science University of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

slide-2
SLIDE 2

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product

Outline

1

BLAS

2

Inner Product

3

Outer Product

4

Matrix-Vector Product

5

Matrix-Matrix Product

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

slide-3
SLIDE 3

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product

Basic Linear Algebra Subprograms

Basic Linear Algebra Subprograms (BLAS) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib, and many computer vendors provide custom versions optimized for their particular systems

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

slide-4
SLIDE 4

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product

Examples of BLAS

Level Work Examples Function 1 O(n) daxpy Scalar × vector + vector ddot Inner product dnrm2 Euclidean vector norm 2 O(n2) dgemv Matrix-vector product dtrsv Triangular solve dger Outer-product 3 O(n3) dgemm Matrix-matrix product dtrsm Multiple triangular solves dsyrk Symmetric rank-k update γ1

  • BLAS 1 effective sec/flop

> γ2

  • BLAS 2 effective sec/flop

≫ γ3

  • BLAS 3 effective sec/flop

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

slide-5
SLIDE 5

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Inner Product

Inner product of two n-vectors x and y given by xT y =

n

  • i=1

xi yi Computation of inner product requires n multiplications and n − 1 additions M1 = Θ(n), Q1 = Θ(n), T1 = Θ(γ n) Effectively as hard as scalar reduction, can be done via binary or binomial tree summation

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

slide-6
SLIDE 6

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Parallel Algorithm

Partition For i = 1, . . . , n, fine-grain task i stores xi and yi, and computes their product xi yi Communicate Sum reduction over n fine-grain tasks

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 x9 y9

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

slide-7
SLIDE 7

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Fine-Grain Parallel Algorithm

zi = xiyi reduce zi across all tasks i = 1, ..., n { local scalar product } { sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

slide-8
SLIDE 8

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Agglomeration and Mapping

Agglomerate Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Map Assign (n/k)/p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 x9 y9 + + + + + +

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

slide-9
SLIDE 9

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Coarse-Grain Parallel Algorithm

zi = xT

[i]y[i]

reduce zi across all processors i = 1, ..., p { local inner product } { sum reduction }

  • x[i] – subvector of x assigned to processor i
  • Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms 9 / 77

slide-10
SLIDE 10

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Performance

The parallel costs (Lp, Wp, Fp) for the inner product are given by Computational cost Fp = Θ(n/p) regardless of network The latency and bandwidth costs depend on network:

1-D mesh: Lp, Wp = Θ(p) 2-D mesh: Lp, Wp = Θ(√p) hypercube: Lp, Wp = Θ(log p)

For a hypercube or fully-connected network time is Tp = αLp + βWp + γFp = Θ(α log(p) + γn/p) Efficiency and scaling are the same as for binary tree sum

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

slide-11
SLIDE 11

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Inner product on 1-D Mesh

For 1-D mesh, total time is Tp = Θ(γn/p + αp) To determine strong scalability, we set constant efficiency and solve for ps const = Eps = T1 psTps = Θ

  • γn

γn + αp2

s

  • = Θ
  • 1

1 + (α/γ)p2

s/n

  • which yields ps = Θ(
  • (γ/α)n)

1-D mesh weakly scalable to pw = Θ((γ/α)n) processors: Epw(pwn) = Θ

  • 1

1 + (α/γ)p2

w/(pwn)

  • = Θ
  • 1

1 + (α/γ)pw/n

  • Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms 11 / 77

slide-12
SLIDE 12

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Scalability

Inner product on 2-D Mesh

For 2-D mesh, total time is Tp = Θ(γn/p + α√p) To determine strong scalability, we set constant efficiency and solve for ps const = Eps = T1 psTps = Θ

  • γn

γn + αp3/2

s

  • = Θ
  • 1

1 + (α/γ)p3/2

s

/n

  • which yields ps = Θ((γ/α)2/3n2/3)

2-D mesh weakly scalable to pw = Θ((γ/α)2n2), since Epw(pwn) = Θ

  • 1

1 + (α/γ)p3/2

w /(pwn)

  • = Θ
  • 1

1 + (α/γ)√pw/n

  • Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms 12 / 77

slide-13
SLIDE 13

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Outer Product

Outer product of two n-vectors x and y is n × n matrix Z = xyT whose (i, j) entry zij = xi yj For example,   x1 x2 x3     y1 y2 y3  

T

=   x1y1 x1y2 x1y3 x2y1 x2y2 x2y3 x3y1 x3y2 x3y3   Computation of outer product requires n2 multiplications, M1 = Θ(n2), Q1 = Θ(n2), T1 = Θ(γn2) (in this case, we should treat M1 as output size or define the problem as in the BLAS: Z = Zinput + xyT )

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

slide-14
SLIDE 14

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Parallel Algorithm

Partition For i, j = 1, . . . , n, fine-grain task (i, j) computes and stores zij = xi yj, yielding 2-D array of n2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either

for some j, task (i, j) stores xi and task ( j, i) stores yi, or task (i, i) stores both xi and yi, i = 1, . . . , n

Communicate For i = 1, . . . , n, task that stores xi broadcasts it to all other tasks in ith task row For j = 1, . . . , n, task that stores yj broadcasts it to all

  • ther tasks in jth task column

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

slide-15
SLIDE 15

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fine-Grain Tasks and Communication

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

slide-16
SLIDE 16

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fine-Grain Parallel Algorithm

broadcast xi to tasks (i, k), k = 1, . . . , n broadcast yj to tasks (k, j), k = 1, . . . , n zij = xiyj { horizontal broadcast } { vertical broadcast } { local scalar product }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

slide-17
SLIDE 17

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Agglomeration

Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: Combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k)2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

slide-18
SLIDE 18

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration

Each task that stores portion of x must broadcast its subvector to all other tasks in its task row Each task that stores portion of y must broadcast its subvector to all other tasks in its task column

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 77

slide-19
SLIDE 19

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 77

slide-20
SLIDE 20

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Agglomeration

If either x or y stored in one task, then broadcast required to communicate needed values to all other tasks If either x or y distributed across tasks, then multinode broadcast required to communicate needed values to other tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 77

slide-21
SLIDE 21

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Column Agglomeration

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 77

slide-22
SLIDE 22

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Row Agglomeration

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 77

slide-23
SLIDE 23

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Mapping

Map 2-D: Assign (n/k)2/p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 77

slide-24
SLIDE 24

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration with Block Mapping

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 77

slide-25
SLIDE 25

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Column Agglomeration with Block Mapping

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 77

slide-26
SLIDE 26

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Row Agglomeration with Block Mapping

x1 y1 x1 y2 x2 y1 x2 y2 x1 y3 x1 y4 x2 y3 x2 y4 x3 y1 x3 y2 x4 y1 x4 y2 x3 y3 x3 y4 x4 y3 x4 y4 x5 y1 x5 y2 x6 y1 x6 y2 x5 y3 x5 y4 x6 y3 x6 y4 x1 y5 x1 y6 x2 y5 x2 y6 x3 y5 x3 y6 x4 y5 x4 y6 x5 y5 x5 y6 x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 77

slide-27
SLIDE 27

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Coarse-Grain Parallel Algorithm

broadcast x[i ] to ith process row broadcast y[ j ] to jth process column Z[i ][ j ] = x[i ]yT

[ j ]

{ horizontal broadcast } { vertical broadcast } { local outer product }

  • Z[i ][ j ] means submatrix of Z assigned to process (i, j) by

mapping

  • Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms 27 / 77

slide-28
SLIDE 28

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Performance

The parallel costs (Lp, Wp, Fp) for the outer product are Computational cost Fp = Θ(n2/p) regardless of network The latency and bandwidth costs can be derived from the cost of broadcast/allgather

1-D agglomeration: Lp = Θ(log p), Wp = Θ(n) 2-D agglomeration: Lp = Θ(log p), Wp = Θ(n/√p)

For 1-D agglomeration, execution time is T 1-D

p

= T allgather

p

(n) + Θ(γn2/p) = Θ(α log(p) + βn + γn2/p) For 2-D agglomeration, execution time is T 2-D

p

= 2T bcast

√p

(n/√p)+Θ(γn2/p) = Θ(α log(p)+βn/√p+γn2/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 77

slide-29
SLIDE 29

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Outer Product Strong Scaling

1-D agglomeration is strongly scalable to ps = Θ(min((γ/α)n2/ log((γ/α)n2), (γ/β)n)) processors, since E1-D

ps

= Θ(1/(1 + (α/γ) log(ps)ps/n2 + (β/γ)ps/n)) 2-D agglomeration is strongly scalable to ps = Θ(min((γ/α)n2/ log((γ/α)n2), (γ/β)2n2)) processors, since E2-D

ps

= Θ(1/(1 + (α/γ) log(ps)ps/n2 + (β/γ)√ps/n))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 77

slide-30
SLIDE 30

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Outer Product Weak Scaling

1-D agglomeration is weakly scalable to pw = Θ(min(2(γ/α)n2, (γ/β)2n2)) processors, since E1-D

pw (√pwn) = Θ(1/(1 + (α/γ) log(pw)/n2 + (β/γ)√pw/n))

2-D agglomeration is weakly scalable to pw = Θ(2(γ/α)n2) processors, since E2-D

pw (√pwn) = Θ(1/(1 + (α/γ) log(pw)/n2 + (β/γ)/n))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 77

slide-31
SLIDE 31

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Memory Requirements

The memory requirements are dominated by storing Z, which in practice means the outer-product is a poor primitive (local flop-to-byte ratio of 1) If possible, Z should be operated on as it is computed, e.g. if we really need vi =

  • j

f(xiyj) for some scalar function f If Z does not need to be stored, vector storage dominates Without further modification, 1-D algorithm requires Mp = Θ(np) overall storage for vector Without further modification, 2-D algorithm requires Mp = Θ(n√p) overall storage for vector

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 77

slide-32
SLIDE 32

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Matrix-Vector Product

Consider matrix-vector product y = Ax where A is n × n matrix and x and y are n-vectors Components of vector y are given by inner products: yi =

n

  • j=1

aij xj, i = 1, . . . , n The sequential memory, work, and time are M1 = Θ(n2), Q1 = Θ(n2), T1 = Θ(γn2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 77

slide-33
SLIDE 33

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Parallel Algorithm

Partition For i, j = 1, . . . , n, fine-grain task (i, j) stores aij and computes aij xj, yielding 2-D array of n2 fine-grain tasks Assuming no replication of data, at most 2n fine-grain tasks store components of x and y, say either

for some j, task ( j, i) stores xi and task (i, j) stores yi, or task (i, i) stores both xi and yi, i = 1, . . . , n

Communicate For j = 1, . . . , n, task that stores xj broadcasts it to all

  • ther tasks in jth task column

For i = 1, . . . , n, sum reduction over ith task row gives yi

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 77

slide-34
SLIDE 34

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fine-Grain Tasks and Communication

a11x1 y1 a12x2 a21x1 a22x2 y2 a13x3 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 77

slide-35
SLIDE 35

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fine-Grain Parallel Algorithm

broadcast xj to tasks (k, j), k = 1, . . . , n yi = aijxj reduce yi across tasks (i, k), k = 1, . . . , n { vertical broadcast } { local scalar product } { horizontal sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 77

slide-36
SLIDE 36

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Agglomeration

Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: Combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k)2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 77

slide-37
SLIDE 37

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration

Subvector of x broadcast along each task column Each task computes local matrix-vector product of submatrix of A with subvector of x Sum reduction along each task row produces subvector of result y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 77

slide-38
SLIDE 38

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration

a13x3 a11x1 y1 a12x2 a21x1 a22x2 y2 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 77

slide-39
SLIDE 39

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Agglomeration

1-D column agglomeration Each task computes product of its component of x times its column of matrix, with no communication required Sum reduction across tasks then produces y 1-D row agglomeration If x stored in one task, then broadcast required to communicate needed values to all other tasks If x distributed across tasks, then multinode broadcast required to communicate needed values to other tasks Each task computes inner product of its row of A with entire vector x to produce its component of y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 77

slide-40
SLIDE 40

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Column Agglomeration

a11x1 y1 a12x2 a21x1 a22x2 y2 a13x3 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 77

slide-41
SLIDE 41

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Row Agglomeration

a11x1 y1 a12x2 a21x1 a22x2 y2 a13x3 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 77

slide-42
SLIDE 42

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Agglomeration

Column and row algorithms are dual to each other Column algorithm begins with communication-free local vector scaling (daxpy) computations combined across processors by a reduction Row algorithm begins with broadcast followed by communication-free local inner-product (ddot) computations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 77

slide-43
SLIDE 43

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Mapping

Map 2-D: Assign (n/k)2/p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 77

slide-44
SLIDE 44

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

2-D Agglomeration with Block Mapping

a13x3 a11x1 y1 a12x2 a21x1 a22x2 y2 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 77

slide-45
SLIDE 45

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Column Agglomeration with Block Mapping

a11x1 y1 a12x2 a21x1 a22x2 y2 a13x3 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 77

slide-46
SLIDE 46

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

1-D Row Agglomeration with Block Mapping

a11x1 y1 a12x2 a21x1 a22x2 y2 a13x3 a14x4 a23x3 a24x4 a31x1 a32x2 a41x1 a42x2 a33x3 y3 a34x4 a43x3 a44x4 y4 a51x1 a52x2 a61x1 a62x2 a53x3 a54x4 a63x3 a64x4 a15x5 a16x6 a25x5 a26x6 a35x5 a36x6 a45x5 a46x6 a55x5 y5 a56x6 a65x5 a66x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 77

slide-47
SLIDE 47

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Coarse-Grain Parallel Algorithm

broadcast x[ j ] to jth process column y[i ] = A[i ][ j ]x[ j ] reduce y[i ] across ith process row { vertical broadcast } { local matrix-vector product } { horizontal sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 77

slide-48
SLIDE 48

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Performance

The parallel costs (Lp, Wp, Fp) for the matrix-vector product are Computational cost Fp = Θ(n2/p) regardless of network Communication costs can be derived from the cost of collectives

1-D agglomeration: Lp = Θ(log p), Wp = Θ(n) 2-D agglomeration: Lp = Θ(log p), Wp = Θ(n/√p)

For 1-D row agglomeration, perform allgather, T 1-D

p

= T allgather

p

(n) + Θ(γn2/p) = Θ(α log(p) + βn + γn2/p) For 2-D agglomeration, perform broadcast and reduction, T 2-D

p

= T bcast

√p

(n/√p) + T reduce

√p

(n/√p) + Θ(γn2/p) = Θ(α log(p) + βn/√p + γn2/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 77

slide-49
SLIDE 49

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Matrix-Matrix Product

Consider matrix-matrix product C = A B where A, B, and result C are n × n matrices Entries of matrix C are given by cij =

n

  • k=1

aik bkj, i, j = 1, . . . , n Each of n2 entries of C requires n multiply-add operations, so model serial time as T1 = γ n3

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 77

slide-50
SLIDE 50

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Matrix-Matrix Product

Matrix-matrix product can be viewed as

n2 inner products, or sum of n outer products, or n matrix-vector products

and each viewpoint yields different algorithm One way to derive parallel algorithms for matrix-matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product However, considering the problem as a whole yields the best algorithms

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 77

slide-51
SLIDE 51

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Parallel Algorithm

Partition For i, j, k = 1, . . . , n, fine-grain task (i, j, k) computes product aik bkj, yielding 3-D array of n3 fine-grain tasks Assuming no replication of data, at most 3n2 fine-grain tasks store entries of A, B,

  • r C, say task (i, j, j) stores aij, task

(i, j, i) stores bij, and task (i, j, k) stores cij for i, j = 1, . . . , n and some fixed k

i j k

We refer to subsets of tasks along i, j, and k dimensions as rows, columns, and layers, respectively, so kth column

  • f A and kth row of B are stored in kth layer of tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 77

slide-52
SLIDE 52

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Parallel Algorithm

Communicate Broadcast entries of jth column of A horizontally along each task row in jth layer Broadcast entries of ith row of B vertically along each task column in ith layer For i, j = 1, . . . , n, result cij is given by sum reduction over tasks (i, j, k), k = 1, . . . , n

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 77

slide-53
SLIDE 53

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fine-Grain Algorithm

broadcast aik to tasks (i, q, k), q = 1, . . . , n broadcast bkj to tasks (q, j, k), q = 1, . . . , n cij = aikbkj reduce cij across tasks (i, j, q), q = 1, . . . , n { horizontal broadcast } { vertical broadcast } { local scalar product } { lateral sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 77

slide-54
SLIDE 54

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Agglomeration

Agglomerate With n × n × n array of fine-grain tasks, natural strategies are 3-D: Combine q × q × q subarray of fine-grain tasks 2-D: Combine q × q × n subarray of fine-grain tasks, eliminating sum reductions 1-D column: Combine n × 1 × n subarray of fine-grain tasks, eliminating vertical broadcasts and sum reductions 1-D row: Combine 1 × n × n subarray of fine-grain tasks, eliminating horizontal broadcasts and sum reductions

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 77

slide-55
SLIDE 55

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Mapping

Map Corresponding mapping strategies are 3-D: Assign (n/q)3/p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 3-D mesh 2-D: Assign (n/q)2/p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: Assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 77

slide-56
SLIDE 56

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Agglomeration with Block Mapping

1-D row 1-D col 2-D 3-D

agglomerations

1-D column 1-D row 3-D 2-D

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 77

slide-57
SLIDE 57

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Coarse-Grain 3-D Parallel Algorithm

broadcast A[i ][k] to ith processor row broadcast B[k][ j ] to jth processor column C[i ][ j ] = A[i ][k]B[k][ j ] reduce C[i ][ j ] across processor layers { horizontal broadcast } { vertical broadcast } { local matrix product } { lateral sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 77

slide-58
SLIDE 58

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Agglomeration with Block Mapping

A21 B12 + A22 B22 A21 B11 + A22 B21 A11 B12 + A12 B22 B22 B21 B12 B11 A22 A21 A12 A11 A11 B11 + A12 B21

=

1-D column: A21 B12 + A22 B22 A21 B11 + A22 B21 A11 B12 + A12 B22 B22 B21 B12 B11 A22 A21 A12 A11 A11 B11 + A12 B21

=

1-D row: A21 B12 + A22 B22 A21 B11 + A22 B21 A11 B12 + A12 B22 B22 B21 B12 B11 A22 A21 A12 A11 A11 B11 + A12 B21

=

2-D:

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 77

slide-59
SLIDE 59

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Coarse-Grain 2-D Parallel Algorithm

allgather A[i ][ j ] in ith processor row allgather B[i ][ j ] in jth processor column C[i ][ j ] = 0 for k = 1, . . . , √p C[i ][ j ] = C[i ][ j ] + A[i ][k]B[k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 77

slide-60
SLIDE 60

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

SUMMA Algorithm

Algorithm just described requires excessive memory, since each process accumulates √p blocks of both A and B One way to reduce memory requirements is to

broadcast blocks of A successively across processor rows broadcast blocks of B successively across processor cols

C[i ][ j ] = 0 for k = 1, . . . , √p broadcast A[i ][ k ] in ith processor row broadcast B[k ][ j ] in jth processor column C[i ][ j ] = C[i ][ j ] + A[i ][k]B[k][ j ] end { horizontal broadcast } { vertical broadcast } { sum local products }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 77

slide-61
SLIDE 61

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

SUMMA Algorithm

A B A B A B A B

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

16 CPUs (4x4) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 77

slide-62
SLIDE 62

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Cannon Algorithm

Another approach, due to Cannon (1969), is to circulate blocks of B vertically and blocks of A horizontally in ring fashion Blocks of both matrices must be initially aligned using circular shifts so that correct blocks meet as needed Requires less memory than SUMMA and replaces line broadcasts with shifts, lowering the number of messages

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 77

slide-63
SLIDE 63

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Cannon Algorithm

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 77

slide-64
SLIDE 64

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Fox Algorithm

It is possible to mix techniques from SUMMA and Cannon’s algorithm:

circulate blocks of B in ring fashion vertically along processor columns step by step so that each block of B comes in conjunction with appropriate block of A broadcast at that same step

This algorithm is due to Fox et al. Shifts, especially in Cannon’s algorithm, are harder to generalize to nonsquare processor grids than collectives in algorithms like SUMMA

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 64 / 77

slide-65
SLIDE 65

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Execution Time for 3-D Agglomeration

For 3-D agglomeration, computing each of p blocks C[i ][ j ] requires matrix-matrix product of two (n/ 3 √p ) × (n/ 3 √p ) blocks, so Fp = (n/ 3 √p )3 = n3/p On 3-D mesh, each broadcast or reduction takes time T bcast

p1/3 ((n/p1/3)2) = O(α log p + βn2/p2/3)

Total time is therefore Tp = O(α log p + β n2/p2/3 + γ n3/p) However, overall memory footprint is Mp = Θ(p(n/p1/3)2) = Θ(p1/3n2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 65 / 77

slide-66
SLIDE 66

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Strong Scalability of 3-D Agglomeration

The 3-D agglomeration efficiency is given by Ep(n) = pT1(n) Tp(n) = O(1/(1+(α/γ)p log p/n3+(β/γ) p1/3/n)) For strong scaling to ps processors we need Eps(n) = Θ(1), so 3-D agglomeration strong scales to ps = O(min((γ/α)n3/ log(n), (γ/β)n3)) processors

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 66 / 77

slide-67
SLIDE 67

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Weak Scalability of 3-D Agglomeration

For weak scaling (with constant input size / processor) to pw processor, we need Epw(n√pw) = Θ(1), which holds However, note that the algorithm is not memory-efficient as Mp = Θ(p1/3n2), and if keeping memory footprint per processor constant, we must grow n with p1/3 Scaling with constant memory footprint processor (Mp/p = const) is possible to pm processors where Epm(np1/3

m ) = Θ(1), which holds for

pm = Θ(2(γ/α)n3) processors The isoefficiency function is ˜ Q(p) = Θ(p log(p))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 67 / 77

slide-68
SLIDE 68

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Execution Time for 2-D Agglomeration

For 2-D agglomeration (SUMMA), computation of each block C[i ][ j ] requires √p matrix-matrix products of (n/√p ) × (n/√p ) blocks, so Fp = √p (n/√p )3 = n3/p For broadcast among rows and columns of processir grid, communication time is 2√pT bcast

√p

(n2/p) = Θ(α√p log(p) + βn2/√p) Total time is therefore Tp = O(α√p log(p) + βn2/√p + γ n3/p) The algorithm is memory-efficient, Mp = Θ(n2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 68 / 77

slide-69
SLIDE 69

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Strong Scalability of 2-D Agglomeration

The 2-D agglomeration efficiency is given by Ep(n) = pT1(n) Tp(n) = O(1/(1+(α/γ)p3/2 log p/n3+(β/γ) √p/n)) For strong scaling to ps processors we need Eps(n) = Θ(1), so 2-D agglomeration strong scales to ps = O(min((γ/α)n2/ log(n)2/3, (γ/β)n2)) processors For weak scaling to pw processors with n2/p matrix elements per processor, we need Epw(√pwn) = Θ(1), so 2-D agglomeration (SUMMA) weak scales to pw = O(2(γ/α)n3) processors Cannon’s algorithm achieves unconditional weak scalability

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 69 / 77

slide-70
SLIDE 70

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Scalability for 1-D Agglomeration

For 1-D agglomeration on 1-D mesh, total time is Tp = O(α log(p) + βn2 + γn3/p) The corresponding efficiency is Ep = O(1/(1 + (α/β)p log(p)n3 + (β/γ)p/n) Strong scalability is possible to at most ps = O((γ/β)n) processors Weak scalability is possible to at most pw = O(

  • (γ/β)n)

processors

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 70 / 77

slide-71
SLIDE 71

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Rectangular Matrix Multiplication

If C is m × n, A is m × k, and B is k × n, choosing a 3D grid that optimizes volume-to-surface-area ratio yields bandwidth cost... Wp(m, n, k) = O

  • min

p1p2p3=p

mk p1p2 + kn p1p3 + mn p2p3

  • Michael T. Heath and Edgar Solomonik

Parallel Numerical Algorithms 71 / 77

slide-72
SLIDE 72

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Communication vs. Memory Tradeoff

Communication cost for 2-D algorithms for matrix-matrix product is optimal, assuming no replication of storage If explicit replication of storage is allowed, then lower communication volume is possible via 3-D algorithm Generally, we assign n/p1 × n/p2 × n/p3 computation subvolume to each processor The largest face of the subvolume gives communication cost, the smallest face gives minimal memory usage

can keep smallest face local while successively computing slices of subvolume

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 72 / 77

slide-73
SLIDE 73

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Leveraging Additional Memory in Matrix Multiplication

Provided ¯ M total available memory, 2-D and 3-D algorithms achieve bandwidth cost Wp(n, ¯ M) =      ∞ : ¯ M < n2 n2/√p : ¯ M = Θ(n2) n2/p2/3 : ¯ M = Θ(n2p1/3) For general ¯ M, possible to pick processor grid to achieve Wp(n, ¯ M) = O(n3/(√p ¯ M1/2) + n2/p2/3) and an overall execution time of Tp(n, ¯ M) = O(α(log p + n3√p/ ¯ M3/2) + βWp(n, ¯ M) + γn3/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 73 / 77

slide-74
SLIDE 74

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

Strong Scaling using All Available Memory

The efficiency of the algorithm for a given amount of total memory ¯ Mp is Ep(n, ¯ Mp) = O(1/(1 + (α/γ)(p log p/n3 + p3/2/ ¯ M3/2

p

) + (β/γ)(√p/ ¯ M1/2

p

+ p1/3/n))) For strong scaling assuming ¯ Mp = p ¯ M1, we need Eps(n, ps ¯ M1) = psT1(n, ¯ M1)/Tps(n, ps ¯ M1) = Θ(1) In this case, we obtain ps = Θ(min((α/γ)n3/ log(n), (β/γ)n3)) as good as the 3-D algorithm, but more memory-efficient

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 74 / 77

slide-75
SLIDE 75

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

References

  • R. C. Agarwal, S. M. Balle, F

. G. Gustavson, M. Joshi, and P . Palkar, A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39:575-582, 1995

  • J. Berntsen, Communication efficient matrix multiplication
  • n hypercubes, Parallel Comput. 12:335-342, 1989
  • J. Demmel, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and
  • O. Spillinger, Communication-optimal parallel recursive

rectangular matrix multiplication, IPDPS, 2013

  • J. W. Demmel, M. T. Heath, and H. A. van der Vorst,

Parallel numerical linear algebra, Acta Numerica 2:111-197, 1993

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 75 / 77

slide-76
SLIDE 76

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

References

  • R. Dias da Cunha, A benchmark study based on the

parallel computation of the vector outer-product A = uvT

  • peration, Concurrency: Practice and Experience

9:803-819, 1997

  • G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithms
  • n a hypercube I: matrix multiplication, Parallel Comput.

4:17-31, 1987

  • D. Irony, S. Toledo, and A. Tiskin, Communication lower

bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput. 64:1017-1026, 2004.

  • S. L. Johnsson, Communication efficient basic linear

algebra computations on hypercube architectures, J. Parallel Distrib. Comput. 4(2):133-172, 1987

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 76 / 77

slide-77
SLIDE 77

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Agglomeration Schemes Scalability

References

  • S. L. Johnsson, Minimizing the communication time for

matrix multiplication on multiprocessors, Parallel Comput. 19:1235-1257, 1993

  • B. Lipshitz, Communication-avoiding parallel recursive

algorithms for matrix multiplication, Tech. Rept. UCB/EECS-2013-100, University of California at Berkeley, May 2013.

  • O. McBryan and E. F

. Van de Velde, Matrix and vector

  • perations on hypercube parallel processors, Parallel
  • Comput. 5:117-126, 1987
  • R. A. Van De Geijn and J. Watts, SUMMA: Scalable

universal matrix multiplication algorithm, Concurrency: Practice and Experience 9(4):255-274, 1997

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 77 / 77