Communication Lower Bounds for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation

communication lower bounds for matrix matrix
SMART_READER_LITE
LIVE PREVIEW

Communication Lower Bounds for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel


slide-1
SLIDE 1

Communication Lower Bounds for Matrix-Matrix Multiplication

Dagstuhl Seminar #15281 July 6-9, 2015

Julien Langou

slide-2
SLIDE 2

MOTIVATIONS 2

Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

slide-3
SLIDE 3

MOTIVATIONS 3

Getting up to speed: The Future of Supercomputing, Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004.

Annual improvement Time per flop Bandwidth Latency 59% Network 26% 15% DRAM 23% 5%

slide-4
SLIDE 4

MOTIVATIONS 3

Getting up to speed: The Future of Supercomputing, Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004.

Annual improvement Time per flop Bandwidth Latency 59% Network 26% 15% DRAM 23% 5%

1 10 100 1000 10000 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 Memory BW (Mword/sec) Mflops DRAM Chip BW (Mword/sec)

FIGURE 5.3 Arithmetic performance (Mflops), memory bandwidth, and DRAM chip bandwidth per calendar year.

10 100 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 Time (nsec) DRAM Row Access Time

  • Expon. (DRAM Row Access Time)

FIGURE 5.4 Decrease in memory latency (in nanoseconds) per calendar year.

slide-5
SLIDE 5

MOTIVATIONS 4

http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

slide-6
SLIDE 6

MOTIVATIONS 5

Data Movement Cost: Energy Trends

Source: Jim Demmel, John Shalf

1 ¡ 10 ¡ 100 ¡ 1000 ¡ 10000 ¡ 45mm ¡ 11nm ¡(2018) ¡ 45 nm 11 nm

FLOPs almost free; data movement cost is dominant Minimizing amount

  • f data movement

increasingly critical

Picojoules

No Change

slide-7
SLIDE 7

MOTIVATIONS 6

slide-8
SLIDE 8

MOTIVATIONS 7

slide-9
SLIDE 9

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 8

Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

slide-10
SLIDE 10

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 9

One core Intel Xeon Processor E5520 (nehalem)

β−1 = 580 · 106words/sec γ−1 = 10.12 · 109flops/sec M = 106words

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 DGEMM on one core Intel Xeon Processor E5520 (nehalem) GFlops/sec Matrix Order

slide-11
SLIDE 11

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

slide-12
SLIDE 12

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication

C" A" =" B"

slide-13
SLIDE 13

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication

input: A an n-by-n matrix, B an n-by-n matrix

  • utput: C an n-by-n matrix

% starting from C = 0 for i = 1 : n, for j = 1 : n, for k = 1 : n, cij = cij + aikbkj

slide-14
SLIDE 14

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication

input: A an n-by-n matrix, B an n-by-n matrix

  • utput: C an n-by-n matrix

% starting from C = 0 for i = 1 : n, for j = 1 : n, for k = 1 : n, cijk = aikbkj cij = cij + cijk 2n3 operations any order of creation of the cijk results in the correct answer

slide-15
SLIDE 15

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication sequential: two levels of memory

» sequential = not parallel! » fast memory of size M » slow memory » computation happens in fast

memory

25.6%GB/s %ec% Cache% (8%MB)% CPU% Main%memory% (16%GB)% 10.12%GFLOP/sec/core% Intel%Xeon%Processor%E5520%(Nehalem)%

slide-16
SLIDE 16

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication sequential: two levels of memory communication cost: time, energy, etc.

slide-17
SLIDE 17

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10

Mission Statement

We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

dense matrix-matrix multiplication sequential: two levels of memory communication cost: time, energy, etc.

  • rdinary: we compute all (n3 of them) the

cijk = aik · bkj (consequence: Strassen-like matrix-matrix multiplication algorithms are not allowed.)

slide-18
SLIDE 18

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 11

Sequential Lower Bounds for Matrix-Matrix Multiplication Consider any ordinary dense matrix-matrix multiplication algorithm for multiplying an n–by–n matrix with an n–by–n matrix, consider a computer with fast memory of size M, then

Upper bound :: square tile matrix-matrix multiplication

The number of words transferred between slow and fast memory is at most 3.46 n3 √ M

  • .

Lower Bound :: Irony, Toledo, and Tiskin, 2004

The number of words transferred between slow and fast memory is at least 0.35 n3 √ M

  • − M.

Note: 3.46 ≈ 2 √ 3 Note: 0.35 ≈ (2 √ 2)−1

slide-19
SLIDE 19

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 12

What is an algorithm ?

A sequence of the following instructions define an algorithm:

» Read an element from slow memory to fast memory. » Create an element in fast memory. » Write an element from fast memory to slow memory. » Delete an element from fast memory. » Perform a floating-point operation operation in fast memory.

                                   Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . . Split the instructions into segments so exactly M reads and writes occur in each segment.

slide-20
SLIDE 20

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 13

What do we want to compute?

We want to compute the n3 cijk = aik bkj . The computation of cijk requires aik and bkj to be in cache.

k" j" i" mul(plica(on"(i,j,k)" cij"="cij"+"aik"bkj" aik" bkj" cij"

slide-21
SLIDE 21

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 14

What do we want to compute?

We want to compute the n3 cijk = aikbkj. The computation of cijk requires aik, bkj, and a cij∗ to be in cache. In order to compute cijk,

we either have to have aik in cache at the start of the segment (Ma) we have

to read aik (Ra) from slow memory to cache during the segment.

we either have to have bkj in cache at the start of the segment aik (Mb) we

have to read akj (Rb) from slow memory to cache during the segment.

we have to have a cij∗ in cache at the end of the segment (Nc) or we have to

write back (Wc) during the segment.

k" j" i" mul(plica(on"(i,j,k)" cij"="cij"+"aik"bkj" aik" bkj" cij"

slide-22
SLIDE 22

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

slide-23
SLIDE 23

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A.

slide-24
SLIDE 24

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A.

slide-25
SLIDE 25

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A. » Similar for B and C.

slide-26
SLIDE 26

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A. » Similar for B and C.

Maximize number of

multiplications in a segment.

slide-27
SLIDE 27

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A. » Similar for B and C.

Maximize number of

multiplications in a segment.

Deletes are free.

slide-28
SLIDE 28

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A. » Similar for B and C.

Maximize number of

multiplications in a segment.

Deletes are free. Ma = number of A elements

in fast memory at the start.

slide-29
SLIDE 29

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15

Split the instructions into segments so exactly M reads and writes

  • ccur in each segment.

Segment                                      Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . .

M reads and writes.

» Ra = number of reads for A. » Wa = number of writes for A. » Similar for B and C.

Maximize number of

multiplications in a segment.

Deletes are free. Ma = number of A elements

in fast memory at the start.

Na = number of A elements

in fast memory at the end.

slide-30
SLIDE 30

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

slide-31
SLIDE 31

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Constraint 1: Total number of reads and writes.

slide-32
SLIDE 32

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment.

slide-33
SLIDE 33

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment.

slide-34
SLIDE 34

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment. Constraint 4: Nonnegative.

slide-35
SLIDE 35

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16

max(Number of Scalar Multiplications), subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment. Constraint 4: Nonnegative. Note that we do not prevent a memory useage greater than M during

a segment. We only control the end points of a segment.

slide-36
SLIDE 36

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 17

Lemma (Loomis-Whitney Inequality)

Let V ∈ Z3 be a finite set, and let Vx, Vy, and Vz be orthogonal projections of V onto the coordinate planes. The cardinality of V, |V|, satisfies |V| ≤

  • |Vx| · |Vy| · |Vz|.
slide-37
SLIDE 37

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 18

Lemma (Loomis-Whitney Inequality)

Let V ∈ Z3 be a finite set, and let Vx, Vy, and Vz be orthogonal projections of V onto the coordinate planes. The cardinality of V, |V|, satisfies |V| ≤

  • |Vx| · |Vy| · |Vz|.
slide-38
SLIDE 38

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 19

Lemma (Loomis-Whitney Inequality)

Let V ∈ Z3 be a finite set, and let Vx, Vy, and Vz be orthogonal projections of V onto the coordinate planes. The cardinality of V, |V|, satisfies |V| ≤

  • |Vx| · |Vy| · |Vz|.

A B C

slide-39
SLIDE 39

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 20

To perform cijk = aikbkj we need cij (or cijk), aik, and bkj in fast memory.

A B C

Given VA elements of A in fast memory, VB elements of B in fast memory, VC elements of C in fast memory, Loomis-Whitney inequality gives an upper bound on the number

  • f scalar multiplications

cijk = aikbkj that we can perform: Upper Bound on Number of Scalar Multiplications ≤

  • |VA| · |VB| · |VC|.
slide-40
SLIDE 40

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 21

In a segment of length M, the number of elements of A (aik) in fast memory is less or equal than either the ones who were there at the start of the segment or the one who have been read: VA ≤ Ma + Ra, ditto for B, VB ≤ Mb + Rb, the number of elements of C (cij.) contributed to during the segment is less or equal than either the ones who are left at the end of the segment

  • r the one who have been written:

VC ≤ Nc + Wc.

slide-41
SLIDE 41

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Loomis-Whitney inequality.

slide-42
SLIDE 42

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory.

slide-43
SLIDE 43

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory. Mb + Rb: Maximum number of elements of B in fast memory.

slide-44
SLIDE 44

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M Ma ≥ 0, Na ≥ 0, Ra ≥ 0, Wa ≥ 0 Mb ≥ 0, Nb ≥ 0, Rb ≥ 0, Wb ≥ 0 Mc ≥ 0, Nc ≥ 0, Rc ≥ 0, Wc ≥ 0

Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory. Mb + Rb: Maximum number of elements of B in fast memory. Nc + Wc: Maximum number of elements of C in fast memory.

slide-45
SLIDE 45

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 23

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Rc + Wa + Wb + Wc = M Ma + Mb + Mc ≤ M Na + Nb + Nc ≤ M

Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and

Nc ≤ M.

slide-46
SLIDE 46

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24

max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M

Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and

Nc ≤ M.

Set each to zero since nonzero values will only reduce objective

  • function. Set Nc to M.
slide-47
SLIDE 47

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24

max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M

Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and

Nc ≤ M.

Set each to zero since nonzero values will only reduce objective

  • function. Set Nc to M.

Each variable is bounded by M. Therefore,

  • (Ma + Ra)(Mb + Rb)(M + Wc) ≤ 2

√ 2M3/2.

slide-48
SLIDE 48

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24

max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M

Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and

Nc ≤ M.

Set each to zero since nonzero values will only reduce objective

  • function. Set Nc to M.

Each variable is bounded by M. Therefore,

  • (Ma + Ra)(Mb + Rb)(M + Wc) ≤ 2

√ 2M3/2.

  • (Ma + Ra)(Mb + Rb)(M + Wc) ≤ 2

√ 2M3/2. means that an upper bound for the number of multiplications in a segment of size M is 2 √ 2M3/2, (maximum # multiplications in a segment of size M) ≤ 2 √ 2M3/2,

slide-49
SLIDE 49

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25

We got that

(maximum # multiplications in a segment of size M) ≤ 2 √ 2M3/2,

slide-50
SLIDE 50

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25

We got that

(maximum # multiplications in a segment of size M) ≤ 2 √ 2M3/2,

since we need to perform n3 multiplications in all,

  • n3

2 √ 2M3/2

  • ≤ (minimum # of segments of size M),
slide-51
SLIDE 51

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25

We got that

(maximum # multiplications in a segment of size M) ≤ 2 √ 2M3/2,

since we need to perform n3 multiplications in all,

  • n3

2 √ 2M3/2

  • ≤ (minimum # of segments of size M),

this gives a lower bound for the number of words transferred as

  • n3

2 √ 2M3/2

  • (M) ≤ (minimum volume of communication),
slide-52
SLIDE 52

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25

We got that

(maximum # multiplications in a segment of size M) ≤ 2 √ 2M3/2,

since we need to perform n3 multiplications in all,

  • n3

2 √ 2M3/2

  • ≤ (minimum # of segments of size M),

this gives a lower bound for the number of words transferred as

  • n3

2 √ 2M3/2

  • (M) ≤ (minimum volume of communication),

and so we lower bound with (no one seems to like the floor function)

1 2 √ 2 n3 √ M − M ≤ (minimum volume of communication).

slide-53
SLIDE 53

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 26

Ways to improve lower bound

  • 1. sequential case, parallel distributed 2D case

1.1 Change the formulation of the maximization problem. 1.2 Improve the majorization and solve exactly. 1.3 Change length of a segment.

  • 2. parallel distributed 3D case

2.1 (We cannot change the formulation of the maximization problem.) 2.2 Improve the majorization and solve exactly. 2.3 Change length of a segment. 2.4 Consider multiple segments that follow one another.

slide-54
SLIDE 54

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27

  • 1. Change length of a segment.

Instead of defining a segment to be of length M, we define a segment to be of length αM where α is a parameter. max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = αM Ma + Mb ≤ M

slide-55
SLIDE 55

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27

  • 1. Change length of a segment.

Instead of defining a segment to be of length M, we define a segment to be of length αM where α is a parameter.

Upper bound number of scalar multiplications in one segment is

(1 + α)3/2 M3/2.

The minimum number of reads and writes is

  • mnp

(1 + α)3/2 M3/2

  • (αM) ≥

α (1 + α)3/2 mnp √ M − αM.

slide-56
SLIDE 56

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27

  • 1. Change length of a segment.

Instead of defining a segment to be of length M, we define a segment to be of length αM where α is a parameter.

Upper bound number of scalar multiplications in one segment is

(1 + α)3/2 M3/2.

The minimum number of reads and writes is

  • mnp

(1 + α)3/2 M3/2

  • (αM) ≥

α (1 + α)3/2 mnp √ M − αM.

α = 2 maximizes the constant. A lower bound for the volume of words transferred is

Volume ≥ 2 3 √ 3 mnp √ M − 2M.

Increased constant from about 0.35 to about 0.38. Yeah!

slide-57
SLIDE 57

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 28

  • 2. Improve majorization or solve exactly.

max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M

slide-58
SLIDE 58

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 28

  • 2. Improve majorization or solve exactly.

max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M

Solving exactly, the maximum number of scalar multiplications in one

segment is M3/2.

A lower bound for the volume of words transferred is

Volume ≥ n3 √ M − M.

Increased constant from about 0.38 to about 1. Yeah!

slide-59
SLIDE 59

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 29

Solving exactly for arbitrary segment length, αM. max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = αM Ma + Mb ≤ M

slide-60
SLIDE 60

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 29

Solving exactly for arbitrary segment length, αM. max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = αM Ma + Mb ≤ M

A lower bound for the number of words transferred is

    mnp 2+α

3

3/2 M3/2     (αM)

α = 4 maximizes the constant. A lower bound for the volume of words transferred is

Volume ≥ √ 2mnp √ M − 4M.

Increased constant from about 1 to about 1.41. Yeah!

slide-61
SLIDE 61

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 30

An optimal solution for one segment of length M. max

  • (Ma + Ra)(Mb + Rb)(M + Wc),

subject to Ra + Rb + Wc = αM Ma + Mb ≤ M

Fast memory is half A and half B at start. Read 1/2 A and 1/2 B, and write 0 C. Fast memory is full with C at end.

Other solution: Ma = 1/3M, Ra = 2/3M, Mb = 2/3M, Ra = 1/3M, Wc = 0. Other solution: Ma = M, Ra = 0, Mb = 0, Ra = M, Wc = 0. General solution: Ma = cM, Ra = (1 − c)M, Mb = (1 − c)M, Ra = cM, Wc = 0.

slide-62
SLIDE 62

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 31

Ways to improve lower bound

  • 1. Change length of a segment.
  • 2. Improve majorization or solve exactly.
  • 3. Change the optimization problem.
slide-63
SLIDE 63

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32

  • 3. Change the optimization problem.

[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M Nc ≤ M

slide-64
SLIDE 64

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32

  • 3. Change the optimization problem.

[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M Nc ≤ M

[3] Dongarra, Pineau, Robert, Shi, and Vivien. (2007). “Revisiting Matrix Product on Master-Worker Platforms.” IEEE International Parallel and Distributed Processing Symposium.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = M Ma + Mb + Mc ≤ M

slide-65
SLIDE 65

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32

  • 3. Change the optimization problem.

[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.

max

  • (Ma + Ra)(Mb + Rb)(Nc + Wc),

subject to Ra + Rb + Wc = M Ma + Mb ≤ M Nc ≤ M

[3] Dongarra, Pineau, Robert, Shi, and Vivien. (2007). “Revisiting Matrix Product on Master-Worker Platforms.” IEEE International Parallel and Distributed Processing Symposium.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = M Ma + Mb + Mc ≤ M If you solve exactly the new problem, you get 3

√ 3 2 √ 2 ∼ 1.83

slide-66
SLIDE 66

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33

  • 3. Change the optimization problem.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = αM Ma + Mb + Mc ≤ M

Solving exactly, the maximum number of scalar multiplications in one

segment is 1 3 √ 3 (1 + α)3/2 M3/2,

A lower bound for the volume of words transferred is

Volume ≥ ⌊3 √ 3 1 (1 + α)3/2 n3 M3/2 ⌋(αM).

slide-67
SLIDE 67

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33

  • 3. Change the optimization problem.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = αM Ma + Mb + Mc ≤ M

Solving exactly, the maximum number of scalar multiplications in one

segment is 1 3 √ 3 (1 + α)3/2 M3/2,

A lower bound for the volume of words transferred is

Volume ≥ ⌊3 √ 3 1 (1 + α)3/2 n3 M3/2 ⌋(αM). Take α = 1, get 3

√ 3 2 √ 2 ∼ 1.83.

slide-68
SLIDE 68

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33

  • 3. Change the optimization problem.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = αM Ma + Mb + Mc ≤ M

Solving exactly, the maximum number of scalar multiplications in one

segment is 1 3 √ 3 (1 + α)3/2 M3/2,

A lower bound for the volume of words transferred is

Volume ≥ ⌊3 √ 3 1 (1 + α)3/2 n3 M3/2 ⌋(αM). Take α = 1, get 3

√ 3 2 √ 2 ∼ 1.83.

Take α = 2, get 2.

slide-69
SLIDE 69

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33

  • 3. Change the optimization problem.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = αM Ma + Mb + Mc ≤ M

Solving exactly, the maximum number of scalar multiplications in one

segment is 1 3 √ 3 (1 + α)3/2 M3/2,

A lower bound for the volume of words transferred is

Volume ≥ ⌊3 √ 3 1 (1 + α)3/2 n3 M3/2 ⌋(αM). Take α = 1, get 3

√ 3 2 √ 2 ∼ 1.83.

Take α = 2, get 2. Lower bound on volume of message transferred is Volume ≥ 2 n3 √ M − 2M.

slide-70
SLIDE 70

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 34

  • 3. Change the optimization problem.

max

  • (Ma + Ra)(Mb + Rb)(Mc + Rc),

subject to Ra + Rb + Rc = M Ma + Mb + Mc ≤ M Changing to this optimization problem is justified in

the sequential model (see Lowery and L.) in parallel distributed for example in 2D distribution. The condition in

parallel distributed is that if a processor owns cij, all multiplications of this cij are done on this processor. “The owner computes.” Changing to this optimization problem is not justified when replication of cij are allowed. E.g. 3D algorithm.

slide-71
SLIDE 71

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 35

  • 4. Increase number of segments.

(Two segment problem formulation) max

  • (M(0)

a

+ R(1)

a )(M(0) b

+ R(1)

b )(M(1) c

+ W (1)

c

) +

  • (M(1)

a

+ R(2)

a )(M(1) b

+ R(2)

b )(M(2) c

+ W (2)

c

), subject to M(0)

a

+ M(0)

b

+ M(0)

c

≤ M R(1)

a

+ R(1)

b

+ R(1)

c

+ W (1)

a

+ W (1)

b

+ W (1)

c

= αM M(1)

a

+ M(1)

b

+ M(1)

c

≤ M R(2)

a

+ R(2)

b

+ R(2)

c

+ W (2)

a

+ W (2)

b

+ W (2)

c

= αM M(2)

a

+ M(2)

b

+ M(2)

c

≤ M

slide-72
SLIDE 72

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 36

Two segment solution

Used GAMS global optimization solver lindoglobal.

» Returns a global solution. » Branch-and-cut global optimization procedure.

Maximized the lower bound when (s = 2, α = 2). Lower bound on volume of message transferred is

Volume ≥ 1.57mnp √ M − 4M.

(s = 3, α = 2) gives 1.65. (s = 4, α = 2) gives 1.73.

slide-73
SLIDE 73

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 37

Sequential Lower Bounds for Matrix-Matrix Multiplication Consider any ordinary dense matrix-matrix multiplication algorithm for multiplying an n–by–n matrix with an n–by–n matrix, consider a computer with fast memory of size M, then

Upper bound :: square tile matrix-matrix multiplication

The number of words transferred between slow and fast memory is at most 3.46 n3 √ M

  • .

Lower Bound :: Lowery and Langou, 2014

The number of words transferred between slow and fast memory is at least 2 n3 √ M

  • − 2M.

Note: 3.46 ≈ 2 √ 3

slide-74
SLIDE 74

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 38

Block matrix-matrix multiplication See: the blocked matrix-multiply algorithm of S. Toledo. A survey of

  • ut-of-core algorithms in numerical linear algebra. In External Memory

Algorithms and Visualization, pages 161–180. American Mathematical Society Press, 1999.

slide-75
SLIDE 75

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

slide-76
SLIDE 76

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ =

Aik × Bkj

slide-77
SLIDE 77

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ =

Aik × Bkj

Three square blocks fit in fast memory: 3b2 = M.

slide-78
SLIDE 78

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ =

Aik × Bkj

Three square blocks fit in fast memory: 3b2 = M. Good bandwidth: Volume = 2

√ 3mnp √ M

slide-79
SLIDE 79

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ =

Aik × Bkj

Three square blocks fit in fast memory: 3b2 = M. Good bandwidth: Volume = 2

√ 3mnp √ M

Good latency: # Messages = 3

√ 3 mnp M3/2

slide-80
SLIDE 80

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 40

Block matrix-matrix multiplication See: the maximum re-use matrix-multiply algorithm of Dongarra, Pineau, Robert, Shi, and Vivien. “Revisiting Matrix Product on Master-Worker Platforms.” IEEE International Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. See PUMMA / SUMMA parallel distributed algorithms.

slide-81
SLIDE 81

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

slide-82
SLIDE 82

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ = Aik ×

Bkj

slide-83
SLIDE 83

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ = Aik ×

Bkj

Block Cij fits in fast memory: b2 ≈ M.

slide-84
SLIDE 84

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ = Aik ×

Bkj

Block Cij fits in fast memory: b2 ≈ M. Better bandwidth: Volume = 2mnp

√ M

slide-85
SLIDE 85

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41

Block matrix-matrix multiplication

= × =

Cij Aik Bkj

Cij b b

+ = Aik ×

Bkj

Block Cij fits in fast memory: b2 ≈ M. Better bandwidth: Volume = 2mnp

√ M

Horrible latency: # Messages =

√ M mnp M3/2

slide-86
SLIDE 86

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 42

Sequential Lower Bounds for Matrix-Matrix Multiplication Consider any ordinary dense matrix-matrix multiplication algorithm for multiplying an n–by–n matrix with an n–by–n matrix, consider a computer with fast memory of size M, then

Upper bound :: square tile matrix-matrix multiplication

The number of words transferred between slow and fast memory is at most 2 n3 √ M

  • .

Lower Bound :: Lowery and Langou, 2014

The number of words transferred between slow and fast memory is at least 2 n3 √ M

  • − 2M.
slide-87
SLIDE 87

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 43

β−1 = 108words/sec γ−1 = 1010flops/sec M = 106words

5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6 7 8 9 10 performance (GFlop/sec) matrix size (n) Mfast = 106 −− β−1 = 108 −− γ−1 = 1010 sequential case −− Ordinary dense matrix−matrix multiplication −− square matrices

slide-88
SLIDE 88

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 44

One core Intel Xeon Processor E5520 (nehalem)

β−1 = 580 · 106words/sec γ−1 = 10.12 · 109flops/sec M = 106words

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 DGEMM on one core Intel Xeon Processor E5520 (nehalem) GFlops/sec Matrix Order

slide-89
SLIDE 89

COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 45

The time of an OD matrix-matrix multiplication is 2 β √ M n3 + 2γn3

(1) assuming no overlap between communication and computations; (2) with β being the time to move one unit of data (inverse of bandwidth) and γ being the time to perform one floating-point operation.

slide-90
SLIDE 90

APPLICATION TO PARALLEL DISTRIBUTED 46

Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

slide-91
SLIDE 91

APPLICATION TO PARALLEL DISTRIBUTED 47

Tianhe-2, China Top 500 #1 (6/13, 11/13, 6/14 and 11/14)

P = 16000; ( number of nodes ) peakpernode = 3.431e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; ( in bits / sec, from/to node ) 16,000 nodes

  • ne node = 2 Intel Xeon Ivy Bridge processors

and 3 Xeon Phi co-processors

  • ne node = 3.431 Tflop/sec/node

peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1

n #106 1 2 3 4 5 6 7 8 9 10 11 12 performance (PFlop/sec) 5 10 15 20 25 30 35 40 45 50 Tianhe-2, P=16,000 (nodes)

  • verlapping comm/comp

no overlapping comm/comp

slide-92
SLIDE 92

APPLICATION TO PARALLEL DISTRIBUTED 48

Tianhe-2, China Top 500 #1 (6/13, 11/13, 6/14 and 11/14)

P = 16000; ( number of nodes ) peakpernode = 3.431e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; ( in bits / sec, from/to node ) 16,000 nodes

  • ne node = 2 Intel Xeon Ivy Bridge processors

and 3 Xeon Phi co-processors

  • ne node = 3.431 Tflop/sec/node

peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1 looking locally on a node

nloc #104 1 2 3 4 5 6 7 8 9 performance (TFlop/sec/node) 0.5 1 1.5 2 2.5 3 Tianhe-2, Performance per node

  • verlapping comm/comp

no overlapping comm/comp

slide-93
SLIDE 93

APPLICATION TO PARALLEL DISTRIBUTED 49

Tianhe-2, China Top 500 #1 (6/13, 11/13, 6/14 and 11/14)

P = 16000; ( number of nodes ) peakpernode = 3.431e+12 3.009e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; 8e9 ( in bits / sec, from/to node ) 16,000 nodes

  • ne node = 2 Intel Xeon Ivy Bridge processors

and 3 Xeon Phi co-processors

  • ne node = 3.431 Tflop/sec/node

peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1 looking locally on a node

n #106 1 2 3 4 5 6 7 8 9 10 11 12 performance (PFlop/sec) 5 10 15 20 25 30 35 40 45 50 Tianhe-2, P=16,000 (nodes)

slide-94
SLIDE 94

APPLICATION TO PARALLEL DISTRIBUTED 50

K Computer, Japan Top 500 #1 (6/11, 11/11), #2 (6/12), #3 (11/12), #4 (6/13, 11/13, 6/14, 11/14)

P = 88128; ( number of nodes ) peakpernode = 128e+9; ( in flops/sec, per node ) mempernode = 16e+9; ( in bytes, per node ) nodebandwidth = 5e+9 * 2 * 6; ( in bits / sec, from/to node )

n #106 1 2 3 4 5 6 7 8 9 10 11 12 13 14 performance (PFlop/sec) 2 4 6 8 10 K computer, P=88,128 (nodes)

  • verlapping comm/comp

no overlapping comm/comp

slide-95
SLIDE 95

APPLICATION TO PARALLEL DISTRIBUTED 51

Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed 2 β √ M n3 + 2γn3