Communication Lower Bounds for Matrix-Matrix Multiplication
Dagstuhl Seminar #15281 July 6-9, 2015
Julien Langou
Communication Lower Bounds for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation
Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel
Julien Langou
MOTIVATIONS 2
MOTIVATIONS 3
Getting up to speed: The Future of Supercomputing, Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004.
MOTIVATIONS 3
Getting up to speed: The Future of Supercomputing, Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004.
1 10 100 1000 10000 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 Memory BW (Mword/sec) Mflops DRAM Chip BW (Mword/sec)
FIGURE 5.3 Arithmetic performance (Mflops), memory bandwidth, and DRAM chip bandwidth per calendar year.
10 100 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 Time (nsec) DRAM Row Access Time
FIGURE 5.4 Decrease in memory latency (in nanoseconds) per calendar year.
MOTIVATIONS 4
http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
MOTIVATIONS 5
Source: Jim Demmel, John Shalf
1 ¡ 10 ¡ 100 ¡ 1000 ¡ 10000 ¡ 45mm ¡ 11nm ¡(2018) ¡ 45 nm 11 nm
FLOPs almost free; data movement cost is dominant Minimizing amount
increasingly critical
No Change
MOTIVATIONS 6
MOTIVATIONS 7
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 8
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 9
β−1 = 580 · 106words/sec γ−1 = 10.12 · 109flops/sec M = 106words
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 DGEMM on one core Intel Xeon Processor E5520 (nehalem) GFlops/sec Matrix Order
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication sequential: two levels of memory
25.6%GB/s %ec% Cache% (8%MB)% CPU% Main%memory% (16%GB)% 10.12%GFLOP/sec/core% Intel%Xeon%Processor%E5520%(Nehalem)%
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication sequential: two levels of memory communication cost: time, energy, etc.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 10
dense matrix-matrix multiplication sequential: two levels of memory communication cost: time, energy, etc.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 11
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 12
A sequence of the following instructions define an algorithm:
Read a11 Read b11 Create c111 = a11b11 Read a12 Read b21 Create c112 = a12b21 Write c11 Delete c11, a11, b11 . . . Split the instructions into segments so exactly M reads and writes occur in each segment.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 13
We want to compute the n3 cijk = aik bkj . The computation of cijk requires aik and bkj to be in cache.
k" j" i" mul(plica(on"(i,j,k)" cij"="cij"+"aik"bkj" aik" bkj" cij"
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 14
we either have to have aik in cache at the start of the segment (Ma) we have
we either have to have bkj in cache at the start of the segment aik (Mb) we
we have to have a cij∗ in cache at the end of the segment (Nc) or we have to
k" j" i" mul(plica(on"(i,j,k)" cij"="cij"+"aik"bkj" aik" bkj" cij"
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
Maximize number of
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
Maximize number of
Deletes are free.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
Maximize number of
Deletes are free. Ma = number of A elements
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 15
Split the instructions into segments so exactly M reads and writes
M reads and writes.
Maximize number of
Deletes are free. Ma = number of A elements
Na = number of A elements
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
Constraint 1: Total number of reads and writes.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment. Constraint 4: Nonnegative.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 16
Constraint 1: Total number of reads and writes. Constraint 2: Total number of elements at start of segment. Constraint 3: Total number of elements at end of segment. Constraint 4: Nonnegative. Note that we do not prevent a memory useage greater than M during
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 17
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 18
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 19
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 20
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 21
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22
Loomis-Whitney inequality.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22
Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22
Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory. Mb + Rb: Maximum number of elements of B in fast memory.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 22
Loomis-Whitney inequality. Ma + Ra: Maximum number of elements of A in fast memory. Mb + Rb: Maximum number of elements of B in fast memory. Nc + Wc: Maximum number of elements of C in fast memory.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 23
Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24
Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and
Set each to zero since nonzero values will only reduce objective
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24
Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and
Set each to zero since nonzero values will only reduce objective
Each variable is bounded by M. Therefore,
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 24
Rc, Wa, Wb, Mc, Na, and Nb do not appear in objective function and
Set each to zero since nonzero values will only reduce objective
Each variable is bounded by M. Therefore,
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25
We got that
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25
We got that
since we need to perform n3 multiplications in all,
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25
We got that
since we need to perform n3 multiplications in all,
this gives a lower bound for the number of words transferred as
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 25
We got that
since we need to perform n3 multiplications in all,
this gives a lower bound for the number of words transferred as
and so we lower bound with (no one seems to like the floor function)
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 26
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27
Upper bound number of scalar multiplications in one segment is
The minimum number of reads and writes is
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 27
Upper bound number of scalar multiplications in one segment is
The minimum number of reads and writes is
α = 2 maximizes the constant. A lower bound for the volume of words transferred is
Increased constant from about 0.35 to about 0.38. Yeah!
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 28
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 28
Solving exactly, the maximum number of scalar multiplications in one
A lower bound for the volume of words transferred is
Increased constant from about 0.38 to about 1. Yeah!
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 29
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 29
A lower bound for the number of words transferred is
3
α = 4 maximizes the constant. A lower bound for the volume of words transferred is
Increased constant from about 1 to about 1.41. Yeah!
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 30
Fast memory is half A and half B at start. Read 1/2 A and 1/2 B, and write 0 C. Fast memory is full with C at end.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 31
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32
[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32
[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.
[3] Dongarra, Pineau, Robert, Shi, and Vivien. (2007). “Revisiting Matrix Product on Master-Worker Platforms.” IEEE International Parallel and Distributed Processing Symposium.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 32
[2] Irony, D., Toledo, S., and Tiskin, A. (2004). “Communication lower bounds for distributed-memory matrix multiplication.” J. Parallel Distrib. Comput., 64(9):1017-1026.
[3] Dongarra, Pineau, Robert, Shi, and Vivien. (2007). “Revisiting Matrix Product on Master-Worker Platforms.” IEEE International Parallel and Distributed Processing Symposium.
√ 3 2 √ 2 ∼ 1.83
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33
Solving exactly, the maximum number of scalar multiplications in one
A lower bound for the volume of words transferred is
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33
Solving exactly, the maximum number of scalar multiplications in one
A lower bound for the volume of words transferred is
√ 3 2 √ 2 ∼ 1.83.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33
Solving exactly, the maximum number of scalar multiplications in one
A lower bound for the volume of words transferred is
√ 3 2 √ 2 ∼ 1.83.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 33
Solving exactly, the maximum number of scalar multiplications in one
A lower bound for the volume of words transferred is
√ 3 2 √ 2 ∼ 1.83.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 34
the sequential model (see Lowery and L.) in parallel distributed for example in 2D distribution. The condition in
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 35
a
a )(M(0) b
b )(M(1) c
c
a
a )(M(1) b
b )(M(2) c
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 36
Used GAMS global optimization solver lindoglobal.
Maximized the lower bound when (s = 2, α = 2). Lower bound on volume of message transferred is
(s = 3, α = 2) gives 1.65. (s = 4, α = 2) gives 1.73.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 37
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 38
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39
Cij Aik Bkj
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39
Cij Aik Bkj
Cij b b
Aik × Bkj
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39
Cij Aik Bkj
Cij b b
Aik × Bkj
Three square blocks fit in fast memory: 3b2 = M.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39
Cij Aik Bkj
Cij b b
Aik × Bkj
Three square blocks fit in fast memory: 3b2 = M. Good bandwidth: Volume = 2
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 39
Cij Aik Bkj
Cij b b
Aik × Bkj
Three square blocks fit in fast memory: 3b2 = M. Good bandwidth: Volume = 2
Good latency: # Messages = 3
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 40
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41
Cij Aik Bkj
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41
Cij Aik Bkj
Cij b b
Bkj
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41
Cij Aik Bkj
Cij b b
Bkj
Block Cij fits in fast memory: b2 ≈ M.
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41
Cij Aik Bkj
Cij b b
Bkj
Block Cij fits in fast memory: b2 ≈ M. Better bandwidth: Volume = 2mnp
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 41
Cij Aik Bkj
Cij b b
Bkj
Block Cij fits in fast memory: b2 ≈ M. Better bandwidth: Volume = 2mnp
Horrible latency: # Messages =
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 42
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 43
5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6 7 8 9 10 performance (GFlop/sec) matrix size (n) Mfast = 106 −− β−1 = 108 −− γ−1 = 1010 sequential case −− Ordinary dense matrix−matrix multiplication −− square matrices
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 44
β−1 = 580 · 106words/sec γ−1 = 10.12 · 109flops/sec M = 106words
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 DGEMM on one core Intel Xeon Processor E5520 (nehalem) GFlops/sec Matrix Order
COMMUNICATION LOWER BOUND FOR SEQUENTIAL MATRIX-MATRIX MULTIPLICATION 45
(1) assuming no overlap between communication and computations; (2) with β being the time to move one unit of data (inverse of bandwidth) and γ being the time to perform one floating-point operation.
APPLICATION TO PARALLEL DISTRIBUTED 46
APPLICATION TO PARALLEL DISTRIBUTED 47
P = 16000; ( number of nodes ) peakpernode = 3.431e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; ( in bits / sec, from/to node ) 16,000 nodes
and 3 Xeon Phi co-processors
peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1
n #106 1 2 3 4 5 6 7 8 9 10 11 12 performance (PFlop/sec) 5 10 15 20 25 30 35 40 45 50 Tianhe-2, P=16,000 (nodes)
no overlapping comm/comp
APPLICATION TO PARALLEL DISTRIBUTED 48
P = 16000; ( number of nodes ) peakpernode = 3.431e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; ( in bits / sec, from/to node ) 16,000 nodes
and 3 Xeon Phi co-processors
peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1 looking locally on a node
nloc #104 1 2 3 4 5 6 7 8 9 performance (TFlop/sec/node) 0.5 1 1.5 2 2.5 3 Tianhe-2, Performance per node
no overlapping comm/comp
APPLICATION TO PARALLEL DISTRIBUTED 49
P = 16000; ( number of nodes ) peakpernode = 3.431e+12 3.009e+12; ( in flops/sec, per node ) mempernode = 64e+9; ( in bytes, per node ) nodebandwidth = 16e9 * 2; 8e9 ( in bits / sec, from/to node ) 16,000 nodes
and 3 Xeon Phi co-processors
peak = 54.90 PFlop/sec Bandwidth from/to nodes 16 Gb/sec bidirectional HPL = 33.86 PFlop/sec Top 500 #1 looking locally on a node
n #106 1 2 3 4 5 6 7 8 9 10 11 12 performance (PFlop/sec) 5 10 15 20 25 30 35 40 45 50 Tianhe-2, P=16,000 (nodes)
APPLICATION TO PARALLEL DISTRIBUTED 50
P = 88128; ( number of nodes ) peakpernode = 128e+9; ( in flops/sec, per node ) mempernode = 16e+9; ( in bytes, per node ) nodebandwidth = 5e+9 * 2 * 6; ( in bits / sec, from/to node )
n #106 1 2 3 4 5 6 7 8 9 10 11 12 13 14 performance (PFlop/sec) 2 4 6 8 10 K computer, P=88,128 (nodes)
no overlapping comm/comp
APPLICATION TO PARALLEL DISTRIBUTED 51