NUMA-aware Matrix-Matrix-Multiplication
Max Reimann, Philipp Otto
1
NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto - - PowerPoint PPT Presentation
NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example Code was written in C with numa.h, pthread.h
1
2
3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png
4
5
0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive MKL dl980 on one core
6
8
– Accept overhead for remote memory access or – Copy input/output matrices to the other nodes (preprocessing)
9
* =
10
11
0,38 11,79 98,14 0,05 0,26 2,54 0,19 0,27 0,28 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive Sequential Naive Parallel MKL Parallel dl980 on 128 cores
– for computing the i-th summand, only the i-th row of matrixA / column of matrixB is needed – This allows to only copy the needed parts to the other nodes
– matrixB has to be transposed to be able to partition the memory (preprocessing) – locking or merging of matrixC is needed
12
13
14
1,59 2,81 3,34 14,91 218,84 0,27 1,41 2,94 17,24 186,39 0,19 0,27 0,28 0,43 2,41 0,13 0,25 0,50 1,00 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512 1024 2048 4096 8192 Parallel sum Naive Parallel MKL Parallel dl980 on 128 cores
15
16
17
𝑁1 ∶= 𝐵1,1 + 𝐵2,2 ∙ 𝐶1,1 + 𝐶2,2 𝑁2 ∶= 𝐵2,1 + 𝐵2,2 ∙ 𝐶1,1 𝑁3 ∶= 𝐵1,1 ∙ 𝐶1,2 − 𝐶2,2 𝑁4 ∶= 𝐵2,2 ∙ 𝐶2,1 − 𝐶1,1 𝑁5 ∶= 𝐵1,1 + 𝐵1,2 ∙ 𝐶2,2 𝑁6 ∶= 𝐵2,1 − 𝐵1,1 ∙ 𝐶1,1 + 𝐶1,2 𝑁7 ∶= (𝐵1,2 − 𝐵2,2) ∙ (B2,1 + 𝐶2,2
Only 7 multiplications!
18
Substituting the 𝑁𝑗𝑡 by their term gives back the original formula:
19
20
0,00 0,00 0,01 0,05 0,38 11,79 98,14 0,00 0,00 0,00 0,02 0,12 0,87 6,12 0,00 0,00 0,00 0,00 0,02 0,13 1,02
0,0001 0,0001 0,0002 0,0005 0,0010 0,0020 0,0039 0,0078 0,0156 0,0313 0,0625 0,1250 0,2500 0,5000 1,0000 2,0000 4,0000 8,0000 16,0000 32,0000 64,0000 128,0000 32 64 128 256 512 1024 2048
Seconds N-dimension
Naive Strassen MKL 21
strassen: BREAK = 64 dl980 on 1 core
22
23
0,05 0,26 2,54 27,61 228,57 0,05 0,14 0,49 2,06 13,53 0,19 0,27 0,28 0,44 1,84
0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
seconds N-dimension
Naive Strassen MKL 24
dl980 on 49 cores
25
0,34 11,39 101,12 0,35 18,34 182,45 0,35 21,96 204,85 0,37 14,33 143,44 0,25 0,5 1 2 4 8 16 32 64 128 256 1024 2048 4096 Distributed Memory and Threads Neither distributed Distributed threads Distributed memory
26
Parallel naive on ubuntu-numa0101 on 24 cores
27
28
29
22,147083 19,611477 21,316545 14,671332 5 10 15 20 25 30 35 40 6 7 8 distributed
Dimension: 16384
memory copy multiplication result combination 30
dl980 on 128 core
31
33
perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048
34
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 1,074E+09 8,59E+09 Not Tiled, not Transposed Not Tiled, Transposed Tiled, not Transposed Tiled, Transposed 97 39 13 12 20 40 60 80 100 120 Time s dl980 on 128 cores
35
36
37
38
39
Example Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
40
continuous memory X Can’t be loaded in one instr.
41
X
A11
𝐶11 𝐶12 𝐶13 𝐶14 A12 A13 A14 Add up results
42
1024 2048 4096 naiveSSE 0,27 2 20 tiledSSE 0,48 5 41 tiled 2 24 213 naive 11 97 879 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024
Seconds N-dimensions
naiveSSE tiledSSE tiled naive 43
dl980 on 1 core
1.000.000.000 2.000.000.000 3.000.000.000 4.000.000.000 5.000.000.000 6.000.000.000 L1 cache misses dTLB misses naiveSSE tiledSSE 44
0,00 0,05 0,10 0,15 0,20 0,25 64 128 256 512
seconds
naiveSSE tiled strassen MKL 45
dl980 on 128 cores
0,79 7,29 3,90 0,17 0,34 1,20 5,09 0,20 0,39 0,53 1,94 1 2 3 4 5 6 7 8 9 10 1024 2048 4096 8192
seconds
naiveSSE tiled strassenSSE MKL
28,3
46
dl980 on 128 cores
47