[PPT] - NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto PowerPoint Presentation

SLIDE 1

NUMA-aware Matrix-Matrix-Multiplication

Max Reimann, Philipp Otto

1

SLIDE 2

About this talk

Objective:

Show how to improve performance of algorithms in a NUMA-system with MMM as an example

Code was written in C with numa.h, pthread.h
Tested on FSOC

– ubuntu-0101

2 Nodes, 24 Cores

– dl980

8 Nodes, 128 Cores
Compiled with gcc

– –O3

2

SLIDE 3

Naïve Matrix-Matrix-Multiplication

We will examine MMM for large n x n matrices
𝒫 𝑜3

3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

SLIDE 4

Naïve MMM implementation

4

SLIDE 5

Performance of Naive vs. MKL

5

0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive MKL dl980 on one core

SLIDE 6

Intel Math Kernel Library (MKL)

BLAS: Basic Linear Algebra Subprograms

– Standard for Linear Algebra

MKL:

– Implements BLAS for Intel hardware – Vectorized and threaded for highest performance

6

SLIDE 7

Analysis of Naïve MMM

Testsetup
Use ubuntu-numa machine
No thread or memory pinning
Use numatop/pcm
Performance tools show:

– Unused cores (obvious) – QPI cannot be fully loaded with one thread

8

SLIDE 8

Parallelization I

How can the work be divided?

– 1. Partition computation of matrixC by rows or columns

Problem: All threads need matrixA and matrixB
Solution:

– Accept overhead for remote memory access or – Copy input/output matrices to the other nodes (preprocessing)

9

* =

SLIDE 9

Parallelization – Partition by rows

10

SLIDE 10

Parallelization – Partition by rows

11

0,38 11,79 98,14 0,05 0,26 2,54 0,19 0,27 0,28 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive Sequential Naive Parallel MKL Parallel dl980 on 128 cores

SLIDE 11

Parallelization II

How can the work be divided?

– 2. Partition computation of matrixC by summands

Benefit:

– for computing the i-th summand, only the i-th row of matrixA / column of matrixB is needed – This allows to only copy the needed parts to the other nodes

Disadvantage:

– matrixB has to be transposed to be able to partition the memory (preprocessing) – locking or merging of matrixC is needed

12

SLIDE 12

Parallelization II

13

SLIDE 13

Performance of „Parallel Sum“ Method

14

1,59 2,81 3,34 14,91 218,84 0,27 1,41 2,94 17,24 186,39 0,19 0,27 0,28 0,43 2,41 0,13 0,25 0,50 1,00 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512 1024 2048 4096 8192 Parallel sum Naive Parallel MKL Parallel dl980 on 128 cores

SLIDE 14

Strassen

Runtime Complexity:

– Naive algorithm 𝒫 𝑜3

Can we get better?

– Strassens algorithm, published 1969, was the first to improve asymptotic complexity – Runtime 𝒫 𝑜log2 7 ≈ 𝒫 𝑜2.8

Algorithms today can get O(𝑜2.35), but are not pratical

– Uses only 7 multiplications instead of 8 per recursion step

15

SLIDE 15

Matrix definition

16

A = 𝐵1,1 𝐵1,2 𝐵2,1 𝐵2,2 , 𝐶 = 𝐶1,1 𝐶1,2 𝐶2,1 𝐶2,2 , 𝐷 = 𝐷1,1 𝐷1,2 𝐷2,1 𝐷2,2

For matrices A,B,C with dimension 𝑜 = 4𝑙, 𝑙 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:

𝐷1,1 = 𝐵1,1 ∙ 𝐶1,1 + 𝐵1,2 ∙ 𝐶2,1 𝐷1,2 = 𝐵1,1 ∙ 𝐶1,2 + 𝐵1,2 ∙ 𝐶2,2 𝐷2,1 = 𝐵2,1 ∙ 𝐶1,1 + 𝐵2,2 ∙ 𝐶2,1 𝐷2,2 = 𝐵2,1 ∙ 𝐶1,2 + 𝐵2,2 ∙ 𝐶2,2

Conventional Algorithm uses 8 (expensive) multiplications:

SLIDE 16

Strassen’s algorithm

17

𝑁1 ∶= 𝐵1,1 + 𝐵2,2 ∙ 𝐶1,1 + 𝐶2,2 𝑁2 ∶= 𝐵2,1 + 𝐵2,2 ∙ 𝐶1,1 𝑁3 ∶= 𝐵1,1 ∙ 𝐶1,2 − 𝐶2,2 𝑁4 ∶= 𝐵2,2 ∙ 𝐶2,1 − 𝐶1,1 𝑁5 ∶= 𝐵1,1 + 𝐵1,2 ∙ 𝐶2,2 𝑁6 ∶= 𝐵2,1 − 𝐵1,1 ∙ 𝐶1,1 + 𝐶1,2 𝑁7 ∶= (𝐵1,2 − 𝐵2,2) ∙ (B2,1 + 𝐶2,2

Define temporary matrices:

𝐷1,1 = 𝑁1 + 𝑁4 − 𝑁5 + 𝑁7 𝐷1,2 = 𝑁3 + 𝑁5 𝐷2,1 = 𝑁2 + 𝑁4 𝐷2,2 = 𝑁1 − 𝑁2 + 𝑁3 + 𝑁6

Compose final matrix

Only 7 multiplications!

SLIDE 17

Strassen - Example

18

𝐷1,2 = 𝑁3 + 𝑁5 = 𝐵1,1𝐶1,2 + 𝐵1,2𝐶2,2 = 𝐵1,1𝐶1,2 − 𝐵1,1𝐶2,2 + 𝐵1,1𝐶2,2 + 𝐵1,2𝐶2,2 = 𝐵1,1 ∙ 𝐶1,2 − 𝐶2,2 + 𝐵1,1 + 𝐵1,2 ∙ 𝐶2,2 𝐵1,1 𝐵1,2 𝐵2,1 𝐵2,2 ∙ 𝐶1,1 𝐶1,2 𝐶2,1 𝐶2,2 = 𝐷1,1 𝐷1,2 𝐷2,1 𝐷2,2

Substituting the 𝑁𝑗𝑡 by their term gives back the original formula:

SLIDE 18

Strassen - Analysis

Cost: 7 multiplications and 18 additions

– 8 multiplications and 4 additions for naïve

Only practical for large matrices n > 1000

– Although our results indicate otherwise (later)

Define cutoff point for recursion

– If n is sufficiently small, do naïve multiplication

19

SLIDE 19

Strassen - Implementation

20

SLIDE 20

Execution Time: Single-threaded

0,00 0,00 0,01 0,05 0,38 11,79 98,14 0,00 0,00 0,00 0,02 0,12 0,87 6,12 0,00 0,00 0,00 0,00 0,02 0,13 1,02

0,0001 0,0001 0,0002 0,0005 0,0010 0,0020 0,0039 0,0078 0,0156 0,0313 0,0625 0,1250 0,2500 0,5000 1,0000 2,0000 4,0000 8,0000 16,0000 32,0000 64,0000 128,0000 32 64 128 256 512 1024 2048

Seconds N-dimension

Naive Strassen MKL 21

strassen: BREAK = 64 dl980 on 1 core

SLIDE 21

Parallelization of Strassen I

Data dependencies:

– Have to do additions in 𝑁𝑗 before multiplication

e.g. M1 = 𝐵1,1 + 𝐵2,2 ∙ 𝐶1,1 + 𝐶2,2

– Have to calculate 𝑁𝑗 before calculating C

𝐷1,2 = 𝑁3 + 𝑁5
Easiest solution:

– Calculate in 𝑁𝑗s in parallel – Then calculate 𝐷𝑗,𝑘 in parallel

22

SLIDE 22

Parallelization of Strassen II

Level 1 can be scheduled to 7 threads
Level n can be scheduled to 7𝑜 threads

– Most systems have number of processors on base 2

We used manual parallelization

– 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:

Automatic parallelization is hard

– Thread load becomes very unbalanced – Every level needs 7 temporary matrices

Exponential rising memory requirements

23

SLIDE 23

Execution Time – 49 Threads

0,05 0,26 2,54 27,61 228,57 0,05 0,14 0,49 2,06 13,53 0,19 0,27 0,28 0,44 1,84

0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

seconds N-dimension

Naive Strassen MKL 24

dl980 on 49 cores

SLIDE 24

NUMA-Optimizations

Try to have as much memory local as possible

to avoid remote memory access

– Because it is slower by a factor of ~ 1.4

Partition data and work depending on #nodes

and #cores

Pin threads to nodes with the memory they

need

(Topology for other algorithms)

25

SLIDE 25

Distributing memory and threads

0,34 11,39 101,12 0,35 18,34 182,45 0,35 21,96 204,85 0,37 14,33 143,44 0,25 0,5 1 2 4 8 16 32 64 128 256 1024 2048 4096 Distributed Memory and Threads Neither distributed Distributed threads Distributed memory

26

Parallel naive on ubuntu-numa0101 on 24 cores

SLIDE 26

DEMO

27

SLIDE 27

Application of NUMA-Optimizations

Copy all data to every node:

– Duration of preprocessing:

11.11s for a 8192x8192 matrix to 8 nodes
Partition data and move to corresponding

nodes

– Duration of preprocessing:

1.03s for a 8192x8192 matrix to 8 nodes
Pin threads to nodes

– int numa_run_on_node(int node);

28

SLIDE 28

Parallelization – Partition by rows

Copying memory to different nodes

29

SLIDE 29

Strassen Memory Distribution Effects

22,147083 19,611477 21,316545 14,671332 5 10 15 20 25 30 35 40 6 7 8 distributed

Dimension: 16384

memory copy multiplication result combination 30

dl980 on 128 core

SLIDE 30

Other optimization techniques

Tiling
Vectorization
Scalar replacement
Precomputation of constants
(unrolling)

31

SLIDE 31

Tiling

Divide computational work into tiles to

leverage cache

Tile size depends on cache size
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)

33

SLIDE 32

Performance of Tiling

perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048

34

1 8 64 512 4096 32768 262144 2097152 16777216 134217728 1,074E+09 8,59E+09 Not Tiled, not Transposed Not Tiled, Transposed Tiled, not Transposed Tiled, Transposed 97 39 13 12 20 40 60 80 100 120 Time s dl980 on 128 cores

SLIDE 33

Vectorization

SIMD : Single Instruction Multiple Data
All recent Intel and AMD Processors have

Streaming Instructions Extensions (SSE)

An instruction is simultaneously applied to

multiple floats

Can only operate efficiently on aligned data (16

bit aligned)

SSE operate on 128bit registers

– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit – Dl980 machine only support 128bit operations

35

SLIDE 34

Auto-Vectorization

Can this be done automatically?

– Gcc –O3 tries to auto-vectorize

only possible for simple statements

36

SLIDE 35

Assembler

37

SLIDE 36

Aligned Malloc

38

Example:

Numa_alloc returns adrr: 0x1232, not 16bit aligned
We add 15, so addr = 0x1241 or 0b1001001000001
Now we clear last 4 bits by ANDing ~0x0f (=0xfff0)
=> result 0x1240 is now 16bit aligned

SLIDE 37

Intrinsics

39

Example Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

SLIDE 38

Use Parallelism for MMM

We try to construct a 4x4 matrix multiplication
How to process rows ?

40

continuous memory X Can’t be loaded in one instr.

SLIDE 39

Use parallelism for MMM

We try to construct a 4x4 matrix multiplication
How to process rows ?
Idea: process all elements of row of B in parallel

41

X

=

A11

𝐶11 𝐶12 𝐶13 𝐶14 A12 A13 A14 Add up results

SLIDE 40

4x4 Kernel

42

SLIDE 41

SSE – Single Threaded

1024 2048 4096 naiveSSE 0,27 2 20 tiledSSE 0,48 5 41 tiled 2 24 213 naive 11 97 879 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024

Seconds N-dimensions

naiveSSE tiledSSE tiled naive 43

dl980 on 1 core

SLIDE 42

Cache Misses of SSE Variants

1.000.000.000 2.000.000.000 3.000.000.000 4.000.000.000 5.000.000.000 6.000.000.000 L1 cache misses dTLB misses naiveSSE tiledSSE 44

SLIDE 43

Performance for Small Matrices

0,00 0,05 0,10 0,15 0,20 0,25 64 128 256 512

seconds

naiveSSE tiled strassen MKL 45

dl980 on 128 cores

SLIDE 44

Performance for Large Matrices

0,79 7,29 3,90 0,17 0,34 1,20 5,09 0,20 0,39 0,53 1,94 1 2 3 4 5 6 7 8 9 10 1024 2048 4096 8192

seconds

naiveSSE tiled strassenSSE MKL

28,3

46

dl980 on 128 cores

SLIDE 45

Summary

Analyze algorithm for bottlenecks

– IO optimization – Hardware specific optimization

Cache size
NUMA architecture
Specific instructions (SSE)
Try to minimize remote memory access
Visualisations can facilitate understanding

47