NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto - - PowerPoint PPT Presentation

numa aware matrix matrix multiplication
SMART_READER_LITE
LIVE PREVIEW

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto - - PowerPoint PPT Presentation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example Code was written in C with numa.h, pthread.h


slide-1
SLIDE 1

NUMA-aware Matrix-Matrix-Multiplication

Max Reimann, Philipp Otto

1

slide-2
SLIDE 2

About this talk

  • Objective:

Show how to improve performance of algorithms in a NUMA-system with MMM as an example

  • Code was written in C with numa.h, pthread.h
  • Tested on FSOC

– ubuntu-0101

  • 2 Nodes, 24 Cores

– dl980

  • 8 Nodes, 128 Cores
  • Compiled with gcc

– –O3

2

slide-3
SLIDE 3

Naïve Matrix-Matrix-Multiplication

  • We will examine MMM for large n x n matrices
  • 𝒫 𝑜3

3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

slide-4
SLIDE 4

Naïve MMM implementation

4

slide-5
SLIDE 5

Performance of Naive vs. MKL

5

0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive MKL dl980 on one core

slide-6
SLIDE 6

Intel Math Kernel Library (MKL)

  • BLAS: Basic Linear Algebra Subprograms

– Standard for Linear Algebra

  • MKL:

– Implements BLAS for Intel hardware – Vectorized and threaded for highest performance

6

slide-7
SLIDE 7

Analysis of Naïve MMM

  • Testsetup
  • Use ubuntu-numa machine
  • No thread or memory pinning
  • Use numatop/pcm
  • Performance tools show:

– Unused cores (obvious) – QPI cannot be fully loaded with one thread

8

slide-8
SLIDE 8

Parallelization I

  • How can the work be divided?

– 1. Partition computation of matrixC by rows or columns

  • Problem: All threads need matrixA and matrixB
  • Solution:

– Accept overhead for remote memory access or – Copy input/output matrices to the other nodes (preprocessing)

9

* =

slide-9
SLIDE 9

Parallelization – Partition by rows

10

slide-10
SLIDE 10

Parallelization – Partition by rows

11

0,38 11,79 98,14 0,05 0,26 2,54 0,19 0,27 0,28 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048 Naive Sequential Naive Parallel MKL Parallel dl980 on 128 cores

slide-11
SLIDE 11

Parallelization II

  • How can the work be divided?

– 2. Partition computation of matrixC by summands

  • Benefit:

– for computing the i-th summand, only the i-th row of matrixA / column of matrixB is needed – This allows to only copy the needed parts to the other nodes

  • Disadvantage:

– matrixB has to be transposed to be able to partition the memory (preprocessing) – locking or merging of matrixC is needed

12

slide-12
SLIDE 12

Parallelization II

13

slide-13
SLIDE 13

Performance of „Parallel Sum“ Method

14

1,59 2,81 3,34 14,91 218,84 0,27 1,41 2,94 17,24 186,39 0,19 0,27 0,28 0,43 2,41 0,13 0,25 0,50 1,00 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512 1024 2048 4096 8192 Parallel sum Naive Parallel MKL Parallel dl980 on 128 cores

slide-14
SLIDE 14

Strassen

  • Runtime Complexity:

– Naive algorithm 𝒫 𝑜3

  • Can we get better?

– Strassens algorithm, published 1969, was the first to improve asymptotic complexity – Runtime 𝒫 𝑜log2 7 ≈ 𝒫 𝑜2.8

  • Algorithms today can get O(𝑜2.35), but are not pratical

– Uses only 7 multiplications instead of 8 per recursion step

15

slide-15
SLIDE 15

Matrix definition

16

A = 𝐵1,1 𝐵1,2 𝐵2,1 𝐵2,2 , 𝐶 = 𝐶1,1 𝐶1,2 𝐶2,1 𝐶2,2 , 𝐷 = 𝐷1,1 𝐷1,2 𝐷2,1 𝐷2,2

For matrices A,B,C with dimension 𝑜 = 4𝑙, 𝑙 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:

𝐷1,1 = 𝐵1,1 ∙ 𝐶1,1 + 𝐵1,2 ∙ 𝐶2,1 𝐷1,2 = 𝐵1,1 ∙ 𝐶1,2 + 𝐵1,2 ∙ 𝐶2,2 𝐷2,1 = 𝐵2,1 ∙ 𝐶1,1 + 𝐵2,2 ∙ 𝐶2,1 𝐷2,2 = 𝐵2,1 ∙ 𝐶1,2 + 𝐵2,2 ∙ 𝐶2,2

Conventional Algorithm uses 8 (expensive) multiplications:

slide-16
SLIDE 16

Strassen’s algorithm

17

𝑁1 ∶= 𝐵1,1 + 𝐵2,2 ∙ 𝐶1,1 + 𝐶2,2 𝑁2 ∶= 𝐵2,1 + 𝐵2,2 ∙ 𝐶1,1 𝑁3 ∶= 𝐵1,1 ∙ 𝐶1,2 − 𝐶2,2 𝑁4 ∶= 𝐵2,2 ∙ 𝐶2,1 − 𝐶1,1 𝑁5 ∶= 𝐵1,1 + 𝐵1,2 ∙ 𝐶2,2 𝑁6 ∶= 𝐵2,1 − 𝐵1,1 ∙ 𝐶1,1 + 𝐶1,2 𝑁7 ∶= (𝐵1,2 − 𝐵2,2) ∙ (B2,1 + 𝐶2,2

Define temporary matrices:

𝐷1,1 = 𝑁1 + 𝑁4 − 𝑁5 + 𝑁7 𝐷1,2 = 𝑁3 + 𝑁5 𝐷2,1 = 𝑁2 + 𝑁4 𝐷2,2 = 𝑁1 − 𝑁2 + 𝑁3 + 𝑁6

Compose final matrix

Only 7 multiplications!

slide-17
SLIDE 17

Strassen - Example

18

𝐷1,2 = 𝑁3 + 𝑁5 = 𝐵1,1𝐶1,2 + 𝐵1,2𝐶2,2 = 𝐵1,1𝐶1,2 − 𝐵1,1𝐶2,2 + 𝐵1,1𝐶2,2 + 𝐵1,2𝐶2,2 = 𝐵1,1 ∙ 𝐶1,2 − 𝐶2,2 + 𝐵1,1 + 𝐵1,2 ∙ 𝐶2,2 𝐵1,1 𝐵1,2 𝐵2,1 𝐵2,2 ∙ 𝐶1,1 𝐶1,2 𝐶2,1 𝐶2,2 = 𝐷1,1 𝐷1,2 𝐷2,1 𝐷2,2

Substituting the 𝑁𝑗𝑡 by their term gives back the original formula:

slide-18
SLIDE 18

Strassen - Analysis

  • Cost: 7 multiplications and 18 additions

– 8 multiplications and 4 additions for naïve

  • Only practical for large matrices n > 1000

– Although our results indicate otherwise (later)

  • Define cutoff point for recursion

– If n is sufficiently small, do naïve multiplication

19

slide-19
SLIDE 19

Strassen - Implementation

20

slide-20
SLIDE 20

Execution Time: Single-threaded

0,00 0,00 0,01 0,05 0,38 11,79 98,14 0,00 0,00 0,00 0,02 0,12 0,87 6,12 0,00 0,00 0,00 0,00 0,02 0,13 1,02

0,0001 0,0001 0,0002 0,0005 0,0010 0,0020 0,0039 0,0078 0,0156 0,0313 0,0625 0,1250 0,2500 0,5000 1,0000 2,0000 4,0000 8,0000 16,0000 32,0000 64,0000 128,0000 32 64 128 256 512 1024 2048

Seconds N-dimension

Naive Strassen MKL 21

strassen: BREAK = 64 dl980 on 1 core

slide-21
SLIDE 21

Parallelization of Strassen I

  • Data dependencies:

– Have to do additions in 𝑁𝑗 before multiplication

  • e.g. M1 = 𝐵1,1 + 𝐵2,2 ∙ 𝐶1,1 + 𝐶2,2

– Have to calculate 𝑁𝑗 before calculating C

  • 𝐷1,2 = 𝑁3 + 𝑁5
  • Easiest solution:

– Calculate in 𝑁𝑗s in parallel – Then calculate 𝐷𝑗,𝑘 in parallel

22

slide-22
SLIDE 22

Parallelization of Strassen II

  • Level 1 can be scheduled to 7 threads
  • Level n can be scheduled to 7𝑜 threads

– Most systems have number of processors on base 2

  • We used manual parallelization

– 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:

  • Automatic parallelization is hard

– Thread load becomes very unbalanced – Every level needs 7 temporary matrices

  • Exponential rising memory requirements

23

slide-23
SLIDE 23

Execution Time – 49 Threads

0,05 0,26 2,54 27,61 228,57 0,05 0,14 0,49 2,06 13,53 0,19 0,27 0,28 0,44 1,84

0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

seconds N-dimension

Naive Strassen MKL 24

dl980 on 49 cores

slide-24
SLIDE 24

NUMA-Optimizations

  • Try to have as much memory local as possible

to avoid remote memory access

– Because it is slower by a factor of ~ 1.4

  • Partition data and work depending on #nodes

and #cores

  • Pin threads to nodes with the memory they

need

  • (Topology for other algorithms)

25

slide-25
SLIDE 25

Distributing memory and threads

0,34 11,39 101,12 0,35 18,34 182,45 0,35 21,96 204,85 0,37 14,33 143,44 0,25 0,5 1 2 4 8 16 32 64 128 256 1024 2048 4096 Distributed Memory and Threads Neither distributed Distributed threads Distributed memory

26

Parallel naive on ubuntu-numa0101 on 24 cores

slide-26
SLIDE 26

DEMO

27

slide-27
SLIDE 27

Application of NUMA-Optimizations

  • Copy all data to every node:

– Duration of preprocessing:

  • 11.11s for a 8192x8192 matrix to 8 nodes
  • Partition data and move to corresponding

nodes

– Duration of preprocessing:

  • 1.03s for a 8192x8192 matrix to 8 nodes
  • Pin threads to nodes

– int numa_run_on_node(int node);

28

slide-28
SLIDE 28

Parallelization – Partition by rows

Copying memory to different nodes

29

slide-29
SLIDE 29

Strassen Memory Distribution Effects

22,147083 19,611477 21,316545 14,671332 5 10 15 20 25 30 35 40 6 7 8 distributed

Dimension: 16384

memory copy multiplication result combination 30

dl980 on 128 core

slide-30
SLIDE 30

Other optimization techniques

  • Tiling
  • Vectorization
  • Scalar replacement
  • Precomputation of constants
  • (unrolling)

31

slide-31
SLIDE 31

Tiling

  • Divide computational work into tiles to

leverage cache

  • Tile size depends on cache size
  • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)

33

slide-32
SLIDE 32

Performance of Tiling

perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048

34

1 8 64 512 4096 32768 262144 2097152 16777216 134217728 1,074E+09 8,59E+09 Not Tiled, not Transposed Not Tiled, Transposed Tiled, not Transposed Tiled, Transposed 97 39 13 12 20 40 60 80 100 120 Time s dl980 on 128 cores

slide-33
SLIDE 33

Vectorization

  • SIMD : Single Instruction Multiple Data
  • All recent Intel and AMD Processors have

Streaming Instructions Extensions (SSE)

  • An instruction is simultaneously applied to

multiple floats

  • Can only operate efficiently on aligned data (16

bit aligned)

  • SSE operate on 128bit registers

– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit – Dl980 machine only support 128bit operations

35

slide-34
SLIDE 34

Auto-Vectorization

  • Can this be done automatically?

– Gcc –O3 tries to auto-vectorize

  • only possible for simple statements

36

slide-35
SLIDE 35

Assembler

37

slide-36
SLIDE 36

Aligned Malloc

38

Example:

  • Numa_alloc returns adrr: 0x1232, not 16bit aligned
  • We add 15, so addr = 0x1241 or 0b1001001000001
  • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0)
  • => result 0x1240 is now 16bit aligned
slide-37
SLIDE 37

Intrinsics

39

Example Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

slide-38
SLIDE 38

Use Parallelism for MMM

  • We try to construct a 4x4 matrix multiplication
  • How to process rows ?

40

continuous memory X Can’t be loaded in one instr.

slide-39
SLIDE 39

Use parallelism for MMM

  • We try to construct a 4x4 matrix multiplication
  • How to process rows ?
  • Idea: process all elements of row of B in parallel

41

X

=

A11

𝐶11 𝐶12 𝐶13 𝐶14 A12 A13 A14 Add up results

slide-40
SLIDE 40

4x4 Kernel

42

slide-41
SLIDE 41

SSE – Single Threaded

1024 2048 4096 naiveSSE 0,27 2 20 tiledSSE 0,48 5 41 tiled 2 24 213 naive 11 97 879 0,25 0,5 1 2 4 8 16 32 64 128 256 512 1024

Seconds N-dimensions

naiveSSE tiledSSE tiled naive 43

dl980 on 1 core

slide-42
SLIDE 42

Cache Misses of SSE Variants

1.000.000.000 2.000.000.000 3.000.000.000 4.000.000.000 5.000.000.000 6.000.000.000 L1 cache misses dTLB misses naiveSSE tiledSSE 44

slide-43
SLIDE 43

Performance for Small Matrices

0,00 0,05 0,10 0,15 0,20 0,25 64 128 256 512

seconds

naiveSSE tiled strassen MKL 45

dl980 on 128 cores

slide-44
SLIDE 44

Performance for Large Matrices

0,79 7,29 3,90 0,17 0,34 1,20 5,09 0,20 0,39 0,53 1,94 1 2 3 4 5 6 7 8 9 10 1024 2048 4096 8192

seconds

naiveSSE tiled strassenSSE MKL

28,3

46

dl980 on 128 cores

slide-45
SLIDE 45

Summary

  • Analyze algorithm for bottlenecks

– IO optimization – Hardware specific optimization

  • Cache size
  • NUMA architecture
  • Specific instructions (SSE)
  • Try to minimize remote memory access
  • Visualisations can facilitate understanding

47