Scalable Dense Matrix Multiplication on Multi-Socket Many-Core - - PowerPoint PPT Presentation

scalable dense matrix multiplication on multi socket many
SMART_READER_LITE
LIVE PREVIEW

Scalable Dense Matrix Multiplication on Multi-Socket Many-Core - - PowerPoint PPT Presentation

Scalable Dense Matrix Multiplication on Multi-Socket Many-Core Systems with Fast Shared Memory Ricardo Magana, Natalia Vassilieva Acknowledgment Ricardo Magaa magania@gmail.com And also many thanks to prof. Robert Van De Geijn, Field Van


slide-1
SLIDE 1

Scalable Dense Matrix Multiplication

  • n Multi-Socket Many-Core Systems

with Fast Shared Memory

Ricardo Magana, Natalia Vassilieva

slide-2
SLIDE 2

Acknowledgment Ricardo Magaña

magania@gmail.com And also many thanks to prof. Robert Van De Geijn, Field Van Zee and Tyler Smith!

2

slide-3
SLIDE 3

Outline

– Motivation and The Machine pitch – NUMA-aware extension of BLIS for multi-socket systems – Experimental results

3

slide-4
SLIDE 4

4

The Machine

slide-5
SLIDE 5

5

I/O Copper

slide-6
SLIDE 6

6

Copper

slide-7
SLIDE 7

7

Copper

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

Processor-centric computing

10

Memory Memory Memory Memory

GPU ASIC

Quantum

RISC V

Open Architecture

CPU CPU CPU CPU

Memory-Driven Computing

slide-11
SLIDE 11

The Machine in context

11

Shared nothing

SoC SoC Local DRAM Local DRAM Local NVM Local NVM SoC SoC Local DRAM Local DRAM Local NVM Local NVM

Network

Shared everything

SoC Local DRAM Local NVM SoC Local DRAM Local NVM

Network

Physical Server Coherent Interconnect Physical Server Physical Server

slide-12
SLIDE 12

The Machine in context

12

Communications and memory fabric

SoC SoC Local DRAM Local DRAM SoC SoC Local DRAM Local DRAM

Shared something

NVM NVM NVM NVM Memory Pool

Shared nothing

SoC SoC Local DRAM Local DRAM Local NVM Local NVM SoC SoC Local DRAM Local DRAM Local NVM Local NVM

Network

Shared everything

SoC Local DRAM Local NVM SoC Local DRAM Local NVM

Network

Physical Server Coherent Interconnect Physical Server Physical Server

slide-13
SLIDE 13

Our goal: efficient linear algebra library for The Machine

– Fast GEMM is crucial for fast machine learning (deep learning in particular) – BLAS is essential for many problems in scientific computing, pattern recognition and optimization – The ratio of compute/bandwidth on The Machine enables efficient scaling of GEMM for matrices of moderate sizes (up to 100000000 elements)

13

slide-14
SLIDE 14

Linear algebra on The Machine: aspiration

14

Typical sizes

  • f matrices

for deep learning

What do we need to be true:

– High-performing single-node multi-core GEMM for small matrices – Scalable multi-node GEMM

slide-15
SLIDE 15

Existing BLAS libraries

Proprietary Open Source

– ATLAS – OpenBLAS – BLIS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 15 – Intel MKL – AMD ACML – IBM ESSL and PESSL – NVIDIA cuBLAS and NVBLAS

slide-16
SLIDE 16

Existing BLAS libraries

Proprietary Open Source

– ATLAS – OpenBLAS – BLIS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 16 – Intel MKL – AMD ACML – IBM ESSL and PESSL – NVIDIA cuBLAS and NVBLAS Single-node

  • Access shared coherent memory
  • Threads don’t share data, only

synchronization messages

Multi-node

  • Distributed memory
  • Different processes transfer data and

synchronization messages

Multi-socket with shared memory

In The Machine we have different processes that can access shared memory

slide-17
SLIDE 17

Existing BLAS libraries

Proprietary Open Source

– ATLAS – OpenBLAS – BLIS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 17 – Intel MKL – AMD ACML – IBM ESSL and PESSL – NVIDIA cuBLAS and NVBLAS – Open Source – Different ways of parallelization – Easier to optimize for a new CPU

slide-18
SLIDE 18

Multi-socket systems today: NUMA

The ones we used

Superdome X

– 16 sockets – 18 haswell cores per socket (288 cores total) – Theoretical peak: ~20 TFLOPS

DL580

– 4 sockets – 15 ivybridge/haswell cores per socket (60 cores total) – Theoretical peak: ~2.6/5.2 TFLOPS NUMA node 1 CPU Memory NUMA node 2 CPU Memory QPI 32 GB/s NUMA node 3 CPU Memory NUMA node 4 CPU Memory QPI 32 GB/s

Crossbar fabric

NUMA node 1 CPU Memory NUMA node 1 CPU Memory NUMA node 1 CPU Memory NUMA node 1 CPU Memory

… …

slide-19
SLIDE 19

NUMA-aware extension of BLIS (1)

Cannon Like

= A B SoC 1 Compute = SoC 2 Compute = SoC 3 Compute Node 1 Node 2 Node 3

  • Matrix A is composed of horizontal

panels

  • Matrix B is composed of vertical

panels

  • Panels are distributed in SoC

memory

  • Each SoC own one panel of A and
  • ne of B
  • GEMM is distributed, each SoC

compute 3 blocks, each block is

  • btained by panel times panel
  • At every step one read from one

remote SoC

  • Resulting matrix have “A” format.
slide-20
SLIDE 20

NUMA-aware extension of BLIS (2)

Blocks

= A B SoC 1 Compute = SoC 2 Compute = SoC 3 Compute

  • A and B have the same format
  • As previous every SoC reads from
  • nly one other SoC
  • Unlike previous switch reading

SoC after each block. Node 1 Node 2 Node 3

slide-21
SLIDE 21

Other tricks

– Support for different memory pools (for different panels)

– The entry point (bli_gemm) receives an array of obj_t that represent the panels of the matrix

– MCS barrier instead of linear – Support for multiple thread entry points

– To do not spawn new set of threads at every iteration (in every bli_gemm call)

– Affinity of threads

– We pre-launch the threads, pin them to particular CPU cores using a #pragma omp (outside of blis), and then use multiple threads entry points

slide-22
SLIDE 22

SGEMM performance on Superdome X, comparison with a GPU system (2 NVIDIA Tesla K80)

22

2000 4000 6000 8000 10000 12000 14000 16000 10000 20000 30000 40000 50000 60000 70000 SGEMM PERFORMANCE ( GFLOPS ) MATRIX DIMENSION ( M=N=K )

DISTRIBUTED SGEMM PERFORMANCE

Intel ScaLAPACK PLASMA+OpenBLAS Custom+BLIS cuBLAS (1 GPU nocopy) cuBLAS (4 GPUs) cuBLAS (2 GPUs) NUMA-BLIS v1

slide-23
SLIDE 23

SGEMM performance on Superdome X

23

2000 4000 6000 8000 10000 12000 14000 16000 2000 4000 6000 8000 10000 12000 14000 16000 18000

SGEMM PERFORMANCE ( GFLOPS )

MATRIX DIMENSION ( M=N=K )

DISTRIBUTED SGEMM PERFORMANCE

nvBLAS (4 GPUs) nvBLAS (2 GPUs) nvBLAS (1 GPU no copy) Custom + BLIS nvBLAS (1 GPU) NUMA-BLIS v1

slide-24
SLIDE 24

Improved usability and performance for small matrices (v2)

Distributed SGEMM on Superdome X

NUMA-BLIS v1 NUMA-BLIS v2

slide-25
SLIDE 25

Conclusion

– Done (almost): Extended BLIS (GEMM so far…) for multi-socket systems with shared memory

– Matrix data is accessed directly – Synchronization via barriers – NUMA-aware

– In progress: Extended BLIS for The Machine

– Matrix data is accessed directly – Matrix data is in NVM – Synchronization via MPI/RVMA

slide-26
SLIDE 26

Thank you!

nvassilieva@hpe.com