Scalable Dense Matrix Multiplication on Multi-Socket Many-Core - PowerPoint PPT Presentation

Scalable Dense Matrix Multiplication on Multi-Socket Many-Core Systems with Fast Shared Memory Ricardo Magana, Natalia Vassilieva

Acknowledgment Ricardo Magaña magania@gmail.com And also many thanks to prof. Robert Van De Geijn, Field Van Zee and Tyler Smith! 2

Outline – Motivation and The Machine pitch – NUMA-aware extension of BLIS for multi-socket systems – Experimental results 3

The Machine 4

I/O Copper 5

Copper 6

Copper 7

Memory GPU CPU Open Memory ASIC Memory CPU RISC CPU V Architecture CPU Quantum Memory Processor-centric computing Memory-Driven Computing 10

The Machine in context Physical Server Local DRAM SoC Local NVM Local DRAM SoC Local DRAM Local NVM SoC Physical Local NVM Server Interconnect Network Network Coherent Local DRAM SoC Local DRAM Local NVM SoC Local NVM Local DRAM SoC Physical Local NVM Server Shared nothing Shared everything 11

The Machine in context Physical Communications and memory fabric Server Local DRAM Local DRAM SoC SoC Local NVM NVM Local DRAM SoC Local DRAM Local DRAM Local NVM NVM SoC SoC Physical Local NVM Server Interconnect Network Network Coherent Local DRAM Local DRAM SoC NVM SoC Local DRAM Local NVM SoC NVM Local NVM Local DRAM Local DRAM SoC SoC Physical Local NVM Memory Pool Server Shared nothing Shared something Shared everything 12

Our goal: efficient linear algebra library for The Machine – Fast GEMM is crucial for fast machine learning (deep learning in particular) – BLAS is essential for many problems in scientific computing, pattern recognition and optimization – The ratio of compute/bandwidth on The Machine enables efficient scaling of GEMM for matrices of moderate sizes (up to 100000000 elements) 13

Linear algebra on The Machine: aspiration What do we need to be true: – High-performing single-node Typical sizes multi-core GEMM for small of matrices matrices for deep – Scalable multi-node GEMM learning 14

Existing BLAS libraries Proprietary Open Source – Intel MKL – ATLAS – AMD ACML – OpenBLAS – IBM ESSL and PESSL – BLIS – NVIDIA cuBLAS and NVBLAS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 15

Existing BLAS libraries Single-node Proprietary Open Source • Access shared coherent memory – Intel MKL – ATLAS • Threads don’t share data, only synchronization messages – AMD ACML – OpenBLAS Multi-node – IBM ESSL and PESSL – BLIS • Distributed memory • Different processes transfer data and – NVIDIA cuBLAS and NVBLAS – Armadillo synchronization messages – Eigen Multi-socket with shared memory – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental In The Machine we have different processes that can access shared memory 16

Existing BLAS libraries Proprietary Open Source – Intel MKL – ATLAS – Open Source – AMD ACML – OpenBLAS – Different ways of parallelization – IBM ESSL and PESSL – BLIS – Easier to optimize for a new CPU – NVIDIA cuBLAS and NVBLAS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 17

Multi-socket systems today: NUMA The ones we used DL580 Superdome X – 4 sockets – 16 sockets – 15 ivybridge/haswell cores per socket (60 cores total) – 18 haswell cores per socket (288 cores total) – Theoretical peak: ~2.6/5.2 TFLOPS – Theoretical peak: ~20 TFLOPS CPU CPU CPU CPU QPI … Memory Memory Memory Memory 32 GB/s NUMA node 1 NUMA node 1 NUMA node 1 NUMA node 2 Crossbar fabric CPU CPU QPI Memory Memory CPU CPU 32 GB/s … NUMA node 3 NUMA node 4 Memory Memory NUMA node 1 NUMA node 1

NUMA-aware extension of BLIS (1) • Matrix A is composed of horizontal panels Cannon Like • Matrix B is composed of vertical panels A B Node 1 Node 2 • Panels are distributed in SoC Node 3 memory • Each SoC own one panel of A and one of B • GEMM is distributed, each SoC compute 3 blocks, each block is obtained by panel times panel • At every step one read from one remote SoC • Resulting matrix have “A” format. = = = SoC 1 Compute SoC 2 Compute SoC 3 Compute

NUMA-aware extension of BLIS (2) • A and B have the same format Blocks • As previous every SoC reads from only one other SoC A B Node 1 • Unlike previous switch reading Node 2 SoC after each block. Node 3 = = = SoC 1 Compute SoC 2 Compute SoC 3 Compute

Other tricks – Support for different memory pools (for different panels) – The entry point (bli_gemm) receives an array of obj_t that represent the panels of the matrix – MCS barrier instead of linear – Support for multiple thread entry points – To do not spawn new set of threads at every iteration (in every bli_gemm call) – Affinity of threads – We pre-launch the threads, pin them to particular CPU cores using a #pragma omp (outside of blis), and then use multiple threads entry points

SGEMM performance on Superdome X, comparison with a GPU system (2 NVIDIA Tesla K80) DISTRIBUTED SGEMM PERFORMANCE 16000 14000 SGEMM PERFORMANCE ( GFLOPS ) 12000 10000 8000 Intel ScaLAPACK PLASMA+OpenBLAS 6000 NUMA-BLIS v1 Custom+BLIS cuBLAS (1 GPU nocopy) 4000 cuBLAS (4 GPUs) 2000 cuBLAS (2 GPUs) 0 0 10000 20000 30000 40000 50000 60000 70000 MATRIX DIMENSION ( M=N=K ) 22

SGEMM performance on Superdome X DISTRIBUTED SGEMM PERFORMANCE nvBLAS (4 GPUs) 16000 nvBLAS (2 GPUs) nvBLAS (1 GPU no copy) 14000 SGEMM PERFORMANCE ( GFLOPS ) NUMA-BLIS v1 Custom + BLIS 12000 nvBLAS (1 GPU) 10000 8000 6000 4000 2000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 MATRIX DIMENSION ( M=N=K ) 23

Improved usability and performance for small matrices (v2) Distributed SGEMM on Superdome X NUMA-BLIS v1 NUMA-BLIS v2

Conclusion – Done (almost): Extended BLIS (GEMM so far…) for multi-socket systems with shared memory – Matrix data is accessed directly – Synchronization via barriers – NUMA-aware – In progress: Extended BLIS for The Machine – Matrix data is accessed directly – Matrix data is in NVM – Synchronization via MPI/RVMA

Thank you! nvassilieva@hpe.com

Scalable Dense Matrix Multiplication on Multi-Socket Many-Core - PowerPoint PPT Presentation

Scalable Dense Matrix Multiplication on Multi-Socket Many-Core Systems with Fast Shared Memory Ricardo Magana, Natalia Vassilieva Acknowledgment Ricardo Magaa magania@gmail.com And also many thanks to prof. Robert Van De Geijn, Field Van

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Scalable Socket I/O PG Consultants Peter Gordon peter@pg-consultants.com Objective

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

Loranger Socket Families Socket Families Loranger Over 3,000 Socket Designs - - 0.25mm and

What is a socket? Socket Programming Socket: An interface between an application process and

Outline ! Socket basics ! TCP sockets ! Socket details ! Socket options Operating Systems ! Final

Socket Service Types The following socket types are defined: 1. SOCK_STREAM : stream socket 2.

Outline ! Socket basics ! TCP sockets ! Socket details ! Socket options Computer Networks ! Final

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Matrix Multiply in Hadoop Botong Huang and You Wu (Will) Content Dense Matrix Multiplication

Introduction to Parallel Computing George Karypis Dense Matrix Algorithms Outline Focus on

Socket Programming CS457, FA 11 What is a socket? Basically, a socket is just a file

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

SoC, why should we care about Fault Injection Attacks ? Guillaume BOUFFARD (

Towards Scalable SoC Security Validation Sujit Kumar Muduli Indian Institute of Technology,

A Touch of Calculus: Shaking Up the Pre-Requisite Structure of College Mathematics Rick Cleary,

Crash and Burn: Learning from Failure SOA 2020 June 17, 2020 Crash and Burn Collette N.

LiteX: SoC builder and library OSDA Workshop (2019), Florence, March 29 Florent Kermarrec,

S O C I A L NETWORK ANALYSIS CURRENT VISUALIZATION PROBLEMS Flat network structure and a Poor

Complexity: Highly Optimized Tolerance C. A. Pearson cap10@gwu.edu Papers J.M.Carlson and

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore