Performance Evaluation of Sparse Matrix Multiplication Kernels on - PowerPoint PPT Presentation

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 , Kamer Kaya 1 and ¨ urek 1 , 2 Umit V. C ¸ataly¨ esaule@uncc.edu, { kamer,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University PPAM 2013 Monday Sept 9 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 1 / 20

Outline The Intel MIC Architecture 1 SpMV 2 SpMM 3 Conclusion 4 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 2 / 20

What is Intel MIC ? Intel Many integrated Core (MIC) is Intel’s response to GPUs becoming popular in High Performance Computing. What GPUs do well? Get a lot of GFlops by using hundreds of cores Each core has large SIMD-like abilities Hide memory latency by using 1 cycle context switch What GPUs do not do well? Alien to program Poor support for legacy applications Inter thread communications Branching Goal of Intel MIC: do all of it well! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 3 / 20

Overall Architecture 8 memory controllers GDDR5 Core 2 channels (32-bit) Memory 5.5GT/s 352GB/s aggregated peak Controller twice the GPU’s but you typically get half Ring bus at 220GB/s 50+ cores PCI-e 32KB of L1 cache Controller 512KB of L2 cache LRU 8-way associative 1 PCI-e controller to the host (2GB/s guaranteed to memory) Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 4 / 20

Core Architecture 64-bit 4 hardware threads no context switching consecutive instr. from different threads A vectorial unit 32 512-bit wide registers sqrt, rsqrt, log, exp mul, div, add, sub, fma permutation swizzling masking Two instruction pipes: 2 ALU ops ALU + MEM ops ALU + VFP ops VFP + MEM ops In-order execution source: Intel Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 5 / 20

When Should I Use Intel MIC? Key points of SE10P Large memory bandwidth (peak: 220GB/s) 61 cores with mandatory use of hardware threading 512-bit wide SIMD registers: FMA: up to 2x16 SP Flop/c (2x8 DP Flop/c) otherwise: up to 16 SP Flop/c (8 DP Flop/c) On a 61-core configuration at 1.05Ghz: FMA: 2x16x61x1.05Ghz = 2.048 TFlop/s SP (1.024TFlop/s DP) otherwise: 16x61x1.05Ghz = 1.024 GFlop/s SP (512GFlop/s DP) Lots of bandwidth? Fused Multiply Add? Large vector registers? Sounds like the perfect system for Sparse Matrix operations! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 6 / 20

Getting actual bandwidth 250 250 Ring BW Ring BW 200 200 Bandwidth (in GB/s) Bandwidth (in GB/s) 150 150 100 100 50 50 0 0 loop-char loop-int vect vect+pref store store-NR store-NRNGO Read Write Using the appropriate vectorial instructions gives significant improvements. Read Peak: 183GB/s. Write Peak: 160GB/s. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 7 / 20

SpMV Test instances from UFL Sparse Matrix Collection max max # name #row #nonzero density nnz/row nnz/r nnz/c 1 81,920 204,800 3.05e-05 2.50 4 4 shallow water1 2 2cubes sphere 101,492 874,378 8.48e-05 8.61 24 29 3 scircuit 170,998 958,936 3.27e-05 5.60 353 353 4 mac econ 206,500 1,273,389 2.98e-05 6.16 44 47 5 cop20k A 121,192 1,362,087 9.27e-05 11.23 24 75 6 cant 62,451 2,034,917 5.21e-04 32.58 40 40 7 pdb1HYS 36,417 2,190,591 1.65e-03 60.15 184 162 8 webbase-1M 1,000,005 3,105,536 3.10e-06 3.10 4700 28685 9 hood 220,542 5,057,982 1.03e-04 22.93 51 77 Compressed Storage by Row 10 bmw3 2 227,362 5,757,996 1.11e-04 25.32 204 327 11 659,033 5,834,044 1.34e-05 8.85 627 745 pre2 12 217,918 5,871,175 1.23e-04 26.94 180 90 pwtk 13 crankseg 2 63,838 7,106,348 1.74e-03 111.31 297 3423 14 torso1 116,158 8,516,500 6.31e-04 73.31 3263 1224 15 atmosmodd 1,270,432 8,814,880 5.46e-06 6.93 7 7 16 msdoor 415,863 9,794,513 5.66e-05 23.55 57 77 17 F1 343,791 13,590,452 1.14e-04 39.53 306 378 18 nd24k 72,000 14,393,817 2.77e-03 199.91 481 483 19 inline 1 503,712 18,659,941 7.35e-05 37.04 843 333 20 mesh 2048 4,194,304 20,963,328 1.19e-06 4.99 5 5 21 952,203 21,723,010 2.39e-05 22.81 49 77 ldoor 22 1,505,785 27,130,349 1.19e-05 18.01 41 41 cage14 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 9 / 20

Optimization levels 25 No Vect. Variable performance Comp. Vect. 20 Variable impact of vectorization Performance (in GFlop/s) 15 vgatherd 10 x v[0:7] = x[adj[0:7]] 5 Takes one cycle per cache line that spans x[adj[0:7]] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

Optimization levels 25 No Vect. Variable performance Comp. Vect. 20 Variable impact of vectorization Performance (in GFlop/s) 15 vgatherd 10 x v[0:7] = x[adj[0:7]] 5 Takes one cycle per cache line that spans x[adj[0:7]] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 25 No Vect. Comp. Vect. 20 Performance (in GFlop/s) Useful Cacheline Density 15 Fraction of the accessed cache 10 lines of x that is useful for 5 computing y[i] . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Useful cache line density Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

A Bandwidth point of view Naive The matrix is transferred once. Application 140 Hardware 512KB cache Hardware infinite cache The matrix and the vectors are 120 Application Naive transferred once. 100 Bandwidth (in GB/s) 80 Hardware infinite cache 60 A core that access an entry from x 40 brings the whole cacheline in. 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Hardware 512KB cache A cacheline might be transferred multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

A Bandwidth point of view Naive The matrix is transferred once. 140 Hardware 512KB cache Hardware infinite cache 120 Application Naive Application 100 Bandwidth (in GB/s) The matrix and the vectors are 80 transferred once. 60 40 Hardware infinite cache 20 A core that access an entry from x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 brings the whole cacheline in. Provided the peak bandwidth is between 160GB/s and Hardware 512KB cache 180GB/s. That’s close to A cacheline might be transferred optimal for some matrices. multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

So bandwidth constraint? 200 Max Sustained Read BW Max Sustained Write BW 4 thr/core 3 thr/core 150 2 thr/core Bandwidth (in GB/s) 1 thr/core 100 50 0 0 10 20 30 40 50 60 70 number of cores crankseg 2 There is a contention inside the cores. More threads do not help. There is a hint at contention on the Xeon Phi. Scaling is similar to max bandwidth. Bandwidth constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

So bandwidth constraint? 200 40 Max Sustained Read BW Max Sustained Read BW Max Sustained Write BW Max Sustained Write BW 35 4 thr/core 4 thr/core 3 thr/core 3 thr/core 150 2 thr/core 30 2 thr/core Bandwidth (in GB/s) Bandwidth (in GB/s) 1 thr/core 1 thr/core 25 100 20 15 50 10 5 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 number of cores number of cores crankseg 2 There is a contention inside the pre2 cores. No contention inside the cores. More threads do not help. More threads helps. There is a hint at contention on No global contention. the Xeon Phi. Linear scaling when adding cores. Scaling is similar to max bandwidth. Latency constraint? Bandwidth constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

SpMM SpMV gets low GFlop/s because of 2 flop the flop-to-byte ratio is 12 bytes /nnz. SpMM Put k SpMV together. The ratio becomes to 2 k 12 . We experiment with k = 16 Applications PDE eigensolving graph based recommendations Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 14 / 20

SpMM Performance 140 Manual Vect. + NRNGO Manual Vect. 120 Comp. Vect Performance (in GFlop/s) 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Variants Basic C++. 8 columns at a time with fma . Using proper store operation. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 15 / 20

Bandwidth 140 Hardware 512KB cache Hardware Infinite cache 120 Application 100 Bandwidth (in GB/s) 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Bandwidth analysis Where x goes is much more important. Cache is still large enough. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 16 / 20

Performance Evaluation of Sparse Matrix Multiplication Kernels on - PowerPoint PPT Presentation

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 , Kamer Kaya 1 and urek 1 , 2 Umit V. C ataly esaule@uncc.edu, { kamer,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Matrix-chain multiplication Carola Wenk 1 CMPS 6610 Algorithms Matrix-chain multiplication

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

SSA Introduction Sebastian Hack hack@cs.uni-saarland.de Compiler Construction 2013 saarland

CS 294-73 Software Engineering for Scientific Computing Lecture 4:

Nonlinear dynamic stochastic general equilibrium models in Stata 16 David Schenck Senior

Cryptography Esthers added slides (the rest are in the lecture slide deck) RSA- Rivest Shamir

Coexistence of Physical and Crypto Assets in a Stochastic Endogenous Growth Model Alexis Derviz

Economic Issues in the 2020 Election and Beyond Lee E. Ohanian Professor of Economics,

COMPREHENSIVE ECONOMIC DEVELOPMENT STRATEGIES (CEDS) Calvin Edghill, South Atlantic Area

PixelPlace: Staying Connected While Being Apart A digital way to be engaged with your