performance evaluation of sparse matrix multiplication
play

Performance Evaluation of Sparse Matrix Multiplication Kernels on - PowerPoint PPT Presentation

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 , Kamer Kaya 1 and urek 1 , 2 Umit V. C ataly esaule@uncc.edu, { kamer,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of


  1. Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 , Kamer Kaya 1 and ¨ urek 1 , 2 Umit V. C ¸ataly¨ esaule@uncc.edu, { kamer,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University PPAM 2013 Monday Sept 9 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 1 / 20

  2. Outline The Intel MIC Architecture 1 SpMV 2 SpMM 3 Conclusion 4 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 2 / 20

  3. What is Intel MIC ? Intel Many integrated Core (MIC) is Intel’s response to GPUs becoming popular in High Performance Computing. What GPUs do well? Get a lot of GFlops by using hundreds of cores Each core has large SIMD-like abilities Hide memory latency by using 1 cycle context switch What GPUs do not do well? Alien to program Poor support for legacy applications Inter thread communications Branching Goal of Intel MIC: do all of it well! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 3 / 20

  4. Overall Architecture 8 memory controllers GDDR5 Core 2 channels (32-bit) Memory 5.5GT/s 352GB/s aggregated peak Controller twice the GPU’s but you typically get half Ring bus at 220GB/s 50+ cores PCI-e 32KB of L1 cache Controller 512KB of L2 cache LRU 8-way associative 1 PCI-e controller to the host (2GB/s guaranteed to memory) Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 4 / 20

  5. Core Architecture 64-bit 4 hardware threads no context switching consecutive instr. from different threads A vectorial unit 32 512-bit wide registers sqrt, rsqrt, log, exp mul, div, add, sub, fma permutation swizzling masking Two instruction pipes: 2 ALU ops ALU + MEM ops ALU + VFP ops VFP + MEM ops In-order execution source: Intel Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 5 / 20

  6. When Should I Use Intel MIC? Key points of SE10P Large memory bandwidth (peak: 220GB/s) 61 cores with mandatory use of hardware threading 512-bit wide SIMD registers: FMA: up to 2x16 SP Flop/c (2x8 DP Flop/c) otherwise: up to 16 SP Flop/c (8 DP Flop/c) On a 61-core configuration at 1.05Ghz: FMA: 2x16x61x1.05Ghz = 2.048 TFlop/s SP (1.024TFlop/s DP) otherwise: 16x61x1.05Ghz = 1.024 GFlop/s SP (512GFlop/s DP) Lots of bandwidth? Fused Multiply Add? Large vector registers? Sounds like the perfect system for Sparse Matrix operations! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 6 / 20

  7. Getting actual bandwidth 250 250 Ring BW Ring BW 200 200 Bandwidth (in GB/s) Bandwidth (in GB/s) 150 150 100 100 50 50 0 0 loop-char loop-int vect vect+pref store store-NR store-NRNGO Read Write Using the appropriate vectorial instructions gives significant improvements. Read Peak: 183GB/s. Write Peak: 160GB/s. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 7 / 20

  8. Outline The Intel MIC Architecture 1 SpMV 2 SpMM 3 Conclusion 4 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 8 / 20

  9. SpMV Test instances from UFL Sparse Matrix Collection max max # name #row #nonzero density nnz/row nnz/r nnz/c 1 81,920 204,800 3.05e-05 2.50 4 4 shallow water1 2 2cubes sphere 101,492 874,378 8.48e-05 8.61 24 29 3 scircuit 170,998 958,936 3.27e-05 5.60 353 353 4 mac econ 206,500 1,273,389 2.98e-05 6.16 44 47 5 cop20k A 121,192 1,362,087 9.27e-05 11.23 24 75 6 cant 62,451 2,034,917 5.21e-04 32.58 40 40 7 pdb1HYS 36,417 2,190,591 1.65e-03 60.15 184 162 8 webbase-1M 1,000,005 3,105,536 3.10e-06 3.10 4700 28685 9 hood 220,542 5,057,982 1.03e-04 22.93 51 77 Compressed Storage by Row 10 bmw3 2 227,362 5,757,996 1.11e-04 25.32 204 327 11 659,033 5,834,044 1.34e-05 8.85 627 745 pre2 12 217,918 5,871,175 1.23e-04 26.94 180 90 pwtk 13 crankseg 2 63,838 7,106,348 1.74e-03 111.31 297 3423 14 torso1 116,158 8,516,500 6.31e-04 73.31 3263 1224 15 atmosmodd 1,270,432 8,814,880 5.46e-06 6.93 7 7 16 msdoor 415,863 9,794,513 5.66e-05 23.55 57 77 17 F1 343,791 13,590,452 1.14e-04 39.53 306 378 18 nd24k 72,000 14,393,817 2.77e-03 199.91 481 483 19 inline 1 503,712 18,659,941 7.35e-05 37.04 843 333 20 mesh 2048 4,194,304 20,963,328 1.19e-06 4.99 5 5 21 952,203 21,723,010 2.39e-05 22.81 49 77 ldoor 22 1,505,785 27,130,349 1.19e-05 18.01 41 41 cage14 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 9 / 20

  10. Optimization levels 25 No Vect. Variable performance Comp. Vect. 20 Variable impact of vectorization Performance (in GFlop/s) 15 vgatherd 10 x v[0:7] = x[adj[0:7]] 5 Takes one cycle per cache line that spans x[adj[0:7]] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

  11. Optimization levels 25 No Vect. Variable performance Comp. Vect. 20 Variable impact of vectorization Performance (in GFlop/s) 15 vgatherd 10 x v[0:7] = x[adj[0:7]] 5 Takes one cycle per cache line that spans x[adj[0:7]] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 25 No Vect. Comp. Vect. 20 Performance (in GFlop/s) Useful Cacheline Density 15 Fraction of the accessed cache 10 lines of x that is useful for 5 computing y[i] . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Useful cache line density Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

  12. A Bandwidth point of view Naive The matrix is transferred once. Application 140 Hardware 512KB cache Hardware infinite cache The matrix and the vectors are 120 Application Naive transferred once. 100 Bandwidth (in GB/s) 80 Hardware infinite cache 60 A core that access an entry from x 40 brings the whole cacheline in. 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Hardware 512KB cache A cacheline might be transferred multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

  13. A Bandwidth point of view Naive The matrix is transferred once. 140 Hardware 512KB cache Hardware infinite cache 120 Application Naive Application 100 Bandwidth (in GB/s) The matrix and the vectors are 80 transferred once. 60 40 Hardware infinite cache 20 A core that access an entry from x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 brings the whole cacheline in. Provided the peak bandwidth is between 160GB/s and Hardware 512KB cache 180GB/s. That’s close to A cacheline might be transferred optimal for some matrices. multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

  14. So bandwidth constraint? 200 Max Sustained Read BW Max Sustained Write BW 4 thr/core 3 thr/core 150 2 thr/core Bandwidth (in GB/s) 1 thr/core 100 50 0 0 10 20 30 40 50 60 70 number of cores crankseg 2 There is a contention inside the cores. More threads do not help. There is a hint at contention on the Xeon Phi. Scaling is similar to max bandwidth. Bandwidth constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

  15. So bandwidth constraint? 200 40 Max Sustained Read BW Max Sustained Read BW Max Sustained Write BW Max Sustained Write BW 35 4 thr/core 4 thr/core 3 thr/core 3 thr/core 150 2 thr/core 30 2 thr/core Bandwidth (in GB/s) Bandwidth (in GB/s) 1 thr/core 1 thr/core 25 100 20 15 50 10 5 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 number of cores number of cores crankseg 2 There is a contention inside the pre2 cores. No contention inside the cores. More threads do not help. More threads helps. There is a hint at contention on No global contention. the Xeon Phi. Linear scaling when adding cores. Scaling is similar to max bandwidth. Latency constraint? Bandwidth constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

  16. Outline The Intel MIC Architecture 1 SpMV 2 SpMM 3 Conclusion 4 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 13 / 20

  17. SpMM SpMV gets low GFlop/s because of 2 flop the flop-to-byte ratio is 12 bytes /nnz. SpMM Put k SpMV together. The ratio becomes to 2 k 12 . We experiment with k = 16 Applications PDE eigensolving graph based recommendations Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 14 / 20

  18. SpMM Performance 140 Manual Vect. + NRNGO Manual Vect. 120 Comp. Vect Performance (in GFlop/s) 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Variants Basic C++. 8 columns at a time with fma . Using proper store operation. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 15 / 20

  19. Bandwidth 140 Hardware 512KB cache Hardware Infinite cache 120 Application 100 Bandwidth (in GB/s) 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Bandwidth analysis Where x goes is much more important. Cache is still large enough. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 16 / 20

  20. Outline The Intel MIC Architecture 1 SpMV 2 SpMM 3 Conclusion 4 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 17 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend