Performance Evaluation of Sparse Matrix Multiplication Kernels on - - PowerPoint PPT Presentation

performance evaluation of sparse matrix multiplication
SMART_READER_LITE
LIVE PREVIEW

Performance Evaluation of Sparse Matrix Multiplication Kernels on - - PowerPoint PPT Presentation

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 , Kamer Kaya 1 and urek 1 , 2 Umit V. C ataly esaule@uncc.edu, { kamer,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of


slide-1
SLIDE 1

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

Erik Saule1, Kamer Kaya1 and ¨ Umit V. C ¸ataly¨ urek1,2

esaule@uncc.edu, {kamer,umit}@bmi.osu.edu

1Department of Biomedical Informatics 2Department of Electrical and Computer Engineering

The Ohio State University

PPAM 2013 Monday Sept 9

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 1 / 20

slide-2
SLIDE 2

Outline

1

The Intel MIC Architecture

2

SpMV

3

SpMM

4

Conclusion

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 2 / 20

slide-3
SLIDE 3

What is Intel MIC ?

Intel Many integrated Core (MIC) is Intel’s response to GPUs becoming popular in High Performance Computing.

What GPUs do well?

Get a lot of GFlops by using hundreds of cores Each core has large SIMD-like abilities Hide memory latency by using 1 cycle context switch

What GPUs do not do well?

Alien to program Poor support for legacy applications Inter thread communications Branching Goal of Intel MIC: do all of it well!

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 3 / 20

slide-4
SLIDE 4

Overall Architecture

Memory Controller PCI-e Controller Core

8 memory controllers

GDDR5 2 channels (32-bit) 5.5GT/s 352GB/s aggregated peak

twice the GPU’s but you typically get half

Ring bus at 220GB/s

50+ cores

32KB of L1 cache 512KB of L2 cache LRU 8-way associative

1 PCI-e controller

to the host (2GB/s guaranteed to memory)

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 4 / 20

slide-5
SLIDE 5

Core Architecture

source: Intel 64-bit 4 hardware threads

no context switching consecutive instr. from different threads

A vectorial unit

32 512-bit wide registers sqrt, rsqrt, log, exp mul, div, add, sub, fma permutation swizzling masking

Two instruction pipes:

2 ALU ops ALU + MEM ops ALU + VFP ops VFP + MEM ops

In-order execution

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 5 / 20

slide-6
SLIDE 6

When Should I Use Intel MIC?

Key points of SE10P

Large memory bandwidth (peak: 220GB/s) 61 cores with mandatory use of hardware threading 512-bit wide SIMD registers:

FMA: up to 2x16 SP Flop/c (2x8 DP Flop/c)

  • therwise: up to 16 SP Flop/c (8 DP Flop/c)

On a 61-core configuration at 1.05Ghz:

FMA: 2x16x61x1.05Ghz = 2.048 TFlop/s SP (1.024TFlop/s DP)

  • therwise: 16x61x1.05Ghz = 1.024 GFlop/s SP (512GFlop/s DP)

Lots of bandwidth? Fused Multiply Add? Large vector registers? Sounds like the perfect system for Sparse Matrix operations!

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 6 / 20

slide-7
SLIDE 7

Getting actual bandwidth

50 100 150 200 250 loop-char loop-int vect vect+pref Bandwidth (in GB/s) Ring BW

Read

50 100 150 200 250 store store-NR store-NRNGO Bandwidth (in GB/s) Ring BW

Write Using the appropriate vectorial instructions gives significant improvements. Read Peak: 183GB/s. Write Peak: 160GB/s.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 7 / 20

slide-8
SLIDE 8

Outline

1

The Intel MIC Architecture

2

SpMV

3

SpMM

4

Conclusion

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 8 / 20

slide-9
SLIDE 9

SpMV

Compressed Storage by Row Test instances

from UFL Sparse Matrix Collection

max max # name #row #nonzero density nnz/row nnz/r nnz/c 1 shallow water1 81,920 204,800 3.05e-05 2.50 4 4 2 2cubes sphere 101,492 874,378 8.48e-05 8.61 24 29 3 scircuit 170,998 958,936 3.27e-05 5.60 353 353 4 mac econ 206,500 1,273,389 2.98e-05 6.16 44 47 5 cop20k A 121,192 1,362,087 9.27e-05 11.23 24 75 6 cant 62,451 2,034,917 5.21e-04 32.58 40 40 7 pdb1HYS 36,417 2,190,591 1.65e-03 60.15 184 162 8 webbase-1M 1,000,005 3,105,536 3.10e-06 3.10 4700 28685 9 hood 220,542 5,057,982 1.03e-04 22.93 51 77 10 bmw3 2 227,362 5,757,996 1.11e-04 25.32 204 327 11 pre2 659,033 5,834,044 1.34e-05 8.85 627 745 12 pwtk 217,918 5,871,175 1.23e-04 26.94 180 90 13 crankseg 2 63,838 7,106,348 1.74e-03 111.31 297 3423 14 torso1 116,158 8,516,500 6.31e-04 73.31 3263 1224 15 atmosmodd 1,270,432 8,814,880 5.46e-06 6.93 7 7 16 msdoor 415,863 9,794,513 5.66e-05 23.55 57 77 17 F1 343,791 13,590,452 1.14e-04 39.53 306 378 18 nd24k 72,000 14,393,817 2.77e-03 199.91 481 483 19 inline 1 503,712 18,659,941 7.35e-05 37.04 843 333 20 mesh 2048 4,194,304 20,963,328 1.19e-06 4.99 5 5 21 ldoor 952,203 21,723,010 2.39e-05 22.81 49 77 22 cage14 1,505,785 27,130,349 1.19e-05 18.01 41 41

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 9 / 20

slide-10
SLIDE 10

Optimization levels

5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Performance (in GFlop/s) No Vect.

  • Comp. Vect.

Variable performance Variable impact of vectorization

vgatherd

x v[0:7] = x[adj[0:7]] Takes one cycle per cache line that spans x[adj[0:7]]

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

slide-11
SLIDE 11

Optimization levels

5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Performance (in GFlop/s) No Vect.

  • Comp. Vect.

Variable performance Variable impact of vectorization

vgatherd

x v[0:7] = x[adj[0:7]] Takes one cycle per cache line that spans x[adj[0:7]]

5 10 15 20 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Performance (in GFlop/s) Useful cache line density No Vect.

  • Comp. Vect.

Useful Cacheline Density

Fraction of the accessed cache lines of x that is useful for computing y[i].

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 10 / 20

slide-12
SLIDE 12

A Bandwidth point of view

20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Bandwidth (in GB/s) Naive Application Hardware infinite cache Hardware 512KB cache

Naive

The matrix is transferred once.

Application

The matrix and the vectors are transferred once.

Hardware infinite cache

A core that access an entry from x brings the whole cacheline in.

Hardware 512KB cache

A cacheline might be transferred multiple times to a core if the cache is full.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

slide-13
SLIDE 13

A Bandwidth point of view

20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Bandwidth (in GB/s) Naive Application Hardware infinite cache Hardware 512KB cache

Provided the peak bandwidth is between 160GB/s and 180GB/s. That’s close to

  • ptimal for some matrices.

Naive

The matrix is transferred once.

Application

The matrix and the vectors are transferred once.

Hardware infinite cache

A core that access an entry from x brings the whole cacheline in.

Hardware 512KB cache

A cacheline might be transferred multiple times to a core if the cache is full.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 11 / 20

slide-14
SLIDE 14

So bandwidth constraint?

50 100 150 200 10 20 30 40 50 60 70 Bandwidth (in GB/s) number of cores 1 thr/core 2 thr/core 3 thr/core 4 thr/core Max Sustained Write BW Max Sustained Read BW

crankseg 2

There is a contention inside the cores.

More threads do not help.

There is a hint at contention on the Xeon Phi.

Scaling is similar to max bandwidth.

Bandwidth constraint?

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

slide-15
SLIDE 15

So bandwidth constraint?

50 100 150 200 10 20 30 40 50 60 70 Bandwidth (in GB/s) number of cores 1 thr/core 2 thr/core 3 thr/core 4 thr/core Max Sustained Write BW Max Sustained Read BW 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 Bandwidth (in GB/s) number of cores 1 thr/core 2 thr/core 3 thr/core 4 thr/core Max Sustained Write BW Max Sustained Read BW

crankseg 2

There is a contention inside the cores.

More threads do not help.

There is a hint at contention on the Xeon Phi.

Scaling is similar to max bandwidth.

Bandwidth constraint?

pre2

No contention inside the cores.

More threads helps.

No global contention.

Linear scaling when adding cores.

Latency constraint?

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 12 / 20

slide-16
SLIDE 16

Outline

1

The Intel MIC Architecture

2

SpMV

3

SpMM

4

Conclusion

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 13 / 20

slide-17
SLIDE 17

SpMM

SpMV gets low GFlop/s because of the flop-to-byte ratio is

2flop 12bytes /nnz.

SpMM

Put k SpMV together. The ratio becomes to 2k

12.

We experiment with k = 16

Applications

PDE eigensolving graph based recommendations

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 14 / 20

slide-18
SLIDE 18

SpMM Performance

20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Performance (in GFlop/s)

  • Comp. Vect

Manual Vect. Manual Vect. + NRNGO

Variants

Basic C++. 8 columns at a time with fma. Using proper store operation.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 15 / 20

slide-19
SLIDE 19

Bandwidth

20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Bandwidth (in GB/s) Application Hardware Infinite cache Hardware 512KB cache

Bandwidth analysis

Where x goes is much more important. Cache is still large enough.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 16 / 20

slide-20
SLIDE 20

Outline

1

The Intel MIC Architecture

2

SpMV

3

SpMM

4

Conclusion

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 17 / 20

slide-21
SLIDE 21

Comparison with other architectures

5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Performance (in GFlop/s) SE10P C2050 K20 Dual X5680 Dual E5-2670

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 18 / 20

slide-22
SLIDE 22

Conclusion

Xeon Phi gets good performance on Sparse Matrix kernels. Vectorization is paramount.

Register blocking could improve usage, but matrices are too sparse.

Because of vgatherd useful cacheline density matters. Test on plenty of sparse matrices or risk a bias.

Most still use only a few.

Locality is key (but ordering has little effect). Observing performance at different number of cores and thread per core hints at what is happening. Still low utilization on many matrices

Blocked ELLPACK has been developed.

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 19 / 20

slide-23
SLIDE 23

Thank you

Support

Intel for providing Xeon Phi cards. NVIDIA for providing C2050 and K20 cards. OSC for providing computation infrastructure.

More information

contact : esaule@uncc.edu visit: http://webpages.uncc.edu/~esaule/ or http://bmi.osu.edu/hpc/

Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM 2013 20 / 20