[PPT] - Tuning Sparse Matrix Vector Multiplication for multi-core SMPs PowerPoint Presentation

SLIDE 1

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2

1University of California Berkeley 2Lawrence Berkeley National Laboratory 3Georgia Institute of Technology

samw@cs.berkeley.edu

SLIDE 2

BIPS BIPS

Overview



Multicore is the de facto performance solution for the next decade



Examined Sparse Matrix Vector Multiplication (SpMV) kernel

Important HPC kernel
Memory intensive
Challenging for multicore



Present two autotuned threaded implementations:

Pthread, cache-based implementation
Cell local store-based implementation



Benchmarked performance across 4 diverse multicore architectures

Intel Xeon (Clovertown)
AMD Opteron
Sun Niagara2
IBM Cell Broadband Engine



Compare with leading MPI implementation(PETSc) with an autotuned serial kernel (OSKI)

SLIDE 3

BIPS BIPS Sparse Matrix Vector Multiplication

 Sparse Matrix

Most entries are 0.0
Performance advantage in only

storing/operating on the nonzeros

Requires significant meta data

 Evaluate y=Ax

A is a sparse matrix
x & y are dense vectors

 Challenges

Difficult to exploit ILP(bad for superscalar),
Difficult to exploit DLP(bad for SIMD)
Irregular memory access to source vector
Difficult to load balance
Very low computational intensity (often >6 bytes/flop)

A x y

SLIDE 4

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Dataset (Matrices) Multicore SMPs

Test Suite

SLIDE 5

BIPS BIPS

Matrices Used

 Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M

Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship Economics Epidemiology FEM / Accelerator Circuit webbase LP

2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

SLIDE 6

BIPS BIPS

Multicore SMP Systems

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

SLIDE 7

BIPS BIPS

Multicore SMP Systems

(memory hierarchy)

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

C

n

v e n t i

n

a l C a c h e

b

a s e d M e m

r

y H i e r a r c h y

D i s j

i

n t L

c

a l S t

r

e M e m

r

y H i e r a r c h y

SLIDE 8

BIPS BIPS

Multicore SMP Systems

(cache)

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

16MB

(vectors fit)

4MB 4MB (local store) 4MB

SLIDE 9

BIPS BIPS

Multicore SMP Systems

(peak flops)

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

75 Gflop/s (w/SIMD) 17 Gflop/s 29 Gflop/s (w/SIMD) 11 Gflop/s

SLIDE 10

BIPS BIPS

Multicore SMP Systems

(peak read bandwidth)

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

21 GB/s 21 GB/s 51 GB/s 43 GB/s

SLIDE 11

BIPS BIPS

Multicore SMP Systems

(NUMA)

4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core2 Core2 21.3 GB/s(read)

Intel Clovertown

Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 8K D$ MT UltraSparc FPU 179 GB/s (fill) 90 GB/s (writethru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 4GB/s (each direction) XDR DRAM 25.6GB/s EIB (Ring Network) <<20GB/s each direction SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC

IBM Cell Blade

Uniform Memory Access Non-Uniform Memory Access

SLIDE 12

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Naïve Implementation

For cache-based machines Included a median performance number

SLIDE 13

BIPS BIPS

vanilla C Performance

Intel Clovertown AMD Opteron Sun Niagara2

 Vanilla C implementation  Matrix stored in CSR (compressed

sparse row)

 Explored compiler options - only

the best is presented here

SLIDE 14

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Optimized for multicore/threading Variety of shared memory programming models

are acceptable(not just Pthreads)

More colors = more optimizations = more work

Pthread Implementation

SLIDE 15

BIPS BIPS

Parallelization

 Matrix partitioned by rows and balanced by the number of

nonzeros

 SPMD like approach  A barrier() is called before and after the SpMV kernel  Each sub matrix stored separately in CSR  Load balancing can be challenging  # of threads explored in powers of 2 (in paper)

A x y

SLIDE 16

BIPS BIPS

Naïve Parallel Performance

Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 17

BIPS BIPS

Naïve Parallel Performance

Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance

SLIDE 18

BIPS BIPS

Naïve Parallel Performance

Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth

SLIDE 19

BIPS BIPS

Case for Autotuning

 How do we deliver good performance across all these

architectures, across all matrices without exhaustively

ptimizing every combination

 Autotuning

Write a Perl script that generates all possible optimizations
Heuristically, or exhaustively search the optimizations
Existing SpMV solution: OSKI (developed at UCB)

 This work:

Optimizations geared for multi-core/-threading
generates SSE/SIMD intrinsics, prefetching, loop

transformations, alternate data structures, etc…

“prototype for parallel OSKI”

SLIDE 20

BIPS BIPS

Exploiting NUMA, Affinity

 Bandwidth on the Opteron(and Cell) can vary

substantially based on placement of data

 Bind each sub matrix and the thread to process it

together

 Explored libnuma, Linux, and Solaris routines  Adjacent blocks bound to adjacent cores

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single Thread Multiple Threads, One memory controller Multiple Threads, Both memory controllers

SLIDE 21

BIPS BIPS

Performance (+NUMA)

+NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 22

BIPS BIPS

Performance (+SW Prefetching)

+Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 23

BIPS BIPS

Matrix Compression

 For memory bound kernels, minimizing memory

traffic should maximize performance

 Compress the meta data

Exploit structure to eliminate meta data

 Heuristic: select the compression that

minimizes the matrix size:

power of 2 register blocking
CSR/COO format
16b/32b indices
etc…

 Side effect: matrix may be minimized to the point where it fits

entirely in cache

SLIDE 24

BIPS BIPS

Performance (+matrix compression)

+Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 25

BIPS BIPS

Cache and TLB Blocking

 Accesses to the matrix and destination vector are streaming  But, access to the source vector can be random  Reorganize matrix (and thus access pattern) to maximize reuse.  Applies equally to TLB blocking (caching PTEs)  Heuristic: block destination, then keep adding

more columns as long as the number of source vector cache lines(or pages) touched is less than the cache(or TLB). Apply all previous optimizations individually to each cache block.

 Search: neither, cache, cache&TLB  Better locality at the expense of confusing

the hardware prefetchers.

A x y

SLIDE 26

BIPS BIPS

Performance (+cache blocking)

+Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 27

BIPS BIPS

Banks, Ranks, and DIMMs

 In this SPMD approach, as the number of threads increases, so

to does the number of concurrent streams to memory.

 Most memory controllers have finite capability to reorder the

requests. (DMA can avoid or minimize this)

 Addressing/Bank conflicts become increasingly likely  Add more DIMMs, configuration of ranks can help  Clovertown system was already fully populated

SLIDE 28

BIPS BIPS

Performance (more DIMMs, …)

+More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

SLIDE 29

BIPS BIPS

Performance (more DIMMs, …)

+More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

4% of peak flops 52% of bandwidth 20% of peak flops 66% of bandwidth 52% of peak flops 54% of bandwidth

SLIDE 30

BIPS BIPS

Performance (more DIMMs, …)

+More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread

Intel Clovertown AMD Opteron Sun Niagara2

3 essential optimizations 4 essential optimizations 2 essential optimizations

SLIDE 31

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Comments Performance

Cell Implementation

SLIDE 32

BIPS BIPS

Cell Implementation

 No vanilla C implementation (aside from the PPE)  Even SIMDized double precision is extremely weak

Scalar double precision is unbearable
Minimum register blocking is 2x1 (SIMDizable)
Can increase memory traffic by 66%

 Cache blocking optimization is transformed into local store blocking

Spatial and temporal locality is captured by software when

the matrix is optimized

In essence, the high bits of column indices are grouped into DMA

lists

 No branch prediction

Replace branches with conditional operations

 In some cases, what were optional optimizations on cache based

machines, are requirements for correctness on Cell

 Despite the performance, Cell is still handicapped by double precision

SLIDE 33

BIPS BIPS

Performance

Intel Clovertown AMD Opteron Sun Niagara2 IBM Cell Broadband Engine

SLIDE 34

BIPS BIPS

Intel Clovertown AMD Opteron Sun Niagara2

Performance

IBM Cell Broadband Engine

39% of peak flops 89% of bandwidth

SLIDE 35

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Multicore MPI Implementation

This is the default approach to programming multicore

SLIDE 36

BIPS BIPS

Multicore MPI Implementation

 Used PETSc with shared memory MPICH  Used OSKI (developed @ UCB) to optimize each thread  = Highly optimized MPI

MPI(autotuned) Pthreads(autotuned) Naïve Single Thread

Intel Clovertown AMD Opteron

SLIDE 37

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Summary

SLIDE 38

BIPS BIPS

Median Performance & Efficiency



Used digital power meter to measure sustained system power

FBDIMM drives up Clovertown and Niagara2 power
Right: sustained MFlop/s / sustained Watts



Default approach(MPI) achieves very low performance and efficiency

SLIDE 39

BIPS BIPS

Summary

 Paradoxically, the most complex/advanced architectures required the

most tuning, and delivered the lowest performance.

 Most machines achieved less than 50-60% of DRAM bandwidth  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency

90% of memory bandwidth
High power efficiency
Easily understood performance
Extra traffic = lower performance (future work can address this)

 multicore specific autotuned implementation significantly

utperformed a state of the art MPI implementation

 Matrix compression geared towards multicore  NUMA  Prefetching

SLIDE 40

BIPS BIPS

Acknowledgments

 UC Berkeley

RADLab Cluster (Opterons)
PSI cluster(Clovertowns)

 Sun Microsystems

Niagara2

 Forschungszentrum Jülich

Cell blade cluster

SLIDE 41

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPS BIPS

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs - - PowerPoint PPT Presentation

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

Test Suite

C

v e n t i

a l C a c h e

a s e d M e m

y H i e r a r c h y

16MB

Uniform Memory Access Non-Uniform Memory Access

Naïve Implementation

Pthread Implementation

Cell Implementation

Multicore MPI Implementation

Summary

Questions?