SPARSITY: Optimization Framework For Sparse Matrix Kernels Eun-Jin - - PowerPoint PPT Presentation

sparsity optimization framework for sparse matrix kernels
SMART_READER_LITE
LIVE PREVIEW

SPARSITY: Optimization Framework For Sparse Matrix Kernels Eun-Jin - - PowerPoint PPT Presentation

SPARSITY: Optimization Framework For Sparse Matrix Kernels Eun-Jin Im, Katherine Yelick, Richard Vuduc International Journal of High Performance Computing Applications 2004 18: 135 The online version of this article can be found at:


slide-1
SLIDE 1

SPARSITY: Optimization Framework For Sparse Matrix Kernels

Eun-Jin Im, Katherine Yelick, Richard Vuduc

International Journal of High Performance Computing Applications 2004 18: 135 The online version of this article can be found at: http://hpc.sagepub.com/content/18/1/135 Published by: http://www.sagepublications.com

slide-2
SLIDE 2

One Operation

⋅ =

MATLAB, file from http://www.cise.ufl.edu/research/sparse/matrices/Simon/venkat01.html

slide-3
SLIDE 3

Motivation

http://3.bp.blogspot.com/-jwj51xaDhsk/Thk3KtjWwsI/AAAAAAAAAOA/P8eNt0_MJUQ/s1600/Challenger2.gif http://www.erneuerbareenergiequellen.com/pictures/other/oil_some_questions/oil_rig.jpg http://eu.art.com/products/p14342284-sa-i2886553/posters.htm?ui=BFBAB751660645AA8C02F859E5BAD142 http://www.aspsys.com/userfiles/image/fluent3.jpg http://www.bloodhoundssc.com/_db/_images/airliner_resized.jpg http://www.fft.be/images/documents/219.jpg http://www.onu.edu/files/images/alumni/Flow_around_object.jpg http://t0.gstatic.com/images?q=tbn:ANd9GcQDP4JEXQNigtR04rNdj2gBvI8QpO1Sf1k2hcOMF9yXWqP_PCQb

slide-4
SLIDE 4

Processor Clock (MHz) Data Cache sizes DGEMV

(MFLOPS)

DGEMM

(MFLOPS)

Sun Ultra Sparc IIi 333 L1: 16 KB L2: 2 MB 58 425 Intel Pentium III-Mobile 800 L1: 16 KB L2: 256 MB 147 590 IBM Power 4 1300 L1: 64 KB L2: 1.5 MB L3: 32 MB 915 3500 Intel Itanium 2 900 L1: 16 KB L2: 256 KB L3: 3 MB 1330 3500

Machines

slide-5
SLIDE 5

CSR: Compressed Sparse Row Format

3 5 1 7 2 4 3 5 1 7 2 4

Values: Column Index: Row start Index:

3 1 2 2 4 2 3 5 6

slide-6
SLIDE 6

Register-Blocking

3 5 1 7 2 4

Values: Column Index: Row start Index:

3 1 5 7 2 4 2 2 2 3

slide-7
SLIDE 7

Example for Register-Blocking

slide-8
SLIDE 8

Example Results

slide-9
SLIDE 9

Performance Model: Machine Profile

slide-10
SLIDE 10

Performance Model: Fill-Overhead

3 5 1 7 2 4

12 6 =2

slide-11
SLIDE 11

Performance Model

3 5 1 7 2 4

12 6 =2

2.54 2 = 1.27

Example on Intel Itanium 2 with 2×2 block-size:

slide-12
SLIDE 12

Register-Blocking Speedup: Intel Pentium III-M

slide-13
SLIDE 13

Register-Blocking Speedup: Intel Itanium 2

slide-14
SLIDE 14

Cache-Blocking

3 5 1 7 2 4

Values: Column Index:

3 1 5 7 2 4 1 3 2 2 3 1 2 3 4 5 6 4 7

Block row start: Block start Index:

slide-15
SLIDE 15

Cache-Blocking

slide-16
SLIDE 16

Benchmark Cache-Blocking

slide-17
SLIDE 17

Cache-Blocking Speedup

slide-18
SLIDE 18

Multiple Vectors

u0 v0 u1 v1 u2 v2 u3 v3 y00 y01 y10 y11 y20 y21 y30 y31

(1)

3⋅u0+0⋅u1= y00

(2)

0⋅u0+1⋅u1= y10 ⋯

(nz+1)

3⋅v0+0⋅v1= y01

(nz+2)

0⋅v0+1⋅v1= y11

(1)

3⋅u0+0⋅u1= y00

(2)

0⋅u0+1⋅u1=y10

(3)

3⋅v0+0⋅v1=y01

(4)

0⋅v0+1⋅v1= y11 3 5 1 7 2 4

= ⋅

nz = number of non-zero elements in A

slide-19
SLIDE 19

Multiple Vectors Speedup: Intel Pentium III-M

slide-20
SLIDE 20

Multiple Vectors Speedup: Intel Itanium 2

slide-21
SLIDE 21

SPARSITY System

Graph: Paper

slide-22
SLIDE 22

Conclusion

4x improvement for register-blocking

2x for cache-blocking

10x for register-blocking combined with multiple vectors

Lot of publications in reference to SPARSITY