Performance Optimizations and Bounds for Sparse Matrix-Vector - - PowerPoint PPT Presentation

performance optimizations and bounds for sparse matrix
SMART_READER_LITE
LIVE PREVIEW

Performance Optimizations and Bounds for Sparse Matrix-Vector - - PowerPoint PPT Presentation

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP)


slide-1
SLIDE 1

[ <--> ]

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project

www.cs.berkeley.edu/

  • richie/bebop

Computer Science Division, U.C. Berkeley Berkeley, California, USA . SC 2002: Session on Sparse Linear Algebra

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36

slide-2
SLIDE 2

[ <--> ]

Context: Performance Tuning in the Sparse Case

Application performance dominated by a few computational kernels Performance tuning today

Vendor-tuned libraries (e.g., BLAS) or user hand-tunes Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT)

Tuning sparse linear algebra kernels is hard

Sparse code has . . . high bandwidth requirements (extra storage) poor locality (indirect, irregular memory access) poor instruction mix (data structure manipulation) Sparse matrix-vector multiply (SpM

V) performance: less than 10% of machine peak Performance depends on kernel, architecture, and matrix

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36

slide-3
SLIDE 3

[ <--> ]

Example: Matrix olafu

2400 4800 7200 9600 12000 14400 2400 4800 7200 9600 12000 14400 1015156 non−zeros Spy Plot: 03−olafu.rua

N = 16146 nnz = 1.0M Kernel = SpM

V

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

slide-4
SLIDE 4

[ <--> ]

Example: Matrix olafu

10 20 30 40 50 60 10 20 30 40 50 60 1188 non−zeros Spy Plot (zoom−in): 03−olafu.rua

N = 16146 nnz = 1.0M Kernel = SpM

V A natural choice:

✂ ✁ ✂

blocked compressed sparse row (BCSR). Experiment: Measure performance of all

✄ ✁ ☎

block sizes for

✄✝✆ ☎ ✞ ✟ ✠ ✆ ✡ ✆ ☛ ✆ ✂ ☞

.

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

slide-5
SLIDE 5

[ <--> ]

Speedups on Itanium: The Need for Search

93 100 110 120 130 140 150 160 170 180 190 200 208 c = 1 2 3 c = 6 r = 1 2 3 r = 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93 1.21 1.27 0.97 0.78 1.44 0.99 1.07 0.64 1.20 0.81 0.75 0.95

Mflop/s Mflop/s

(Peak machine speed: 3.2 Gflop/s)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

slide-6
SLIDE 6

[ <--> ]

Speedups on Itanium: The Need for Search

93 100 110 120 130 140 150 160 170 180 190 200 208 c = 1 2 3 c = 6 r = 1 2 3 r = 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93 1.21 1.27 0.97 0.78 1.44 0.99 1.07 0.64 1.20 0.81 0.75 0.95

Mflop/s Mflop/s

(Peak machine speed: 3.2 Gflop/s)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

slide-7
SLIDE 7

[ <--> ]

Speedups on Itanium: The Need for Search

93 100 110 120 130 140 150 160 170 180 190 200 208 c = 1 2 3 c = 6 r = 1 2 3 r = 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93 1.21 1.27 0.97 0.78 1.44 0.99 1.07 0.64 1.20 0.81 0.75 0.95

Mflop/s Mflop/s

(Peak machine speed: 3.2 Gflop/s)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

slide-8
SLIDE 8

[ <--> ]

Speedups on Itanium: The Need for Search

93 100 110 120 130 140 150 160 170 180 190 200 208 c = 1 2 3 c = 6 r = 1 2 3 r = 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93 1.21 1.27 0.97 0.78 1.44 0.99 1.07 0.64 1.20 0.81 0.75 0.95

Mflop/s Mflop/s

(Peak machine speed: 3.2 Gflop/s)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

slide-9
SLIDE 9

[ <--> ]

Speedups on Itanium: The Need for Search

93 100 110 120 130 140 150 160 170 180 190 200 208 c = 1 2 3 c = 6 r = 1 2 3 r = 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93 1.21 1.27 0.97 0.78 1.44 0.99 1.07 0.64 1.20 0.81 0.75 0.95

Mflop/s Mflop/s

(Peak machine speed: 3.2 Gflop/s)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

slide-10
SLIDE 10

[ <--> ]

Key Questions and Conclusions

How do we choose the best data structure automatically?

New heuristic for choosing optimal (or near-optimal) block sizes

What are the limits on performance of blocked SpM

V?

Derive performance upper bounds for blocking Often within 20% of upper bound, placing limits on improvement from more “low-level” tuning Performance is memory-bound: reducing data structure size is critical

Where are the new opportunities (kernels, techniques) for achieving higher performance?

Identify cases in which blocking does and does not work Identify new kernels and opportunities for reducing memory traffic

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36

slide-11
SLIDE 11

[ <--> ]

Related Work

Automatic tuning systems and code generation PHiPAC [BACD97], ATLAS [WPD01], SPARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) FLAME [GGHvdG01] Sparse performance modeling and tuning Temam and Jalby [TJ92] Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Gropp, et al. [GKKS99], Geus [GR99] Sparse kernel interfaces Sparse BLAS Standard [BCD

01] NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00] PETSc, hypre, . . .

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36

slide-12
SLIDE 12

[ <--> ]

Approach to Automatic Tuning

For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

slide-13
SLIDE 13

[ <--> ]

Approach to Automatic Tuning

For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) The SPARSITY system for SpM

V [Im & Yelick ’99] Interface

Input: Your sparse matrix (CSR) Output: Data structure + routine tuned to your matrix & machine

Implementation space

register level blocking (

✄ ✁ ☎

) cache blocking, multiple vectors, . . .

Search

Off-line: benchmarking (once per architecture) Run-time: estimate matrix properties (“search”) and predict best data structure parameters

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

slide-14
SLIDE 14

[ <--> ]

Approach to Automatic Tuning

For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) The SPARSITY system for SpM

V [Im & Yelick ’99] Interface

Input: Your sparse matrix (CSR) Output: Data structure + routine tuned to your matrix & machine

Implementation space

register level blocking (

✄ ✁ ☎

) cache blocking, multiple vectors, . . .

Search

Off-line: benchmarking (once per architecture) Run-time: estimate matrix properties (“search”) and predict best data structure parameters

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

slide-15
SLIDE 15

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

10 20 30 40 50 5 10 15 20 25 30 35 40 45 50 688 true non−zeros 3 x 3 Register Blocking Example

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

slide-16
SLIDE 16

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

10 20 30 40 50 5 10 15 20 25 30 35 40 45 50 688 true non−zeros 3 x 3 Register Blocking Example

BCSR with uniform, aligned grid

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

slide-17
SLIDE 17

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

10 20 30 40 50 5 10 15 20 25 30 35 40 45 50 (688 true non−zeros) + (383 explicit zeros) = 1071 nz 3 x 3 Register Blocking Example

Fill-in zeros: trade-off extra flops for better efficiency

This example: 50% fill led to 1.5x speedup on Pentium III

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

slide-18
SLIDE 18

[ <--> ]

Search: Choosing the Block Size

Off-line benchmarking (once per architecture) Measure Dense Performance (r,c)

Performance (Mflop/s) of dense matrix in sparse

✄ ✁ ☎

blocked format

At run-time, when matrix is known: Estimate Fill Ratio (r,c),

✎✝✏✒✑ ✓

Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)

Choose

✏ ✑ ✓

that maximizes

✔✖✕ ✗✙✘ ✚✖✛ ✜ ✢✤✣ ✜✥ ✦ ✧ ★ ✛ ✩ ✜ ✑ ★ ✪ ✫ ✬ ✛ ✧ ✕ ✛ ✚✖✛ ✜ ✢✤✣ ✜✥ ✦ ✧ ★ ✛ ✩ ✜ ✑ ★ ✪ ✭ ✮ ✯ ✯ ✰ ✦ ✗ ✮ ✣ ✩ ✜ ✑ ★ ✪

(Replaces previous SPARSITY heurstic)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36

slide-19
SLIDE 19

[ <--> ]

Off-line Benchmarking [Intel Pentium III]

43 50 60 70 80 90 100 107 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 2.54 2.49 2.45 2.44 2.41 2.39 2.38 2.39 2.37 2.36

Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 (

✡ ✁ ✠ ✱

).

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.10/36

slide-20
SLIDE 20

[ <--> ]

333 MHz Sun Ultra 2i (2.03)

36 40 45 50 55 60 65 70 72 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] 2.03 1.98 1.98 1.98 1.97 1.96 1.95 1.94 1.94 1.94

72 36 500 MHz Intel Pentium III (2.54)

43 50 60 70 80 90 100 107 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 2.54 2.49 2.45 2.44 2.41 2.39 2.38 2.39 2.37 2.36

107 42 375 MHz IBM Power3 (1.22)

90 100 110 120 130 140 150 160 170 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix] 1.22 1.20 1.19 1.19 1.18 1.18 1.18 1.17 1.17 1.17

174 86 800 MHz Intel Itanium (1.55)

109 120 140 160 180 200 220 240 255 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 1.55 1.55 1.53 1.49 1.49 1.48 1.45 1.46 1.42 1.40

256 109

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.11/36

slide-21
SLIDE 21

[ <--> ]

Performance Bounds for Register Blocking

How close are we to the speed limit of blocking? Upper-bounds on performance: (flops) / (time) Flops

2 * (number of true non-zeros) Lower-bound on time: Two key assumptions Consider only memory operations Count only compulsory misses (i.e., ignore conflicts) Other features of model Account for cache line size Bound is function of matrix and

✏✒✑ ✓

Model of execution time (see paper)

Charge full latency (

✳ ✴

) for hits at each cache level

, e.g.,

✶ ✷ ✸ ✹ ✠ ✺ ✻ ✼ ✽ ✾ ✳ ✿ ❀ ✸ ✹ ✡ ✺ ✻ ✼ ✽ ✾ ✳ ❁ ❀ ✸✝❂ ❃ ❂ ✺ ✻ ✼ ✽ ✾ ✳❅❄ ❆ ❄ ✷ ✸ ✹ ❇❈ ❉ ✽ ✾ ✳ ✿ ❀ ✸ ✹ ✠ ❂ ✻ ✽ ✽ ❃ ✽ ✾ ✸ ✳ ❁ ❊ ✳ ✿ ✾ ❀ ✸ ✹ ✡ ❂ ✻ ✽ ✽ ❃ ✽ ✾ ✸ ✳ ❄ ❆ ❄ ❊ ✳ ❁ ✾

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.12/36

slide-22
SLIDE 22

[ <--> ]

Overview of Performance Results

Experimental setup Four machines: Pentium III, Ultra 2i, Power3, Itanium 44 matrices: dense, finite element, mixed, linear programming Measured misses and cycles using PAPI [BDG

00] Reference: CSR (i.e.,

  • r “unblocked”)

Main observations SPARSITY vs. reference: up to 2.5x faster, especially on FEM Block size selection: chooses within 10% of best SPARSITY performance typically within 20% of upper-bound SPARSITY least effective on Power3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.13/36

slide-23
SLIDE 23

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 10 20 30 40 50 60 70 80 90 100 110 120 130 140 matrix Performance (Mflop/s) Performance Summary [pentium3−linux−icc] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Reference

DGEMV (n=1000): 96 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

slide-24
SLIDE 24

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 10 20 30 40 50 60 70 80 90 100 110 120 130 140 matrix Performance (Mflop/s) Performance Summary [pentium3−linux−icc] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Upper bound (PAPI) Reference

DGEMV (n=1000): 96 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

slide-25
SLIDE 25

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 10 20 30 40 50 60 70 80 90 100 110 120 130 140 matrix Performance (Mflop/s) Performance Summary [pentium3−linux−icc] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Reference

DGEMV (n=1000): 96 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

slide-26
SLIDE 26

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 10 20 30 40 50 60 70 80 90 100 110 120 130 140 matrix Performance (Mflop/s) Performance Summary [pentium3−linux−icc] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

DGEMV (n=1000): 96 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

slide-27
SLIDE 27

[ <--> ]

Performance Results: Sun Ultra 2i

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 matrix Performance (Mflop/s) Performance Summary [ultra−solaris] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

DGEMV (n=1000): 58 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.15/36

slide-28
SLIDE 28

[ <--> ]

Performance Results: Intel Itanium

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 matrix Performance (Mflop/s) Performance Summary [itanium−linux−ecc] D FEM FEM (var. blk.) Mixed LP Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

DGEMV (n=1000): 310 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.16/36

slide-29
SLIDE 29

[ <--> ]

Performance Results: IBM Power3

1 4 5 7 8 9 10 12 13 15 40 20 40 60 80 100 120 140 160 180 200 220 240 260 280 matrix Performance (Mflop/s) Performance Summary [power3−aix] D FEM FEM (var. blk.) LP Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

DGEMV (n=2000): 260 Mflop/s

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.17/36

slide-30
SLIDE 30

[ <--> ]

Conclusions

Tuning can be difficult, even when matrix structure is known Performance is a complicated function of architecture and matrix New heuristic for choosing block size selects optimal implementation, or near-optimal (performance within 5–10%) Limits of low-level tuning for blocking are near Performance is often within 20% of upper-bound, particularly with FEM matrices Unresolved: closing the gap on the Power3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.18/36

slide-31
SLIDE 31

[ <--> ]

BeBOP: Current and Future Work (1/2)

Further performance improvements

symmetry (1.5–2x speedups) diagonals, block diagonals, and bands (1.2–2x), splitting for variable block structure (1.3–1.7x), reordering to create dense structure (1.7x), cache blocking (1.5–4x) multiple vectors (2–7x) and combinations . . . How to choose optimizations & tuning parameters?

Sparse triangular solve (ICS’02: POHLL Workshop paper)

hybrid sparse/dense structure (1.2–1.8x)

Higher-level kernels that permit reuse

❍ ❍ ■❑❏

,

❍ ■ ❍ ❏

(1.5–3x)

❍ ❏

and

❍ ■✒▲

simultaneously,

❍ ▼ ❏

,

◆ ❍ ◆ ■

, . . . (future work)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.19/36

slide-32
SLIDE 32

[ <--> ]

BeBOP: Current and Future Work (2/2)

An automatically tuned sparse matrix library

Code generation via sparse compilers (Bernoulli; Bik) Plan to extend new Sparse BLAS standard by one routine to support tuning

Architectural issues

Improvements for Power3? Latency vs. bandwidth (see paper) Using models to explore architectural design space

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.20/36

slide-33
SLIDE 33

[ <--> ]

EXTRA SLIDES

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.21/36

slide-34
SLIDE 34

[ <--> ]

Example: No Big Surprises on Sun Ultra 2i

36 38 40 42 44 46 48 50 52 54 1 2 3 6 1 2 3 6 column block size (c) row block size (r) Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]

1.00 1.09 1.19 1.21 1.17 1.25 1.30 1.30 1.31 1.34 1.41 1.39 1.40 1.39 1.42 1.53

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.22/36

slide-35
SLIDE 35

[ <--> ]

Where in Memory is the Time Spent?

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Platform Fraction of Cycles (exhaustive best; average over matrices) Execution Time Model (Model Hits/Misses) −− Where is the Time Spent?

L1 L2 L3 Mem

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.23/36

slide-36
SLIDE 36

[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5

matrix

  • no. of misses (millions)

L1 Misses −− [ultra−solaris]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.24/36

slide-37
SLIDE 37

[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 10

−1

10 10

1

matrix

  • no. of misses (millions)

L2 Misses −− [ultra−solaris]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.25/36

slide-38
SLIDE 38

[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44 10

−1

10 10

1

matrix

  • no. of misses (millions)

L1 Misses −− [pentium3−linux−icc]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.26/36

slide-39
SLIDE 39

[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44 10

−1

10 10

1

matrix

  • no. of misses (millions)

L2 Misses −− [pentium3−linux−icc]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.27/36

slide-40
SLIDE 40

[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L1)

1 4 5 7 8 9 10 12 13 15 40 10

−1

10 10

1

matrix

  • no. of misses (millions)

L1 Misses −− [power3−aix]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.28/36

slide-41
SLIDE 41

[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L2)

1 4 5 7 8 9 10 12 13 15 40 10

−1

10 10

1

matrix

  • no. of misses (millions)

L2 Misses −− [power3−aix]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.29/36

slide-42
SLIDE 42

[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 10

−1

10 10

1

matrix

  • no. of misses (millions)

L2 Misses −− [itanium−linux−ecc]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.30/36

slide-43
SLIDE 43

[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L3)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 10

−1

10 10

1

matrix

  • no. of misses (millions)

L3 Misses −− [itanium−linux−ecc]

Upper bound PAPI Lower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.31/36

slide-44
SLIDE 44

[ <--> ]

Latency vs. Bandwidth

Ultra 2i Pentium III Power3 Itanium 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Platform Fraction of Peak Main Memory Bandwidth Sustainable Memory Bandwidth (STREAM)

*−⋅ Full−latency model a[i] = b[i] a[i] = α*b[i] a[i] = b[i] + c[i] a[i] = b[i] + α*c[i] s += a[i] s += a[i]*b[i] s += a[i]; k += ind[i] s += a[i]*x[ind[i]]; x∈L_chip s += a[i]*x[ind[i]]; x∈L_ext DGEMV SpM×V (dense, 1×1) SpM×V (dense, best)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.32/36

slide-45
SLIDE 45

[ <--> ]

Related Work (2/2)

Compilers (analysis and models); run-time selection CROPS (UCSD/Carter, Ferrante, et al.) TUNE (Chatterjee, et al.) Iterative compilation (O’Boyle, et al., 1998) Broadway (Guyer and Lin, ’99) Brewer (’95); ADAPT (Voss, 2000) Interfaces: Sparse BLAS; PSBLAS; PETSc Sparse triangular solve SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . . Approximation: Alvarado (’93); Raghavan (’98) Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.33/36

slide-46
SLIDE 46

[ <--> ]

What is the Cost of Search?

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 5 10 15 20 25 30 35 40 45

Block Size Selection Overhead [itanium−linux−ecc] Matrix # Cost (no. of reference Sp×MVs)

.74 .74 .85 .71 .86 .85 .80 .85 .84 .72 .76 .76 .78 .76 .79

heuristic rebuild

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.34/36

slide-47
SLIDE 47

[ <--> ]

Where in Memory is the Time Spent? (PAPI Data)

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Platform Fraction of Cycles (exhaustive best; average over matrices) Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent?

L1 L2 L3 Mem

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.35/36

slide-48
SLIDE 48

[ <--> ]

Fill: Some Surprises!

Sometimes faster to fill in many zeros Dense Fill (Size BCSR) (Perf BCSR) Matrix Speedup ratio (Size CSR) (Perf CSR) Platform 11 1.09 1.70 1.26 1.55 Itanium 13 1.50 1.52 1.07 2.30 Pentium 3 17 1.04 1.59 1.23 1.54 Itanium 27 1.16 1.94 1.47 1.54 Itanium 27 1.10 1.53 1.25 1.23 Ultra 2i 29 1.00 1.98 1.44 1.89 Pentium 3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.36/36

slide-49
SLIDE 49

References

[BACD97]

  • J. Bilmes, K. Asanovi´

c, C.W. Chin, and J. Demmel. Optimizing ma- trix multiply using PHiPAC: a portable, high-performance, ANSI C cod- ing methodology. In Proceedings of the International Conference on Supercomputing, Vienna, Austria, July 1997. ACM SIGARC. see

http://www.icsi.berkeley.edu/˜bilmes/phipac.

[BCD

01]

  • S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammar-

ling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F . Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster,

  • C. Whaley, and J. Wolff von Gudenberg.

Document for the Basic Lin- ear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001.

www.netlib.org/blast.

[BDG

00]

  • S. Browne, J. Dongarra, N. Garner, K. London, and P

. Mucci. A scalable cross-platform infrastructure for application performance tuning using hard- ware counters. In Proceedings of Supercomputing, November 2000. [BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure anal-

  • ysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.

[FC00] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for paral- lel linear algebra computation on sparse matrices. ACM Transactions on Mathematical Software, 26(4):527–550, December 2000. [FDZ99] Basilio B. Fraguela, Ram´

  • n Doallo, and Emilio L. Zapata. Memory hier-

archy performance prediction for sparse blocked algorithms. Parallel Pro- cessing Letters, 9(3), March 1999. [FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architec- ture for the FFT. In Proceedings of the International Conference on Acous- tics, Speech, and Signal Processing, Seattle, Washington, May 1998. [GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de

  • Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-

actions on Mathematical Software, 27(4), December 2001.

36-1

slide-50
SLIDE 50

[GKKS99] William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F . Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241–248, 1999. [GR99] Roman Geus and S. R¨

  • llin. Towards a fast parallel sparse matrix-vector
  • multiplication. In E. H. D’Hollander, J. R. Joubert, F

. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308–315. Imperial College Press, 1999. [HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos Cabaleiro Dominguez, and Francisco F . Rivera. Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In HPCN Europe, pages 201–210, 1999. [Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplica-

  • tion. PhD thesis, University of California, Berkeley, May 2000.

[MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptive software library for fast fourier transforms. In Proceedings of the Interna- tional Conference on Supercomputing, pages 215–224, Sante Fe, NM, May 2000. [Neu98]

  • T. Neubert.

Anwendung von generativen Programmiertechniken am Beispiel der Matrixalgebra. Master’s thesis, Technische Universit¨ at Chem- nitz, 1998. [NGLPJ96]

  • J. J. Navarro, E. Garc´

ia, J. L. Larriba-Pey, and T. Juan. Algorithms for sparse matrix computations on high-performance workstations. In Pro- ceedings of the 10th ACM International Conference on Supercomputing, pages 301–308, Philadelpha, PA, USA, May 1996. [PH99] Ali Pinar and Michael Heath. Improving performance of sparse matrix- vector multiplication. In Proceedings of Supercomputing, 1999. [PSVM01] Markus P¨ uschel, Bryan Singer, Manuela Veloso, and Jos´ e M. F . Moura. Fast automatic generation of DSP algorithms. In Proceedings of the In- ternational Conference on Computational Science, volume 2073 of LNCS, pages 97–106, San Francisco, CA, May 2001. Springer.

36-1

slide-51
SLIDE 51

[RP96]

  • K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technical

report, NIST, 1996. gams.nist.gov/spblas. [Saa94] Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations,

  • 1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html

[SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portable high performance: the Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP, 1998. [Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Se- quential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997. [TJ92]

  • O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms
  • n caches. In Proceedings of Supercomputing ’92, 1992.

[Tol97] Sivan Toledo. Improving memory-system performance of sparse matrix- vector multiplication. In Proceedings of the 8th SIAM Conference on Paral- lel Processing for Scientific Computing, March 1997. [Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume 1505 of LNCS. Springer-Verlag, 1998. [VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards an accurate model for collective communications. In Proceedings of the In- ternational Conference on Computational Science, volume 2073 of LNCS, pages 41–50, San Francisco, CA, May 2001. Springer. [WPD01]

  • R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empiri-

cal optimizations of software and the ATLAS project. Parallel Computing, 27(1):3–25, 2001. [WS97] James B. White and P . Sadayappan. On improving the performance of sparse matrix-vector multiplication. In Proceedings of the International Conference on High-Performance Computing, 1997.

36-1