Automatic Performance Tuning and Analysis of Sparse Triangular Solve - - PowerPoint PPT Presentation

automatic performance tuning and analysis of sparse
SMART_READER_LITE
LIVE PREVIEW

Automatic Performance Tuning and Analysis of Sparse Triangular Solve - - PowerPoint PPT Presentation

Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project


slide-1
SLIDE 1

Automatic Performance Tuning and Analysis of Sparse Triangular Solve

Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project

www.cs.berkeley.edu/

  • richie/bebop

Computer Science Division, U.C. Berkeley

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.1/31

slide-2
SLIDE 2

Context: High-Performance Libraries

Application performance dominated by a few computational kernels

Solving PDEs (linear algebra ops) Google (sparse matrix-vector multiply) Multimedia (signal processing)

Performance tuning today

Vendor-tuned standardized libraries (e.g., BLAS) User tunes by hand

Automated tuning for dense linear algebra, FFTs,

✁ ✁ ✁

PHiPAC/ATLAS (dense linear algebra) FFTW/SPIRAL/UHFFT (signal processing)

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.2/31

slide-3
SLIDE 3

Problem Area: Sparse Matrix Kernels

Performance issues in sparse linear algebra

High bandwidth requirements and poor instruction mix Depends on architecture, kernel, and matrix How to select data structures, algorithms? at run-time?

Approach to automatic tuning: for each kernel,

Identify and generate a space of implementations Search (models, experiments) to find the fastest one

Early success: SPARSITY (Im & Yelick ’99) for sparse matrix-vector multiply (SpM

V) This talk: Sparse triangular solve (SpTS), arising in sparse Cholesky and LU factorization (uniprocessor)

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.3/31

slide-4
SLIDE 4

Sparse Triangular Matrix Example

raefsky4 (structural problem) + SuperLU 2.0 + colmmd Dimension: 19779

  • No. non-zeros:

12.6 M Dense trailing triangle: dim=2268 20% of total nnz

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.4/31

slide-5
SLIDE 5

Idea: Sparse/Dense Partitioning

Partition the matrix into sparse (

✄✆☎✞✝ ✄✆✟

) and dense (

✄✡✠

) parts:

✄ ☎ ✄ ✟ ✄ ✠ ☛ ☎ ☛ ✟ ☞ ✌ ☎ ✌ ✟

Leads to 1 SpTS, 1 SpM

V, and 1 Dense TS:

✄ ☎ ☛ ☎ ☞ ✌ ☎

(1)

✍ ✌ ✟ ☞ ✌ ✟ ✎ ✄ ✟ ☛ ☎

(2)

✄ ✠ ☛ ✟ ☞ ✍ ✌ ✟

(3)

SPARSITY optimizations for (1)–(2); tuned BLAS for (3).

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.5/31

slide-6
SLIDE 6

Register Blocking (SPARSITY)

10 20 30 40 50 5 10 15 20 25 30 35 40 45 50

nz = 598

4x3 Register Blocking Example

Store

✏ ✑ ✒

dense blocks Multiply/solve block-by-block Fill in explicit zeros 1.3x–2.5x speedup on FEM matrices (SpM

V) Reduced storage overhead

  • ver, e.g., CSR

Block ops are fully unrolled – improves register reuse Trade-off extra computation for efficiency

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.6/31

slide-7
SLIDE 7

Tuning Parameter Selection

Parameters: switch point,

, and register block size,

Off-line profiling

Benchmark routines on synthetic data Only needed once per architecture

At run-time (when matrix is known)

Determine or estimate matrix properties (e.g., fill ratio, size of trailing triangle) Combine with data collected off-line Convert to new data structure

In practice, total run-time cost to select and reorg: e.g., 10–30 naïve solves on Itanium

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.7/31

slide-8
SLIDE 8

Performance Bounds

Upper-bounds on performance (Mflop/s)? Flops: 2 * (number of non-zeros) - (dimension) Full latency cost model of execution time:

✕ ✖ ✓ ✝ ✔ ✗ ☞ ✘✚✙ ☎ ✛✢✜ ☎ ✣ ✛ ✖ ✓ ✝ ✔ ✗✥✤ ✛ ✦ ✘ ✖ ✓ ✝ ✔ ✗✥✤ ✧ ★ ✧

(4)

Lower bound on misses: ignore conflict misses on vectors

✩ ✪ ✫ ✬ ✭✯✮ ✰ ✱ ✲ ✳✵✴✵✶ ✷ ✸ ✜ ☎ ✹ ✫ ✺✼✻ ✽✿✾ ❀ ☎ ❁ ☎ ❂ ✷ ❃ ❄ ❁ ☎ ❂ ❅❆ ❇ ✷ ❈ ❁ ☎ ❉ ❁ ❀ ✟ ❇ ❁ ✳ ❇ ✙ ✴ ✸ ✳ ✳ ❇ ✙ ✴ ✸ ❁ ☎ ✸ ✟ ❄❊

(5)

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.8/31

slide-9
SLIDE 9

Performance Results: Intel Itanium

dense memplus wang4 ex11 raefsky4 goodwin lhr10 50 75 100 125 150 175 200 225 250 275 300 325 350 matrix Performance (Mflop/s) Sparse Triangular Solve: Performance Summary −− [itanium−linux−ecc] Reference

  • Reg. Blocking (RB)

Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.9/31

slide-10
SLIDE 10

Conclusions and Directions

Limits of “low-level” tuning are near

Can we approach bandwidth limits? Other kernels?

  • ❋■❍

,

❋ ❏ ❍

,

❑ ❋ ❑
  • Other structures? multiple vectors, symmetry, reordering

Interface from/to libraries and applications?

Leverage existing generators (e.g., Bernoulli) Hybrid on-line, off-line optimizations

SpTS-specific future work

symbolic structure; other fill-reducing orderings refinements to switch point selection Incomplete Cholesky and LU preconditioners

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.10/31

slide-11
SLIDE 11

Related Work (1/R)

Automatic tuning systems

PHiPAC [BACD97], ATLAS [WPD01], SPARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01])

Code generation

FLAME [GGHvdG01] Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . )

Sparse performance modeling

Temam and Jalby [TJ92] White and Sadayappan [WS97] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99]

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.11/31

slide-12
SLIDE 12

Related Work (2/R)

Compilers (analysis and models); run-time selection

CROPS (UCSD/Carter, Ferrante, et al.) TUNE (Chatterjee, et al.) Iterative compilation (O’Boyle, et al., 1998) Broadway (Guyer and Lin, ’99) Brewer (’95); ADAPT (Voss, 2000)

Interfaces: Sparse BLAS; PSBLAS; PETSc Sparse triangular solve

SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . . Approximation: Alvarado (’93); Raghavan (’98) Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.12/31

slide-13
SLIDE 13

—End—

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.13/31

slide-14
SLIDE 14

Tuning Parameter Selection

First, select switch point,

; at run-time:

Assume matrix in CSR format on input Scan bottom row from diag. until two consecutive zeros found Fill vs. efficiency trade-off

Then, select register block size,

Maximize, over all

,

▼❖◆ P ◗ ◆ ❘ ❙ ❚❱❯ ❲❨❳ ❲ P ◗ ❩ ❙ ❯ ◗ ◆ ❬❪❭ ❯ ❘ ❙ ❚ ❩ ◆ ❯ ❬❪❭ ❯ ❘ ❙ P ❫ ◆ ❴ ❵ ❛ ❭ ❩ ❜ ◗ ❝ ❙ ❚ ▲ ❞ ◗ ❚ ❲ ❘ ❙ ❚ ◆ ❡ ❢ ❣ ❣ ❯ ❙ ❚ ❲ ❭ ❙ ❚ ▲

(6)

Total cost to select and reorg.: e.g., 10–30 naïve solves on Itanium

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.14/31

slide-15
SLIDE 15

Matrix Benchmark Suite

Dense Trailing Triangle Nnz % Total Name Application Area Dim. in L Dim. Density Nnz dense Dense matrix 1000 500k 1000 100.0% 100.0% memplus Circuit simulation 17758 2.0M 1978 97.7% 96.8% wang4 Device simulation 26068 15.1M 2810 95.0% 24.8% ex11 Fluid flow 16614 9.8M 2207 88.0% 22.0% raefsky4 Structural mechanics 19779 12.6M 2268 100.0% 20.4% goodwin Fluid mechanics 7320 1.0M 456 65.9% 6.97% lhr10 Chemical processes 10672 369k 104 96.3% 1.43%

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.15/31

slide-16
SLIDE 16

Register Profile (Intel Itanium)

120 140 160 180 200 220 240 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 1.55 1.55 1.53 1.49 1.49 1.48 1.45 1.46 1.42 1.40

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.16/31

slide-17
SLIDE 17

Register Profile (IBM Power3)

140 160 180 200 220 240 260 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); power3−aix] 1.59 1.56 1.54 1.52 1.49 1.49 1.50 1.47 1.47 1.47

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.17/31

slide-18
SLIDE 18

Register Profile (Sun Ultra 2i)

40 45 50 55 60 65 70 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] 2.03 1.98 1.98 1.98 1.97 1.96 1.95 1.94 1.94 1.94

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.18/31

slide-19
SLIDE 19

Register Profile (Intel Pentium III)

50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) row block size (r) Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 2.54 2.49 2.45 2.44 2.41 2.39 2.38 2.39 2.37 2.36

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.19/31

slide-20
SLIDE 20

Miss Model Validation: Intel Itanium

dense memplus wang4 ex11 raefsky4 goodwin lhr10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L2 Misses −− [itanium−linux−ecc] Upper bound PAPI Lower bound dense memplus wang4 ex11 raefsky4 goodwin lhr10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L3 Misses −− [itanium−linux−ecc] Upper bound PAPI Lower bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.20/31

slide-21
SLIDE 21

Miss Model Validation: Sun Ultra 2i

dense memplus wang4 ex11 raefsky4 goodwin lhr10 1 2 3 4 5 6 7 8 9 10 11 12 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L1 Misses −− [ultra−solaris] Upper bound PAPI Lower bound dense memplus wang4 ex11 raefsky4 goodwin lhr10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L2 Misses −− [ultra−solaris] Upper bound PAPI Lower bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.21/31

slide-22
SLIDE 22

Miss Model Validation: IBM Power3

dense memplus wang4 ex11 raefsky4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L1 Misses −− [power3−aix] Upper bound PAPI Lower bound dense memplus wang4 ex11 raefsky4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Matrix

  • No. of misses (millions)

Sparse Triangular Solve: L2 Misses −− [power3−aix] Upper bound PAPI Lower bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.22/31

slide-23
SLIDE 23

Dense Triangle Density: Dense Matrix

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [dense−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.23/31

slide-24
SLIDE 24

Dense Triangle Density: memplus

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [memplus−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.24/31

slide-25
SLIDE 25

Dense Triangle Density: wang4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [wang4−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.25/31

slide-26
SLIDE 26

Dense Triangle Density: ex11

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [ex11−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.26/31

slide-27
SLIDE 27

Dense Triangle Density: raefsky4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [raefsky4−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.27/31

slide-28
SLIDE 28

Dense Triangle Density: goodwin

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [goodwin−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.28/31

slide-29
SLIDE 29

Dense Triangle Density: lhr10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Column number (normalized) Density Density of Trailing Submatrix [lhr10−L] Trailing submatrix density (%) Heuristic switch point

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.29/31

slide-30
SLIDE 30

Performance Results: Sun Ultra 2i

dense memplus wang4 ex11 raefsky4 goodwin lhr10 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 matrix Performance (Mflop/s) Sparse Triangular Solve: Performance Summary −− [ultra−solaris] Reference

  • Reg. Blocking (RB)

Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.30/31

slide-31
SLIDE 31

Performance Results: IBM Power3

dense memplus wang4 ex11 raefsky4 25 50 75 100 125 150 175 200 225 250 275 300 matrix Performance (Mflop/s) Sparse Triangular Solve: Performance Summary −− [power3−aix] Reference

  • Reg. Blocking (RB)

Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound

Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.31/31

slide-32
SLIDE 32

References

[BACD97]

  • J. Bilmes, K. Asanovi´

c, C.W. Chin, and J. Demmel. Optimizing ma- trix multiply using PHiPAC: a portable, high-performance, ANSI C cod- ing methodology. In Proceedings of the International Conference on Supercomputing, Vienna, Austria, July 1997. ACM SIGARC. see

http://www.icsi.berkeley.edu/˜bilmes/phipac.

[BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure anal-

  • ysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.

[FDZ99] Basilio B. Fraguela, Ram´

  • n Doallo, and Emilio L. Zapata. Memory hier-

archy performance prediction for sparse blocked algorithms. Parallel Pro- cessing Letters, 9(3), March 1999. [FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architec- ture for the FFT. In Proceedings of the International Conference on Acous- tics, Speech, and Signal Processing, Seattle, Washington, May 1998. [GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de

  • Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-

actions on Mathematical Software, 27(4), December 2001. [HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos Cabaleiro Dominguez, and Francisco F . Rivera. Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In HPCN Europe, pages 201–210, 1999. [Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplica-

  • tion. PhD thesis, University of California, Berkeley, May 2000.

[MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptive software library for fast fourier transforms. In Proceedings of the Interna- tional Conference on Supercomputing, pages 215–224, Sante Fe, NM, May 2000. [Neu98]

  • T. Neubert.

Anwendung von generativen Programmiertechniken am Beispiel der Matrixalgebra. Master’s thesis, Technische Universit¨ at Chem- nitz, 1998.

31-1

slide-33
SLIDE 33

[NGLPJ96]

  • J. J. Navarro, E. Garc´

ia, J. L. Larriba-Pey, and T. Juan. Algorithms for sparse matrix computations on high-performance workstations. In Pro- ceedings of the 10th ACM International Conference on Supercomputing, pages 301–308, Philadelpha, PA, USA, May 1996. [PSVM01] Markus P¨ uschel, Bryan Singer, Manuela Veloso, and Jos´ e M. F . Moura. Fast automatic generation of DSP algorithms. In Proceedings of the In- ternational Conference on Computational Science, volume 2073 of LNCS, pages 97–106, San Francisco, CA, May 2001. Springer. [SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portable high performance: the Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP, 1998. [Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Se- quential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997. [TJ92]

  • O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms
  • n caches. In Proceedings of Supercomputing ’92, 1992.

[Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume 1505 of LNCS. Springer-Verlag, 1998. [VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards an accurate model for collective communications. In Proceedings of the In- ternational Conference on Computational Science, volume 2073 of LNCS, pages 41–50, San Francisco, CA, May 2001. Springer. [WPD01]

  • R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empiri-

cal optimizations of software and the ATLAS project. Parallel Computing, 27(1):3–25, 2001. [WS97] James B. White and P . Sadayappan. On improving the performance of sparse matrix-vector multiplication. In Proceedings of the International Conference on High-Performance Computing, 1997.

31-2