Automatic Performance Tuning and Analysis of Sparse Triangular Solve - PowerPoint PPT Presentation

� Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project www.cs.berkeley.edu/ richie/bebop Computer Science Division, U.C. Berkeley Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.1/31

✁ ✁ ✁ Context: High-Performance Libraries Application performance dominated by a few computational kernels Solving PDEs (linear algebra ops) Google (sparse matrix-vector multiply) Multimedia (signal processing) Performance tuning today Vendor-tuned standardized libraries ( e.g. , BLAS) User tunes by hand Automated tuning for dense linear algebra, FFTs, PHiPAC/ATLAS (dense linear algebra) FFTW/SPIRAL/UHFFT (signal processing) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.2/31

✂ Problem Area: Sparse Matrix Kernels Performance issues in sparse linear algebra High bandwidth requirements and poor instruction mix Depends on architecture, kernel, and matrix How to select data structures, algorithms? at run-time? Approach to automatic tuning: for each kernel, Identify and generate a space of implementations Search (models, experiments) to find the fastest one Early success: S PARSITY (Im & Yelick ’99) for sparse matrix-vector multiply (SpM V) This talk: Sparse triangular solve (SpTS), arising in sparse Cholesky and LU factorization (uniprocessor) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.3/31

Sparse Triangular Matrix Example raefsky4 (structural problem) + SuperLU 2.0 + colmmd Dimension: 19779 No. non-zeros: 12.6 M Dense trailing triangle: dim=2268 20% of total nnz Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.4/31

☛ ✌ ✂ ✄ ✄ ☎ ☎ ☎ ☞ ☎ ✟ ☛ ✍ ✌ ✟ ☞ ✌ ✟ ✎ ✠ ✌ ✟ ✄ ✟ ✌ ✍ ☞ ✟ ☛ ✄ ☎ ✟ ☎ ✄ ✠ ☛ ☎ ☛ ✟ ☞ ✌ ✄ Idea: Sparse/Dense Partitioning Partition the matrix into sparse ( ) and dense ( ) ✄✆☎✞✝ ✄✆✟ ✄✡✠ parts: Leads to 1 SpTS, 1 SpM V, and 1 Dense TS: (1) (2) (3) S PARSITY optimizations for (1)–(2); tuned BLAS for (3). Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.5/31

✑ ✒ ✑ ✏ Register Blocking (S PARSITY ) Store dense blocks 4x3 Register Blocking Example Multiply/solve block-by-block 0 5 Fill in explicit zeros 10 1.3x–2.5x speedup on FEM 15 matrices (SpM V) 20 25 Reduced storage overhead 30 over, e.g. , CSR 35 Block ops are fully unrolled – 40 improves register reuse 45 50 Trade-off extra computation for 0 10 20 30 40 50 nz = 598 efficiency Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.6/31

✓ ✔ Tuning Parameter Selection Parameters: switch point , , and register block size , Off-line profiling Benchmark routines on synthetic data Only needed once per architecture At run-time (when matrix is known) Determine or estimate matrix properties ( e.g. , fill ratio, size of trailing triangle) Combine with data collected off-line Convert to new data structure In practice, total run-time cost to select and reorg: e.g. , 10–30 naïve solves on Itanium Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.7/31

✜ ✹ ❁ ☎ ❀ ✙ ✴ ✫ ☎ ❂ ✸ ✸ ✷ ✳ ✲ ✱ ✰ ☎ ✷ ✬ ❁ ❁ ❇ ✟ ❀ ❁ ❉ ☎ ❈ ❃ ✷ ❇ ❅❆ ❂ ☎ ❁ ❄ ✳ ✫ ❇ ☞ ✛ ✣ ☎ ☎ ☎ ✸ ✗ ✓ ✔ ✝ ✓ ✖ ✕ ✟ ❄❊ ✖ ✝ ✪ ✸ ✩ ❇ ✙ ✴ ✧ ★ ✧ ✔ ✔ ✝ ✓ ✖ ✘ ✦ ✛ ❁ ✳ Performance Bounds Upper-bounds on performance (Mflop/s)? Flops: 2 * (number of non-zeros) - (dimension) Full latency cost model of execution time: ✘✚✙ ✗✥✤ ✗✥✤ (4) ✛✢✜ Lower bound on misses: ignore conflict misses on vectors ✺✼✻ ✳✵✴✵✶ ✽✿✾ (5) ✭✯✮ Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.8/31

Performance Results: Intel Itanium Sparse Triangular Solve: Performance Summary −− [itanium−linux−ecc] 350 Reference 325 Reg. Blocking (RB) Switch−to−Dense (S2D) 300 RB + S2D 275 Performance (Mflop/s) Analytic upper bound 250 Analytic lower bound PAPI upper bound 225 200 175 150 125 100 75 50 dense memplus wang4 ex11 raefsky4 goodwin lhr10 matrix Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.9/31

❑ ❏ ● ❑ ❋ ● ❋ ❍ ❋ Conclusions and Directions Limits of “low-level” tuning are near Can we approach bandwidth limits? Other kernels? , , ❋■❍ Other structures? multiple vectors, symmetry, reordering Interface from/to libraries and applications? Leverage existing generators ( e.g. , Bernoulli) Hybrid on-line, off-line optimizations SpTS-specific future work symbolic structure; other fill-reducing orderings refinements to switch point selection Incomplete Cholesky and LU preconditioners Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.10/31

Related Work (1/R) Automatic tuning systems PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Code generation FLAME [GGHvdG01] Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) Sparse performance modeling Temam and Jalby [TJ92] White and Sadayappan [WS97] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.11/31

Related Work (2/R) Compilers (analysis and models); run-time selection CROPS (UCSD/Carter, Ferrante, et al. ) TUNE (Chatterjee, et al. ) Iterative compilation (O’Boyle, et al. , 1998) Broadway (Guyer and Lin, ’99) Brewer (’95); ADAPT (Voss, 2000) Interfaces: Sparse BLAS; PSBLAS; PETSc Sparse triangular solve SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . . Approximation: Alvarado (’93); Raghavan (’98) Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.12/31

—End— Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.13/31

❙ ❛ ❩ ◆ ❯ ❡ ❯ ❘ ◆ P ❫ ◆ ❴ ❵ ❭ ❙ ❩ ❜ ◗ ❝ ❙ ❚ ▲ ❞ ◗ ❚ ❲ ❘ ❚ ❘ ❚ ◗ ✓ ▲ ❚ ❙ ❭ ❲ ✔ ❚ ▲ ❙ ❯ P ◆ ❯ ❘ ❙ ❣ ❣ ❲ P ◗ ❩ ❙ ❯ ◗ ◆ ❢ ❙ Tuning Parameter Selection First, select switch point , ; at run-time: Assume matrix in CSR format on input Scan bottom row from diag. until two consecutive zeros found Fill vs. efficiency trade-off Then, select register block size , Maximize, over all , ❬❪❭ ❬❪❭ ▼❖◆ ❲❨❳ ❚❱❯ (6) Total cost to select and reorg.: e.g. , 10–30 naïve solves on Itanium Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.14/31

Matrix Benchmark Suite Dense Trailing Triangle Nnz % Total Name Application Area Dim. in L Dim. Density Nnz dense Dense matrix 1000 500k 1000 100.0% 100.0% memplus Circuit simulation 17758 2.0M 1978 97.7% 96.8% wang4 Device simulation 26068 15.1M 2810 95.0% 24.8% ex11 Fluid flow 16614 9.8M 2207 88.0% 22.0% raefsky4 Structural mechanics 19779 12.6M 2268 100.0% 20.4% goodwin Fluid mechanics 7320 1.0M 456 65.9% 6.97% lhr10 Chemical processes 10672 369k 104 96.3% 1.43% Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.15/31

Register Profile (Intel Itanium) Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 12 1.46 240 11 1.53 1.40 1.45 10 220 9 8 1.48 200 row block size (r) 7 180 6 1.42 5 160 4 1.55 140 1.55 3 1.49 1.49 2 120 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.16/31

Register Profile (IBM Power3) Register Blocking Performance (Mflop/s) [ Dense (n=1000); power3−aix] 260 12 11 1.50 1.54 1.52 240 10 1.47 9 220 8 row block size (r) 200 7 6 1.49 180 5 1.56 4 1.49 1.59 1.47 160 1.47 3 2 140 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.17/31

Register Profile (Sun Ultra 2i) Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] 12 70 11 1.94 10 65 9 1.98 60 8 2.03 1.95 1.94 1.98 row block size (r) 7 55 6 50 5 1.96 1.98 1.97 1.94 4 45 3 2 40 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.18/31

Register Profile (Intel Pentium III) Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 12 100 11 10 90 9 8 row block size (r) 80 7 6 2.36 70 5 2.39 4 2.44 2.49 60 2.39 2.41 2.37 2.45 3 2.38 2.54 2 50 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.19/31

Automatic Performance Tuning and Analysis of Sparse Triangular Solve - PowerPoint PPT Presentation

Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

TUNING Russia: Development of master programmes in engineering education using the Tuning

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

CAPES:Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Learning and Interpreting STS with Structural Kernels Alessandro Moschitti Department of

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Institute for Defense Analyses 4850 Mark Center Drive Alexandria, Virginia 22311-1882

Toward Ubiquitous Web Hirotaka UEDA, Tahar CHERIF Sharp Corporation March 9, 2006 Whats

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

Automatic Performance Tuning and Analysis of Sparse Triangular Solve - PowerPoint PPT Presentation

Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

TUNING Russia: Development of master programmes in engineering education using the Tuning

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

CAPES:Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Learning and Interpreting STS with Structural Kernels Alessandro Moschitti Department of

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Institute for Defense Analyses 4850 Mark Center Drive Alexandria, Virginia 22311-1882

Toward Ubiquitous Web Hirotaka UEDA, Tahar CHERIF Sharp Corporation March 9, 2006 Whats

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

Semantic T extual Similarity &amp; more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C