A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. - PowerPoint PPT Presentation

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C. Dohrmann, S. Hammond, W. Held, M. Heroux, M. Hoemmen, K. Kim, S. Olivier, A. Prokopenko, S. Rajamanickam SIAM CSC16 SAND2016-10150 C Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Problem Statement • Solve R P T Q x = b • Upper or lower sparse triangular matrix T • Row scaling R • Permutations P , Q • Solution and RHS x, b • (Everything that is needed for LDL, LU, incomplete factorizations, etc.) • Efficient to absorb user data • For many sequential RHS with • Same T or • Same nonzero pattern pat(T) • On a multi/many-core node 2

Solution Approach • Symbolic phase • Find parallelism in pat(T), the graph of T • Numeric phase • Load data structures with numbers • Solve phase 3

Motivation: Level Scheduling 5 4 log 10 #Rows 3 2 1 0 Reorder 0 5 10 15 Level Index 4

Motivation: Level Scheduling 14 13 12 11 log 2 #Rows 10 9 8 7 6 5 4 3 2 1 0 200 400 600 800 1000 1200 1400 1600 1 Cumulative Fraction #Rows 0.8 0.6 0.4 0.2 0 200 400 600 800 1000 1200 1400 1600 1 Cumulative Fraction NNZ 0.8 0.6 0.4 0.2 0 200 400 600 800 1000 1200 1400 1600 5 Level Index

Motivation: Hybrid 14 12 10 log 2 #Rows 8 6 Reorder 4 2 0 0 0.5 1 1.5 2 2.5 3 log 10 Level Index 6

Motivation: Hybrid Solve phase on Knights Corner Elastic cube, bilinear hexes, 86490 unknowns, L from LDL, NodeND 90 90 85 85 80 80 75 75 70 70 Hybrid solver Level scheduling only 65 65 Recursive blocking only Speedup w.r.t. MKL trisolver 60 60 mkl_cspblas_dcsrtrsv 55 55 50 50 45 45 40 40 Reorder 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 1 4 8 16 28 57 114 7 # Threads, KMP_AFFINITY=balanced

Software: HTS • Trilinos/packages/shylu/hts • C++ and OpenMP • Templated on row pointer, column index, and scalar types • CSR, CSC, forward, transpose, conjugate inputs • Effective nonzero pattern reuse • Will be an option in Ifpack2::LocalSparseTriangularSolver • Interface will support nonzero pattern reuse 8

Algorithms: Switching Method • Want robustness to downward and upward spikes in 𝑂 " . • Use levels 1 to k : n good ∼ 10 , f bad ∼ 1% N i ≡ size of level set i X C i ≡ N j j ≤ i C bad X N j ≡ i j ≤ i ∩ N j <n good C bad k ≡ arg max N k ≥ n good ≤ f bad C k ∩ k k 9

Algorithms: Level Scheduling Reorder 10

Algorithms: Pruned Point-to-Point Thread 0 Thread 1 Thread 0 Thread 1 0 1 0 1 Level 1 Level 1 2 3 2 3 Level 2 Level 2 4 5 Level 3 4 5 Level 3 Park, J., M. Smelyanskiy, N. Sundaram, and P. Dubey., "Sparsifying synchronization for high-performance shared-memory sparse triangular solver." In Supercomputing , pp. 124-140. Springer International Publishing, 2014. 11

Algorithms: Recursive Blocking serial trisolve serial mvp serial trisolve inverse parallel mvp sparse or dense parallel or serial inverse parallel mvp parallel mvp 12

10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 9 Solve phase speedup w.r.t. MKL trisolver copter2 gas_sensor Results: UMFPACK LU on IB and KNC matrix-new_3 av41092 Hybrid Recursive blocking Level scheduling xenon2 OMP_PROC_BIND=spread OMP_PLACES=cores c-71 shipsec1 UMFPACK LU, Ivy Bridge, 20 threads xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 0.25 0.75 1.25 1.75 dawson5 0.5 1.5 venkat50 0 1 2 c-59 Straightforward reference serial trisolver speedup w.r.t. MKL trisolver 2D_54019_highK gas_sensor gridgena epb3 torso2 KnightsCorner Ivy Bridge xenon2 finan512 twotone shipsec1 torsion1 xenon1 jan99jac120 boyd1 c-73b hvdc2 rajat16 ct20stif hcircuit vanbody 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 0 5 dawson5 Solve phase speedup w.r.t. MKL trisolver copter2 gas_sensor matrix-new_3 av41092 xenon2 epb3 c-71 UMFPACK LU, Knights Corner, 240 threads shipsec1 xenon1 g7jac160 g7jac140sc KMP_AFFINITY=compact mark3jac120 boyd1 mark3jac100sc ct20stif hvdc2 vanbody ncvxbqp1 hcircuit dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone Hybrid Recursive blocking Level scheduling torsion1 jan99jac120 boyd1 c-73b hvdc2 13 rajat16 hcircuit

Results: UMFPACK LU on IB and KNC UMFPACK LU, Ivy Bridge, OMP_PROC_BIND=spread OMP_PLACES=cores UMFPACK LU, Knights Corner, KMP_AFFINITY=compact 19 80 18 60 threads 75 17 120 threads 70 16 240 threads 15 65 14 60 13 55 12 50 11 45 10 40 9 35 8 7 30 6 25 5 20 4 15 10 threads 3 20 threads 10 2 Solve phase speedup w.r.t. MKL trisolver Solve phase speedup w.r.t. MKL trisolver 40 threads 5 1 0 0 14 14 12 12 10 10 8 8 6 6 4 4 (Numeric phase time) / (parallel solve time) (Numeric phase time) / (parallel solve time) 2 2 0 0 10 10 8 8 6 6 4 4 (Symbolic phase time) / (serial solve time) 2 2 (Symbolic phase time) / (serial solve time) 0 0 gas_sensor epb3 hvdc2 gas_sensor epb3 hvdc2 copter2 xenon2 c-71 shipsec1 xenon1 vanbody dawson5 c-59 gridgena torso2 boyd1 rajat16 copter2 xenon2 c-71 shipsec1 xenon1 vanbody dawson5 c-59 gridgena torso2 boyd1 rajat16 av41092 g7jac160 g7jac140sc ct20stif ncvxbqp1 venkat50 finan512 twotone torsion1 c-73b hcircuit av41092 g7jac160 g7jac140sc ct20stif ncvxbqp1 venkat50 finan512 twotone torsion1 c-73b hcircuit mark3jac120 2D_54019_highK jan99jac120 mark3jac120 2D_54019_highK jan99jac120 mark3jac100sc mark3jac100sc matrix-new_3 matrix-new_3 14

UMFPACK LU, Ivy Bridge 20 threads, 822 UF matrices OMP_PROC_BIND=spread OMP_PLACES=cores 20 19 18 17 Solve phase speedup w.r.t. MKL trisolver 16 15 14 13 12 11 10 9 UMFPACK LU, Knights Corner 8 240 threads, 824 UF matrices 7 6 KMP_AFFINITY=compact 5 95 90 4 Median for ≥ N 3 85 Median for ≥ N 2 80 Solve phase speedup w.r.t. MKL trisolver 1 75 0 70 10 3 10 4 10 5 10 6 65 N 60 55 50 45 40 35 30 25 20 15 10 5 0 10 3 10 4 10 5 10 6 15 N

Future Work • Point-to-point level scheduling • Group rows into tasks to minimize #dependencies • Size tasks to reflect level of synchronization • Hybrid • Switching method(s) • Does not have to be 3 blocks; alternate • HTS • Improve formatting of recursively blocked part to take further advantage of dense sub- blocks • Direct sparse methods on GPU? 16

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. - PowerPoint PPT Presentation

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C. Dohrmann, S. Hammond, W. Held, M. Heroux, M. Hoemmen, K. Kim, S. Olivier, A. Prokopenko, S. Rajamanickam SIAM CSC16 SAND2016-10150 C Sandia National

Dense Triangular Solvers on Multicore Clusters using UPC Jorge Gonzlez-Domnguez*, Mara J.

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

HOW-TO GUIDE ON SOUTH-SOUTH AND TRIANGULAR COOPERATION AND DECENT WORK Contents Introduction

Triangular Matrices Definition 1 Given an n n matrix A A is called upper triangular if all

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architecture Ichitaro

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

SE350: Operating Systems Lecture 5: Multithreaded Kernels Outline Use cases for multithreaded

Factors Impacting Performance of Multithreaded Triangular Solve Sandia National Laboratories is a

Triangular solution to the general relativistic three-body problem Kei Yamada Hirosaki

Triangular Distributions and Correlations The simple math behind triangular distributions and

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Mixed models in R using the lme4 package Part 2: lattice graphics in R Douglas Bates Merck,

Reordering Philipp Koehn 31 October 2017 Philipp Koehn Machine Translation: Reordering 31

ALI 453: Asceticism in Nahjul Balagha Session T wo Quote for Reflection The soul is an

Identifying Character Types in German Drama As a Classification Task Benjamin Krautter , Janis

Chunk-level Reordering of Source Language Sentences with Automatically Learned Rules for

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue Dai 03/06/2020 Administrivia

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY

CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling Slides