A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. - - PowerPoint PPT Presentation

a hybrid multithreaded direct sparse triangular solver
SMART_READER_LITE
LIVE PREVIEW

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. - - PowerPoint PPT Presentation

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C. Dohrmann, S. Hammond, W. Held, M. Heroux, M. Hoemmen, K. Kim, S. Olivier, A. Prokopenko, S. Rajamanickam SIAM CSC16 SAND2016-10150 C Sandia National


slide-1
SLIDE 1

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

A Hybrid Multithreaded Direct Sparse Triangular Solver

Andrew M. Bradley

Thanks: E. Boman, C. Dohrmann, S. Hammond, W. Held, M. Heroux, M. Hoemmen,

  • K. Kim, S. Olivier, A. Prokopenko, S. Rajamanickam

SAND2016-10150 C SIAM CSC16

slide-2
SLIDE 2

Problem Statement

2

  • Solve R P T Q x = b
  • Upper or lower sparse triangular matrix T
  • Row scaling R
  • Permutations P

, Q

  • Solution and RHS x, b
  • (Everything that is needed for LDL, LU, incomplete

factorizations, etc.)

  • Efficient to absorb user data
  • For many sequential RHS with
  • Same T or
  • Same nonzero pattern pat(T)
  • On a multi/many-core node
slide-3
SLIDE 3

Solution Approach

3

  • Symbolic phase
  • Find parallelism in pat(T), the graph of T
  • Numeric phase
  • Load data structures with numbers
  • Solve phase
slide-4
SLIDE 4

Motivation: Level Scheduling

4

Level Index

5 10 15

log 10 #Rows

1 2 3 4 5

Reorder

slide-5
SLIDE 5

Motivation: Level Scheduling

5

200 400 600 800 1000 1200 1400 1600

log 2 #Rows

1 2 3 4 5 6 7 8 9 10 11 12 13 14 200 400 600 800 1000 1200 1400 1600

Cumulative Fraction #Rows

0.2 0.4 0.6 0.8 1

Level Index

200 400 600 800 1000 1200 1400 1600

Cumulative Fraction NNZ

0.2 0.4 0.6 0.8 1

slide-6
SLIDE 6

Motivation: Hybrid

6

log 10 Level Index

0.5 1 1.5 2 2.5 3

log 2 #Rows

2 4 6 8 10 12 14

Reorder

slide-7
SLIDE 7

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

# Threads, KMP_AFFINITY=balanced

1 4 8 16 28 57 114

Speedup w.r.t. MKL trisolver

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

Solve phase on Knights Corner Elastic cube, bilinear hexes, 86490 unknowns, L from LDL, NodeND

Hybrid solver Level scheduling only Recursive blocking only mkl_cspblas_dcsrtrsv

7

Reorder

Motivation: Hybrid

slide-8
SLIDE 8

Software: HTS

8

  • Trilinos/packages/shylu/hts
  • C++ and OpenMP
  • Templated on row pointer, column index, and scalar types
  • CSR, CSC, forward, transpose, conjugate inputs
  • Effective nonzero pattern reuse
  • Will be an option in

Ifpack2::LocalSparseTriangularSolver

  • Interface will support nonzero pattern reuse
slide-9
SLIDE 9

Algorithms: Switching Method

9

ngood ∼ 10, fbad ∼ 1% Ni ≡ size of level set i Ci ≡ X

j≤i

Nj Cbad

i

≡ X

j≤i ∩ Nj<ngood

Nj k ≡ arg max

k

Nk ≥ ngood ∩ Cbad

k

≤ fbadCk

  • Want robustness to downward and upward spikes in 𝑂".
  • Use levels 1 to k:
slide-10
SLIDE 10

Algorithms: Level Scheduling

10

Reorder

slide-11
SLIDE 11

Thread 0 Thread 1 Level 1 Level 2 Level 3

1 2 3 4 5

Algorithms: Pruned Point-to-Point

11

Thread 0 Thread 1 Level 1 Level 2 Level 3

1 2 3 4 5

Park, J., M. Smelyanskiy, N. Sundaram, and P. Dubey., "Sparsifying synchronization for high-performance shared-memory sparse triangular solver." In Supercomputing,

  • pp. 124-140. Springer International Publishing, 2014.
slide-12
SLIDE 12

sparse or dense parallel or serial

Algorithms: Recursive Blocking

12 parallel mvp inverse parallel mvp inverse parallel mvp

serial mvp serial trisolve serial trisolve

slide-13
SLIDE 13

13

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Solve phase speedup w.r.t. MKL trisolver

UMFPACK LU, Ivy Bridge, 20 threads OMP_PROC_BIND=spread OMP_PLACES=cores

copter2 gas_sensor matrix-new_3 av41092 xenon2 c-71 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone torsion1 jan99jac120 boyd1 c-73b hvdc2 rajat16 hcircuit

Level scheduling Recursive blocking Hybrid 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Solve phase speedup w.r.t. MKL trisolver

UMFPACK LU, Knights Corner, 240 threads KMP_AFFINITY=compact

copter2 gas_sensor matrix-new_3 av41092 xenon2 c-71 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone torsion1 jan99jac120 boyd1 c-73b hvdc2 rajat16 hcircuit

Level scheduling Recursive blocking Hybrid

Results: UMFPACK LU on IB and KNC

0.25 0.5 0.75 1 1.25 1.5 1.75 2

Straightforward reference serial trisolver speedup w.r.t. MKL trisolver

gas_sensor xenon2 shipsec1 xenon1 ct20stif vanbody dawson5 epb3 boyd1 hvdc2 hcircuit

Ivy Bridge KnightsCorner

slide-14
SLIDE 14

14

2 4 6 8 10 12 14

(Numeric phase time) / (parallel solve time)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Solve phase speedup w.r.t. MKL trisolver

UMFPACK LU, Ivy Bridge, OMP_PROC_BIND=spread OMP_PLACES=cores

10 threads 20 threads 40 threads

2 4 6 8 10

(Symbolic phase time) / (serial solve time)

copter2 gas_sensor matrix-new_3 av41092 xenon2 c-71 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone torsion1 jan99jac120 boyd1 c-73b hvdc2 rajat16 hcircuit 2 4 6 8 10

(Symbolic phase time) / (serial solve time)

copter2 gas_sensor matrix-new_3 av41092 xenon2 c-71 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone torsion1 jan99jac120 boyd1 c-73b hvdc2 rajat16 hcircuit 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Solve phase speedup w.r.t. MKL trisolver

UMFPACK LU, Knights Corner, KMP_AFFINITY=compact

60 threads 120 threads 240 threads

2 4 6 8 10 12 14

(Numeric phase time) / (parallel solve time)

Results: UMFPACK LU on IB and KNC

slide-15
SLIDE 15

15

N

10 3 10 4 10 5 10 6

Solve phase speedup w.r.t. MKL trisolver

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

UMFPACK LU, Ivy Bridge 20 threads, 822 UF matrices OMP_PROC_BIND=spread OMP_PLACES=cores

Median for ≥ N

N

10 3 10 4 10 5 10 6

Solve phase speedup w.r.t. MKL trisolver

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

UMFPACK LU, Knights Corner 240 threads, 824 UF matrices KMP_AFFINITY=compact

Median for ≥ N

slide-16
SLIDE 16

Future Work

16

  • Point-to-point level scheduling
  • Group rows into tasks to minimize

#dependencies

  • Size tasks to reflect level of synchronization
  • Hybrid
  • Switching method(s)
  • Does not have to be 3 blocks; alternate
  • HTS
  • Improve formatting of recursively blocked

part to take further advantage of dense sub- blocks

  • Direct sparse methods on GPU?