 
              Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign Collaborators on this allocation: Amanda Bienz , University of Illinois at Urbana-Champaign Bill Gropp, University of Illinois at Urbana-Champaign Andrew Reisner, University of Illinois at Urbana-Champaign Lukas Spies, University of Illinois at Urbana-Champaign
Sparse Matrix Operations Time Stepping Linear Systems PCA / Clustering Figure: XPACC @ Illinois Figure: MD Anderson Figure: Fischer @ Illinois C ← A ∗ B w ← A − 1 v Eigen analysis C ← R ∗ A ∗ R T Figure:QMCpack w ← A ∗ v Sparse Matrix-Vector multiplication (SpMV)
What is this talk about? (Why it matters) Iterative method for solving Ax = b while... α h r, z i / h Ap, p i CA algorithms, see Eller/Gropp x x + α p SpMV r + r � α Ap 2…10…100 SpMVs z + precond( r ) β h r + , z + i / h r, z i p z + β p • 10s, 100s, 1000s, … of SpMVs in a computation • SpMV is a major kernel, but is limited e ffi ciency and limited scalability • Use machine layout (nodes) on Blue Waters to reduce communication • Use consistent timings on Blue Waters to develop accurate performance models
Anatomy of a Sparse Matrix-Vector (SpMV) product w ← A ∗ v Solid blocks: on-process portion • Patterned blocks: off-process portion (requires communication of the input vector) • w v A p = 0 P0 p = 1 p = 2 P1 p = 3 P2 p = 4 P3 p = 5 Data layout Where data is sent
Cost of a Sparse Matrix-Vector (SpMV) product nlpkkt240 100 Process ID Process ID % of Time in Communication 80 60 Process ID 40 20 0 500K 100K 50K Non-zeros per core All-reduce SpMV SpMV Modeling difficult (more later) • Basic SpMV: rows-per-process layout •
Case Study: Preconditioning (Algebraic Multigrid) • AMG: Algebraic Multigrid iteratively whittles away at the error • Series or hierarchy of successively smaller (and more dense) sparse matrices • SpMV dominated x ← x + ω Ar x ← x + ω Ar Level 0 nnz A 0 n rows = 26 x ← x + ωA 1 r Level 1 x ← x + ωA 1 r nnz n rows = 30 A 1 x ← x + ωA 2 r Level 2 nnz x ← x + ωA 2 r A 2 n rows = 64 … Level 3 nnz A 3 n rows = 66
Case Study: Preconditioning (Algebraic Multigrid) 10 − 2 Time (Seconds) 10 − 3 10 − 4 0 5 10 15 20 25 Level in AMG Hierarchy Smaller matrices == more communication • MFEM discretization • Linear elasticity • 8192 cores, 512 nodes, 10k dof / core
Observation 1: message volume between procs 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket Max Messages Size (bytes) 10 5 Max Number of Messages 10 3 10 4 10 2 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level Maximum number of Maximum size of messages messages
Observation 2: limits of communication 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket latency message size ppn · s T = α + min ( R N , ppn · R B ) Node injection Bandwidth between bandwidth two processes Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test, Gropp, Olson, Samfass, EuroMPI 2016.
Observation 3: node locality 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. o ff node > on node > on socket Network (PPN ≥ 4) On-Node On-Socket Network (PPN < 4) • Split into short, eager, 10 − 4 Time (seconds rendezvous • Partition into on- 10 − 5 socket, on-node, and 10 − 6 off-node 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Number of Bytes Communicated
Anatomy of a node level SpMV product w v A P0 P1 P0 P2 P4 P2 P3 P1 P3 P5 P4 N0 N1 N2 P5 Six processes distributed Linear system distributed across three nodes across the processes
Standard Communication q core n m Node Node
Standard Communication core p n m Node Node
New Algorithm: On-Node Communication p n
New Algorithm: Off-Node Communication p q n m
New Algorithm: Off-Node Communication p q n m
New Algorithm: Off-Node Communication p q n m
New Algorithm: Off-Node Communication p q n m
Node-Aware Parallel (NAP) Matrix Operation 1.) Redistribute initial values 2.) Inter-node communication p p q q n m n m 4.) On-node 3.) Redistribute received values communication 5.) Local computation with on-process, on-node, p p and off-node portions of Matrix q n m n Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3
Case Study: Preconditioning (Algebraic Multigrid) Maximum number of messages sent from any process on 16,384 processes ref. SpMV TAPSpMV ref. SpMV TAPSpMV Max O ff -Node Number of Messages Max Number of On-Node Messages 10 3 10 2 10 1 10 1 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level O ff -node On-node
Case Study: Preconditioning (Algebraic Multigrid) Maximum size of messages sent from any process on 16,384 processes ref. SpMV TAPSpMV ref. SpMV TAPSpMV Max O ff -Node Messages Size (bytes) Max On-Node Messages Size (bytes) 10 5 10 3 10 4 10 2 10 3 10 1 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level O ff -node On-node
Case Study: Preconditioning (Algebraic Multigrid) ref. SpMV TAPSpMV ref. SpMV TAPSpMV Time (seconds) 10 − 1 Time (seconds) 10 − 2 10 − 3 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 5 10 15 20 Number of Processes AMG Level Total Time Strong Scaling Node aware sparse matrix-vector multiplication, Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv
Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention 16 Bytes 1024 Bytes 65536 Bytes 64 Bytes 4096 Bytes 262144 Bytes 256 Bytes 16384 Bytes 10 0 10 − 1 Time (seconds) 10 − 2 10 − 3 10 − 4 MPI Irecv message queue costly • 10 − 5 10 − 6 Identified a quadratic cost • 10 0 10 1 10 2 10 3 10 4 10 0 10 − 1 Time (seconds) 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 0 10 1 10 2 10 3 10 4 Number of Messages Communicated
Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 16 Bytes 1024 Bytes 65536 Bytes 2. Model network contention 64 Bytes 4096 Bytes 262144 Bytes 256 Bytes 16384 Bytes 10 0 10 − 1 Time (seconds) 10 − 2 G0 G1 G2 G3 10 − 3 10 − 4 10 − 5 256 Bytes 16384 Bytes Network contention is costly 10 0 • 10 0 10 1 10 2 10 3 10 4 10 − 1 Identified a hop model • Time (seconds) 10 − 2 10 − 3 10 − 4 10 − 5 10 0 10 1 10 2 10 3 10 4 Number of Messages Communicated
Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention Measured Queue Search Max-Rate Contention 0 . 008 0 . 007 0 . 006 Time (seconds) 0 . 005 0 . 004 0 . 003 0 . 002 0 . 001 0 . 000 0 1 2 3 4 5 6 Level in AMG Hierarchy Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.
Summary and Ongoing Work Drop in replacement for a range of Sparse Matrix operations • (SpMV, SPMM, MIS(k), assembly operations, etc) Blue Waters instrumental in testing at scale, reproducible • outcomes, and accurate performance analysis. (this) Code base: https://github.com/lukeolson/raptor • Structured code base: https://github.com/cedar-framework/cedar • This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
Recommend
More recommend