Reducing Communication in Sparse Matrix Operations 2018 Blue Waters - - PowerPoint PPT Presentation

reducing communication in sparse matrix operations 2018
SMART_READER_LITE
LIVE PREVIEW

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters - - PowerPoint PPT Presentation

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign Collaborators on this allocation: Amanda Bienz , University of Illinois at


slide-1
SLIDE 1

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium

Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign

Collaborators on this allocation: Amanda Bienz, University of Illinois at Urbana-Champaign Bill Gropp, University of Illinois at Urbana-Champaign Andrew Reisner, University of Illinois at Urbana-Champaign Lukas Spies, University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

Figure: XPACC @ Illinois Time Stepping

Sparse Matrix Operations

Figure: MD Anderson Figure: Fischer @ Illinois PCA / Clustering Linear Systems

w ← A ∗ v C ← A ∗ B C ← R ∗ A ∗ RT w ← A−1v

Sparse Matrix-Vector multiplication (SpMV)

Figure:QMCpack Eigen analysis

slide-3
SLIDE 3

What is this talk about? (Why it matters)

  • 10s, 100s, 1000s, … of SpMVs in a computation
  • SpMV is a major kernel, but is limited efficiency and limited scalability
  • Use machine layout (nodes) on Blue Waters to reduce communication
  • Use consistent timings on Blue Waters to develop accurate performance models

Iterative method for solving Ax = b

while... α hr, zi/hAp, pi x x + αp r+ r αAp z+ precond(r) β hr+, z+i/hr, zi p z + βp

CA algorithms, see Eller/Gropp SpMV 2…10…100 SpMVs

slide-4
SLIDE 4

p = 0 p = 1 p = 2 p = 3 p = 4 p = 5

  • Solid blocks: on-process portion
  • Patterned blocks: off-process portion (requires communication of the input vector)

Anatomy of a Sparse Matrix-Vector (SpMV) product

w A v P0 P1 P2 P3

w ← A ∗ v

Data layout Where data is sent

slide-5
SLIDE 5
  • Modeling difficult (more later)
  • Basic SpMV: rows-per-process layout

Cost of a Sparse Matrix-Vector (SpMV) product

500K 100K 50K Non-zeros per core 20 40 60 80 100 % of Time in Communication

Process ID Process ID Process ID

SpMV All-reduce SpMV

nlpkkt240

slide-6
SLIDE 6

Case Study: Preconditioning (Algebraic Multigrid)

  • AMG: Algebraic Multigrid iteratively whittles away at the error
  • Series or hierarchy of successively smaller (and more dense) sparse matrices
  • SpMV dominated

x ← x + ωAr

x ← x + ωA1r x ← x + ωA2r

x ← x + ωAr

x ← x + ωA1r x ← x + ωA2r

A0 A1 A2 A3

nnz n rows = 30 nnz n rows = 64 nnz n rows = 66 nnz n rows = 26

Level 0 Level 1 Level 2 Level 3

slide-7
SLIDE 7

Case Study: Preconditioning (Algebraic Multigrid)

  • MFEM discretization
  • Linear elasticity
  • 8192 cores, 512 nodes, 10k dof / core

5 10 15 20 25 Level in AMG Hierarchy 10−4 10−3 10−2 Time (Seconds)

Smaller matrices == more communication

slide-8
SLIDE 8

Observation 1: message volume between procs

Maximum number of messages Maximum size of messages

5 10 15 20 AMG Level 102 103 Max Number of Messages 5 10 15 20 AMG Level 104 105 Max Messages Size (bytes)

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3.

  • ff node > on node > on socket
slide-9
SLIDE 9

Observation 2: limits of communication

T = α + ppn · s min (RN, ppn · RB)

latency message size Bandwidth between two processes Node injection bandwidth

Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test,
 Gropp, Olson, Samfass, EuroMPI 2016.

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3.

  • ff node > on node > on socket
slide-10
SLIDE 10

Observation 3: node locality

100 101 102 103 104 105 106 Number of Bytes Communicated 10−6 10−5 10−4 Time (seconds

Network (PPN ≥ 4) Network (PPN < 4) On-Node On-Socket

  • Split into short, eager,

rendezvous

  • Partition into on-

socket, on-node, and

  • ff-node

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3.

  • ff node > on node > on socket
slide-11
SLIDE 11

Anatomy of a node level SpMV product

P0 P1 P2 P3 P4 P5 N0 N1 N2

Six processes distributed across three nodes Linear system distributed across the processes

w A v P0 P1 P2 P3 P4 P5

slide-12
SLIDE 12

Standard Communication

n m q

Node Node core

slide-13
SLIDE 13

Standard Communication

n m p

Node Node core

slide-14
SLIDE 14

New Algorithm: On-Node Communication

n p

slide-15
SLIDE 15

New Algorithm: Off-Node Communication

n m p q

slide-16
SLIDE 16

New Algorithm: Off-Node Communication

n m p q

slide-17
SLIDE 17

New Algorithm: Off-Node Communication

n m p q

slide-18
SLIDE 18

New Algorithm: Off-Node Communication

n m p q

slide-19
SLIDE 19

Node-Aware Parallel (NAP) Matrix Operation

1.) Redistribute initial values

n m p q n m p q

2.) Inter-node communication 3.) Redistribute received values 4.) On-node communication 5.) Local computation with on-process, on-node, and off-node portions

  • f Matrix

Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3

n m p q n p

slide-20
SLIDE 20

Case Study: Preconditioning (Algebraic Multigrid)

Off-node On-node Maximum number of messages sent from any process on 16,384 processes

5 10 15 20 AMG Level 101 Max Number of On-Node Messages

  • ref. SpMV

TAPSpMV

5 10 15 20 AMG Level 101 102 103 Max Off-Node Number of Messages

  • ref. SpMV

TAPSpMV

slide-21
SLIDE 21

Case Study: Preconditioning (Algebraic Multigrid)

Maximum size of messages sent from any process

  • n 16,384 processes

5 10 15 20 AMG Level 101 102 103 Max On-Node Messages Size (bytes)

  • ref. SpMV

TAPSpMV

5 10 15 20 AMG Level 103 104 105 Max Off-Node Messages Size (bytes)

  • ref. SpMV

TAPSpMV

Off-node On-node

slide-22
SLIDE 22

Case Study: Preconditioning (Algebraic Multigrid)

5 10 15 20 AMG Level 10−3 10−2 Time (seconds)

  • ref. SpMV

TAPSpMV

2000 4000 6000 8000 10000 12000 14000 16000 18000 Number of Processes 10−1 Time (seconds)

  • ref. SpMV

TAPSpMV

Total Time Strong Scaling

Node aware sparse matrix-vector multiplication,
 Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv

slide-23
SLIDE 23

Cost analysis on Blue Waters

  • Blue Waters provided a unique setting for two aspects:
  • 1. Model MPI queueing times
  • 2. Model network contention

100 101 102 103 104 10−6 10−5 10−4 10−3 10−2 10−1 100 Time (seconds)

16 Bytes 64 Bytes 256 Bytes 1024 Bytes 4096 Bytes 16384 Bytes 65536 Bytes 262144 Bytes 100 101 102 103 104 Number of Messages Communicated 10−6 10−5 10−4 10−3 10−2 10−1 100 Time (seconds)

  • MPI Irecv message queue costly
  • Identified a quadratic cost
slide-24
SLIDE 24

Cost analysis on Blue Waters

  • Blue Waters provided a unique setting for two aspects:
  • 1. Model MPI queueing times
  • 2. Model network contention
  • Network contention is costly
  • Identified a hop model

100 101 102 103 104 10−5 10−4 10−3 10−2 10−1 100 Time (seconds)

16 Bytes 64 Bytes 256 Bytes 1024 Bytes 4096 Bytes 16384 Bytes 65536 Bytes 262144 Bytes

G0 G1 G2 G3

100 101 102 103 104 Number of Messages Communicated 10−5 10−4 10−3 10−2 10−1 100 Time (seconds)

256 Bytes 16384 Bytes

slide-25
SLIDE 25

Cost analysis on Blue Waters

  • Blue Waters provided a unique setting for two aspects:
  • 1. Model MPI queueing times
  • 2. Model network contention

1 2 3 4 5 6 Level in AMG Hierarchy 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Time (seconds)

Measured Max-Rate Queue Search Contention Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.

slide-26
SLIDE 26

Summary and Ongoing Work

  • Drop in replacement for a range of Sparse Matrix operations

(SpMV, SPMM, MIS(k), assembly operations, etc)

  • Blue Waters instrumental in testing at scale, reproducible
  • utcomes, and accurate performance analysis.
  • (this) Code base: https://github.com/lukeolson/raptor
  • Structured code base: https://github.com/cedar-framework/cedar

This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.