Understanding and Optimizing Communication Performance on HPC - - PowerPoint PPT Presentation

understanding and optimizing communication performance on
SMART_READER_LITE
LIVE PREVIEW

Understanding and Optimizing Communication Performance on HPC - - PowerPoint PPT Presentation

Understanding and Optimizing Communication Performance on HPC Networks Contributors: Nikhil Jain, Abhinav Bhatele, Todd Gamblin, Xiang Ni, Michael Robson, Bilge Acun, Laxmikant Kale University of Illinois at Urbana-Champaign


slide-1
SLIDE 1

Understanding and Optimizing Communication Performance

  • n HPC Networks

Contributors: Nikhil Jain, Abhinav Bhatele, Todd Gamblin, Xiang Ni, Michael Robson, Bilge Acun, Laxmikant Kale University of Illinois at Urbana-Champaign http://charm.cs.illinois.edu/~nikhil/

1

slide-2
SLIDE 2

Communication in HPC

  • A necessity, but

can be viewed as an overhead

  • Can consume half

the execution time

2

Time spent in communication (%) 25 50 75 100 Cores 17500 35000 52500 70000

EpiSimdemics ClothSim OpenAtom NAMD PF3D MILC

slide-3
SLIDE 3

3

Communication in HPC

Complex interplay of several components : hardware, configurable network properties, interaction patterns, algorithms… As a user, limited control over environment and interference As an admin, how to best use the system while keeping users happy

slide-4
SLIDE 4

3

Torus Dragonfly MILC OpenAtom

Communication in HPC

Complex interplay of several components : hardware, configurable network properties, interaction patterns, algorithms… As a user, limited control over environment and interference As an admin, how to best use the system while keeping users happy Diverse apps Many systems

slide-5
SLIDE 5

Topology Aware Mapping

  • Profile applications for their communication graphs and map them
  • Extremely important for Torus-based systems; ongoing work on
  • ther topologies

4

slide-6
SLIDE 6
  • Use Case: OpenAtom

Topology Aware Mapping

  • Profile applications for their communication graphs and map them
  • Extremely important for Torus-based systems; ongoing work on
  • ther topologies

4

0" 2" 4" 6" 8" 10" 256" 512" 1024" 2048"

Time%per%step%(s)% Number%of%nodes%(each%node%is%64%threads)%

Scaling%for%MOF%on%Vulcan%

Default" Topo3aware"

1" 10" 100" 1000"

256" 512" 1024" Time%per%step%(s)% Number%of%nodes%(each%node%is%64%threads)% Min+Def" Min+Topo" BOMD+Def" BOMD+Topo"

slide-7
SLIDE 7

5

map()

=

app network

=

Application 3D Torus Application ranks mapped to the 3D torus

Rubik - Python based tool to create maps

slide-8
SLIDE 8

5

map()

=

app network

=

Application 3D Torus Application ranks mapped to the 3D torus

40 80 120 160 Default RR Tile1 Tile2 Tile3 Tile4 Tilt Time (s) Different mappings pF3D: Time spent in MPI calls on 4,096 nodes Recv Barrier Send Alltoall 100 200 300 400 500 Default RR Node Tile1 Tile2 Tile3 Tile4 Time (s) Different mappings MILC: Time spent in MPI calls on 4,096 nodes Irecv Isend Allreduce Wait

Rubik - Python based tool to create maps

slide-9
SLIDE 9

Understanding Networks

6

slide-10
SLIDE 10

Understanding Networks

  • What determines communication performance?
  • How can we predict it?
  • Quantification of metrics

6

slide-11
SLIDE 11

Understanding Networks

  • What determines communication performance?
  • How can we predict it?
  • Quantification of metrics
  • What is the relation between performance and the

entities quantified above?

  • Linear, higher polynomial, or indeterminate
  • Is statistical data related to performance?

6

slide-12
SLIDE 12

Understanding Networks

  • What determines communication performance?
  • How can we predict it?
  • Quantification of metrics
  • What is the relation between performance and the

entities quantified above?

  • Linear, higher polynomial, or indeterminate
  • Is statistical data related to performance?
  • Method 1: Supervised Learning
  • More on this in Abhinav’s talk

6

slide-13
SLIDE 13

Method 2: Packet-level Simulation

7

slide-14
SLIDE 14

Method 2: Packet-level Simulation

  • Detailed study of what-if scenarios
  • Comparison of similar systems

7

slide-15
SLIDE 15

Method 2: Packet-level Simulation

  • Detailed study of what-if scenarios
  • Comparison of similar systems
  • BigSim was among the earliest accurate packet-

level HPC network simulator (circa 2004)

  • Reviving Emulation and Simulation capabilities
  • f BigSim
  • BigSim + CODES + ROSS = TraceR
  • More on this in the Bilge’s talk

7

slide-16
SLIDE 16

8

Q1: What is the best combination of routing strategies and job placement policies for single jobs? Q3: Should the routing policy be job-specific or system-wide? Q2: What is the best combination for parallel job workloads?

Method 3: Modeling via Damselfly

Intermediate methods sufficient to answer certain types

  • f questions
slide-17
SLIDE 17

Dragonfly Topology

9

Level 1: Dense connectivity among routers to form groups IBM PERCS CRAY ARIES/XC30

slide-18
SLIDE 18

Dragonfly Topology

9

  • Level 2: Dense connectivity among groups as virtual routers

IBM PERCS CRAY ARIES/XC30

slide-19
SLIDE 19

What needs to be evaluated?

10

Job Placement Routing Comm Kernel

Random Nodes (RDN) Static Direct (SD) UnStructured Random Routers (RDR) Static Indirect (SI) 2D Stencil Random Chassis (RDC) Adaptive Direct (AD) 4D Stencil Random Group (RDG) Adaptive Indirect (AI) Many-to-many Round Robin Nodes (RRN) Adaptive Hybrid (AH) Spread Round Robin Routers (RRR) Job-specific (JS) Parallel Workloads (4)

Total cases ~ 360 for 8.8 million cores with 92,160 routers

slide-20
SLIDE 20

Model for link utilization

  • Input to the model:
  • 1. Network graph of Dragonfly routers
  • 2. Application communication graph for a communication step
  • 3. Job placement
  • 4. Routing strategy
  • Output: The steady-state traffic distribution on all network links,

which is representative of the network throughput

  • Implemented as a scalable parallel MPI program executed on Blue

Gene/Q
 — Maximum runtime of 2 hours on 8,192 cores for prediction on 8.8 million cores

11

slide-21
SLIDE 21

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity)

slide-22
SLIDE 22
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity)

Start with 10 GB/s per link

S D

slide-23
SLIDE 23
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3

Start with 10 GB/s per link

S D

slide-24
SLIDE 24
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3

Start with 10 GB/s per link

S D

slide-25
SLIDE 25
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

S D

slide-26
SLIDE 26
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

  • for each path p in P(m), compute its availability

using NRemain

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

S D

slide-27
SLIDE 27
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

  • for each path p in P(m), compute its availability

using NRemain

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

10 5 3.33

S D

slide-28
SLIDE 28
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

  • for each path p in P(m), compute its availability

using NRemain

  • using availability, allocate more bandwidth to the

messages

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

10 5 3.33

S D

slide-29
SLIDE 29
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

  • for each path p in P(m), compute its availability

using NRemain

  • using availability, allocate more bandwidth to the

messages

  • update NAlloc and NRemain to to reflect the new

allocations

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity) P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

10 5 3.33

S D

slide-30
SLIDE 30
  • Iterative solve for computing representative state

NAlloc
 while a message is allocated additional bandwidth

  • for each message m, obtain the list of paths P(m)
  • using P(m) of all messages, find the request

count for each link

  • for each path p in P(m), compute its availability

using NRemain

  • using availability, allocate more bandwidth to the

messages

  • update NAlloc and NRemain to to reflect the new

allocations

12

  • Initialize two copies of network graph N : 


NAlloc : stores total and per message allocated bandwidth ( = 0)
 NRemain : stores bandwidth available for allocation (= capacity)

  • Use NAlloc to compute the distribution of bytes on the given links

P1 P2 P3 1 2 2 1 3

Start with 10 GB/s per link

10 5 3.33

S D

slide-31
SLIDE 31

How to read the plots?

13

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

slide-32
SLIDE 32

How to read the plots?

13

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

Maximum traffic on any link: indicates network hotspot

slide-33
SLIDE 33

How to read the plots?

13

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

Average traffic on all links: indicates relative merit

slide-34
SLIDE 34

Median traffic: valuable for estimating distribution by comparing with the average

How to read the plots?

13

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

slide-35
SLIDE 35

Ideal: distribution with close values for all data points, lower the better

How to read the plots?

13

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

slide-36
SLIDE 36

Job placement: blocking reduces the maximum (up to 90% drop) and average (up to 92% drop)

Single job: Unstructured Mesh 6-20 partners with 512 KB messages

14

slide-37
SLIDE 37

Indirect routing: increases average, but reduces maximum by 50% in the best case

Single job: Unstructured Mesh 6-20 partners with 512 KB messages

14

slide-38
SLIDE 38

Adaptivity: similar distribution as static, but with lower maximum AI leads to 50% reduction in maximum traffic; hybrid does worse than AI

Single job: Unstructured Mesh 6-20 partners with 512 KB messages

14

slide-39
SLIDE 39

Job placement: negligible impact!

Single job: Random Neighbors 6-20 partners with 512 KB messages

15

slide-40
SLIDE 40

Indirect routing: shifts the graph upwards and increases all quartiles; 100% increase in maximum and average

Single job: Random Neighbors 6-20 partners with 512 KB messages

15

slide-41
SLIDE 41

Single job: Random Neighbors 6-20 partners with 512 KB messages

15

Adaptivity: Minor gains, 10% reduction in maximum hybrid does better than AI

slide-42
SLIDE 42

Parallel Workloads: % Core Distribution

16

Comm Pattern Workload 1 Workload 2 Workload 3 Workload 4 Unstructured Mesh 20 10 20 40 2D Stencil 10 10 40 10 4D Stencil 40 20 10 20 Many to many 20 40 10 20 Random neighbors 10 20 20 10

slide-43
SLIDE 43

Workloads

17

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (a) Workload 1 (All Links) Median Average Lowest maximum

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (b) Workload 2 (All Links) Median Average Lowest maximum

  • Adaptivity reduces

the maximum traffic by 35%

  • Hybrid with RDN/

RDR shows lowest data points

slide-44
SLIDE 44

Job-specific Routing

18

1 10 1E2 1E3 1E4

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Workload 2 Workload 4 Median Average Lowest maximum (Workload 2) Lowest maximum (Workload 4)

slide-45
SLIDE 45

Summary

19

slide-46
SLIDE 46

Summary

  • Fast analytical model enables studies with a large number of scenarios

19

slide-47
SLIDE 47

Summary

  • Fast analytical model enables studies with a large number of scenarios
  • Adaptivity results in significantly lower values for maximum and

average traffic (up to 50% reduction)

19

slide-48
SLIDE 48

Summary

  • Fast analytical model enables studies with a large number of scenarios
  • Adaptivity results in significantly lower values for maximum and

average traffic (up to 50% reduction)

  • Q1. What is the best combination for single job runs?
  • Depends on the job being run!
  • Patterns with communication among near-by MPI ranks benefit by

blocking

  • Indirect routing is better when the communication pattern is not

sufficiently spread by the application or job placement

  • Hybrid routing provides similar distribution as Adaptive Indirect, but

its data points are shifted depending on the communication pattern

19

slide-49
SLIDE 49

Summary

20

slide-50
SLIDE 50

Summary

  • Q2. What is the best combination for parallel workloads?
  • Similar distributions are observed irrespective of the jobs

proportions in the workloads!

  • Adaptive Hybrid combines the best of both worlds
  • Randomized placement with node/router based blocking is

good

20

slide-51
SLIDE 51

Summary

  • Q2. What is the best combination for parallel workloads?
  • Similar distributions are observed irrespective of the jobs

proportions in the workloads!

  • Adaptive Hybrid combines the best of both worlds
  • Randomized placement with node/router based blocking is

good

  • Q3. Is it beneficial to use job-specific routing?
  • Yes, provides similar distribution as the best routing while

reducing the values of the data points such as the maximum

20

slide-52
SLIDE 52

Relevant publications

  • Predicting application performance using supervised

learning on communication features. SC 2013.

  • Mapping to Irregular Torus Topologies and Other

Techniques for Petascale Biomolecular Simulation. SC 2014.

  • Maximizing Network Throughput on the Dragonfly
  • Interconnect. SC 2014.
  • Improving Application Performance via Task Mapping on

IBM Blue Gene/Q. HiPC 2014.

  • Identifying the Culprits behind Network Congestion. IPDPS

2015.

21