ON NETWORK LOCALITY IN MPI-BASED HPC APPLICATIONS Felix Zahn, - - PowerPoint PPT Presentation

on network locality in mpi based hpc applications
SMART_READER_LITE
LIVE PREVIEW

ON NETWORK LOCALITY IN MPI-BASED HPC APPLICATIONS Felix Zahn, - - PowerPoint PPT Presentation

ON NETWORK LOCALITY IN MPI-BASED HPC APPLICATIONS Felix Zahn, Holger Frning 49th International Conference on Parallel Processing - ICPP August, 20th 2020, Edmonton, AB, Canada MOTIVATION "Locality is efficiency, Efficiency is power,


slide-1
SLIDE 1

ON NETWORK LOCALITY IN MPI-BASED HPC APPLICATIONS

Felix Zahn, Holger Fröning 49th International Conference on Parallel Processing - ICPP August, 20th 2020, Edmonton, AB, Canada

slide-2
SLIDE 2

MOTIVATION

2

"Locality is efficiency, Efficiency is power, Power is performance, Performance is king”

  • Bill Dally*

*according to Wikipedia / cited from Johnson, Matt (2011). An Analysis of Linux Scalability to Many Cores. p. 4

slide-3
SLIDE 3

MPI TRAFFIC PATTERNS

Currently:

  • Message/injection rate, message

size distribution, idle times, …

  • Heat maps
  • Good “human readability”

=> How to compare workloads

  • bjectively?

3

slide-4
SLIDE 4

NEW METRICS

Application level:

  • Directly derived from MPI traces
  • No particular network information
  • Metric: Locality
  • Metric: Selectivity

Network level:

  • Model for torus, fat-tree, and dragonfly
  • Studies include multi-core analyses, network

locality, and network utilization

4

All sources All destinations Locality: distance Selectivity: traffic

slide-5
SLIDE 5

APPLICATIONS

DoE exascale mini-applications

  • Repository with dumpi traces
  • maintained by Sandia National

Laboratories

Limitations

  • Missing header (no information

about MPI DDTs)

  • Custom communicators

(MPI_Cart_create, MPI_Cart_sub)

5

  • B. Klenk et al., "Relaxations for High-Performance Message

Passing on Massively Parallel SIMT Processors," 2017 IPDPS

h"ps://portal.nersc.gov/project/CAL/doe-miniapps.htm

slide-6
SLIDE 6

RANK LOCALITY

“Distance” between rank pairs 90% threshold Hints communication pattern dist = |ranksource − rankdest| locality = 1 dist

6

  • Rank_0 -> Rank_2:

distance=2, locality=50% Rank_4 -> Rank_1: distance=3, locality=33%

slide-7
SLIDE 7

RANK LOCALITY

Example nearest neighbor:

  • Max. distance: 2 => locality > 50%
  • Dimensionality

7

Rank Locality Workload Ranks 1D 2D 3D AMG 216 3% 17% 100% 1728 1% 8% 100% Boxlib CNS large 64 3% 13% 21% 256 1% 8% 13% 1024 0% 3% 7% EXMATEX LULESH 64 6% 24% 100% 512 2% 6% 100% MultiGrid C 125 2% 6% 17% 1000 0% 3% 9% PARTISN 168 7% 100% 22%

  • 1D

2D

slide-8
SLIDE 8

SELECTIVITY

How many communication partner does

  • ne rank send 90% of its traffic volume to
  • Only point-to-point
  • Independent from distances

Peers:

  • Total number of communication partners
  • For comparison (difference equals 10% of total

traffic volume)

8

1 4 16 5 17 20 rank 10 20 30 40 relative share of total network traffic [%]

90% of total traffic

slide-9
SLIDE 9

SELECTIVITY

9

  • 2

4 6 8 10 12 14 20 40 60 80 100 ranks

  • relative share of total network traffic [%]
  • MOCFE (64)

MiniFE (18) Lulesh (64) Multigrid_C (125) Nekbone (64) Boxlib CNS (64) AMR (64) Partisn (168) AMG (27) Boxlib MutliGrid (64) Fillboundary (125) SNAP (168)

slide-10
SLIDE 10

SELECTIVITY

10

AMG (8) Crystal Router (10) MiniFE (18) AMG (27) AMR_Miniapp (64) CNS large (*) (64) MultiGrid C (64) MOCFE (*) (64) Nekbone (*) (64) LULESH (64) LULESH (64) Crystal Router (100) FillBoundary (125) MultiGrid_C (125) MiniFE (144) PARTISN (*) (168) SNAP (*) (168) AMG (216) CNS large (*) (256) CNS large (*) (256) MultiGrid C (256) MultiGrid C (256) MOCFE (*) (256) Nekbone (*) (256) LULESH (512) Crystal Router (1000) FillBoundary (1000) MultiGrid_C (1000) CNS large (*) (1024) MultiGrid C (1024) MOCFE (*) (1024) Nekbone (*) (1024) MiniFE (1152) AMG (1728) AMR_Miniapp (1728)

Share of peers and selectivity

0.0 0.2 0.4 0.6 0.8 1.0

Selectivity Peers Ranks

slide-11
SLIDE 11

NETWORK LAYER

Model to provide static analyses

  • No simulation needed, non-temporal character
  • Including: routing, # links, BW, packets

Topologies

  • 3D Torus
  • Fat Tree
  • Dragonfly

Shortest-path routing

  • Deterministic
  • No traffic flows / emphasis of topology characteristics

Incremental mapping

11

slide-12
SLIDE 12

MULTI-CORE EFFECTS

How does network traffic change with the number of cores/socket?

  • One rank/core
  • Incremental mapping
  • Includes p2p &

collectives

12

  • 10

20 30 40 20 40 60 80 100 cores/die

  • relative share of total network traffic [%]
  • MiniFE (1024)

Boxlib Multigrid_C (1024) Fillboundary (1000) AMG (1728) BigFFT (1024) Lulesh (512) Multigrid_C (1000) Nekbone (1024) Boxlib_CNS (1024) Crystal Router (1000) AMR (1728)

slide-13
SLIDE 13

HOPS & UTILIZATION

Topological locality:

  • Routing information required

packethops = ∑

p∈packets

#hops(p)

hops = packethops #packets

13

Network utilization:

  • Depending on distances and

number of links

datavolume = ∑

p∈packets

#hops(p) ⋅ size(p)

utilization = datavolume BW ⋅ texecution ⋅ #links

slide-14
SLIDE 14

HOPS & UTILIZATION

AMG (8) BigFFT (Medium) (9) Crystal Router (10) MiniFE (18) AMG (27) CMC 2D Multinode (64) EXMATEX LULESH (64) EXMATEX LULESH (64) Nekbone (*) (64) MultiGrid C (64) AMR_Miniapp (64) MOCFE (*) (64) Boxlib CNS large (*) (64) CMC Multinode (64) BigFFT (Medium) (100) Crystal Router (100) FillBoundary (125) MultiGrid_C (125) MiniFE (144) PARTISN (*) (168) SNAP (*) (168) AMG (216) CMC 2D Multinode (256) CNS large (*) (256) CNS large (*) (256) MultiGrid C (256) MultiGrid C (256) MOCFE (*) (256) Nekbone (*) (256) CMC Multinode (256) EXMATEX LULESH (512) FillBoundary (1000) MultiGrid_C (1000) Crystal Router (1000) CMC 2D Multinode (1024) MultiGrid C (1024) Nekbone (*) (1024) CNS large (*) (1024) MOCFE (*) (1024) BigFFT (Medium) (1024) CMC Multinode (1024) MiniFE (1152) AMG (1728) AMR_Miniapp (1728)

  • avg. number of hops

2 4 6 8

3D Torus Fat−Tree Dragonfly

slide-15
SLIDE 15

HOPS & UTILIZATION

15

Network utilization [%] [%] [%] Workload Ranks 3D Torus fat-tree Dragonfly AMG 8 0.0052 0.0303 0.0116 27 0.0012 0.0034 0.0034 216 0.0008 0.0032 0.0021 1728 0.0001 0.0004 0.0002 AMR Miniapp 64 0.0034 0.0058 0.0048 1728 0.0278 0.0229 0.0119 BigFFT (Medium) 9 0.6721 3.0725 1.2943 100 7.4849 10.5544 7.6985 1024 47.2317 38.4346 22.1491 Boxlib CNS large (*) 64 0.0002 0.0003 0.0003 256 0.0004 0.0005 0.0004 256 0.0005 0.0006 0.0004 1024 0.0012 0.0010 0.0006 Boxlib MultiGrid C 64 0.0011 0.0020 0.0017 256 0.0035 0.0045 0.0032 256 0.0036 0.0046 0.0033 1024 0.0106 0.0092 0.0054 CESAR MOCFE (*) 64 0.0498 0.0769 0.0605 256 0.1216 0.1368 0.0895 1024 0.4495 0.3656 0.2108 CESAR Nekbone (*) 64 0.0027 0.0090 0.0081 256 0.3447 0.3882 0.2541 1024 0.0029 0.0057 0.0035 Network utilization [%] [%] [%] Workload Ranks 3D Torus fat-tree Dragonfly Crystal Router 10 0.0469 0.1938 0.0882 100 0.0408 0.0637 0.0490 1000 0.1475 0.1531 0.0959 EXMATEX CMC 2D Multinode 64 2.0E-05 3.0E-05 2.4E-05 256 0.0001 0.0001 0.0001 1024 0.0008 0.0007 0.0004 EXMATEX LULESH 64 0.0004 0.0013 0.0011 64 0.0004 0.0016 0.0013 512 0.0005 0.0020 0.0012 FillBoundary 125 0.0319 0.0466 0.0351 1000 0.0245 0.0248 0.0160 MiniFE 18 0.0008 0.0031 0.0015 144 0.0017 0.0025 0.0017 1152 0.0039 0.0037 0.0022 MultiGrid C 125 0.0038 0.0056 0.0041 1000 0.0013 0.0013 0.0008 PARTISN (*) 168 7.4E-08 1.6E-07 1.2E-07 SNAP (*) 168 4.2E-07 6.2E-07 4.0E-07

slide-16
SLIDE 16

CONCLUSION

Two new metrics to quantify communication patterns

  • Rank Locality: deriving types of communication and dimensionality
  • Selectivity: small subsets of ranks cause most of communication
  • > advanced mapping can reduce network traffic

Network level analyses

  • Multi-core: traffic plateau at 8 cores/socket -> no further reduction of traffic volume
  • : 3D torus best fit for <256 ranks, fat-tree undertakes for larger configurations
  • Overall low network utilization of less than 1% for 14/15 applications.

hops

16

slide-17
SLIDE 17

OUTLOOK

Analyzing more traffic patterns for locality and selectivity

  • Goal: Predict communication pattern from two values

Further quantitative description of MPI communication

  • Substitute complex network simulation with accurate models
  • Deeper understanding benefits network design

Mapping

  • Identifying subset of heavily communicating traces to map them together
  • Reduce of network load and save energy by provisioning the network

appropriately

17

slide-18
SLIDE 18

THANK YOU! SPONSORS

18