Automating Topology Aware Mapping for Supercomputers Abhinav - - PowerPoint PPT Presentation

automating topology aware mapping for supercomputers
SMART_READER_LITE
LIVE PREVIEW

Automating Topology Aware Mapping for Supercomputers Abhinav - - PowerPoint PPT Presentation

Automating Topology Aware Mapping for Supercomputers Abhinav Bhatele, Gagan Gupta Laxmikant V. Kale 1 1 Application Topologies Patch Compute Proxy


slide-1
SLIDE 1

Automating Topology Aware Mapping for Supercomputers

Abhinav Bhatele, Gagan Gupta Laxmikant

  • V. Kale

1

1

slide-2
SLIDE 2

Application Topologies

Patch Compute Proxy

  • 2

2

slide-3
SLIDE 3

Interconnect Topologies

  • Three dimensional meshes
  • 3D Torus: Blue Gene/L, Blue Gene/P

, Cray XT4/5

  • Trees
  • Fat-trees (Infiniband) and CLOS networks (Federation)
  • Dense Graphs
  • Kautz Graph (SiCortex), Hypercubes
  • Future Topologies?
  • Blue Waters, Blue Gene/Q

3

3

slide-4
SLIDE 4

The Mapping Problem

  • Applications have a communication topology and

processors have an interconnect topology

  • Definition: Given a set of communicating parallel

“entities”, map them on to physical processors to

  • ptimize communication
  • Goals:
  • Balance computational load
  • Minimize communication traffic and hence contention

4

4

slide-5
SLIDE 5

Scope of this work

  • Currently we are focused on 3D mesh/torus machines
  • For certain classes of applications

5

Communication bound Computation bound Latency tolerant Latency sensitive

5

slide-6
SLIDE 6

Application specific mapping

0.075 0.15 0.225 0.3 512 1024 2048 4096 8192 Time per step (s) Number of cores

Default Topology

6

OpenAtom

  • A. Bhatele, E. Bohm, and L.
  • V. Kale. A Case Study of Communication

Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages 1015–1028, 2009. Distinguished Paper Award.

  • A. Bhatele, L.
  • V. Kale and S. Kumar, Dynamic Topology Aware Load

Balancing Algorithms for Molecular Dynamics Applications, In 23rd ACM International Conference on Supercomputing (ICS), 2009.

6

slide-7
SLIDE 7

Application specific mapping

Inner Brick Outer Brick Patch 1 Patch 2

0.075 0.15 0.225 0.3 512 1024 2048 4096 8192 Time per step (s) Number of cores

Default Topology

6

3.75 7.5 11.25 15 512 1024 2048 4096 8192 16384 Time per step (ms) Number of cores

Topology Oblivious TopoAware Patches TopoAware LDBs

NAMD OpenAtom

  • A. Bhatele, E. Bohm, and L.
  • V. Kale. A Case Study of Communication

Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages 1015–1028, 2009. Distinguished Paper Award.

  • A. Bhatele, L.
  • V. Kale and S. Kumar, Dynamic Topology Aware Load

Balancing Algorithms for Molecular Dynamics Applications, In 23rd ACM International Conference on Supercomputing (ICS), 2009.

6

slide-8
SLIDE 8

Automatic Mapping

  • Obtaining the processor topology and the application

communication graph

  • Pattern matching to identify regular patterns
  • 2D/3D near-neighbor communication
  • A suite of heuristics: the right strategy invoked

depending on the communication scenario:

  • Regular communication
  • Irregular communication

7

7

slide-9
SLIDE 9

Topology Discovery

  • Topology Manager API: for 3D interconnects (Blue

Gene, XT)

  • Information required for mapping:
  • Physical dimensions of the allocated job partition
  • Mapping of ranks to physical coordinates and vice versa
  • On Blue Gene machines such information is available

and the API is a wrapper

  • On Cray XT machines, jump several hoops to get this

information and make it available through the same API

http://charm.cs.uiuc.edu/~bhatele/phd/TopoMgrAPI.tar.gz

8

8

slide-10
SLIDE 10

Application communication graph

  • Several ways to obtain the graph
  • MPI applications:
  • Graph obtained from a run can only be used in a subsequent run
  • Profiling tools (IBM’s HPCT tools)
  • Charm++ applications:
  • Instrumentation at runtime
  • Enables dynamic mapping for changing communication graphs

9

9

slide-11
SLIDE 11

Pattern Matching

  • We want to identify simple communication patterns

Pattern matching to identify simple communication patterns such as 2D/3D near-neighbor graphs

10

Processors

31

10

slide-12
SLIDE 12

Communication Graphs

  • Regular communication:
  • POP (Parallel Ocean Program): 2D Stencil like computation
  • WRF (Weather Research and Forecasting model): 2D Stencil
  • MILC (MIMD Lattice Computation): 4D near-neighbor
  • Irregular communication:
  • Unstructured mesh computations: FLASH, CPSD code
  • Many other classes of applications

11

11

slide-13
SLIDE 13

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-14
SLIDE 14

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-15
SLIDE 15

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-16
SLIDE 16

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-17
SLIDE 17

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-18
SLIDE 18

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-19
SLIDE 19

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-20
SLIDE 20

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-21
SLIDE 21

Mapping Regular Graphs

  • Maximum Overlap (MXOVLP)
  • Expand from Corner (EXCO)
  • Affine Mapping (AFFN)

12

Object Graph: 7 x 4 Processor Graph: 4 x 7

12

slide-22
SLIDE 22

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-23
SLIDE 23

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-24
SLIDE 24

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-25
SLIDE 25

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-26
SLIDE 26

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-27
SLIDE 27

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-28
SLIDE 28

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-29
SLIDE 29

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-30
SLIDE 30

Example Mapping

Object Graph: 6 x 11 Processor Graph: 11 x 6

13 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

13

slide-31
SLIDE 31

Different mapping solutions

14

Object graph of 14 x 6 to processor graph of 7 x 12

Algorithms in order: MXOVLP , MXOV+AL, EXCO, COCE, AFFN, STEP

14

slide-32
SLIDE 32

Evaluation Metric: Hop-bytes

  • Weighted sum of message sizes where the weights are

the number of links traversed by each message

  • Indicator of the communication traffic and hence

contention on the network

  • Previously used metric: maximum dilation

15

di = distance bi = bytes n = no. of messages

15

slide-33
SLIDE 33

Evaluation

16

7.5 15 22.5 30 14X6 to 7X12 16X16 to 8X32 27X35 to 45X21 Hops per processor Different mapping configurations

MXOVLP MXOV+AL EXCO COCE AFFN STEP Lower Bound

16

slide-34
SLIDE 34

Results: WRF

  • Performance

improvement negligible on 256 and 512 cores

  • On 1024 cores:
  • Hops reduce by: 64%
  • Time for communication

reduces by 45%

  • Performance improves

by 17%

1 2 3 4 256 512 1024 2048 Average hops per byte per core Number of nodes

Default Topology Lower Bound

17

17

slide-35
SLIDE 35

Mapping Irregular Graphs

18

  • Object graph: 90 nodes

Processor Mesh: 10 x 9

18

slide-36
SLIDE 36

Two different scenarios

  • There is no spatial information associated with the node
  • Option 1: Work without it
  • Option 2: If we know that the simulation has a geometric

configuration, try to guess the structure of the graph

  • We have geometric coordinate information for each

node

  • Use coordinate information to avoid crossing of edges and for other
  • ptimizations

19

19

slide-37
SLIDE 37

No coordinate information

  • Breadth first traversal (BFT)
  • Start with a random node and one end of the processor mesh
  • Map nodes as you encounter them around the centroid of their

mapped neighbors

  • Max heap traveral (MHT)
  • Start with a random node and one end/center of the mesh
  • Put neighbors of a mapped node into the heap (node at the top is the
  • ne with maximum mapped neighbors)
  • Map elements in the heap one by one around the centroid of their

mapped neighbors

20

20

slide-38
SLIDE 38

Mapping visualization

21

  • BFT

MHT

21

slide-39
SLIDE 39

With coordinate information

  • Affine Mapping (AFFN)
  • Stretch/shrink the object graph (based on coordinates of nodes) to

map it on to the processor grid

  • In case of conflicts for the same processor, spiral around that

processor

  • Corners to Center (COCE)
  • Use four corners of the object graph based on coordinates
  • Start mapping simultaneously from all sides
  • Either a simple BFT
  • type scheme
  • Or a MHT
  • style heuristic

22

22

slide-40
SLIDE 40

Mapping visualization

23

  • AFFN

COCE

23

slide-41
SLIDE 41

Results: simple2D

24

150000 300000 450000 600000 90 nodes 256 nodes 1024 nodes Hop bytes

Default BFT MHT AFFN COCE Lower bound

24

slide-42
SLIDE 42

Completely Distributed Mapping

  • Problem (in content of Charm++):
  • n objects to be placed on p processors (n much greater than p)
  • Computational loads of objects are distributed
  • Each object should make its decision by itself
  • Start with simple cases:
  • 1D ring communication
  • 2D stencil communication

25

25

slide-43
SLIDE 43

Distributed strategies

  • 1D ring to a line:
  • Perform a parallel prefix sum between chares and send total load to all
  • bjects (chares)
  • Each chare now decides which processor it should be on
  • 2D stencil to a 2D mesh:
  • Linearize using Hilbert ordering
  • Perform 1D parallel prefix
  • Or perform a parallel prefix in 2D (on all rows and

columns)

  • Gives (x, y) coordinates for processor on which the node should go

26

26

slide-44
SLIDE 44

Summary and Future Work

  • Developing an automatic mapping framework
  • Topology discovery: Topology Manager API
  • Pattern matching
  • Regular graphs
  • Irregular graphs
  • Suite of heuristics for mapping
  • Completely distributed strategies
  • Topology aware hierarchical load balancers (NAMD)

27

27