automating topology aware mapping for supercomputers
play

Automating Topology Aware Mapping for Supercomputers Abhinav - PowerPoint PPT Presentation

Automating Topology Aware Mapping for Supercomputers Abhinav Bhatele, Gagan Gupta Laxmikant V. Kale 1 1 Application Topologies Patch Compute Proxy


  1. Automating Topology Aware Mapping for Supercomputers Abhinav Bhatele, Gagan Gupta Laxmikant V. Kale 1 1

  2. Application Topologies Patch Compute Proxy � � �� � � � �� � �� �� �� � � �� �� �� �� � �� � 2 2

  3. Interconnect Topologies • Three dimensional meshes • 3D Torus: Blue Gene/L, Blue Gene/P , Cray XT4/5 • Trees • Fat-trees (Infiniband) and CLOS networks (Federation) • Dense Graphs • Kautz Graph (SiCortex), Hypercubes • Future Topologies? • Blue Waters, Blue Gene/Q 3 3

  4. The Mapping Problem • Applications have a communication topology and processors have an interconnect topology • Definition: Given a set of communicating parallel “entities”, map them on to physical processors to optimize communication • Goals: • Balance computational load • Minimize communication traffic and hence contention 4 4

  5. Scope of this work • Currently we are focused on 3D mesh/torus machines • For certain classes of applications Computation Communication bound bound Latency tolerant Latency sensitive 5 5

  6. Application specific mapping OpenAtom Default Time per step (s) 0.3 Topology 0.225 0.15 0.075 0 512 1024 2048 4096 8192 Number of cores A. Bhatele, E. Bohm, and L. V. Kale. A Case Study of Communication A. Bhatele, L. V. Kale and S. Kumar, Dynamic Topology Aware Load Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages Balancing Algorithms for Molecular Dynamics Applications, In 23rd ACM 1015–1028, 2009. Distinguished Paper Award. International Conference on Supercomputing (ICS), 2009. 6 6

  7. Application specific mapping OpenAtom NAMD Outer Brick Patch 2 Inner Brick Patch 1 Time per step (ms) Default Time per step (s) 0.3 15 Topology Oblivious Topology TopoAware Patches 0.225 11.25 TopoAware LDBs 0.15 7.5 0.075 3.75 0 0 512 1024 2048 4096 8192 512 1024 2048 4096 8192 16384 Number of cores Number of cores A. Bhatele, E. Bohm, and L. V. Kale. A Case Study of Communication A. Bhatele, L. V. Kale and S. Kumar, Dynamic Topology Aware Load Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages Balancing Algorithms for Molecular Dynamics Applications, In 23rd ACM 1015–1028, 2009. Distinguished Paper Award. International Conference on Supercomputing (ICS), 2009. 6 6

  8. Automatic Mapping • Obtaining the processor topology and the application communication graph • Pattern matching to identify regular patterns • 2D/3D near-neighbor communication • A suite of heuristics: the right strategy invoked depending on the communication scenario: • Regular communication • Irregular communication 7 7

  9. Topology Discovery • Topology Manager API: for 3D interconnects (Blue Gene, XT) • Information required for mapping: • Physical dimensions of the allocated job partition • Mapping of ranks to physical coordinates and vice versa • On Blue Gene machines such information is available and the API is a wrapper • On Cray XT machines, jump several hoops to get this information and make it available through the same API http://charm.cs.uiuc.edu/~bhatele/phd/TopoMgrAPI.tar.gz 8 8

  10. Application communication graph • Several ways to obtain the graph • MPI applications: • Graph obtained from a run can only be used in a subsequent run • Profiling tools (IBM’s HPCT tools) • Charm++ applications: • Instrumentation at runtime • Enables dynamic mapping for changing communication graphs 9 9

  11. Pattern Matching • We want to identify simple communication patterns 0 Processors Pattern matching to identify simple communication patterns such as 2D/3D near-neighbor graphs 31 10 10

  12. Communication Graphs • Regular communication: • POP (Parallel Ocean Program): 2D Stencil like computation • WRF (Weather Research and Forecasting model): 2D Stencil • MILC (MIMD Lattice Computation): 4D near-neighbor • Irregular communication: • Unstructured mesh computations: FLASH, CPSD code • Many other classes of applications 11 11

  13. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  14. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  15. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  16. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  17. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  18. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  19. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  20. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  21. Mapping Regular Graphs • Maximum Overlap (MXOVLP) Object Graph: 7 x 4 Processor Graph: 4 x 7 • Expand from Corner (EXCO) • Affine Mapping (AFFN) 12 12

  22. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  23. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  24. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  25. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  26. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  27. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  28. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  29. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  30. Example Mapping Object Graph: 6 x 11 Processor Graph: 11 x 6 Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982 13 13

  31. Different mapping solutions Object graph of 14 x 6 to processor graph of 7 x 12 Algorithms in order: MXOVLP , MXOV+AL, EXCO, COCE, AFFN, STEP 14 14

  32. Evaluation Metric: Hop-bytes • Weighted sum of message sizes where the weights are the number of links traversed by each message d i = distance b i = bytes n = no. of messages • Indicator of the communication traffic and hence contention on the network • Previously used metric: maximum dilation 15 15

  33. Evaluation 30 MXOVLP MXOV+AL Hops per processor EXCO COCE 22.5 AFFN STEP Lower Bound 15 7.5 0 14X6 to 7X12 16X16 to 8X32 27X35 to 45X21 Different mapping configurations 16 16

  34. Results: WRF • Performance Average hops per byte per core Default improvement Topology 4 Lower Bound negligible on 256 and 512 cores 3 • On 1024 cores: 2 • Hops reduce by: 64% 1 • Time for communication reduces by 45% 0 • Performance improves 256 512 1024 2048 by 17% Number of nodes 17 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend