hierarchy aware blocking and nonblocking collective
play

Hierarchy Aware Blocking and Nonblocking Collective - PowerPoint PPT Presentation

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by


  1. Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_CAC_2010 Graham_CAC_2010

  2. Acknowledgements • US Department of Energy FASTOS program 2 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  3. Outline • Statement of the problem • Design Overview • Results • Next steps 3 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  4. Problems being addressed • Optimization of collective operations • Implementation of extensible optimized collective operations • Implementation of nonblocking collective operations 4 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  5. Why Optimize Collective Communications • Collective operations limit application scalability • Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved) • Optimized collectives involve a communicator-wide data-dependent communication pattern • Data needs to be manipulated at intermediate stages of a collective operation • Collective operations magnify the effects of system- noise 5 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  6. Scalability of Collective Operations Ideal Algorithm Impact of System Noise 3'&/ : 3'&/ : ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 ;'1*) $ 012) 012) $ ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 6 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  7. Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm 3'&/ = 3'&/ = ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 :)9);-61'.+<;).6 :)9);-61'.+<;).6 012) $ $ 012) ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 7 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  8. Mapping the collectives onto the system • Consider communication hierarchies • Schedule the network 8 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  9. Example – 4 Process Recursive Doubling Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 9 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  10. Example – 4 Process Recursive Doubling – On host optimization Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 Step 3 1 2 3 4 10 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  11. Design strategy • Decouple – Hierarchy detection – Network specific collective algorithm implementation (“single” level) – Full collective function implementation (hierarchical) – Basic building blocks from MPI level functions • Share resources between levels w/o breaking the abstraction between layers 11 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  12. Collectives – Software Layers OMPI Module Component Architecture Collective Framework Basic Collectives (bcol) Framework Subgroup Framework SM NUMA MUMA IBNET Pt2Pt ML – Hierarchical Tuned (pt2pt) IB Collectives Comp. Collectives Comp. OFFLOAD MLNX OFED 12 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  13. Benchmarks 13 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  14. System setup • Jaguar • 2.6 GHz Istanbul processor • Dual socket • Hex-core • Smoky – 2.0 GHz Opteron – Quad socket – Quad core 14 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  15. Barrier as a function of Process count – Jaguar – 2 Level hierarchy 9 Shared Memory pt-2-pt 8 Latency of the Barrier (usecs) 7 6 5 4 3 2 1 0 2 4 6 8 10 12 Processes 15 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  16. Barrier as a function of Process count – Smoky – 2 Level hierarchy 12 Shared Memory pt-2-pt Latency of the Barrier (usecs) 10 8 6 4 2 0 2 4 6 8 10 12 14 16 Processes 16 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  17. Barrier As a function of number of sockets - Jaguar 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 17 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  18. Barrier As a function of number of sockets (1,2) – Smoky 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 18 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  19. Barrier As a function of number of sockets (1,4) – Smoky 2 Latency of the Barrier (usecs) Message Traffic within Socket Message Traffic between Sockets 1.5 1 0.5 0 4 Processes 19 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  20. Summary • Added hardware support for offloading collective operations • Developed MPI-level support for asynchronous collectives • Good barrier performance • Good overlap capabilities • Work is continuing 20 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend