multi gpu nodes
play

Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - PowerPoint PPT Presentation

Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queens University Canada The Sixth International


  1. Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queen’s University Canada The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 23, 2016

  2. Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 2 AsHES 2016

  3. Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 3 AsHES 2016

  4. Introduction • GPU accelerators have successfully established themselves in modern HPC clusters – High performance – Energy efficiency • Demand for higher GPU computational power and memory – Multi-GPU nodes in state-of-the-art HPC clusters Parallel Processing Research Laboratory (PPRL) 4 AsHES 2016

  5. Introduction • Clusters with multi-GPU nodes provide:  Higher computational power  More memory to hold larger datasets However, this brings up a challenge… More GPUs  Potentially higher GPU-to-GPU communications “Achilles heel” in GPU -accelerated application performance! Parallel Processing Research Laboratory (PPRL) 5 AsHES 2016

  6. Introduction • To address the GPU communication bottleneck: – Increase GPU utilization at the application level  Reducing the share of GPU communications in application runtime  Not all applications can highly utilize the GPUs in a node – Asynchronously progress inter-process GPU communications and GPU computation  Overlapping GPU communication with computation  Highly overlapping GPU communication and computation is not always feasible – Leverage GPU hardware features (such as IPC)  Improving GPU-to-GPU communication performance  Only possible for specific GPU pairs within a node  Communication performance still limited by the latency and bandwidth capacity  HOWEVER… Parallel Processing Research Laboratory (PPRL) 6 AsHES 2016

  7. Introduction • Smartly designed applications will continue to use these features • GPU communications can still become a soft-point in different applications and GPU nodes Conduct GPU communications as efficient as possible HOW? Parallel Processing Research Laboratory (PPRL) 7 AsHES 2016

  8. Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 8 AsHES 2016

  9. Background and Motivation • Multi-GPU node architecture Level 3 Level 2 Level 2 Level 1 Level 1 Level 1 Level 1 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 · Level 1: Path between GPU pairs traverses multiple internal switches · Level 0: Path between GPU pairs traverses a PCIe internal switch · Level 2: Path between GPU pairs traverses a PCIe host bridge · Level 3: Path traverses a socket-level link (e.g., QPI) Helios-K80 cluster at Université Laval's computing centre Parallel Processing Research Laboratory (PPRL) 9 AsHES 2016

  10. Background and Motivation • Multi-GPU node bandwidth Parallel Processing Research Laboratory (PPRL) 10 AsHES 2016

  11. Background and Motivation • Multi-GPU node latency Parallel Processing Research Laboratory (PPRL) 11 AsHES 2016

  12. Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 12 AsHES 2016

  13. Design • What we know: – Intranode GPU-to-GPU communications may traverse different paths – Different paths can have different latency and bandwidth • Ultimate goal: – Efficient utilization of GPU communication channels • Intensive communications carried over stronger channels • Our proposal: – Topology-aware GPU selection • Intelligent assignment of Intranode GPUs to MPI processes so as to maximize communication performance Parallel Processing Research Laboratory (PPRL) 13 AsHES 2016

  14. Design Our Approach: GPU Node GPU Physical Communication 1. Extracting the GPU Characteristics Pattern communication pattern 2.Extracting the physical characteristics of the node GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 14 AsHES 2016

  15. Design Our Approach: Instrumenting Metrics: GPU Node GPU Open MPI library 1. Latency Physical Communication 1. Extracting the GPU to collect GPU 2. Bandwidth Characteristics Pattern inter-process 3. Distance communication pattern communication 2.Extracting the physical SCOTCH characteristics of the node GRAPH API GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph SCOTCH Mapping Algorithm mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 15 AsHES 2016

  16. Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 16 AsHES 2016

  17. Result: Setup • One node, Helios cluster from Calcul Quebec – 16 GPUs (K80) – Two 12-core Intel Xeon 2.7 GHz • 4 micro-benchmarks – 5-point 2D stencil – 5-point 2D torus – 7-point 3D torus – 5-point 4D hypercube • One application (New) Parallel Processing Research Laboratory (PPRL) 17 AsHES 2016

  18. Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on non-weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 18 AsHES 2016

  19. Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 19 AsHES 2016

  20. NEW! Results Application 12.8% improvement 15.7% improvement Runtime of the HOOMD-Blue application with LJ-512K particle size using default and topology-aware mappings Parallel Processing Research Laboratory (PPRL) 20 AsHES 2016

  21. Conclusion • Discussed GPU inter-process communication bottleneck – Overviewed some potential solutions to subside its effect • Showed an example of a multi-GPU node and its communication channels • Showed different levels of bandwidth and latency in a Multi- GPU node • Proposed a topology-aware GPU selection approach – More efficient utilization of GPU-to-GPU communication channels – Performance improvement by mapping intensive communications onto stronger channels Parallel Processing Research Laboratory (PPRL) 21 AsHES 2016

  22. Conclusion Topology awareness matters for GPU communications and can provide considerable performance improvements. Parallel Processing Research Laboratory (PPRL) 22 AsHES 2016

  23. Future Work • Evaluation on different multi-GPU nodes with different node architectures and GPUs. • Impact on different applications. • Extension towards multiple nodes across the cluster. Parallel Processing Research Laboratory (PPRL) 23 AsHES 2016

  24. Acknowledgments Parallel Processing Research Laboratory (PPRL) 24 AsHES 2016

  25. Thank you for your attention! Contacts: • Iman Faraji : i.faraji@queensu.ca • Seyed H. Mirsadeghi : s.mirsadeghi@queensu.ca • Ahmad Afsahi : ahmad.afsahi@queensu.ca Question? Parallel Processing Research Laboratory (PPRL) 25 AsHES 2016

  26. Backup Motivation Helios-K20 Helios-K80 Parallel Processing Research Laboratory (PPRL) 26 AsHES 2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend