 
              Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queen’s University Canada The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 23, 2016
Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 2 AsHES 2016
Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 3 AsHES 2016
Introduction • GPU accelerators have successfully established themselves in modern HPC clusters – High performance – Energy efficiency • Demand for higher GPU computational power and memory – Multi-GPU nodes in state-of-the-art HPC clusters Parallel Processing Research Laboratory (PPRL) 4 AsHES 2016
Introduction • Clusters with multi-GPU nodes provide:  Higher computational power  More memory to hold larger datasets However, this brings up a challenge… More GPUs  Potentially higher GPU-to-GPU communications “Achilles heel” in GPU -accelerated application performance! Parallel Processing Research Laboratory (PPRL) 5 AsHES 2016
Introduction • To address the GPU communication bottleneck: – Increase GPU utilization at the application level  Reducing the share of GPU communications in application runtime  Not all applications can highly utilize the GPUs in a node – Asynchronously progress inter-process GPU communications and GPU computation  Overlapping GPU communication with computation  Highly overlapping GPU communication and computation is not always feasible – Leverage GPU hardware features (such as IPC)  Improving GPU-to-GPU communication performance  Only possible for specific GPU pairs within a node  Communication performance still limited by the latency and bandwidth capacity  HOWEVER… Parallel Processing Research Laboratory (PPRL) 6 AsHES 2016
Introduction • Smartly designed applications will continue to use these features • GPU communications can still become a soft-point in different applications and GPU nodes Conduct GPU communications as efficient as possible HOW? Parallel Processing Research Laboratory (PPRL) 7 AsHES 2016
Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 8 AsHES 2016
Background and Motivation • Multi-GPU node architecture Level 3 Level 2 Level 2 Level 1 Level 1 Level 1 Level 1 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 · Level 1: Path between GPU pairs traverses multiple internal switches · Level 0: Path between GPU pairs traverses a PCIe internal switch · Level 2: Path between GPU pairs traverses a PCIe host bridge · Level 3: Path traverses a socket-level link (e.g., QPI) Helios-K80 cluster at Université Laval's computing centre Parallel Processing Research Laboratory (PPRL) 9 AsHES 2016
Background and Motivation • Multi-GPU node bandwidth Parallel Processing Research Laboratory (PPRL) 10 AsHES 2016
Background and Motivation • Multi-GPU node latency Parallel Processing Research Laboratory (PPRL) 11 AsHES 2016
Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 12 AsHES 2016
Design • What we know: – Intranode GPU-to-GPU communications may traverse different paths – Different paths can have different latency and bandwidth • Ultimate goal: – Efficient utilization of GPU communication channels • Intensive communications carried over stronger channels • Our proposal: – Topology-aware GPU selection • Intelligent assignment of Intranode GPUs to MPI processes so as to maximize communication performance Parallel Processing Research Laboratory (PPRL) 13 AsHES 2016
Design Our Approach: GPU Node GPU Physical Communication 1. Extracting the GPU Characteristics Pattern communication pattern 2.Extracting the physical characteristics of the node GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 14 AsHES 2016
Design Our Approach: Instrumenting Metrics: GPU Node GPU Open MPI library 1. Latency Physical Communication 1. Extracting the GPU to collect GPU 2. Bandwidth Characteristics Pattern inter-process 3. Distance communication pattern communication 2.Extracting the physical SCOTCH characteristics of the node GRAPH API GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph SCOTCH Mapping Algorithm mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 15 AsHES 2016
Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 16 AsHES 2016
Result: Setup • One node, Helios cluster from Calcul Quebec – 16 GPUs (K80) – Two 12-core Intel Xeon 2.7 GHz • 4 micro-benchmarks – 5-point 2D stencil – 5-point 2D torus – 7-point 3D torus – 5-point 4D hypercube • One application (New) Parallel Processing Research Laboratory (PPRL) 17 AsHES 2016
Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on non-weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 18 AsHES 2016
Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 19 AsHES 2016
NEW! Results Application 12.8% improvement 15.7% improvement Runtime of the HOOMD-Blue application with LJ-512K particle size using default and topology-aware mappings Parallel Processing Research Laboratory (PPRL) 20 AsHES 2016
Conclusion • Discussed GPU inter-process communication bottleneck – Overviewed some potential solutions to subside its effect • Showed an example of a multi-GPU node and its communication channels • Showed different levels of bandwidth and latency in a Multi- GPU node • Proposed a topology-aware GPU selection approach – More efficient utilization of GPU-to-GPU communication channels – Performance improvement by mapping intensive communications onto stronger channels Parallel Processing Research Laboratory (PPRL) 21 AsHES 2016
Conclusion Topology awareness matters for GPU communications and can provide considerable performance improvements. Parallel Processing Research Laboratory (PPRL) 22 AsHES 2016
Future Work • Evaluation on different multi-GPU nodes with different node architectures and GPUs. • Impact on different applications. • Extension towards multiple nodes across the cluster. Parallel Processing Research Laboratory (PPRL) 23 AsHES 2016
Acknowledgments Parallel Processing Research Laboratory (PPRL) 24 AsHES 2016
Thank you for your attention! Contacts: • Iman Faraji : i.faraji@queensu.ca • Seyed H. Mirsadeghi : s.mirsadeghi@queensu.ca • Ahmad Afsahi : ahmad.afsahi@queensu.ca Question? Parallel Processing Research Laboratory (PPRL) 25 AsHES 2016
Backup Motivation Helios-K20 Helios-K80 Parallel Processing Research Laboratory (PPRL) 26 AsHES 2016
Recommend
More recommend