Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - - PowerPoint PPT Presentation

multi gpu nodes
SMART_READER_LITE
LIVE PREVIEW

Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - - PowerPoint PPT Presentation

Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queens University Canada The Sixth International


slide-1
SLIDE 1

Topology-Aware GPU Selection on Multi-GPU Nodes

Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queen’s University Canada

The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 23, 2016

slide-2
SLIDE 2

AsHES 2016

  • Introduction
  • Background and Motivation
  • Design
  • Results
  • Conclusion
  • Future Work

Outline

Parallel Processing Research Laboratory (PPRL) 2

slide-3
SLIDE 3

AsHES 2016

Outline

  • Introduction
  • Background and Motivation
  • Design
  • Results
  • Conclusion
  • Future Work

Parallel Processing Research Laboratory (PPRL) 3

slide-4
SLIDE 4

AsHES 2016

Introduction

  • GPU accelerators have successfully established

themselves in modern HPC clusters

– High performance – Energy efficiency

  • Demand for higher GPU computational power and

memory

–Multi-GPU nodes in state-of-the-art HPC clusters

Parallel Processing Research Laboratory (PPRL) 4

slide-5
SLIDE 5

AsHES 2016

Introduction

  • Clusters with multi-GPU nodes provide:

Higher computational power More memory to hold larger datasets However, this brings up a challenge… More GPUs  Potentially higher GPU-to-GPU communications “Achilles heel” in GPU-accelerated application performance!

Parallel Processing Research Laboratory (PPRL) 5

slide-6
SLIDE 6

AsHES 2016

Introduction

  • To address the GPU communication bottleneck:

– Increase GPU utilization at the application level

Reducing the share of GPU communications in application runtime  Not all applications can highly utilize the GPUs in a node

– Asynchronously progress inter-process GPU communications and GPU computation

Overlapping GPU communication with computation  Highly overlapping GPU communication and computation is not always feasible

– Leverage GPU hardware features (such as IPC)

Improving GPU-to-GPU communication performance  Only possible for specific GPU pairs within a node  Communication performance still limited by the latency and bandwidth capacity

HOWEVER…

Parallel Processing Research Laboratory (PPRL) 6

slide-7
SLIDE 7

AsHES 2016

Introduction

  • Smartly designed applications will continue to use these features
  • GPU communications can still become a soft-point in different

applications and GPU nodes HOW?

Parallel Processing Research Laboratory (PPRL) 7

Conduct GPU communications as efficient as possible

slide-8
SLIDE 8

AsHES 2016

Outline

  • Introduction
  • Background and Motivation
  • Design
  • Results
  • Conclusion
  • Future Work

Parallel Processing Research Laboratory (PPRL) 8

slide-9
SLIDE 9

AsHES 2016

Background and Motivation

  • Multi-GPU node architecture

Helios-K80 cluster at Université Laval's computing centre

GPU Level 0 GPU 1 GPU 2 GPU 3 Level 1 Level 0 Level 2 Level 3 Level 2 GPU 4 Level 0 GPU 5 GPU 6 GPU 7 Level 1 Level 0 GPU 8 Level 0 GPU 9 GPU 10 GPU 11 Level 1 Level 0 GPU 12 Level 0 GPU 13 GPU 14 GPU 15 Level 1 Level 0

· Level 0: Path between GPU pairs traverses a PCIe internal switch · Level 2: Path between GPU pairs traverses a PCIe host bridge · Level 1: Path between GPU pairs traverses multiple internal switches · Level 3: Path traverses a socket-level link (e.g., QPI)

Parallel Processing Research Laboratory (PPRL) 9

slide-10
SLIDE 10

AsHES 2016

Background and Motivation

  • Multi-GPU node bandwidth

Parallel Processing Research Laboratory (PPRL) 10

slide-11
SLIDE 11

AsHES 2016

Background and Motivation

  • Multi-GPU node latency

Parallel Processing Research Laboratory (PPRL) 11

slide-12
SLIDE 12

AsHES 2016

Outline

  • Introduction
  • Background and Motivation
  • Design
  • Results
  • Conclusion
  • Future Work

Parallel Processing Research Laboratory (PPRL) 12

slide-13
SLIDE 13

AsHES 2016

Design

  • What we know:

– Intranode GPU-to-GPU communications may traverse different paths – Different paths can have different latency and bandwidth

  • Ultimate goal:

– Efficient utilization of GPU communication channels

  • Intensive communications carried over stronger channels
  • Our proposal:

– Topology-aware GPU selection

  • Intelligent assignment of Intranode GPUs to MPI processes so as to

maximize communication performance

Parallel Processing Research Laboratory (PPRL) 13

slide-14
SLIDE 14

AsHES 2016

Design

Our Approach:

  • 1. Extracting the GPU

communication pattern 2.Extracting the physical characteristics of the node

  • 3. Modeling topology-aware

GPU selection as a graph mapping problem

  • 4. Solving the problem using

a mapping algorithm

GPU Node Physical Characteristics GPU Communication Pattern GPU Virtual Topology GPU Physical Topology

GPU Mapping Table Mapping Algorithm

Parallel Processing Research Laboratory (PPRL) 14

slide-15
SLIDE 15

AsHES 2016

Design

Our Approach:

  • 1. Extracting the GPU

communication pattern 2.Extracting the physical characteristics of the node

  • 3. Modeling topology-aware

GPU selection as a graph mapping problem

  • 4. Solving the problem using

a mapping algorithm

GPU Node Physical Characteristics GPU Communication Pattern GPU Virtual Topology GPU Physical Topology

GPU Mapping Table Instrumenting Open MPI library to collect GPU inter-process communication Metrics:

  • 1. Latency
  • 2. Bandwidth
  • 3. Distance

SCOTCH GRAPH API SCOTCH Mapping Algorithm

Parallel Processing Research Laboratory (PPRL) 15

Mapping Algorithm

slide-16
SLIDE 16

AsHES 2016

  • Introduction
  • Background and Motivation
  • Design
  • Results
  • Conclusion
  • Future Work

Outline

Parallel Processing Research Laboratory (PPRL) 16

slide-17
SLIDE 17

AsHES 2016

Result: Setup

  • One node, Helios cluster from Calcul Quebec

–16 GPUs (K80) –Two 12-core Intel Xeon 2.7 GHz

  • 4 micro-benchmarks

–5-point 2D stencil –5-point 2D torus –7-point 3D torus –5-point 4D hypercube

  • One application (New)

Parallel Processing Research Laboratory (PPRL) 17

slide-18
SLIDE 18

AsHES 2016

Results

Micro-benchmark

Runtime improvement of topology-aware mappings over default mapping on non-weighted microbenchmarks

Parallel Processing Research Laboratory (PPRL) 18

slide-19
SLIDE 19

AsHES 2016

Results

Micro-benchmark

Runtime improvement of topology-aware mappings over default mapping on weighted microbenchmarks

Parallel Processing Research Laboratory (PPRL) 19

slide-20
SLIDE 20

AsHES 2016

Results

Application

Runtime of the HOOMD-Blue application with LJ-512K particle size using default and topology-aware mappings

NEW!

12.8% improvement 15.7% improvement

Parallel Processing Research Laboratory (PPRL) 20

slide-21
SLIDE 21

AsHES 2016

Conclusion

  • Discussed GPU inter-process communication bottleneck

– Overviewed some potential solutions to subside its effect

  • Showed an example of a multi-GPU node and its

communication channels

  • Showed different levels of bandwidth and latency in a Multi-

GPU node

  • Proposed a topology-aware GPU selection approach

– More efficient utilization of GPU-to-GPU communication channels – Performance improvement by mapping intensive communications

  • nto stronger channels

Parallel Processing Research Laboratory (PPRL) 21

slide-22
SLIDE 22

AsHES 2016

Conclusion

Parallel Processing Research Laboratory (PPRL) 22

Topology awareness matters for GPU communications and can provide considerable performance improvements.

slide-23
SLIDE 23

AsHES 2016

Future Work

  • Evaluation on different multi-GPU nodes with

different node architectures and GPUs.

  • Impact on different applications.
  • Extension towards multiple nodes across the cluster.

Parallel Processing Research Laboratory (PPRL) 23

slide-24
SLIDE 24

AsHES 2016

Acknowledgments

Parallel Processing Research Laboratory (PPRL) 24

slide-25
SLIDE 25

AsHES 2016

Thank you for your attention!

Contacts:

  • Iman Faraji: i.faraji@queensu.ca
  • Seyed H. Mirsadeghi: s.mirsadeghi@queensu.ca
  • Ahmad Afsahi: ahmad.afsahi@queensu.ca

Question?

Parallel Processing Research Laboratory (PPRL) 25

slide-26
SLIDE 26

AsHES 2016

Backup

Motivation Helios-K20 Helios-K80

Parallel Processing Research Laboratory (PPRL) 26