Mapping MPI+X Applications to Multi-GPU Architectures A - - PowerPoint PPT Presentation

mapping mpi x applications to multi gpu architectures
SMART_READER_LITE
LIVE PREVIEW

Mapping MPI+X Applications to Multi-GPU Architectures A - - PowerPoint PPT Presentation

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. Len Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the


slide-1
SLIDE 1

LLNL-PRES-746812

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Mapping MPI+X Applications to Multi-GPU Architectures

A Performance-Portable Approach

Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference

slide-2
SLIDE 2

LLNL-PRES-746812

2

Application developers face greater complexity

§ Hardware architectures

— SMT — GPUs — FPGAs — NVRAM — NUMA, multi-rail

§ Programming abstractions

— MPI — OpenMP, POSIX threads — CUDA, OpenMP 4.5, OpenACC — Kokkos, RAJA

§ Applications

— Need the hardware topology to

run efficiently

— Need to run on more than one

architecture

— Need multiple programming

abstractions Hybrid applications

slide-3
SLIDE 3

LLNL-PRES-746812

3

§ Compute power is not the bottleneck § Data movement dominates energy consumption § HPC applications dominated by the memory system

— Latency and bandwidth — Capacity tradeoffs (multi-level memories)

§ Leverage local resources

— Avoid remote accesses

How do we map hybrid applications to increasingly complex hardware?

More than compute resources, it is about the memory system!

slide-4
SLIDE 4

LLNL-PRES-746812

4

The Sierra system that will replace Sequoia features a GPU-accelerated architecture

Mellanox Interconnect Single Plane EDR InfiniBand 2 to 1 Tapered Fat Tree IBM POWER9

  • Gen2 NVLink

NVIDIA Volta

  • 7 TFlop/s
  • HBM2
  • Gen2 NVLink

Components Compute Node

2 IBM POWER9 CPUs 4 NVIDIA Volta GPUs NVMe-compatible PCIe 1.6 TB SSD 256 GiB DDR4 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory

Compute Rack

Standard 19” Warm water cooling

Compute System

4320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS ~12 MW

GPFS File System

154 PB usable storage 1.54 TB/s R/W bandwidth

slide-5
SLIDE 5

LLNL-PRES-746812

5

Machine (256GB total) NUMANode P#0 (128GB) Package P#0 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#8 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#16 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#24 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#32 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#40 PU P#32 PU P#33 PU P#34 PU P#35 PU P#36 PU P#37 PU P#38 PU P#39 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#48 PU P#40 PU P#41 PU P#42 PU P#43 PU P#44 PU P#45 PU P#46 PU P#47 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#72 PU P#48 PU P#49 PU P#50 PU P#51 PU P#52 PU P#53 PU P#54 PU P#55 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#88 PU P#56 PU P#57 PU P#58 PU P#59 PU P#60 PU P#61 PU P#62 PU P#63 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#96 PU P#64 PU P#65 PU P#66 PU P#67 PU P#68 PU P#69 PU P#70 PU P#71 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#112 PU P#72 PU P#73 PU P#74 PU P#75 PU P#76 PU P#77 PU P#78 PU P#79 PCI 15b3:1013 ib0 mlx5_0 PCI 1c58:0003 PCI 10de:15f9 card0 renderD128 PCI 10de:15f9 card1 renderD129 NUMANode P#1 (128GB) Package P#1 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#136 PU P#80 PU P#81 PU P#82 PU P#83 PU P#84 PU P#85 PU P#86 PU P#87 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#144 PU P#88 PU P#89 PU P#90 PU P#91 PU P#92 PU P#93 PU P#94 PU P#95 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#152 PU P#96 PU P#97 PU P#98 PU P#99 PU P#100 PU P#101 PU P#102 PU P#103 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#168 PU P#104 PU P#105 PU P#106 PU P#107 PU P#108 PU P#109 PU P#110 PU P#111 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#176 PU P#112 PU P#113 PU P#114 PU P#115 PU P#116 PU P#117 PU P#118 PU P#119 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#200 PU P#120 PU P#121 PU P#122 PU P#123 PU P#124 PU P#125 PU P#126 PU P#127 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#208 PU P#128 PU P#129 PU P#130 PU P#131 PU P#132 PU P#133 PU P#134 PU P#135 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#216 PU P#136 PU P#137 PU P#138 PU P#139 PU P#140 PU P#141 PU P#142 PU P#143 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#224 PU P#144 PU P#145 PU P#146 PU P#147 PU P#148 PU P#149 PU P#150 PU P#151 L3 (8192KB) L2 (512KB) L1d (64KB) Core P#232 PU P#152 PU P#153 PU P#154 PU P#155 PU P#156 PU P#157 PU P#158 PU P#159 PCI 15b3:1013 ib1 mlx5_1 PCI 1b4b:9235 PCI 1a03:2000 card4 controlD64 PCI 14e4:1657 enP5p7s0f0 PCI 14e4:1657 enP5p7s0f1 PCI 10de:15f9 card2 renderD130 PCI 10de:15f9 card3 renderD131

2017 CORAL EA

Figure generated with hwloc

NVMe SSD 1 IB NIC per socket 2 GPUs per socket 10 cores per socket SMT-8 Private L1, L2, L3 2 Ethernet NICs

IBM Power8+ SL822LC NVIDIA Pascal Tesla P100

slide-6
SLIDE 6

LLNL-PRES-746812

6

Existing approaches and their limitations

§ MPI/RM approaches

— By thread — By core — By socket — Latency (IBM Spectrum MPI) — Bandwidth (IBM Spectrum MPI)

§ OpenMP approaches

— Policies

Spread, Close, Master

— Predefined places

Threads, Cores, Sockets

§ Limitations

— Memory system is not primary

concern

— No coherent mapping across

programming abstractions

— No heterogeneous devices

support

slide-7
SLIDE 7

LLNL-PRES-746812

7

A portable algorithm for multi-GPU architectures: mpibind

§ 2 IBM Power8+

processors

§ Per socket

— 10 SMT-8 cores — 1 InfiniBand NIC — 2 Pascal GPUs

§ NVMe SSD § Private L1, L2, L3

per core

Node Socket Memory G G Socket Memory G G

Machine L3 (8192KB) L2 (512KB) L1d (64KB) Core P#0 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7

slide-8
SLIDE 8

LLNL-PRES-746812

8

mpibind’s primary consideration: The memory hierarchy

L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 NUMA L1 L2 L3 L1 L2 L3 NUMA

Start

...

NIC-0 GPU-0 GPU-1 NVM-0 NIC-1 GPU-2 GPU-3 p0-7 c0 p8-15 c1 p16-23 c2 p24-31 c3 p32-39 c4 p40-47 c5 p48-55 c6 p56-63 c7 p64-71 c8 p72-79 c9 p80-87 c10 p152-159 c19

l0 l1 l2 l3 l4 l5

1 2 3

Workers = 8 Vertices(l2) = 20 8 workers / 2 NUMA 4 workers / 2 GPUs

slide-9
SLIDE 9

LLNL-PRES-746812

9

§ Get hardware topology § Devise memory tree G

— Assign devices to memory vertices

§ Calculate # workers w (all processes and threads) § Traverse tree to determine level k with at least w vertices § Traverse subtrees selecting compute resources for each vertex

m’: vertices(k) → PU

§ Map workers to vertices respecting NUMA boundaries

m: workers → vertices(k) → PU

The algorithm

L1 L1 L2 L2 NUMA NUMA Start ... ... GPU0 NIC0 GPU1 NIC1 Core0 p4 p5 p6 p7 Core1 l0 l1 l2 l3 l4

slide-10
SLIDE 10

LLNL-PRES-746812

10

Example mapping: One task per GPU

Node Socket Memory G G Socket Memory G G

def lat bw mb

0-79 0-7 0-7 0,8,16,24,32

1

80-159 8-15 8-15 40,48,56,64,72

2

0-79 16-23 80-87 80,88,96,104,112

3

80-159 24-31 88-95 120,128,136,144,152

2 1 3 1 2 3 3 1 3 2 1 3 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 3 1 2 2

slide-11
SLIDE 11

LLNL-PRES-746812

11

Evaluation: Synchronous collectives, GPU compute and bandwidth, app benchmark

Machine CORAL EA system Affinity Spectrum-MPI default Spectrum-MPI latency Spectrum-MPI bandwidth mpibind Benchmarks MPI Barrier MPI Allreduce Bytes&Flops Compute Bytes&Flops Bandwidth SW4lite Number

  • f Nodes

1, 2, 4, 8, 16 Processes (tasks) per node 4, 8, 20

slide-12
SLIDE 12

LLNL-PRES-746812

12

Enabled uniform access to GPU resources

§ Execute multiple instances

concurrently

— 4 and 8 PPN

§ Measure GPU FLOPS § Processes time-share GPUs

by default

§ Performance without

mpibind severely limited because of GPU mapping

bandwidth default latency mpibind bandwidth default latency mpibind Mean GFlops per Process 1000 2000 3000 4000 5000 GPU 0 4 4 4 4 5 5 4 8 PPN Target PPN: 4 Target PPN: 8

Compute micro-benchmark*

*kokkos/benchmarks/bytes_and_flops

Better

slide-13
SLIDE 13

LLNL-PRES-746812

13

Enabled access to the memory of all GPUs without user intervention

§ Execute multiple instances

concurrently

— 4 or 8 PPN

§ Measure GPU global

memory bandwidth

§ Processes time-share GPUs

by default

§ Without mpibind some

processes fail running out

  • f memory

bandwidth default latency mpibind bandwidth default latency mpibind Mean GBps per Process 100 200 300 400 500 GPU 0 4 4 4 4 5 4 4 8 PPN Target PPN: 4 Target PPN: 8

Memory bandwidth micro-benchmark*

*kokkos/benchmarks/bytes_and_flops

Better

slide-14
SLIDE 14

LLNL-PRES-746812

14

Impact on SW4lite–Earthquake ground motion simulation

§ Simplified version of SW4

— Layer over half space (LOH.2) — 17 million grid points (h100)

§ Multiple runs, calculate mean

— 6 times for GPU — 10 times for CPU

§ Performance speedup

— CPU: mpibind over default: 3.7x — GPU: mpibind over bandwidth: 2.2x — GPU over CPU: 9.7x

bandwidth default latency mpibind bandwidth default latency mpibind Forcing Supergrid Scheme BC_phys BC_comm Mean Execution Time (secs) 200 400 600 800 CPU: MPI + OpenMP GPU: MPI + RAJA

Better

TPP → CPP x PPN → CPN bandwidth 8 1 4 4 under default 80 10 4 40

  • ver

latency 8 1 4 4 under mpibind 5 5 4 20

TPP: Threads per process, CPP: Cores per process, PPN: Processes per node, CPN: Cores per node

slide-15
SLIDE 15

LLNL-PRES-746812

15

mpibind: A memory-driven mapping algorithm for multi-GPU architectures

§ Focuses on hierarchical

nature of memory system

§ Provides portability and user

transparency

— Same algorithm on GPU-based,

KNL-based, and commodity- based systems

§ Encompasses

— Hybrid programming

abstractions

— Heterogeneous devices

§ Outperforms existing

approaches without user intervention

— Reduces runtime variability — Competitive performance on

collective operations

— Enables uniform access to all

GPU resources

slide-16
SLIDE 16

LLNL-PRES-746812

16

Bibliography and related GTC talks

§ mpibind: A memory-centric affinity

algorithm for hybrid applications. MEMSYS 2017.

§ System noise revisited: Enabling

application scalability and reproducibility with SMT. IPDPS 2016.

§ 3D ground motion simulation in

  • basins. Final report to Pacific

Earthquake Engineering Research Center, 2005.

§ SW4lite, Kokkos, and RAJA

github.com/geodynamics/sw4lite github.com/kokkos/kokkos github.com/LLNL/RAJA

§ S8270 – Acceleration of an

LLNL production Fortran application on the Sierra supercomputer

§ S8470 – Using RAJA for

accelerating LLNL production applications on the Sierra supercomputer

§ S8489 – Scaling molecular

dynamics across 25,000 GPUs

  • n Sierra & Summit