mapping mpi x applications to multi gpu architectures
play

Mapping MPI+X Applications to Multi-GPU Architectures A - PowerPoint PPT Presentation

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. Len Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the


  1. Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Application developers face greater complexity § Hardware architectures § Applications — SMT — Need the hardware topology to run efficiently — GPUs — Need to run on more than one — FPGAs architecture — NVRAM — Need multiple programming — NUMA, multi-rail abstractions Hybrid applications § Programming abstractions — MPI — OpenMP, POSIX threads — CUDA, OpenMP 4.5, OpenACC — Kokkos, RAJA 2 LLNL-PRES-746812

  3. How do we map hybrid applications to increasingly complex hardware? § Compute power is not the bottleneck § Data movement dominates energy consumption § HPC applications dominated by the memory system — Latency and bandwidth — Capacity tradeoffs (multi-level memories) § Leverage local resources — Avoid remote accesses More than compute resources, it is about the memory system! 3 LLNL-PRES-746812

  4. The Sierra system that will replace Sequoia features a GPU-accelerated architecture Compute System Compute Rack Compute Node 4320 nodes Standard 19” 1.29 PB Memory 2 IBM POWER9 CPUs Warm water cooling 240 Compute Racks 4 NVIDIA Volta GPUs 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ~12 MW 256 GiB DDR4 16 GiB Globally addressable HBM2 associated with each GPU Components Coherent Shared Memory IBM POWER9 • Gen2 NVLink NVIDIA Volta GPFS File System Mellanox Interconnect • 7 TFlop/s Single Plane EDR InfiniBand 154 PB usable storage • HBM2 2 to 1 Tapered Fat Tree 1.54 TB/s R/W bandwidth • Gen2 NVLink 4 LLNL-PRES-746812

  5. 2017 Machine (256GB total) NUMANode P#0 (128GB) CORAL EA Package P#0 PCI 15b3:1013 PCI 1c58:0003 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib0 mlx5_0 card0 card1 L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) renderD128 renderD129 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) NVMe Core P#8 Core P#16 Core P#24 Core P#32 PU P#0 PU P#1 PU P#2 PU P#3 PU P#8 PU P#9 PU P#10 PU P#11 PU P#16 PU P#17 PU P#18 PU P#19 PU P#24 PU P#25 PU P#26 PU P#27 SSD PU P#4 PU P#5 PU P#6 PU P#7 PU P#12 PU P#13 PU P#14 PU P#15 PU P#20 PU P#21 PU P#22 PU P#23 PU P#28 PU P#29 PU P#30 PU P#31 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) 1 IB NIC Core P#40 Core P#48 Core P#72 Core P#88 PU P#32 PU P#33 PU P#34 PU P#35 PU P#40 PU P#41 PU P#42 PU P#43 PU P#48 PU P#49 PU P#50 PU P#51 PU P#56 PU P#57 PU P#58 PU P#59 per socket PU P#36 PU P#37 PU P#38 PU P#39 PU P#44 PU P#45 PU P#46 PU P#47 PU P#52 PU P#53 PU P#54 PU P#55 PU P#60 PU P#61 PU P#62 PU P#63 Private L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1, L2, L3 L1d (64KB) L1d (64KB) 10 cores IBM Power8+ Core P#96 Core P#112 PU P#64 PU P#65 PU P#66 PU P#67 PU P#72 PU P#73 PU P#74 PU P#75 per socket PU P#68 PU P#69 PU P#70 PU P#71 PU P#76 PU P#77 PU P#78 PU P#79 SL822LC NUMANode P#1 (128GB) Package P#1 PCI 15b3:1013 PCI 1b4b:9235 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib1 mlx5_1 card2 card3 PCI 1a03:2000 NVIDIA Pascal L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) card4 renderD130 renderD131 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) controlD64 Tesla P100 Core P#136 Core P#144 Core P#152 Core P#168 2 Ethernet PCI 14e4:1657 PU P#80 PU P#81 PU P#82 PU P#83 PU P#88 PU P#89 PU P#90 PU P#91 PU P#96 PU P#97 PU P#98 PU P#99 PU P#104 PU P#105 PU P#106 PU P#107 enP5p7s0f0 PU P#84 PU P#85 PU P#86 PU P#87 PU P#92 PU P#93 PU P#94 PU P#95 PU P#100 PU P#101 PU P#102 PU P#103 PU P#108 PU P#109 PU P#110 PU P#111 NICs PCI 14e4:1657 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) enP5p7s0f1 2 GPUs L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) per socket Core P#176 Core P#200 Core P#208 Core P#216 PU P#112 PU P#113 PU P#114 PU P#115 PU P#120 PU P#121 PU P#122 PU P#123 PU P#128 PU P#129 PU P#130 PU P#131 PU P#136 PU P#137 PU P#138 PU P#139 PU P#116 PU P#117 PU P#118 PU P#119 PU P#124 PU P#125 PU P#126 PU P#127 PU P#132 PU P#133 PU P#134 PU P#135 PU P#140 PU P#141 PU P#142 PU P#143 L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) SMT-8 Core P#224 Core P#232 PU P#144 PU P#145 PU P#146 PU P#147 PU P#152 PU P#153 PU P#154 PU P#155 PU P#148 PU P#149 PU P#150 PU P#151 PU P#156 PU P#157 PU P#158 PU P#159 Figure generated with hwloc 5 LLNL-PRES-746812

  6. Existing approaches and their limitations § MPI/RM approaches § Limitations — By thread — Memory system is not primary concern — By core — No coherent mapping across — By socket programming abstractions — Latency (IBM Spectrum MPI) — No heterogeneous devices — Bandwidth (IBM Spectrum MPI) support § OpenMP approaches — Policies Spread, Close, Master — Predefined places Threads, Cores, Sockets 6 LLNL-PRES-746812

  7. A portable algorithm for multi-GPU architectures: mpibind § 2 IBM Power8+ processors Node § Per socket Memory Memory Machine — 10 SMT-8 cores Socket Socket L3 (8192KB) — 1 InfiniBand NIC L2 (512KB) — 2 Pascal GPUs L1d (64KB) G Core P#0 G § NVMe SSD PU P#0 PU P#1 PU P#2 PU P#3 G PU P#4 PU P#5 PU P#6 PU P#7 § Private L1, L2, L3 G per core 7 LLNL-PRES-746812

  8. mpibind’s primary consideration: The memory hierarchy Workers = 8 8 workers / 2 NUMA Vertices( l 2 ) = 20 4 workers / 2 GPUs Start l 0 NIC-0 3 0 1 2 GPU-0 NIC-1 GPU-1 GPU-2 NVM-0 GPU-3 l 1 NUMA NUMA l 2 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 ... l 3 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 l 4 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c19 l 5 p0-7 p8-15 p16-23 p24-31 p32-39 p40-47 p48-55 p56-63 p64-71 p72-79 p80-87 p152-159 8 LLNL-PRES-746812

  9. l 0 Start The algorithm GPU0 GPU1 NUMA NUMA l 1 NIC0 NIC1 ... l 2 L2 L2 § Get hardware topology ... L1 L1 l 3 Core0 Core1 § Devise memory tree G — Assign devices to memory vertices l 4 p4 p5 p6 p7 § Calculate # workers w (all processes and threads) § Traverse tree to determine level k with at least w vertices § Traverse subtrees selecting compute resources for each vertex m’: vertices(k) → PU § Map workers to vertices respecting NUMA boundaries m: workers → vertices(k) → PU 9 LLNL-PRES-746812

  10. Example mapping: One task per GPU Node Memory Memory def lat bw mb Socket Socket 0 0-79 0-7 0-7 0,8,16,24,32 2 3 1 80-159 8-15 8-15 40,48,56,64,72 1 0 0 1 2 3 3 0 1 0 2 2 0-79 16-23 80-87 80,88,96,104,112 G 2 3 2 0 1 2 3 0 1 2 3 0 1 2 3 G 3 80-159 24-31 88-95 120,128,136,144,152 0 1 0 0 1 2 3 2 3 G 3 G 1 1 3 10 LLNL-PRES-746812

  11. Evaluation: Synchronous collectives, GPU compute and bandwidth, app benchmark Machine CORAL EA system Spectrum-MPI default Spectrum-MPI latency Affinity Spectrum-MPI bandwidth mpibind MPI Barrier MPI Allreduce Benchmarks Bytes&Flops Compute Bytes&Flops Bandwidth SW4lite Number 1, 2, 4, 8, 16 of Nodes Processes (tasks) 4, 8, 20 per node 11 LLNL-PRES-746812

  12. Enabled uniform access to GPU resources Compute micro-benchmark* Target PPN: 4 Target PPN: 8 § Execute multiple instances 5000 concurrently 4 4 4 4 5 5 4 8 PPN — 4 and 8 PPN GPU 0 4000 § Measure GPU FLOPS Mean GFlops per Process Better 3000 § Processes time-share GPUs by default 2000 § Performance without 1000 mpibind severely limited because of GPU mapping 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 12 LLNL-PRES-746812

  13. Enabled access to the memory of all GPUs without user intervention Memory bandwidth micro-benchmark* § Execute multiple instances Target PPN: 4 Target PPN: 8 concurrently 4 4 4 4 5 4 4 8 PPN 500 — 4 or 8 PPN GPU 0 § Measure GPU global 400 Mean GBps per Process memory bandwidth Better 300 § Processes time-share GPUs by default 200 § Without mpibind some processes fail running out 100 of memory 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 13 LLNL-PRES-746812

  14. Impact on SW4lite–Earthquake ground motion simulation § Simplified version of SW4 CPU: MPI + OpenMP GPU: MPI + RAJA — Layer over half space (LOH.2) Forcing — 17 million grid points (h100) Supergrid 800 Scheme § Multiple runs, calculate mean BC_phys BC_comm — 6 times for GPU Mean Execution Time (secs) — 10 times for CPU 600 Better § Performance speedup — CPU: mpibind over default: 3.7x 400 — GPU: mpibind over bandwidth: 2.2x — GPU over CPU: 9.7x 200 TPP → CPP x PPN → CPN bandwidth 8 1 4 4 under 0 default 80 10 4 40 over bandwidth default latency mpibind bandwidth default latency mpibind latency 8 1 4 4 under mpibind 5 5 4 20 TPP: Threads per process, CPP: Cores per process, PPN: Processes per node, CPN: Cores per node 14 LLNL-PRES-746812

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend