bigger gpus and bigger nodes
play

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - PowerPoint PPT Presentation

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters Kepler and Volta GPUs HPC


  1. Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1

  2. Outline Experiences from working with domain experts to develop GPU codes on Blue Waters ▪ Kepler and Volta GPUs ▪ HPC Kepler to Volta Speedup ▪ Blue Waters, Summit, Sierra ▪ Intra-Node Communication Performance 2

  3. GPU Architecture Bird’s Eye View (not to scale) Register L1$ / Shared L2$ Accelerator CPU Interconnect System DRAM Cores hard drives network, etc DRAM / HBM 10-100 SMs Memory Subsystem 3

  4. Kepler Volta Single Global Number Maximum Shared Registers Precision Memory of SMs Blocks / SM Memory / SM / SM Rate Bandwidth K20X (Kepler) 15 16 48 KB 64 K 3.94 TFLOPS 250 GB/s V100 (Volta) 80 32 96 KB 64 K 15 TFLOPS 900 GB/s 4

  5. K20x to V100: Architectural Parameters 5

  6. HPC Case Studies AWP-ODC ChaNGa Tom Jordan, Yifeng Cui Tom Quinn Southern California Earthquake Center University of Washington University of Southern California Anelastic Wave propagation Charm N-body Gravity Solver Solves a velocity-stress formulation of Collisionless N-body simulations the 3D wave equation 6

  7. AWP and ChaNGa V100 Speedup Vs. P100 Vs. K20x (Blue Waters) ChaNGa 3.28 4.73 AWP 1.71 5.19 7

  8. AWP Detail SP over p100 SP over K20X 1.711 5.188 K20x V100 Kernel 1 Kernel 2 Kernel 1 Kernel 2 GPU Time 72.4 % 27.5 % 70.1 % 29.3 % Mem BW 145.7 GB/s 136.1 GB/s 726.7 GB/s 600.2 GB/s Latency-Limited Bandwidth-Limited 8

  9. AWP Optimizations Large Blocks to Uneven Architectural Unclear Tradeoff Capture Reuse Change Reuse in fast memory Many more SMs Fine-grained parallelism: more More memory per SM Blocks / SM limited work for GPU, by registers and SMs less reuse Same registers per SM 9

  10. Takeaways Laissez-faire Approach: 3-5x kernel speedup over optimized Kepler 3-5x interconnect speedup over optimized Kepler Larger problem to fill GPU Redesign/Rewrite Approach: Finer-grained parallelism to fill GPU Harder to capture reuse (key to performance) 10

  11. Nodes are Getting Bigger Summit 1 (ORNL) BW 1x AMD64 POWER9 POWER9 CPU 32 threads 88 threads 88 threads 16 FP 22 FP 22 FP K20X V100 V100 V100 V100 V100 V100 GPU 6 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 4 TF 15 TF 15 TF 15 TF 15 TF 15 TF 15 TF Accelerator PCIe 2x16 NVLink 2.0 x2 Interconnect 8 GB/s 50 GB/s (unidirectional) 32GB 512 GB Memory 1: https://www.olcf.ornl.gov/for-users/system-user-guides/summit/system-overview/ 11

  12. Blue Waters XK and Summit Intra-Node Interconnects Blue Waters Summit V100 V100 PCIe 2.0 x16 NVLink 2.0 x2 AM64 V100 P9 P9 V100 K20x V100 V100 12

  13. System Performance Research CUDA Microbench: https://github.com/rai-project/microbench Neural Networks MLModelScope: http://ml-arc-minsky.netlify.com/ Future Directions: Quick application-driven architecture design Performance modeling of neural networks 13

  14. Faster Interconnects NVLink 2.0 x3 (1.5x Summit) PCIe 3.0 x16 (2x BW) 75 GB/s 15.8 GB/s github.com/rai-project/microbench 14

  15. Allocations accessible from CPU and GPU Unified Memory Implicit data transfer (no cudaMemcpy) GPU 0 GPU 1 CPU cudaSetDevice(0); cudaMallocManaged(&a,...); a[ page0 ] = 0; // gpu0 a[ page1 ] = 1; // gpu1 Page fault and migration a[ page2 ] = 2; // cpu Page fault and migration cudaMemAdvise(a, gpu1 , Write served over NVLink cudaMemAdviseSetPreferredLocation); a[ page1 ] = 1; // cpu cudaMemPrefetcAsync(a, gpu1 ); Bulk page migration 15

  16. P9 Unified Memory Performance Limited by 1 CPU thread Coherence: 30% of explicit management Prefetch: 50-80% of explicit github.com/rai-project/microbench 16

  17. AMD64 Unified Memory Performance Coherence: 30-70% of explicit management Prefetch: 50-95% of explicit github.com/rai-project/microbench 17

  18. Device Affinity Data placement on big nodes can have a dramatic communication impact github.com/rai-project/microbench 18

  19. MLModelScope: Neural Network Performance Data http://ml-arc-minsky.netlify.com (model -- machine -- framework) triples ▪ ( AlexNet -- Jetson TX-1 -- Tensorflow ) ▪ ( VGG19 -- AWS P2 X-large -- MxNet ) Neural-network performance primitive benchmarks 19

  20. Thank You https://cwpearson.github.io pearson@illinois.edu Special thanks to ▪ Professor Wen-Mei Hwu ▪ John Larson, Simon Garcia de Gonzalo, Zaid Qureshi, Mert Hidayetoglu, Abdul Dakkak and Cheng Li (University of Illinois) ▪ Isaac Gelado (NVIDIA) ▪ Jinjun Xiong and I-Hsin Chung (IBM) ▪ The IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizon Network. 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend