Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - - PowerPoint PPT Presentation

bigger gpus and bigger nodes
SMART_READER_LITE
LIVE PREVIEW

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - - PowerPoint PPT Presentation

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters Kepler and Volta GPUs HPC


slide-1
SLIDE 1

Bigger GPUs and Bigger Nodes

Carl Pearson (pearson@illinois.edu)

PhD Candidate, advised by Professor Wen-Mei Hwu

1

slide-2
SLIDE 2

Experiences from working with domain experts to develop GPU codes on Blue Waters ▪ Kepler and Volta GPUs ▪ HPC Kepler to Volta Speedup ▪ Blue Waters, Summit, Sierra ▪ Intra-Node Communication Performance

Outline

2

slide-3
SLIDE 3

GPU Architecture Bird’s Eye View (not to scale)

CPU System DRAM hard drives network, etc

Accelerator Interconnect

L2$ Register L1$ / Shared Cores DRAM / HBM 10-100 SMs Memory Subsystem

3

slide-4
SLIDE 4

Number

  • f SMs

Maximum Blocks / SM Shared Memory / SM Registers / SM Single Precision Rate Global Memory Bandwidth K20X (Kepler) 15 16 48 KB 64 K 3.94 TFLOPS 250 GB/s V100 (Volta) 80 32 96 KB 64 K 15 TFLOPS 900 GB/s

Kepler Volta

4

slide-5
SLIDE 5

K20x to V100: Architectural Parameters

5

slide-6
SLIDE 6

AWP-ODC Tom Jordan, Yifeng Cui Southern California Earthquake Center University of Southern California Anelastic Wave propagation Solves a velocity-stress formulation of the 3D wave equation ChaNGa Tom Quinn University of Washington Charm N-body Gravity Solver Collisionless N-body simulations

HPC Case Studies

6

slide-7
SLIDE 7

AWP and ChaNGa V100 Speedup

  • Vs. P100
  • Vs. K20x (Blue Waters)

ChaNGa 3.28 4.73 AWP 1.71 5.19

7

slide-8
SLIDE 8

AWP Detail

SP over p100 SP over K20X 1.711 5.188

K20x V100 Kernel 1 Kernel 2 Kernel 1 Kernel 2 GPU Time 72.4 % 27.5 % 70.1 % 29.3 % Mem BW 145.7 GB/s 136.1 GB/s 726.7 GB/s 600.2 GB/s Latency-Limited Bandwidth-Limited

8

slide-9
SLIDE 9

AWP Optimizations

Uneven Architectural Change Many more SMs More memory per SM Same registers per SM Large Blocks to Capture Reuse Reuse in fast memory Blocks / SM limited by registers and SMs Unclear Tradeoff Fine-grained parallelism: more work for GPU, less reuse

9

slide-10
SLIDE 10

Laissez-faire Approach: 3-5x kernel speedup over optimized Kepler 3-5x interconnect speedup over optimized Kepler Larger problem to fill GPU Redesign/Rewrite Approach: Finer-grained parallelism to fill GPU Harder to capture reuse (key to performance)

Takeaways

10

slide-11
SLIDE 11

Nodes are Getting Bigger

BW Summit1 (ORNL) CPU 1x AMD64 32 threads 16 FP POWER9 88 threads 22 FP POWER9 88 threads 22 FP GPU K20X 6 GB 4 TF V100 16 GB 15 TF V100 16 GB 15 TF V100 16 GB 15 TF V100 16 GB 15 TF V100 16 GB 15 TF V100 16 GB 15 TF Accelerator Interconnect (unidirectional) PCIe 2x16 8 GB/s NVLink 2.0 x2 50 GB/s Memory 32GB 512 GB 1: https://www.olcf.ornl.gov/for-users/system-user-guides/summit/system-overview/

11

slide-12
SLIDE 12

Blue Waters XK and Summit Intra-Node Interconnects

P9 V100 V100 V100 AM64 K20x P9 V100 V100 V100

Blue Waters PCIe 2.0 x16 Summit NVLink 2.0 x2

12

slide-13
SLIDE 13

CUDA Microbench: https://github.com/rai-project/microbench Neural Networks MLModelScope: http://ml-arc-minsky.netlify.com/ Future Directions: Quick application-driven architecture design Performance modeling of neural networks

System Performance Research

13

slide-14
SLIDE 14

Faster Interconnects

PCIe 3.0 x16 (2x BW) 15.8 GB/s NVLink 2.0 x3 (1.5x Summit) 75 GB/s

github.com/rai-project/microbench

14

slide-15
SLIDE 15

Unified Memory

GPU 0 GPU 1 CPU

cudaSetDevice(0); cudaMallocManaged(&a,...); a[page0] = 0; // gpu0 a[page1] = 1; // gpu1

Page fault and migration

a[page2] = 2; // cpu

Page fault and migration

cudaMemAdvise(a, gpu1, cudaMemAdviseSetPreferredLocation); a[page1] = 1; // cpu

Write served over NVLink

cudaMemPrefetcAsync(a, gpu1);

Bulk page migration

Allocations accessible from CPU and GPU Implicit data transfer (no cudaMemcpy)

15

slide-16
SLIDE 16

P9 Unified Memory Performance

Coherence: 30% of explicit management Prefetch: 50-80% of explicit

Limited by 1 CPU thread

github.com/rai-project/microbench

16

slide-17
SLIDE 17

AMD64 Unified Memory Performance

Coherence: 30-70% of explicit management Prefetch: 50-95% of explicit

github.com/rai-project/microbench

17

slide-18
SLIDE 18

Device Affinity

Data placement on big nodes can have a dramatic communication impact

github.com/rai-project/microbench

18

slide-19
SLIDE 19

http://ml-arc-minsky.netlify.com (model -- machine -- framework) triples ▪ ( AlexNet -- Jetson TX-1 -- Tensorflow ) ▪ ( VGG19 -- AWS P2 X-large -- MxNet ) Neural-network performance primitive benchmarks

MLModelScope: Neural Network Performance Data

19

slide-20
SLIDE 20

https://cwpearson.github.io pearson@illinois.edu

Special thanks to ▪ Professor Wen-Mei Hwu ▪ John Larson, Simon Garcia de Gonzalo, Zaid Qureshi, Mert Hidayetoglu, Abdul Dakkak and Cheng Li (University of Illinois) ▪ Isaac Gelado (NVIDIA) ▪ Jinjun Xiong and I-Hsin Chung (IBM) ▪ The IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizon Network.

Thank You

20