Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical - PowerPoint PPT Presentation

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ GPGPU2: Advanced Methods for Computing with CUDA, University of Cape Town, April 2014 NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

GPU Computing • Commodity devices, omnipresent in modern computers (over a million sold per week ) • Massively parallel hardware, hundreds of processing units, throughput oriented architecture • Standard integer and floating point types supported • Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software • GPU algorithms are often multicore friendly due to attention paid to data locality and data-parallel work decomposition NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Benefits of GPUs vs. Other Parallel Computing Approaches • Increased compute power per unit volume • Increased FLOPS/watt power efficiency • Desktop/laptop computers easily incorporate GPUs, no need to teach non-technical users how to use a remote cluster or supercomputer • GPU can be upgraded without new OS license fees, low cost hardware NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

What Runs on a GPU? • GPUs run data-parallel programs called “kernels” • GPUs are managed by a host CPU thread: – Create a CUDA context – Allocate/deallocate GPU memory – Copy data between host and GPU memory – Launch GPU kernels – Query GPU status – Handle runtime errors NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

What Speedups Can GPUs Achieve? • Single-GPU speedups of 3x to 10x vs. one multi-core CPU are very common • Best speedups can reach 25x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … • Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

CUDA GPU-Accelerated Trajectory Analysis and Visualization in VMD VMD GPU-Accelerated Feature or Typical speedup vs. multi- Kernel core CPU (e.g. 4-core CPU) Molecular orbital display 30x Radial distribution function 23x Molecular surface display 15x Electrostatic field calculation 11x Ray tracing w/ shadows, 7x AO lighting Ion placement 6x MDFF density map synthesis 6x Implicit ligand sampling 6x Root mean squared fluctuation 6x Radius of gyration 5x Close contact determination 5x Dipole moment calculation 4x NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

GPU Solution: Computing C 60 Molecular Orbitals 3-D orbital lattice: Device CPUs, Runtime Speedup millions of points (s) GPUs Intel X5550-SSE 1 30.64 0.14 Intel X5550-SSE 8 4.13 1.0 GeForce GTX 480 1 0.255 16 GeForce GTX 480 4 0.081 51 Lattice slices computed on GPU threads multiple GPUs each compute one point . CUDA thread 2-D CUDA grid blocks on one GPU NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Molecular Orbital Inner Loop, Hand-Coded x86 SSE Hard to Read, Isn’t It? (And this is the “pretty” version!) for (shell=0; shell < maxshell; shell++) { __m128 Cgto = _mm_setzero_ps(); for (prim=0; prim<num_prim_per_shell[shell_counter]; prim++) { float exponent = -basis_array[prim_counter ]; float contract_coeff = basis_array[prim_counter + 1]; __m128 expval = _mm_mul_ps(_mm_load_ps1(&exponent), dist2); __m128 ctmp = _mm_mul_ps(_mm_load_ps1(&contract_coeff), exp_ps(expval)); Cgto = _mm_add_ps(contracted_gto, ctmp); prim_counter += 2; Writing SSE kernels for CPUs requires } assembly language, compiler intrinsics, __m128 tshell = _mm_setzero_ps(); various libraries, or a really smart switch (shell_types[shell_counter]) { autovectorizing compiler and lots of luck... case S_SHELL: value = _mm_add_ps(value, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), Cgto)); break; case P_SHELL: tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), xdist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), ydist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), zdist)); NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, value = _mm_add_ps(value, _mm_mul_ps(tshell, Cgto)); break; U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Molecular Orbital Inner Loop in CUDA for (shell=0; shell < maxshell; shell++) { float contracted_gto = 0.0f; for (prim=0; prim<num_prim_per_shell[shell_counter]; prim++) { float exponent = const_basis_array[prim_counter ]; float contract_coeff = const_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * exp2f(-exponent*dist2); prim_counter += 2; Aaaaahhhh…. } Data-parallel CUDA kernel float tmpshell=0; looks like normal C code for switch (const_shell_symmetry[shell_counter]) { the most part…. case S_SHELL: value += const_wave_f[ifunc++] * contracted_gto; break; case P_SHELL: tmpshell += const_wave_f[ifunc++] * xdist; tmpshell += const_wave_f[ifunc++] * ydist tmpshell += const_wave_f[ifunc++] * zdist; NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, value += tmpshell * contracted_gto; break; U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

HIV-1 Parallel HD Movie Rendering on Blue Waters Cray XE6/XK7 New “ TachyonL-OptiX ” on XK7 vs. Tachyon on XE6: K20X GPUs yield up to eight times geom+ray tracing speedup Cray XE6: 2x Opteron 62xx CPUs (32-cores) Cray XK7: 1x Opteron 62xx CPU (16-cores) + NVIDIA Tesla K20X Node Type Script State Load Geometry + Ray Total and Count Load Time Tracing Time Time 256 XE6 CPU nodes 7 s 160 s 1,374 s 1,541 s 512 XE6 CPU nodes 13 s 211 s 808 s 1,032 s 64 XK7 Tesla K20X GPUs 2 s 38 s 655 s 695 s 128 XK7 Tesla K20X GPUs 4 s 74 s 331 s 410 s 256 XK7 Tesla K20X 7 s 110 s 171 s 288 s GPUs GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms. NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, Stone et al. In UltraVis'13: Eighth Workshop on Ultrascale Visualization Proceedings, 2013. U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Peak Arithmetic Performance Trend NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Peak Memory Bandwidth Trend NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

GPU PCI-Express DMA Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 2014. (In press) http://dx.doi.org/10.1016/j.parco.2014.03.009 NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

GPU: Throughput-Oriented Hardware Architecture • GPUs have very small on-chip caches • Main memory latency (several hundred clock cycles!) is tolerated through hardware multithreading – overlap memory transfer latency with execution of other work • When a GPU thread stalls on a memory operation, the hardware immediately switches context to a ready thread • Effective latency hiding requires saturating the GPU with lots of work – tens of thousands of independent work items NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Comparison of CPU and GPU Hardware Architecture CPU : Cache heavy, GPU : ALU heavy, focused on individual massively parallel, thread performance throughput oriented NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

NVIDIA Streaming Processor Array GT200 TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC Constant Cache 64kB, read-only Streaming Multiprocessor Texture Processor Cluster Instruction L1 Data L1 FP64 Unit Instruction Fetch/Dispatch Special Shared Memory SM Function Unit Texture Unit FP64 Unit (double precision) 1/2/3-D interpolation SIN, EXP, 8kB spatial cache, RSQRT, Etc… SP SP SM Read-only, SP SP Streaming SFU SFU Processor SP SP SM ADD, SUB SP SP MAD, Etc… NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign http://www.ks.uiuc.edu/

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical - PowerPoint PPT Presentation

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ GPGPU2:

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Hardware Observability Framework Hardware Observability Framework Hardware Observability

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery Tao Chen,

in the presence of latent confounders and linear non-Gaussian SEMs Shohei Shimizu Osaka

Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science

AdaGeo: Adaptive Geometric Learning for Optimization and Sampling Gabriele Abbati 1 , Alessandra

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed Data Mining: Why Bother?

On a Road to 6G: Interplay Between NOMA and Reconfigurable Intelligent Surfaces (RIS) Dr. Yuanwei

The Essence of the Course COSC 404 Database System Implementation If you walk out of this course

Room for the River project examples Robert Slomp Rijkswaterstaat RIZA 37. IWASA Aachen

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical - PowerPoint PPT Presentation

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ GPGPU2:

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Hardware Observability Framework Hardware Observability Framework Hardware Observability

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery Tao Chen,

in the presence of latent confounders and linear non-Gaussian SEMs Shohei Shimizu Osaka

Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science

AdaGeo: Adaptive Geometric Learning for Optimization and Sampling Gabriele Abbati 1 , Alessandra

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed Data Mining: Why Bother?

On a Road to 6G: Interplay Between NOMA and Reconfigurable Intelligent Surfaces (RIS) Dr. Yuanwei

The Essence of the Course COSC 404 Database System Implementation If you walk out of this course

Room for the River project examples Robert Slomp Rijkswaterstaat RIZA 37. IWASA Aachen

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,