CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley CUDA 8 AND BEYOND Mark Harris, April 5, 2016

INTRODUCING CUDA 8 Pascal Support Unified Memory New Architecture, Stacked Memory , NVLINK Simple Parallel Programming with large virtual memory Libraries Developer T ools nvGRAPH – library for accelerating graph analytics apps Critical Path Analysis to speed overall app tuning FP16 computation to boost Deep Learning workloads OpenACC profiling to optimize directive performance Single GPU debugging on Pascal 2

INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory 3

UNIFIED MEMORY 4

UNIFIED MEMORY Dramatically Lower Developer Effort CUDA 6+ Single allocation, single pointer, Simpler Kepler accessible anywhere CPU Programming & GPU Eliminate need for explicit copy Memory Model Greatly simplifies code porting Unified Memory Performance Migrate data to accessing processor Through Guarantee global coherence Data Locality Still allows explicit hand tuning Allocate Up To GPU Memory Size 5

SIMPLIFIED MEMORY MANAGEMENT CODE CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); use_data(data); free(data); cudaFree(data); } } 6

GREAT PERFORMANCE WITH UNIFIED MEMORY RAJA: Portable C++ Framework for parallel-for style programming LULESH Throughput 20 CPU: 10-core Haswell 2.0x RAJA uses Unified Memory for 1.9x 18 GPU: Tesla K40 heterogeneous array allocations 16 Million elements per second 1.5x 14 Parallel forall loops run on device 12 10 “Excellent performance 8 considering this is a "generic” 6 version of LULESH with no 4 architecture-specific tuning.” 2 -Jeff Keasler, LLNL 0 45^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E5-2650 v3 @ 2.30GHz, single socket 10-core 7

CUDA 8: UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Oversubscribe GPU memory Enable Large Data Models Pascal Allocate up to system memory size CPU GPU CPU/GPU Data coherence Simpler Data Accesss Unified memory atomic operations Unified Memory Tune Usage hints via cudaMemAdvise API Unified Memory Explicit prefetching API Performance Allocate Beyond GPU Memory Size 8

UNIFIED MEMORY EXAMPLE On-Demand Paging __global__ void setValue(int *ptr, int index, int val) { ptr[index] = val; } void foo(int size) { char *data; Unified Memory allocation cudaMallocManaged(&data, size); Access all values on CPU memset(data, 0, size); Access one value on GPU setValue<<<...>>>(data, size/2, 5); cudaDeviceSynchronize(); useData(data); cudaFree(data); } 9

HOW UNIFIED MEMORY WORKS IN CUDA 6 Servicing CPU page faults GPU Code CPU Code __global__ cudaMallocManaged(&array, size); void setValue(char *ptr, int index, char val) memset(array, size); { ptr[index] = val; setValue<<<...>>>(array, size/2, 5); } GPU Memory Mapping CPU Memory Mapping array array Page Fault Interconnect 10 10

HOW UNIFIED MEMORY WORKS ON PASCAL Servicing CPU and GPU Page Faults GPU Code CPU Code __global__ cudaMallocManaged(&array, size); Void setValue(char *ptr, int index, char val) memset(array, size); { ptr[index] = val; setValue<<<...>>>(array, size/2, 5); } GPU Memory Mapping CPU Memory Mapping array array Page Fault Page Fault Interconnect 11 11

USE CASE: ON-DEMAND PAGING Graph Algorithms Performance over GPU directly accessing host memory (zero-copy) Large Data Set Baseline: migrate on first touch Optimized: best placement in memory 4/14/16 12 12

UNIFIED MEMORY ON PASCAL GPU memory oversubscription 32 GB allocation void foo() { // Assume GPU has 16 GB memory Pascal supports allocations where only // Allocate 32 GB a subset of pages reside on GPU. char *data; Pages can be migrated to the GPU size_t size = 32*1024*1024*1024; when “hot”. cudaMallocManaged(&data, size); } Fails on Kepler/Maxwell 13 13

GPU OVERSUBSCRIPTION Now possible with Pascal Many domains would benefit from GPU memory oversubscription: Combustion – many species to solve for Quantum chemistry – larger systems Ray tracing - larger scenes to render 4/14/16 14 14

GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid T esla P100 (16 GB) T esla K40 (12 GB) 4/14/16 15 15 *Tesla P100 performance is very early modelling results

UNIFIED MEMORY ON PASCAL Concurrent CPU/GPU access to managed memory __global__ void mykernel(char *data) { data[1] = ‘g’; } void foo() { char *data; cudaMallocManaged(&data, 2); mykernel<<<...>>>(data); // no synchronize here OK on Pascal: just a page fault data[0] = ‘c’; Concurrent CPU access to ‘data’ on previous cudaFree(data); GPUs caused a fatal segmentation fault } 16 16

UNIFIED MEMORY ON PASCAL System-Wide Atomics Pascal enables system-wide atomics __global__ void mykernel(int *addr) { Direct support of atomics over NVLink • atomicAdd(addr, 10); Software-assisted over PCIe • } void foo() { int *addr; System-wide atomics not available on cudaMallocManaged(addr, 4); Kepler / Maxwell *addr = 0; mykernel<<<...>>>(addr); __sync_fetch_and_add(addr, 10); } 17 17

PERFORMANCE TUNING ON PASCAL Explicit Memory Hints and Prefetching Advise runtime on known memory access behaviors with cudaMemAdvise() cudaMemAdviseSetReadMostly : Specify read duplication cudaMemAdviseSetPreferredLocation : suggest best location cudaMemAdviseSetAccessedBy : initialize a mapping Explicit prefetching with cudaMemPrefetchAsync(ptr, length, destDevice, stream) Unified Memory alternative to cudaMemcpyAsync Asynchronous operation that follows CUDA stream semantics To Learn More: S6216 “The Future of Unified Memory” by Nikolay Sakharnykh Tuesday, 4pm 18 18

GRAPH ANALYTICS 19 19

GRAPH ANALYTICS Insight from Connections in Big Data Wikimedia Commons Circos.ca Social Network Cyber Security / Genomics Analysis Network Analytics … and much more: Parallel Computing, Recommender Systems, Fraud Detection, Voice Recognition, Text Understanding, Search 20 20

nvGRAPH Accelerated Graph Analytics nvGRAPH: 4x Speedup Process graphs with up to 2.5 Billion edges on a 25 single GPU (24GB M40) 20 Accelerate a wide range of applications: 48 Core Xeon E5 nvGRAPH on K40 Iterations/s Single Source 15 Single Source PageRank Shortest Path Widest Path Search Robotic Path Planning IP Routing 10 Recommendation Power Network Chip Design / EDA Engines Planning 5 Logistics & Supply T raffic sensitive Social Ad Placement Chain Planning routing 0 PageRank on Wikipedia 84 M link dataset developer .nvidia.com/nvgraph 21 21

ENHANCED PROFILING 22 22

DEPENDENCY ANALYSIS Easily Find the Critical Kernel To Optimize 5% 40% CPU A wait B wait GPU Kernel X Kernel Y Timeline Optimize Here The longest running kernel is not always the most critical optimization target 23 23

DEPENDENCY ANALYSIS Visual Profiler Generating critical path Unguided Analysis Dependency Analysis Functions on critical path 24 24

DEPENDENCY ANALYSIS Visual Profiler APIs, GPU activities not in critical path are greyed out 25 25

MORE CUDA 8 PROFILER FEATURES CPU Profiling Unified Memory Profiling NVLink Topology and Bandwidth profiling OpenACC Profiling 26 26

COMPILER IMPROVEMENTS 27 27

2X FASTER COMPILE TIME ON CUDA 8 NVCC Speedups on CUDA 8 2.5x 2.0x 1.5x Speedup over CUDA 7.5 1.0x 0.5x 0.0x SHOC Thrust Rodinia cuDNN cuSparse cuFFT cuBLAS cuRand math Examples Open Source Benchmarks Internal Benchmarks QUDA increase • Average total compile times (per translation unit) 1.54x • Intel Core i7-3930K (6-cores) @ 3.2GHz • CentOS x86_64 Linux release 7.1.1503 (Core) with GCC 4.8.3 20140911 28 28 Performance may vary based on OS and software • GPU target architecture sm_52 versions, and motherboard configuration

HETEROGENEOUS C++ LAMBDA Combined CPU/GPU lambda functions __global__ template <typename F, typename T> void apply(F function, T *ptr) { Call lambda from device code *ptr = function(ptr); } int main(void) { float *x; cudaMallocManaged(&x, 2); __host__ __device__ lambda auto square = [=] __host__ __device__ (float x) { return x*x; }; Pass lambda to CUDA kernel apply<<<1, 1>>>(square, &x[0]); … or call it from host code ptr[1] = square(&x[1]); cudaFree(x); Experimental feature in CUDA 8. } `nvcc --expt-extended-lambda` 29 29

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory New Architecture, Stacked Memory , NVLINK Simple Parallel Programming with large virtual memory Libraries

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

Starting with CUDA Quote: With great power comes great responsibility. Summary How to install

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

ITEM 11B. SH 288 TOLL LANES IN HARRIS COUNTY Texas Transportation Commission Meeting February

HARRIS COUNTY FLOOD CONTROL DISTRICT PROGRAMS AND PROJECTS Michael D. Talbott, P.E. Executive

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

CARES CRF Funding Spending Plan Marvin Odum, City of Houston COVID-19 Response and Recovery

Biographical Information Lisa Novosat-Gradert, Senior Scientist/Project Manager EnviroScience

New York City Department of City Planning March 28, 2016 THEATER SUBDISTRICT Land Use Map

United States Court of Appeals for the Federal Circuit 03-1625, -1626 HARRIS CORPORATION,

ASEAN Regional Forum Regional Confidence Building and the Law of the Sea Session 1: International