Pedro Mario Cruz e Silva, Solutions Architect Manager
WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - - PowerPoint PPT Presentation
WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - - PowerPoint PPT Presentation
ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go Worlds First Atomic Model of HIV Capsid Oak Ridge Deploys
2
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU World’s First Atomic Model of HIV Capsid GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs World’s First 3-D Mapping
- f Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched AlexNet beats expert code by huge margin using GPUs Top 13 Greenest Supercomputers Powered by NVIDIA GPUs
2017
3
“SCALABILITY OF CPU AND GPU SOLUTIONS OF THE PRIME ELLIPTIC CURVE DISCRETE LOGARITHM PROBLEM”
25.99 29.77 77.84 197.33
50 100 150 200 250
1 STI PS3 K40 + CUDA8.0 P100 + CUDA8.0 V100 + CUDA9.0
Visit Speed (106)
Jairo Panetta (ITA), Paulo Souza (ITA), Luiz Laranjeira (UnB), Carlos Teixeira Jr (UnB)
4
GPU PROGRAMMING
5
HOW GPU ACCELERATION WORKS
Application Code
+
GPU CPU
5% of Code
Compute-Intensive Functions Rest of Sequential CPU Code
6
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in” Acceleration
Programming Languages OpenACC Directives
Maximum Flexibility Easily Accelerate Applications
7
THE BASICS
- Host: The CPU and its memory (host memory)
- Device: The GPU and its memory (device memory)
Heterogenous Computing
Host Device
8
ACCELERATING C/C++ CODE WITH CUDA ON GPUS
9
10
V1OO ARCHITECTURE
11
TESLA V100
The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink & HBM2
Efficient Bandwidth
12
THREAD HIERARCHY
Grid, Block & Threads
13
5
21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink
TESLA V100
*full GV100 chip contains 84 SMs
14
VOLTA GV100 SM
GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared memory 128 KB Active Threads 2048
15
NEW TENSOR CORE
New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning
Activation Inputs Weights Inputs Output Results
16
TENSOR CORE
4x4x4 matrix multiply and accumulate
17
CONCEPTS
__global__ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself. For CUDA C/C++, the nvcc compiler will handle compiling this code. blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on. threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block. blockDim.x - this is a read-only variable that is defined for you. It simply returns a value indicating the number of threads there are per block. Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value. myKernel <<< number_of_blocks, threads_per_block>>> (...) - this is the syntax used to launch a kernel on the GPU. Inside the triple-angle brackets we set two values. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. In this case, the system will just continue executing blocks until they have all run.
43
Deep Learning Fundamentals Game Development & Digital Content Finance
NVIDIA DEEP LEARNING INSTITUTE
Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers Request onsite instructor-led workshops at your
- rganization: www.nvidia.com/requestdli
Take self-paced labs online: www.nvidia.com/dlilabs Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli
Intelligent Video Analytics Medical Image Analysis Autonomous Vehicles Accelerated Computing Fundamentals More industry- specific training coming soon… Genomics
44
developer.nvidia.com
45
developer.nvidia.com
46
NVIDIA HW GRANT PROGRAM
Titan V Volta
- Robotics
- Autonomous Machines
Jetson TX2 (Dev Kit)
- Scientific Visualization
- Virtual Reality
Quadro P6000
- Scientific Computing
- HPC
- Deep Learning
https://developer.nvidia.com/academic_gpu_seeding
47
INCEPTION PROGRAM
http://www.nvidia.com/object/inception-program.html