WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - - PowerPoint PPT Presentation

with cuda c c
SMART_READER_LITE
LIVE PREVIEW

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - - PowerPoint PPT Presentation

ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go Worlds First Atomic Model of HIV Capsid Oak Ridge Deploys


slide-1
SLIDE 1

Pedro Mario Cruz e Silva, Solutions Architect Manager

ACCELERATING APPLICATIONS WITH CUDA C/C++

slide-2
SLIDE 2

2

ELEVEN YEARS OF GPU COMPUTING

2010

Fermi: World’s First HPC GPU World’s First Atomic Model of HIV Capsid GPU-Trained AI Machine Beats World Champion in Go

2014

Stanford Builds AI Machine using GPUs World’s First 3-D Mapping

  • f Human Genome

Google Outperforms Humans in ImageNet

2012

Discovered How H1N1 Mutates to Resist Drugs Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

2008

World’s First GPU Top500 System

2006

CUDA Launched AlexNet beats expert code by huge margin using GPUs Top 13 Greenest Supercomputers Powered by NVIDIA GPUs

2017

slide-3
SLIDE 3

3

“SCALABILITY OF CPU AND GPU SOLUTIONS OF THE PRIME ELLIPTIC CURVE DISCRETE LOGARITHM PROBLEM”

25.99 29.77 77.84 197.33

50 100 150 200 250

1 STI PS3 K40 + CUDA8.0 P100 + CUDA8.0 V100 + CUDA9.0

Visit Speed (106)

Jairo Panetta (ITA), Paulo Souza (ITA), Luiz Laranjeira (UnB), Carlos Teixeira Jr (UnB)

slide-4
SLIDE 4

4

GPU PROGRAMMING

slide-5
SLIDE 5

5

HOW GPU ACCELERATION WORKS

Application Code

+

GPU CPU

5% of Code

Compute-Intensive Functions Rest of Sequential CPU Code

slide-6
SLIDE 6

6

3 WAYS TO ACCELERATE APPLICATIONS

Applications

Libraries

“Drop-in” Acceleration

Programming Languages OpenACC Directives

Maximum Flexibility Easily Accelerate Applications

slide-7
SLIDE 7

7

THE BASICS

  • Host: The CPU and its memory (host memory)
  • Device: The GPU and its memory (device memory)

Heterogenous Computing

Host Device

slide-8
SLIDE 8

8

ACCELERATING C/C++ CODE WITH CUDA ON GPUS

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

V1OO ARCHITECTURE

slide-11
SLIDE 11

11

TESLA V100

The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

slide-12
SLIDE 12

12

THREAD HIERARCHY

Grid, Block & Threads

slide-13
SLIDE 13

13

5

21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink

TESLA V100

*full GV100 chip contains 84 SMs

slide-14
SLIDE 14

14

VOLTA GV100 SM

GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared memory 128 KB Active Threads 2048

slide-15
SLIDE 15

15

NEW TENSOR CORE

New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning

Activation Inputs Weights Inputs Output Results

slide-16
SLIDE 16

16

TENSOR CORE

4x4x4 matrix multiply and accumulate

slide-17
SLIDE 17

17

CONCEPTS

__global__ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself. For CUDA C/C++, the nvcc compiler will handle compiling this code. blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on. threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block. blockDim.x - this is a read-only variable that is defined for you. It simply returns a value indicating the number of threads there are per block. Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value. myKernel <<< number_of_blocks, threads_per_block>>> (...) - this is the syntax used to launch a kernel on the GPU. Inside the triple-angle brackets we set two values. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. In this case, the system will just continue executing blocks until they have all run.

slide-18
SLIDE 18

43

Deep Learning Fundamentals Game Development & Digital Content Finance

NVIDIA DEEP LEARNING INSTITUTE

Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers Request onsite instructor-led workshops at your

  • rganization: www.nvidia.com/requestdli

Take self-paced labs online: www.nvidia.com/dlilabs Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli

Intelligent Video Analytics Medical Image Analysis Autonomous Vehicles Accelerated Computing Fundamentals More industry- specific training coming soon… Genomics

slide-19
SLIDE 19

44

developer.nvidia.com

slide-20
SLIDE 20

45

developer.nvidia.com

slide-21
SLIDE 21

46

NVIDIA HW GRANT PROGRAM

Titan V Volta

  • Robotics
  • Autonomous Machines

Jetson TX2 (Dev Kit)

  • Scientific Visualization
  • Virtual Reality

Quadro P6000

  • Scientific Computing
  • HPC
  • Deep Learning

https://developer.nvidia.com/academic_gpu_seeding

slide-22
SLIDE 22

47

INCEPTION PROGRAM

http://www.nvidia.com/object/inception-program.html

slide-23
SLIDE 23