Mark Harris, NVIDIA
EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA
@harrism #NVSC14
EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, - - PowerPoint PPT Presentation
EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE HOT MACHINE LEARNING PLATFORM GPU
Mark Harris, NVIDIA
@harrism #NVSC14
2
3
1.2M training images • 1000 object categories
Hosted by
person car helmet motorcycle bird frog person dog chair person hammer flower pot power drill
4 60 110 20 40 60 80 100 120 2010 2011 2012 2013 2014
GPU Entries Classification Error Rates
28% 26% 16% 12% 7% 0% 5% 10% 15% 20% 25% 30% 2010 2011 2012 2013 2014
4
Caffe, Torch7, Theano
Caffe (CPU*) 1x Caffe (GPU) 11x Caffe (cuDNN) 14x Baseline Caffe compared to Caffe accelerated by cuDNN on K40
*CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3
5
6 6
NVLink NVLink
POWER CPU X86 ARM64 POWER CPU X86 ARM64 POWER CPU
PASCAL GPU KEPLER GPU 2016 2014
PCIe PCIe
7
1.00x 1.25x 1.50x 1.75x 2.00x 2.25x ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
7
3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)
TESLA GPU TESLA GPU CPU
5x Faster than PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink Over 2x Application Performance Speedup
When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs PCIe based Server
8
Unified Memory
NVLink 80 GB/s
TESLA GPU TESLA GPU CPU
5x Faster than PCIe Gen3 x16
PCIe Switch
Share Data Structures at CPU Memory Speeds, not PCIe speeds Eliminate Multi-GPU Scaling Bottlenecks
9
10
GPU Accelerators Interconnect System Management Compiler Solutions
GPU Boost … GPU Direct NVLink … NVML … LLVM …
Profile and Debug
CUDA Debugging API …
Development Tools Programming Languages Infrastructure Management Communication System Solutions
Software Solutions
Libraries
cuBLAS …
11
x86
AmgX cuBLAS
12
13
14
15
for_each, sort, reduce, scan, etc.
accept as official tech. specification working draft
N3960 Technical Specification Working Draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype:
https://github.com/n3554/n3554
std std:: ::ve vecto tor< r<int int> > vec vec = . ... .. // previ // previous st
andard sequent equential loop ial loop std std:: ::for_each(vec.begi vec.begin(), (), vec.end(), f); // expli // explicitly citly sequenti sequential loo al loop std std:: ::fo for_e _eac ach(std std:: ::seq seq, , ve vec.beg c.begin in() (), ve vec. c.end nd() (), , f); f); // permi // permitting tting parallel parallel execu execution tion std std:: ::for_each(std std::par ::par, , vec.begin(), vec.end(), f);
GCC Effort by Mentor Embedded
Free to all Linux users
Most Widely Used HPC Compiler
Oscar Hernandez Oak Ridge National Laboratories
Incorporating OpenACC into GCC is an excellent example of open source and
accessible to all Linux developers.
17
Python interfaces to CUDA libraries
@cuda uda.j .jit it(“void(float32[:], float32, float32[:], float32[:])”) de def sax saxpy py(ou (out, t, a, a, x x, y , y): ): i = = cud cuda.g a.grid rid(1 (1) if if i < < ou
.siz ize:
= a a * x * x[i] ] + y + y[i] # # Lau Launc nch h saxpy saxpy ker kernel el sa saxpy xpy[1 [100, 00, 51 512](out
, a, x , x, y) y)
18
Accelerate Java SE Libraries with CUDA Accelerate Pure Java
java.util.Arrays.sort(int[] a)
CUDA C++ Programming Via Java APIs
IBM Developer Kits for Java: ibm.com/java/jdk
19
void add(int[] a, int[] b, int[] c) throws CudaException, IOException { CudaDevice device = new CudaDevice(0); CudaModule module = new CudaModule(device, getClass().getResourceAsStream(“ArrayAdder”)); CudaKernel kernel = new CudaKernel(module, “Cuda_cuda4j_samples_adder”); CudaGrid grid = new CudaGrid(512, 512); try (CudaBuffer aBuffer = new CudaBuffer(device, a.length * 4); CudaBuffer bBuffer = new CudaBuffer(device, b.length * 4)) { aBuffer.copyFrom(a, 0, a.length); bBuffer.copyFrom(b, 0, b.length); kernel.launch(grid, new CudaKernel.Parameters(aBuffer, bBuffer, a.length)); aBuffer.copyTo(c, 0, a.length); } }
Manage CUDA devices and kernels Easily transfer data between Java heap and CUDA device Simple, flexible kernel launch
20
IntStream.range(0, N).parallel().forEach(i -> c[i] = a[i] + b[i]);
Standard Java idioms, so no code changes required No knowledge of GPU programming model required No low-level device manipulation – Java implementation has the controls Future JIT smarts do not require application code changes
21
Speed-up factor when run on a GPU enabled host
IBM Power 8 with Nvidia K40m GPU
Public void multiply() { IntStream.range(0, COLS*COLS).parallel().forEach( id -> { int i = id / COLS; int j = id % COLS; int sum = 0; for (int k = 0; k < COLS; k++) { sum += left[i*COLS + k] * right[k*COLS + j]; }
}); }
22
x86
AmgX cuBLAS