High-Performance GPU Programming for Deep Learning
7 April 2016 Scott Gray
Nervana Systems
High-Performance GPU Programming for Deep Learning 7 April 2016 - - PowerPoint PPT Presentation
MAKING MACHINES SMARTER. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems High-Performance GPU kernels for deep learning Fast matrix multiply for small minibatches Direct convolution
7 April 2016 Scott Gray
Nervana Systems
Proprietary and confidential. Do not distribute.
2
Proprietary and confidential. Do not distribute.
3
Proprietary and confidential. Do not distribute.
4
Outer product contiguous Outer product strided
threads memory load single tile batched GEMM
Proprietary and confidential. Do not distribute.
Batched GEMM tiles 32 x 32 GEMM tile 32 x 64 GEMM tile 32 x 32
5 threads shared memory load
Proprietary and confidential. Do not distribute.
6
Nx3072x3072 NN op
1500 3000 4500 6000 32 64 96 128
Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS
Proprietary and confidential. Do not distribute.
7
GFLOPS
Nx3072x3072 TN op
1500 3000 4500 6000 32 64 96 128
Nervana 32x32 cuBLAS 128x64 Batch Size (N)
Proprietary and confidential. Do not distribute.
8
Proprietary and confidential. Do not distribute.
9
Proprietary and confidential. Do not distribute.
10
Input Feature Map 4x4 stride 2
product of 1D transforms
simplified to remove zeros
Proprietary and confidential. Do not distribute.
11
different coefficients
independently
Proprietary and confidential. Do not distribute.
12
Proprietary and confidential. Do not distribute.
13
Output Feature Map
space to obtain 2x2 output tile
Proprietary and confidential. Do not distribute.
14
VGG fp32 - Totals by operation
0.5 1 1.5 2
64 32 16 8 4 2 1
Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update
Algorithmic Speedup Batch Size
Proprietary and confidential. Do not distribute.
15
Alexnet Totals
0.5 1 1.5 2 128 64 32 16 8 4
Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32
Batch Size Algorithmic Speedup
Proprietary and confidential. Do not distribute.
16
Proprietary and confidential. Do not distribute.
17
VGG
0.5 1 1.5 2 64 32 16 8 4 2 1
Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32
Algorithmic Speedup Batch Size
GoogLeNetv2 - Totals:
0.5 1 1.5 2 64 32 16 8 4 2 1
Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32
Algorithmic Speedup Batch Size
MSRA - Totals:
0.5 1 1.5 2 64 32 16 8 4 2 1
Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32
Algorithmic Speedup Batch Size
Proprietary and confidential. Do not distribute.
22
X
Proprietary and confidential. Do not distribute.
23
Proprietary and confidential. Do not distribute.
24
in terms of matrix multiplies
2x2 and 4x4 output tile size