April 4-7, 2016 | Silicon Valley
Michael Andersch, 7th April 2016
GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network -
April 4-7, 2016 | Silicon Valley
Michael Andersch, 7th April 2016
2
Building a deep neural network based application Step 1: Use data to train the neural network - training Step 2: Use the neural network to process unseen data - inference
3
1. No backpropagation / static weights
enables graph optimizations, simplifies memory management
2. Tendency towards smaller batch sizes
harder to amortize weight loading, achieve high GPU utilization
3. Reduced precision requirements
provides opportunity for BW savings and accelerated arithmetic
4
What’s running on the GPU: cuDNN optimizations Support for standard tensor layouts and major frameworks Available automatically and “for free” How you use it: Framework optimizations Every last bit of performance matters Challenging due to framework structure Changes to one framework don’t propagate to others
5
Optimal convolution algorithm depends on convolution layer dimensions Meta-parameters (data layouts, texture memory) afford higher performance Using texture memory for convolutions: 13% inference speedup (GoogLeNet, batch size 1)
0.73 1.84 1.83 2.03 2.07 2.26 1.92 1.98 1.25
conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0
Winograd speedup over GEMM-based convolution (VGG-E layers, N=1)
6
tensor concat 1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. 1x1 conv. max pool input
7
concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 3x3 conv. relu bias 3x3 conv. relu bias concat 1x1 conv. relu bias 3x3 conv. relu bias
8
concat max pool input next input concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR 1x1 CBR
9
concat max pool input next input concat 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
10
max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
11
max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
12
Run GEMV instead of GEMM Small batch sizes degrade N dimension B matrix becomes narrow Pre-transpose weight matrices Allows using NN/NT GEMM, where NT > NN > TN
13
Inference products will support a new dedicated vector math instruction Multi-element dot product, 8-bit integer inputs, 32-bit accumulator 4x the rate of equivalent FP32 operations Full-speed FP32 processing for any layers that require higher precision
14
STRATEGY OPTIMIZATION ENGINE EXECUTION ENGINE
15
DIGITS TRAINING TOOLS OPTIMIZATION ENGINE EXECUTION ENGINE STRATEGY
16
GPUs are a great platform for inference Efficiency: great performance/watt Scalability: from 3W to 300W GPU-based inference affords … … same performance in much tighter power envelope … freeing up the CPU to do other work Questions: mandersch@nvidia.com, or find me after the talk!
Tesla M4 Hyperscale Accelerator
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join