GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - - PowerPoint PPT Presentation

gpu inference engine
SMART_READER_LITE
LIVE PREVIEW

GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network -


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Michael Andersch, 7th April 2016

NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE

slide-2
SLIDE 2

2

WHAT IS INFERENCE, ANYWAYS?

Building a deep neural network based application Step 1: Use data to train the neural network - training Step 2: Use the neural network to process unseen data - inference

slide-3
SLIDE 3

3

INFERENCE VS TRAINING

How is inference different from training?

1. No backpropagation / static weights

enables graph optimizations, simplifies memory management

2. Tendency towards smaller batch sizes

harder to amortize weight loading, achieve high GPU utilization

3. Reduced precision requirements

provides opportunity for BW savings and accelerated arithmetic

slide-4
SLIDE 4

4

OPTIMIZING SOFTWARE FOR INFERENCE

Extracting every bit of performance

What’s running on the GPU: cuDNN optimizations Support for standard tensor layouts and major frameworks Available automatically and “for free” How you use it: Framework optimizations Every last bit of performance matters Challenging due to framework structure Changes to one framework don’t propagate to others

slide-5
SLIDE 5

5

OPTIMIZING SOFTWARE FOR INFERENCE

Challenge: Efficient small batch convolutions

Optimal convolution algorithm depends on convolution layer dimensions Meta-parameters (data layouts, texture memory) afford higher performance Using texture memory for convolutions: 13% inference speedup (GoogLeNet, batch size 1)

0.73 1.84 1.83 2.03 2.07 2.26 1.92 1.98 1.25

conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0

Winograd speedup over GEMM-based convolution (VGG-E layers, N=1)

slide-6
SLIDE 6

6

OPTIMIZING SOFTWARE FOR INFERENCE

Challenge: Graph optimization

tensor concat 1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. 1x1 conv. max pool input

slide-7
SLIDE 7

7

OPTIMIZING SOFTWARE FOR INFERENCE

Challenge: Graph optimization

concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 3x3 conv. relu bias 3x3 conv. relu bias concat 1x1 conv. relu bias 3x3 conv. relu bias

slide-8
SLIDE 8

8

OPTIMIZING SOFTWARE FOR INFERENCE

Graph optimization: Vertical fusion

concat max pool input next input concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR 1x1 CBR

slide-9
SLIDE 9

9

OPTIMIZING SOFTWARE FOR INFERENCE

Graph optimization: Horizontal fusion

concat max pool input next input concat 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR

slide-10
SLIDE 10

10

OPTIMIZING SOFTWARE FOR INFERENCE

Graph optimization: Concat elision

max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR

slide-11
SLIDE 11

11

OPTIMIZING SOFTWARE FOR INFERENCE

Graph optimization: Concurrency

max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR

slide-12
SLIDE 12

12

OPTIMIZING SOFTWARE FOR INFERENCE

Challenge: Effective use of cuBLAS intrinsics

Run GEMV instead of GEMM Small batch sizes degrade N dimension B matrix becomes narrow Pre-transpose weight matrices Allows using NN/NT GEMM, where NT > NN > TN

slide-13
SLIDE 13

13

ACCELERATED INFERENCE ON PASCAL

Support for fast mixed precision arithmetic

Inference products will support a new dedicated vector math instruction Multi-element dot product, 8-bit integer inputs, 32-bit accumulator 4x the rate of equivalent FP32 operations Full-speed FP32 processing for any layers that require higher precision

slide-14
SLIDE 14

14

BUT WHO WILL IMPLEMENT IT?

Introducing NVIDIA GIE: GPU Inference Engine

STRATEGY OPTIMIZATION ENGINE EXECUTION ENGINE

slide-15
SLIDE 15

15

GPU INFERENCE ENGINE WORKFLOW

DIGITS TRAINING TOOLS OPTIMIZATION ENGINE EXECUTION ENGINE STRATEGY

slide-16
SLIDE 16

16

SUMMARY

Inference on the GPU

GPUs are a great platform for inference Efficiency: great performance/watt Scalability: from 3W to 300W GPU-based inference affords … … same performance in much tighter power envelope … freeing up the CPU to do other work Questions: mandersch@nvidia.com, or find me after the talk!

Tesla M4 Hyperscale Accelerator

slide-17
SLIDE 17

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join