GPU INFERENCE
IN THE DATACENTER
Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC
NOVEMBER 2017
Eglin AFB, FL BOOZ ALLEN HAMILTON
GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - - PowerPoint PPT Presentation
BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF
Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC
NOVEMBER 2017
Eglin AFB, FL BOOZ ALLEN HAMILTON
MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.
1
Booz Allen Hamilton
THE DAYS OF EASY PERFORMANCE GAINS ARE GONE
We need alternatives to general purpose CPU computation
GRAPHIC PROCESSING UNITS PROVIDE AN ALTERNATIVE
HOW CAN WE LEVERAGE OUR EXISTING INVESTMENTS?
INTRODUCTION
2
Booz Allen Hamilton
WE HAVE A PROBLEM
How to apply complex algorithms as a part of our ingest process?
How to accommodate this within our existing compute fabric?
NO SUPERCOMPUTERS, NO MODEL TRAINING
THE REALITY
3
Booz Allen Hamilton
DATACENTER ARCHITECTURE
4
Booz Allen Hamilton
SINGLE NODE
SINGLE RACK
DATACENTER
DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS
As a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.
The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.
DATA EXTRACTION PIPELINE
5
Booz Allen Hamilton
DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS
Some of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:
GPU ACCELERATED DATA EXTRACTION PIPELINE
6
Booz Allen Hamilton
CPU, MEMORY AND THROUGHPUT
In order to scale linearly as we add more resources, our system must have the following characteristics
DATA EXTRACTION REQUIREMENTS
7
Booz Allen Hamilton
What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?
INTEGRATION OPTIONS
8
Booz Allen Hamilton
SINGLE NODE JAVA VM MEM CPU
THREAD THREAD THREAD THREAD JVM HEAP
JNI LIB WRAPPED LIB
THREAD
JAVA LIB FORKED EXE LOCAL STORAGE
SINGLE NODE JAVA VM MEM CPU
THREAD THREAD THREAD THREAD JVM HEAP THREAD
LOCAL STORAGE GPU MEM GPU JNI LIB? TENSORRT WRAPPED LIB? CUDA LIB FORKED EXE? OPCV LIB JAVA LIB? CAFFE
What do we want to be able to do?
NOTIONAL INTEGRATION
9
Booz Allen Hamilton
So, what components make up the solution?
SOLUTION
10
Booz Allen Hamilton
“ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS”
NVIDIA TESLA P4 INFERENCE ACCELERATOR
11
Booz Allen Hamilton
IMAGE CLASSIFICATION WITH ALEXNET USING CAFFE
We used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.
difference
CAFFE
12
Booz Allen Hamilton
CUDA ACCELERATED COMPUTER VISION LIBRARY
Images were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.
OPEN CV
13
Booz Allen Hamilton
HIGH PERFORMANCE DEEP LEARNING INFERENCE OPTIMIZER
TensorRT can load and optimize Caffe or Tensorflow models for optimized inference
classification task.
NVIDIA TENSORRT
14
Booz Allen Hamilton
MALCONV: MALWARE DETECTION WITH DEEP LEARNING
A convolutional neural network digests entire binaries for malware identification
PYTORCH
15
Booz Allen Hamilton
DEEP LEARNINIG INFERENCE VIA REST
The GRE provided memory and process isolation and native libraries for hardware access
NVIDIA GPU REST ENGINE
16
Booz Allen Hamilton
SIMPLIFIED PACKAGING AND DEPLOYMENT VIA CONTAINERS
Packaging performed in one environment and rapidly deployed to a large number of nodes.
NVIDIA DOCKER
17
Booz Allen Hamilton
We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems
INSTRUMENTATION
18
Booz Allen Hamilton
SINGLE NODE NVIDIA DOCKER JAVA VM MEM CPU
THREAD THREAD THREAD THREAD JVM HEAP THREAD
LOCAL STORAGE GPU MEM P4 GPU
19
Booz Allen Hamilton
GPU REST ENGINE HTTP HTTP HTTP TENSORRT CAFFE LIB OPENCV LIB
FINAL INTEGRATION
TensorRT
What did we evaluate and observe?
EXPERIMENTS AND RESULTS
20
Booz Allen Hamilton
What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU? We processed 9000 images through the ETL framework, GRE and Caffe CPU Only
BASELINE CONCURRENCY TESTS WITH CAFFE CPU
21
Booz Allen Hamilton
Java Thread Count
10 24 32
Total Elapsed Time (Seconds)
271.65 175.59 416
Minimum Processing Time (Msec)
239 465.8 619.2
Mean (Msec)
300 100.49 149.87
483 880 1066
CPU Max User (%)
83.0 99.8 100.00
GPU Max Utilization (%)
50 100 150 200 400 600 800 1000
Milliseconds per Image Count Threads
10 24 32
What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?
CONCURRENCY TESTS WITH CAFFE GPU
22
Booz Allen Hamilton
Java Thread Count
10 24 32
Total Elapsed Time (Seconds)
37.425 38.451 38.153
Minimum Processing Time (Msec)
7 8 10
Mean (Msec)
39.66 100.49 149.87
163 251 415
CPU Max User (%)
56 45 55
GPU Max Utilization (%)
82 79 81
100 200 300 100 200 300 400
Milliseconds per Image Count Threads
10 24 32
How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?
CONCURRENCY TESTS WITH TENSORRT
23
Booz Allen Hamilton
Java Thread Count
10 24 32
Total Elapsed Time (Seconds)
33.327 33.260 33.399
Minimum Processing Time (Msec)
5 5 8
Mean (Msec)
35.01 86.30 116.08
Max (Msec)
188 258 416
CPU Max User (%)
47 54 57
GPU Max Utilization (%)
85 83 84
100 200 300 100 200 300 400
Milliseconds per Image Count Threads
10 24 32
How does performance compare between TensorRT and Caffe?
TENSORRT VS CAFFE
24
Booz Allen Hamilton
Framework / Thread Count Caffe GPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)
37.425 33.327
Minimum Processing Time (Msec)
7 5
Mean (Msec)
39.66 35.01
Max (Msec)
163 188
CPU Max User (%)
56 47
GPU Max Utilization (%)
82 85
GPU Memory Utilization (MB)
1339 895
100 200 300 50 100 150
Milliseconds per Image Count Framework
Caffe GPU TensorRT
How does performance compare between TensorRT and Caffe?
TENSORRT VS CAFFE
25
Booz Allen Hamilton
Framework / Thread Count Caffe GPU 32 Threads TensorRT 32 Threads Total Elapsed Time (Seconds)
38.153 33.399
Minimum Processing Time (Msec)
10 8
Mean (Msec)
39.66 35.01
Max (Msec)
149.9 116
CPU Max User (%)
58 64
GPU Max Utilization (%)
81 84
GPU Memory Utilization (MB)
1339 895
100 200 300 100 200 300 400
Milliseconds per Image Count Framework
Caffe GPU TensorRT
How does performance compare between TensorRT and Caffe?
TENSORRT VS CAFFE
26
Booz Allen Hamilton
Framework / Thread Count Caffe CPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)
271.659
33.327
Minimum Processing Time (Msec)
239 5
Mean (Msec)
280 35.01
Max (Msec)
296 188
CPU Max User (%)
83 47
GPU Max Utilization (%)
85
100 200 300 100 200 300 400 500
Milliseconds per Image Count Framework
Caffe GPU TensorRT Caffe CPU
How does power utilization compare across both TensorRT and Caffe?
POWER UTILIZATION
27
Booz Allen Hamilton
Framework Caffe TensorRT Thread Count
10 24 32 10 24 32
24.26 24.55 25.13 24.36 23.78 24.36
44.02 46.3 45.88 45.32 45.01 45.36
CPU Max User (%)
56 45 55 47 54 57
GPU Utilization (%)
82 79 81 85 83 84
GPU Max Memory Used (MB)
1339 1339 1341 895 895 895
GPU Max Memory Used (%)
~16% ~10%
There’s much more to explore, what are some of the things we should tackle next?
WHAT’S NEXT?
28
Booz Allen Hamilton Internal
Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems Rob Zuppert, Larry Brown and Brad Rees @ NVidia Felix Abecassis & the GPU REST Engine Team @ NVidia Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS Sterling Foster & others @ US Department of Defense Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton
THANK YOU
29
Booz Allen Hamilton
Find me at:
QUESTIONS?
30
Booz Allen Hamilton