[PPT] - GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ PowerPoint Presentation

SLIDE 1

GPU INFERENCE

IN THE DATACENTER

Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC

NOVEMBER 2017

Eglin AFB, FL BOOZ ALLEN HAMILTON

SLIDE 2

MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.

Jen-Hsun Huang, CEO, NVIDIA

1

Booz Allen Hamilton

SLIDE 3

THE DAYS OF EASY PERFORMANCE GAINS ARE GONE

We need alternatives to general purpose CPU computation

GRAPHIC PROCESSING UNITS PROVIDE AN ALTERNATIVE

Algebraic Strengths of GPU
Enable New Algorithms
Adaptable to a variety of tasks
Readably Available Libraries (CUDA, CV/DL Frameworks)

HOW CAN WE LEVERAGE OUR EXISTING INVESTMENTS?

HPC vs. Commodity Hardware in the Datacenter
Evolution vs. Revolution
Scaling Out vs. Scaling Up

INTRODUCTION

2

Booz Allen Hamilton

SLIDE 4

WE HAVE A PROBLEM

How to apply complex algorithms as a part of our ingest process?

Computationally Expensive Algorithms
Heterogeneous Dataflow
Horizontal / Linear Scalability at datacenter level.

How to accommodate this within our existing compute fabric?

Hadoop Clusters: HDFS, YARN, etc.
Commodity Nodes
Small Numbers of Special Purpose Nodes
10G Interconnects
Cost, Power, Space and Cooling

NO SUPERCOMPUTERS, NO MODEL TRAINING

Power Efficient GPUs (50-75w)
We could focus on model Inference
Application of Models to new Data, e.g: Classification

THE REALITY

3

Booz Allen Hamilton

SLIDE 5

DATACENTER ARCHITECTURE

4

Booz Allen Hamilton

SINGLE NODE

128G RAM
12x Drives
24 Core / 48 HT
PCI Express Slot?

SINGLE RACK

40 Nodes
10G ToR Switch
15-16 KW

DATACENTER

Many Racks
Interconnect
Rows?

SLIDE 6

DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS

As a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.

Unpacking, Uncompressing Archives
Zip, Tar, etc.
Converting Binary Formats to Text
Word, PDF
Extracting Metadata
EXIF data from images
Classifying Images / Segmenting Images / Detecting Objects
Search Images using Text
Optical Character Recognition
Extract Text from Scanned Documents
Detecting Malware
Executables, PDFs, RTF

The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.

DATA EXTRACTION PIPELINE

5

Booz Allen Hamilton

SLIDE 7

DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS

Some of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:

Unpacking, Uncompressing Archives
Zip, Tar, etc.
Converting Binary Formats to Text
Word, PDF
Extracting Metadata
EXIF data from images
Classifying Images / Segmenting Images / Detecting Objects
Search Images using Text
Optical Character Recognition
Extract Text from Scanned Documents
Detecting Malware
Executables, PDFs, RTF

GPU ACCELERATED DATA EXTRACTION PIPELINE

6

Booz Allen Hamilton

SLIDE 8

CPU, MEMORY AND THROUGHPUT

In order to scale linearly as we add more resources, our system must have the following characteristics

Shared Nothing vs. Share Little
Stateless vs. Minimally State-ful
CPU bound (No Disk IO, Network IO, Bus or Memory Bottlenecks)
Uniformly Fast: Individual Document Processing: 10-100ms
RAM Frugal: No large models in memory
Something that plays well with Java

DATA EXTRACTION REQUIREMENTS

7

Booz Allen Hamilton

SLIDE 9

What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?

Pure Java
No GPU acceleration
Java Native Interface (or some derivative)
Hand wrapped API calls (JNI, JNA)
JavaCPP (Java and native C++ bridge)
Deeplearning4j cuDNN integration (as of 0.9.1)
Tensorflow Java API
External Processes
Forked Executable
Shared Memory
Sockets (TCP, UDP, Raw, etc)

INTEGRATION OPTIONS

8

Booz Allen Hamilton

SINGLE NODE JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP

JNI LIB WRAPPED LIB

THREAD

JAVA LIB FORKED EXE LOCAL STORAGE

SLIDE 10

SINGLE NODE JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP THREAD

LOCAL STORAGE GPU MEM GPU JNI LIB? TENSORRT WRAPPED LIB? CUDA LIB FORKED EXE? OPCV LIB JAVA LIB? CAFFE

What do we want to be able to do?

Multiple Library or Framework Support
Caffe / Tensorflow / Torch / Others
Cuda Accelerated OpenCV
TensorRT
Other CUDA Libraries

NOTIONAL INTEGRATION

9

Booz Allen Hamilton

SLIDE 11

So, what components make up the solution?

SOLUTION

10

Booz Allen Hamilton

SLIDE 12

“ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS”

NVidia PASCAL
5.5 TeraFLOPS Single-Precision Performance
22 Tera-Operations Per Second Integer 8 Performance
8G GPU Memory
192 GB/s GPU Memory Bandwidth
Low-Profile PCI Express
50W/75W Max Power
http://www.nvidia.com/object/accelerate-inference.html

NVIDIA TESLA P4 INFERENCE ACCELERATOR

11

Booz Allen Hamilton

SLIDE 13

IMAGE CLASSIFICATION WITH ALEXNET USING CAFFE

We used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.

A good stand-in for more complex image models
Evaluated both CPU-only and GPU variants of Caffe to characterize performance

difference

One Image Per Batch
Lightly modified to properly handle multithreading and CUDA streams

CAFFE

12

Booz Allen Hamilton

SLIDE 14

CUDA ACCELERATED COMPUTER VISION LIBRARY

Images were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.

AlexNET input layer size is 224px x 224px
Produces a GpuMat object for image data allocated from GPU memory
GpuMat wrapped to use as input layer for network, avoiding the need for an extra copy
Custom GpuMat allocators introduced in OpenCV 3.2.0

OPEN CV

13

Booz Allen Hamilton

SLIDE 15

HIGH PERFORMANCE DEEP LEARNING INFERENCE OPTIMIZER

TensorRT can load and optimize Caffe or Tensorflow models for optimized inference

performance. In this case, we used it to host the same Caffe model used for the image

classification task.

FP32 to INT8 while minimizing accuracy loss
Better GPU Utilization
Kernel Autotuning
Improved Memory Footprint
Multi-stream Execution
Used unchanged Caffe model for Image classification

NVIDIA TENSORRT

14

Booz Allen Hamilton

SLIDE 16

MALCONV: MALWARE DETECTION WITH DEEP LEARNING

A convolutional neural network digests entire binaries for malware identification

A Custom Malware Identification Model
Current Ingest framework leverages an un-accelerated predecessor
Can’t use MalConv because it’s too computationally intense
Integration with PyTorch will require some work
Not a great inference layer available for PyTorch
Model Translation with ONNX to Caffe2 (or Other?)
Currently a work-in-progress
How do the ergonomics differ from the image captioning task?

PYTORCH

15

Booz Allen Hamilton

SLIDE 17

DEEP LEARNINIG INFERENCE VIA REST

The GRE provided memory and process isolation and native libraries for hardware access

Multi-threaded HTTP server in Golang
RESTFul interface
Multithreaded Caffe
TensorRT – NVidia’s inference engine
CUDA-Accelerated OpenCV
Containerized in Docker
Framework for other inference engines
https://developer.nvidia.com/gre
https://github.com/NVIDIA/gpu-rest-engine

NVIDIA GPU REST ENGINE

16

Booz Allen Hamilton

SLIDE 18

SIMPLIFIED PACKAGING AND DEPLOYMENT VIA CONTAINERS

Packaging performed in one environment and rapidly deployed to a large number of nodes.

Docker image building / testing on Amazon Elastic Compute Cloud
Test Environment on an Isolated Network
Install Docker, CUDA Libraries / Drivers and NVidia Docker and go.
Portability across Centos 7 Nodes
Supported Laptop Development of new analytics.
Worked out-of-the box for Caffe Models in Caffe / TensorRT
https://github.com/NVIDIA/nvidia-docker

NVIDIA DOCKER

17

Booz Allen Hamilton

SLIDE 19

We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems

CollectD / StatsD API with various plugins for CPU, Disk, Memory, IO
nvidia-smi for GPU information
Timely for metric storage, analysis
Grafana for visualization, daskboarding, analysis.
NVidia Data Center GPU Manager
Active Health Monitoring, Early Fault Detection (SMART for GPUs)
Power Management
Configuration & Reporting

INSTRUMENTATION

18

Booz Allen Hamilton

SLIDE 20

SINGLE NODE NVIDIA DOCKER JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP THREAD

LOCAL STORAGE GPU MEM P4 GPU

19

Booz Allen Hamilton

GPU REST ENGINE HTTP HTTP HTTP TENSORRT CAFFE LIB OPENCV LIB

FINAL INTEGRATION

REST Calls from Java to GRE
Golang Coordinator
Copy Image to GPUMat
Resize GPUMat in OpenCV
Resized GPUMat becomes input layer
Calls to framework for inference
Caffe Reference Model Hosted in Caffe or

TensorRT

SLIDE 21

What did we evaluate and observe?

EXPERIMENTS AND RESULTS

20

Booz Allen Hamilton

SLIDE 22

What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU? We processed 9000 images through the ETL framework, GRE and Caffe CPU Only

BASELINE CONCURRENCY TESTS WITH CAFFE CPU

21

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

271.65 175.59 416

Minimum Processing Time (Msec)

239 465.8 619.2

Mean (Msec)

300 100.49 149.87

Max. (Msec)

483 880 1066

CPU Max User (%)

83.0 99.8 100.00

GPU Max Utilization (%)

50 100 150 200 400 600 800 1000

Milliseconds per Image Count Threads

10 24 32

SLIDE 23

What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?

CONCURRENCY TESTS WITH CAFFE GPU

22

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

37.425 38.451 38.153

Minimum Processing Time (Msec)

7 8 10

Mean (Msec)

39.66 100.49 149.87

Max. (Msec)

163 251 415

CPU Max User (%)

56 45 55

GPU Max Utilization (%)

82 79 81

100 200 300 100 200 300 400

Milliseconds per Image Count Threads

10 24 32

SLIDE 24

How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?

CONCURRENCY TESTS WITH TENSORRT

23

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

33.327 33.260 33.399

Minimum Processing Time (Msec)

5 5 8

Mean (Msec)

35.01 86.30 116.08

Max (Msec)

188 258 416

CPU Max User (%)

47 54 57

GPU Max Utilization (%)

85 83 84

100 200 300 100 200 300 400

Milliseconds per Image Count Threads

10 24 32

SLIDE 25

How does performance compare between TensorRT and Caffe?

TensorRT is generally faster and uses more of the GPU than Caffe.
The model loaded into TensorRT consumes 33% less GPU Memory.

TENSORRT VS CAFFE

24

Booz Allen Hamilton

Framework / Thread Count Caffe GPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)

37.425 33.327

Minimum Processing Time (Msec)

7 5

Mean (Msec)

39.66 35.01

Max (Msec)

163 188

CPU Max User (%)

56 47

GPU Max Utilization (%)

82 85

GPU Memory Utilization (MB)

1339 895

100 200 300 50 100 150

Milliseconds per Image Count Framework

Caffe GPU TensorRT

SLIDE 26

How does performance compare between TensorRT and Caffe?

TensorRT seems to handle concurrency better than Caffe

TENSORRT VS CAFFE

25

Booz Allen Hamilton

Framework / Thread Count Caffe GPU 32 Threads TensorRT 32 Threads Total Elapsed Time (Seconds)

38.153 33.399

Minimum Processing Time (Msec)

10 8

Mean (Msec)

39.66 35.01

Max (Msec)

149.9 116

CPU Max User (%)

58 64

GPU Max Utilization (%)

81 84

GPU Memory Utilization (MB)

1339 895

100 200 300 100 200 300 400

Milliseconds per Image Count Framework

Caffe GPU TensorRT

SLIDE 27

How does performance compare between TensorRT and Caffe?

Both TensorRT and Caffe GPU are considerably more performant than Caffe CPU

TENSORRT VS CAFFE

26

Booz Allen Hamilton

Framework / Thread Count Caffe CPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)

271.659

33.327

Minimum Processing Time (Msec)

239 5

Mean (Msec)

280 35.01

Max (Msec)

296 188

CPU Max User (%)

83 47

GPU Max Utilization (%)

85

100 200 300 100 200 300 400 500

Milliseconds per Image Count Framework

Caffe GPU TensorRT Caffe CPU

SLIDE 28

How does power utilization compare across both TensorRT and Caffe?

They are relatively the same, with the Tesla P4 Consuming 24 Watts at near idle
45 Watts at 80% GPU Utilization for the 50W model
Low GPU Memory Utilization
Additional 1.8 kW per rack == Increased Power Consumption of ~10%

POWER UTILIZATION

27

Booz Allen Hamilton

Framework Caffe TensorRT Thread Count

10 24 32 10 24 32

Min. Power Use (Watts)

24.26 24.55 25.13 24.36 23.78 24.36

Max. (Watts)

44.02 46.3 45.88 45.32 45.01 45.36

CPU Max User (%)

56 45 55 47 54 57

GPU Utilization (%)

82 79 81 85 83 84

GPU Max Memory Used (MB)

1339 1339 1341 895 895 895

GPU Max Memory Used (%)

~16% ~10%

SLIDE 29

There’s much more to explore, what are some of the things we should tackle next?

Develop a better understanding of the drivers of GPU usage
What’s required to exceed 80% utilization?
Investigate relatively constant elapsed time
Where is there waste / overhead in this architecture?
Explore additional use-cases and models
Complete integration path for Torch-based Malware model
Evaluate multiple models in memory - how much GPU memory is needed?
Scale it out
Operational management of GPUS at scale: NVidia Datacenter GPU Manager

WHAT’S NEXT?

28

Booz Allen Hamilton Internal

SLIDE 30

Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems Rob Zuppert, Larry Brown and Brad Rees @ NVidia Felix Abecassis & the GPU REST Engine Team @ NVidia Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS Sterling Foster & others @ US Department of Defense Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton

THANK YOU

29

Booz Allen Hamilton

SLIDE 31

Find me at:

LinkedIn: https://www.linkedin.com/in/drewfarris/
Twitter: @drewfarris
Web: https://www.boozallen.com/expertise/analytics.html

QUESTIONS?

30

Booz Allen Hamilton