GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - - PowerPoint PPT Presentation

gpu inference
SMART_READER_LITE
LIVE PREVIEW

GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - - PowerPoint PPT Presentation

BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF


slide-1
SLIDE 1

GPU INFERENCE

IN THE DATACENTER

Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC

NOVEMBER 2017

Eglin AFB, FL BOOZ ALLEN HAMILTON

slide-2
SLIDE 2

MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.

  • Jen-Hsun Huang, CEO, NVIDIA

1

Booz Allen Hamilton

slide-3
SLIDE 3

THE DAYS OF EASY PERFORMANCE GAINS ARE GONE

We need alternatives to general purpose CPU computation

GRAPHIC PROCESSING UNITS PROVIDE AN ALTERNATIVE

  • Algebraic Strengths of GPU
  • Enable New Algorithms
  • Adaptable to a variety of tasks
  • Readably Available Libraries (CUDA, CV/DL Frameworks)

HOW CAN WE LEVERAGE OUR EXISTING INVESTMENTS?

  • HPC vs. Commodity Hardware in the Datacenter
  • Evolution vs. Revolution
  • Scaling Out vs. Scaling Up

INTRODUCTION

2

Booz Allen Hamilton

slide-4
SLIDE 4

WE HAVE A PROBLEM

How to apply complex algorithms as a part of our ingest process?

  • Computationally Expensive Algorithms
  • Heterogeneous Dataflow
  • Horizontal / Linear Scalability at datacenter level.

How to accommodate this within our existing compute fabric?

  • Hadoop Clusters: HDFS, YARN, etc.
  • Commodity Nodes
  • Small Numbers of Special Purpose Nodes
  • 10G Interconnects
  • Cost, Power, Space and Cooling

NO SUPERCOMPUTERS, NO MODEL TRAINING

  • Power Efficient GPUs (50-75w)
  • We could focus on model Inference
  • Application of Models to new Data, e.g: Classification

THE REALITY

3

Booz Allen Hamilton

slide-5
SLIDE 5

DATACENTER ARCHITECTURE

4

Booz Allen Hamilton

SINGLE NODE

  • 128G RAM
  • 12x Drives
  • 24 Core / 48 HT
  • PCI Express Slot?

SINGLE RACK

  • 40 Nodes
  • 10G ToR Switch
  • 15-16 KW

DATACENTER

  • Many Racks
  • Interconnect
  • Rows?
slide-6
SLIDE 6

DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS

As a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.

  • Unpacking, Uncompressing Archives
  • Zip, Tar, etc.
  • Converting Binary Formats to Text
  • Word, PDF
  • Extracting Metadata
  • EXIF data from images
  • Classifying Images / Segmenting Images / Detecting Objects
  • Search Images using Text
  • Optical Character Recognition
  • Extract Text from Scanned Documents
  • Detecting Malware
  • Executables, PDFs, RTF

The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.

DATA EXTRACTION PIPELINE

5

Booz Allen Hamilton

slide-7
SLIDE 7

DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS

Some of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:

  • Unpacking, Uncompressing Archives
  • Zip, Tar, etc.
  • Converting Binary Formats to Text
  • Word, PDF
  • Extracting Metadata
  • EXIF data from images
  • Classifying Images / Segmenting Images / Detecting Objects
  • Search Images using Text
  • Optical Character Recognition
  • Extract Text from Scanned Documents
  • Detecting Malware
  • Executables, PDFs, RTF

GPU ACCELERATED DATA EXTRACTION PIPELINE

6

Booz Allen Hamilton

slide-8
SLIDE 8

CPU, MEMORY AND THROUGHPUT

In order to scale linearly as we add more resources, our system must have the following characteristics

  • Shared Nothing vs. Share Little
  • Stateless vs. Minimally State-ful
  • CPU bound (No Disk IO, Network IO, Bus or Memory Bottlenecks)
  • Uniformly Fast: Individual Document Processing: 10-100ms
  • RAM Frugal: No large models in memory
  • Something that plays well with Java

DATA EXTRACTION REQUIREMENTS

7

Booz Allen Hamilton

slide-9
SLIDE 9

What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?

  • Pure Java
  • No GPU acceleration
  • Java Native Interface (or some derivative)
  • Hand wrapped API calls (JNI, JNA)
  • JavaCPP (Java and native C++ bridge)
  • Deeplearning4j cuDNN integration (as of 0.9.1)
  • Tensorflow Java API
  • External Processes
  • Forked Executable
  • Shared Memory
  • Sockets (TCP, UDP, Raw, etc)

INTEGRATION OPTIONS

8

Booz Allen Hamilton

SINGLE NODE JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP

JNI LIB WRAPPED LIB

THREAD

JAVA LIB FORKED EXE LOCAL STORAGE

slide-10
SLIDE 10

SINGLE NODE JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP THREAD

LOCAL STORAGE GPU MEM GPU JNI LIB? TENSORRT WRAPPED LIB? CUDA LIB FORKED EXE? OPCV LIB JAVA LIB? CAFFE

What do we want to be able to do?

  • Multiple Library or Framework Support
  • Caffe / Tensorflow / Torch / Others
  • Cuda Accelerated OpenCV
  • TensorRT
  • Other CUDA Libraries

NOTIONAL INTEGRATION

9

Booz Allen Hamilton

slide-11
SLIDE 11

So, what components make up the solution?

SOLUTION

10

Booz Allen Hamilton

slide-12
SLIDE 12

“ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS”

  • NVidia PASCAL
  • 5.5 TeraFLOPS Single-Precision Performance
  • 22 Tera-Operations Per Second Integer 8 Performance
  • 8G GPU Memory
  • 192 GB/s GPU Memory Bandwidth
  • Low-Profile PCI Express
  • 50W/75W Max Power
  • http://www.nvidia.com/object/accelerate-inference.html

NVIDIA TESLA P4 INFERENCE ACCELERATOR

11

Booz Allen Hamilton

slide-13
SLIDE 13

IMAGE CLASSIFICATION WITH ALEXNET USING CAFFE

We used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.

  • A good stand-in for more complex image models
  • Evaluated both CPU-only and GPU variants of Caffe to characterize performance

difference

  • One Image Per Batch
  • Lightly modified to properly handle multithreading and CUDA streams

CAFFE

12

Booz Allen Hamilton

slide-14
SLIDE 14

CUDA ACCELERATED COMPUTER VISION LIBRARY

Images were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.

  • AlexNET input layer size is 224px x 224px
  • Produces a GpuMat object for image data allocated from GPU memory
  • GpuMat wrapped to use as input layer for network, avoiding the need for an extra copy
  • Custom GpuMat allocators introduced in OpenCV 3.2.0

OPEN CV

13

Booz Allen Hamilton

slide-15
SLIDE 15

HIGH PERFORMANCE DEEP LEARNING INFERENCE OPTIMIZER

TensorRT can load and optimize Caffe or Tensorflow models for optimized inference

  • performance. In this case, we used it to host the same Caffe model used for the image

classification task.

  • FP32 to INT8 while minimizing accuracy loss
  • Better GPU Utilization
  • Kernel Autotuning
  • Improved Memory Footprint
  • Multi-stream Execution
  • Used unchanged Caffe model for Image classification

NVIDIA TENSORRT

14

Booz Allen Hamilton

slide-16
SLIDE 16

MALCONV: MALWARE DETECTION WITH DEEP LEARNING

A convolutional neural network digests entire binaries for malware identification

  • A Custom Malware Identification Model
  • Current Ingest framework leverages an un-accelerated predecessor
  • Can’t use MalConv because it’s too computationally intense
  • Integration with PyTorch will require some work
  • Not a great inference layer available for PyTorch
  • Model Translation with ONNX to Caffe2 (or Other?)
  • Currently a work-in-progress
  • How do the ergonomics differ from the image captioning task?

PYTORCH

15

Booz Allen Hamilton

slide-17
SLIDE 17

DEEP LEARNINIG INFERENCE VIA REST

The GRE provided memory and process isolation and native libraries for hardware access

  • Multi-threaded HTTP server in Golang
  • RESTFul interface
  • Multithreaded Caffe
  • TensorRT – NVidia’s inference engine
  • CUDA-Accelerated OpenCV
  • Containerized in Docker
  • Framework for other inference engines
  • https://developer.nvidia.com/gre
  • https://github.com/NVIDIA/gpu-rest-engine

NVIDIA GPU REST ENGINE

16

Booz Allen Hamilton

slide-18
SLIDE 18

SIMPLIFIED PACKAGING AND DEPLOYMENT VIA CONTAINERS

Packaging performed in one environment and rapidly deployed to a large number of nodes.

  • Docker image building / testing on Amazon Elastic Compute Cloud
  • Test Environment on an Isolated Network
  • Install Docker, CUDA Libraries / Drivers and NVidia Docker and go.
  • Portability across Centos 7 Nodes
  • Supported Laptop Development of new analytics.
  • Worked out-of-the box for Caffe Models in Caffe / TensorRT
  • https://github.com/NVIDIA/nvidia-docker

NVIDIA DOCKER

17

Booz Allen Hamilton

slide-19
SLIDE 19

We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems

  • CollectD / StatsD API with various plugins for CPU, Disk, Memory, IO
  • nvidia-smi for GPU information
  • Timely for metric storage, analysis
  • Grafana for visualization, daskboarding, analysis.
  • NVidia Data Center GPU Manager
  • Active Health Monitoring, Early Fault Detection (SMART for GPUs)
  • Power Management
  • Configuration & Reporting

INSTRUMENTATION

18

Booz Allen Hamilton

slide-20
SLIDE 20

SINGLE NODE NVIDIA DOCKER JAVA VM MEM CPU

THREAD THREAD THREAD THREAD JVM HEAP THREAD

LOCAL STORAGE GPU MEM P4 GPU

19

Booz Allen Hamilton

GPU REST ENGINE HTTP HTTP HTTP TENSORRT CAFFE LIB OPENCV LIB

FINAL INTEGRATION

  • REST Calls from Java to GRE
  • Golang Coordinator
  • Copy Image to GPUMat
  • Resize GPUMat in OpenCV
  • Resized GPUMat becomes input layer
  • Calls to framework for inference
  • Caffe Reference Model Hosted in Caffe or

TensorRT

slide-21
SLIDE 21

What did we evaluate and observe?

EXPERIMENTS AND RESULTS

20

Booz Allen Hamilton

slide-22
SLIDE 22

What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU? We processed 9000 images through the ETL framework, GRE and Caffe CPU Only

BASELINE CONCURRENCY TESTS WITH CAFFE CPU

21

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

271.65 175.59 416

Minimum Processing Time (Msec)

239 465.8 619.2

Mean (Msec)

300 100.49 149.87

  • Max. (Msec)

483 880 1066

CPU Max User (%)

83.0 99.8 100.00

GPU Max Utilization (%)

50 100 150 200 400 600 800 1000

Milliseconds per Image Count Threads

10 24 32

slide-23
SLIDE 23

What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?

CONCURRENCY TESTS WITH CAFFE GPU

22

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

37.425 38.451 38.153

Minimum Processing Time (Msec)

7 8 10

Mean (Msec)

39.66 100.49 149.87

  • Max. (Msec)

163 251 415

CPU Max User (%)

56 45 55

GPU Max Utilization (%)

82 79 81

100 200 300 100 200 300 400

Milliseconds per Image Count Threads

10 24 32

slide-24
SLIDE 24

How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?

CONCURRENCY TESTS WITH TENSORRT

23

Booz Allen Hamilton

Java Thread Count

10 24 32

Total Elapsed Time (Seconds)

33.327 33.260 33.399

Minimum Processing Time (Msec)

5 5 8

Mean (Msec)

35.01 86.30 116.08

Max (Msec)

188 258 416

CPU Max User (%)

47 54 57

GPU Max Utilization (%)

85 83 84

100 200 300 100 200 300 400

Milliseconds per Image Count Threads

10 24 32

slide-25
SLIDE 25

How does performance compare between TensorRT and Caffe?

  • TensorRT is generally faster and uses more of the GPU than Caffe.
  • The model loaded into TensorRT consumes 33% less GPU Memory.

TENSORRT VS CAFFE

24

Booz Allen Hamilton

Framework / Thread Count Caffe GPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)

37.425 33.327

Minimum Processing Time (Msec)

7 5

Mean (Msec)

39.66 35.01

Max (Msec)

163 188

CPU Max User (%)

56 47

GPU Max Utilization (%)

82 85

GPU Memory Utilization (MB)

1339 895

100 200 300 50 100 150

Milliseconds per Image Count Framework

Caffe GPU TensorRT

slide-26
SLIDE 26

How does performance compare between TensorRT and Caffe?

  • TensorRT seems to handle concurrency better than Caffe

TENSORRT VS CAFFE

25

Booz Allen Hamilton

Framework / Thread Count Caffe GPU 32 Threads TensorRT 32 Threads Total Elapsed Time (Seconds)

38.153 33.399

Minimum Processing Time (Msec)

10 8

Mean (Msec)

39.66 35.01

Max (Msec)

149.9 116

CPU Max User (%)

58 64

GPU Max Utilization (%)

81 84

GPU Memory Utilization (MB)

1339 895

100 200 300 100 200 300 400

Milliseconds per Image Count Framework

Caffe GPU TensorRT

slide-27
SLIDE 27

How does performance compare between TensorRT and Caffe?

  • Both TensorRT and Caffe GPU are considerably more performant than Caffe CPU

TENSORRT VS CAFFE

26

Booz Allen Hamilton

Framework / Thread Count Caffe CPU 10 Threads TensorRT 10 Threads Total Elapsed Time (Seconds)

271.659

33.327

Minimum Processing Time (Msec)

239 5

Mean (Msec)

280 35.01

Max (Msec)

296 188

CPU Max User (%)

83 47

GPU Max Utilization (%)

85

100 200 300 100 200 300 400 500

Milliseconds per Image Count Framework

Caffe GPU TensorRT Caffe CPU

slide-28
SLIDE 28

How does power utilization compare across both TensorRT and Caffe?

  • They are relatively the same, with the Tesla P4 Consuming 24 Watts at near idle
  • 45 Watts at 80% GPU Utilization for the 50W model
  • Low GPU Memory Utilization
  • Additional 1.8 kW per rack == Increased Power Consumption of ~10%

POWER UTILIZATION

27

Booz Allen Hamilton

Framework Caffe TensorRT Thread Count

10 24 32 10 24 32

  • Min. Power Use (Watts)

24.26 24.55 25.13 24.36 23.78 24.36

  • Max. (Watts)

44.02 46.3 45.88 45.32 45.01 45.36

CPU Max User (%)

56 45 55 47 54 57

GPU Utilization (%)

82 79 81 85 83 84

GPU Max Memory Used (MB)

1339 1339 1341 895 895 895

GPU Max Memory Used (%)

~16% ~10%

slide-29
SLIDE 29

There’s much more to explore, what are some of the things we should tackle next?

  • Develop a better understanding of the drivers of GPU usage
  • What’s required to exceed 80% utilization?
  • Investigate relatively constant elapsed time
  • Where is there waste / overhead in this architecture?
  • Explore additional use-cases and models
  • Complete integration path for Torch-based Malware model
  • Evaluate multiple models in memory - how much GPU memory is needed?
  • Scale it out
  • Operational management of GPUS at scale: NVidia Datacenter GPU Manager

WHAT’S NEXT?

28

Booz Allen Hamilton Internal

slide-30
SLIDE 30

Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems Rob Zuppert, Larry Brown and Brad Rees @ NVidia Felix Abecassis & the GPU REST Engine Team @ NVidia Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS Sterling Foster & others @ US Department of Defense Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton

THANK YOU

29

Booz Allen Hamilton

slide-31
SLIDE 31

Find me at:

  • LinkedIn: https://www.linkedin.com/in/drewfarris/
  • Twitter: @drewfarris
  • Web: https://www.boozallen.com/expertise/analytics.html

QUESTIONS?

30

Booz Allen Hamilton