Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa - - PowerPoint PPT Presentation

deep learning inferencing on ibm cloud with nvidia
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa - - PowerPoint PPT Presentation

Watson Cloud Platform Strategic Customer Success Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff Member (STSM), IBM Larry Brown Senior Software Engineer, IBM Watson Cloud Platform Strategic


slide-1
SLIDE 1

Watson Cloud Platform Strategic Customer Success

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Khoa Huynh – Senior Technical Staff Member (STSM), IBM Larry Brown – Senior Software Engineer, IBM

slide-2
SLIDE 2

Watson Cloud Platform Strategic Customer Success

Agenda

§ Introduction § Inferencing with PyCaffe § TensorRT Overview § TensorRT Implementation § Performance Results § Conclusions § Q & A

2

slide-3
SLIDE 3

Watson Cloud Platform Strategic Customer Success

Introduction

§ AI especially deep learning has seen rapid advancement in recent years § Initial focus on image processing, expanding to natural language, different neural network models, recurrent networks, and DL frameworks § Much attention to developing networks and training models § Very compute intensive, GPUs nearly a necessity § As DL becomes mainstream focus is shifting to inferencing (use of the trained network) § Inferencing cloud service could handle requests from multiple users – One request at a time or collect a batch of requests to inference at once – But only wait a short time to fill a batch or latency is affected – Or one user might submit a larger number of images to classify at once § Inferencing cloud service needs – quick response - latency seen by the user – to handle large volume – overall throughput

3

slide-4
SLIDE 4

Watson Cloud Platform Strategic Customer Success

IBM Cloud GPU Offerings

§ Bare Metal Servers – Nvidia M60 GPUs (monthly & hourly) – Nvidia K80 GPU PCIe cards (monthly & hourly) – Nvidia P100 GPUs (monthly) – Nvidia V100 GPUs (monthly) § Virtual Servers – Nvidia P100 GPUs (monthly & hourly) – Nvidia V100 (coming soon – monthly & hourly) § Deep Learning as a Service (DLaaS) – Part of Watson Machine Learning (WML) – Focused on deep learning training – Allows user to run training jobs on a cluster of GPU-enabled machines using various frameworks § PowerAI – Available in 2Q2018 with PowerAI R5 – Delivered through IBM Cloud Catalog & supported by IBM Trusted Partner Nimbix – On-demand cloud provisioning – Containerized – Native Distributed Deep Learning (DDL) and Large Model Support (LMS)

4

slide-5
SLIDE 5

Watson Cloud Platform Strategic Customer Success

Inferencing with PyCaffe

§ Given a trained model want to use it to classify images. § Study performance of various GPUs, FPGAs, etc. § Can use C++, but more familiar with Python so that was the language of choice. § Unlike training, a single GPU is used. Use multiple threads, processes, or services if more volume needs to take advantage of more GPUs. root@V100:~/infer_caffe# python infer_caffe.py -h usage: infer_caffe.py [-h] -m MODEL -w WEIGHTS -l LMDB [-b BATCH] [-i ITERATIONS] [-c CAFFEROOT] [--blobName BLOBNAME] [--labels LABELS] [--meanImage MEANIMAGE] [--debug] [--gpu] [--csvFile CSVFILE] [--quiet] Use a trained Caffe model to classify images from a LMDB database.

5

slide-6
SLIDE 6

Watson Cloud Platform Strategic Customer Success

infer_caffe Sample Output

root@V100:~/infer_caffe# python infer_caffe.py -m ~/model_zoo/caffe/vgg16/pretrained/VGG_ILSVRC_16_layers_deploy.prototxt -w ~/model_zoo/caffe/vgg16/pretrained/VGG_ILSVRC_16_layers.caffemodel -l /datasets/x86_LMDB/LMDB/ilsvrc12_val_lmdb/ -c /opt/nvidia/caffe-0.16/caffe -b 1 -i 5 --gpu --csvFile ./pycaffe.csv --quiet Final Stats (times in seconds)

  • Date: 03/08/2018

Time: 09:05:54 Host: V100 Iterations: 5 Batch size: 1 Data type: NA Total run time: 3.3347 Stats for all iterations

  • Total predictions: 5 Correct top 1 predictions: 5 Correct top 5 predictions: 5

Top 1 accuracy: 100.00% Top 5 accuracy: 100.00% Inference time -- Total: 0.1645 Mean: 0.0329 Min: 0.0076 Max: 0.0726 Range: 0.0649 STD: 0.0308 Median: 0.0079 Inference time/prediction: 0.0329 Images/sec: 30.40

6

slide-7
SLIDE 7

Watson Cloud Platform Strategic Customer Success

Program Flow

Parse command line … # Create the neural network. net = caffe.Net(model_def, # defines the structure of the model model_weights, # contains the trained weights caffe.TEST) # use test mode (e.g., don't perform dropout) … for each iteration read a batch of images from LMDB # Call the network.

  • ut = net.forward() # Time only this step.

  • utput statistics

7

slide-8
SLIDE 8

Watson Cloud Platform Strategic Customer Success

TensorRT Overview

§ Speeds up inferencing by – Merging layers and tensors to reduce size of network and execute in a single kernel. – Selects the best specialized kernel for the target hardware based on layer parameters and measured performance. § Stages – Build: optimize the network (layers, weights, labels) to produce a runtime plan or engine.

  • Optimization can take some time so the resulting engine can be serialized to a file.

– Deploy: run the engine with given input data to get the resulting predictions. § Supports Python and C++. § TensorRT Lite is a simplified interface for Python (not used here). § Can create the TRT network yourself or use TRT utility to import and convert framework model into TRT form. – Caffe and UFF (Universal Framework Format) compatible frameworks such as TensorFlow.

8

slide-9
SLIDE 9

Watson Cloud Platform Strategic Customer Success

Reduced Precision Inferencing

§ Model trained in FLOAT (FP32). § TensorRT inferencing can use FLOAT, HALF, or INT8 (as supported by the GPU). § Increase speed with no or little loss of accuracy. – HALF could be some reduction of accuracy, not noticeable in general. – INT8 a small reduction of accuracy. § INT8 requires calibration files. – Uses sample runs of data through the net to determine the range of FLOAT values encountered. – Maps that range to INT8’s smaller range. – Caffe patch available to easily generate these calibration files during a short training run when environment variable TENSORRT_INT8_BATCH_DIRECTORY is set. – NVIDIA suggests “For ImageNet networks, around 500 calibration images is adequate”.

9

slide-10
SLIDE 10

Watson Cloud Platform Strategic Customer Success

§ Re-implementation of caffe_infer.py using TRT instead of pycaffe. § Shares code for getting images, collecting stats, overall flow.

root@V100:~/infer_caffe# python infer_caffe_trt.py -h usage: infer_caffe_trt.py [-h] -m MODEL -w WEIGHTS -l LMDB [-b BATCH] [-i ITERATIONS] [-c CAFFEROOT] [--imageShape IMAGESHAPE] [--max_batch MAX_BATCH] [--outputLayer OUTPUTLAYER] [--outputSize OUTPUTSIZE] [--dtype {FLOAT,HALF,INT8}] [--labels LABELS] [--meanImage MEANIMAGE] [--csvFile CSVFILE] [--calBatchDir CALBATCHDIR] [--firstCalBatch FIRSTCALBATCH] [--numCalBatches NUMCALBATCHES] [--debug] [--quiet] Uses NVidia TensorRT to optimize and run inference on a trained Caffe model performing an image recognition task and prints performance and accuracy results.

TensorRT Implementation

10

slide-11
SLIDE 11

Watson Cloud Platform Strategic Customer Success

TRT Program Flow

# Create the engine using the TRT utilities for Caffe. # Use the caffe model converter utility in tensorrt.utils. # We provide it a logger, a path to the model prototxt, the model file, the max batch size, # the max workspace size, the output layer(s) and the data type of the weights. engine_dtype = trt.infer.DataType[dtype] calibrator = None if engine_dtype == trt.infer.DataType.INT8: calibrator = infer_utils.Calibrator.Calibrator(cal_batch_dir, first_cal_batch, num_cal_batches, debug) engine = trt.utils.caffe_to_trt_engine(trt_logger, model, weights, max_batch, 1 << 25, [output_layer], engine_dtype, calibrator=calibrator) … # Allocate memory on the GPU with PyCUDA and register it with the engine. # The size of the allocations is the size of the input and expected output * the batch size d_input = cuda.mem_alloc(batch_size * image_shape[0] * image_shape[1] * 3 * np.dtype(np.float32).itemsize) d_output = cuda.mem_alloc(batch_size * output.size * output.dtype.itemsize) # The engine needs bindings provided as pointers to the GPU memory. # PyCUDA lets us do this for memory allocations by casting those allocations to ints bindings = [int(d_input), int(d_output)] # Create a cuda stream to run inference in. stream = cuda.Stream()

11

slide-12
SLIDE 12

Watson Cloud Platform Strategic Customer Success

TRT Program Flow

# Time moving the data to the GPU, running the network, and getting the results back to the host as part of the # inference operation for this iteration. stats.begin_iteration() cuda.memcpy_htod_async(d_input, batchin, stream) # execute model context.enqueue(batch_size, bindings, stream.handle, None) # transfer predictions back cuda.memcpy_dtoh_async(output, d_output, stream) # syncronize threads stream.synchronize()

12

slide-13
SLIDE 13

Watson Cloud Platform Strategic Customer Success

13

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 5 10 25 50 75 100 125 150 175

Average Latency/Batch (sec) Batch Size

V100 Caffe VGGNet16 TensorRT 3.01 Inference Latency

PyCaffe FLOAT HALF INT8

slide-14
SLIDE 14

Watson Cloud Platform Strategic Customer Success

14

500 1000 1500 2000 2500 1 5 10 25 50 75 100 125 150 175

Images/Second Batch Size

V100 Caffe VGGNet16 TensorRT 3.01 Inference Throughput

PyCaffe FLOAT HALF INT8

slide-15
SLIDE 15

Watson Cloud Platform Strategic Customer Success

15

64.65 85.77 64.65 85.77 64.65 85.77 64.37 85.49 10 20 30 40 50 60 70 80 90 100 Top 1 Top 5

Accuracy Per Cent

V100 Caffe VGGNet16 TensorRT 3.0.1 Inference Accuracy

PyCaffe FLOAT HALF INT8

slide-16
SLIDE 16

Watson Cloud Platform Strategic Customer Success

16

10 20 30 40 50 60 70 80 90 1 10 25 50 100 150 200 300 400 500 600 700 800 900 1000

Accuracy Per Cent Number of Calibration Images

V100 Caffe VGGNet16 TensortRT 3.0.1 INT8 Inference Accuracy

Top 1 Top 5

slide-17
SLIDE 17

Watson Cloud Platform Strategic Customer Success

17 50.12 128.74 289.28 373.29 45.61 51.7 52.41 440.82 424.22 842.51 2154.29 1519 500 1000 1500 2000 2500

1 x Nvidia K80 GPU 1 x Nvidia P4 GPU 1 x Nvidia P100 GPU 1 x Nvidia V100 GPU 1 x Intel DLIA 1 x Nvidia K80 GPU SL Bare-Metal (Dual Xeon E5-2690v4) Intel Bare-Metal (Dual Xeon E5-2650v3) SL Bare-Metal (Dual Xeon E5-2690v4) SL Bare-Metal (Dual Xeon E5-2690v4) Intel Bare-Metal (Dual Xeon E5-2690v4) AWS p2.16xlarge (Dual Xeon E5-2686v4)

Number of Images Processed Per Second (Inferening)

Deep-Learning Model Inferencing

Image Classification with VGG-16 on Caffe (Single Precision)

PyCaffe TensorRT TensorRT (HALF2) TensorRT (INT8)

Higher is better

Notes:

  • Nvidia TensorRT with half-precision support improves DL inferencing performance by 5X on a V100 GPU
  • The Nvidia P4 GPU is comparable to many FPGAs in terms of power consumption (30-70W)
slide-18
SLIDE 18

Watson Cloud Platform Strategic Customer Success

1000 2000 3000 4000 5000 6000 7000 8000 9000 1 50 167 238 404 572 1135

Latency Per Iteration (ms)

Batch Size

Deep Learning Model Inferencing - Image Classification

VGG-16 Neural Net on Caffe Framework with TensorRT

(Single Precision Except Where Noted Otherwise)

K80 GPU P100 GPU P100 GPU w/ HALF2 V100 GPU V100 GPU w/ HALF2

0.5 1 1.5 2 2.5

Intel DLIA Nvidia P4 GPU

Classification Power Efficiency (Images/Second/Watt)

Deep-Learning Model Inferencing (Image Classification) Nvidia P4 GPU vs. FPGA for DL

VGG-16 Neural Net on Caffe Framework (Single Precision)

Deep-Learning Model Inferencing

Lower is better Higher is better

Notes:

  • The P100 and V100 GPUs deliver much better latencies and handles much larger batch sizes than

the K80 GPU

  • The Nvidia P4 GPU is comparable to many FPGAs in terms of power consumption (30-70W)
  • Nvidia P100 GPU delivers 5X inferencing performance, and could handle much larger batch sizes,

than previous-generation K80 GPU

  • Nvidia P4 GPU could deliver up to 50% inferencing performance at less than 25% power

consumption of a P100 GPU – making the P4 a very cost-effective DL inferencing engine for a cloud platform

slide-19
SLIDE 19

Watson Cloud Platform Strategic Customer Success

TensorRT Impressions

§ Python interface was very important § Utility to convert Caffe model to TRT network saved much work – Building the network with native TRT calls is much more advanced – Allows flexibility and customization for those who need it § INT8 Calibration was challenging – Doc was not complete – Found a good blog that referenced classes not in the doc

  • https://devblogs.nvidia.com/parallelforall/int8-inference-autonomous-vehicles-tensorrt/

– Need to write a Calibrator implementation by extending a class

  • class Calibrator(trt.infer.Int8EntropyCalibrator): # from the doc did not work
  • class Calibrator(trt.infer.EntropyCalibrator): # from the blog did work

§ Overall for Caffe our experience was good. Other frameworks might require more use of lower TRT calls.

19

slide-20
SLIDE 20

Watson Cloud Platform Strategic Customer Success

Conclusions

§ V100 GPU is faster than anything else § Nvidia TensorRT with half-precision support improves DL inferencing performance by 5X on a V100 GPU § TensorRT is much faster than pycaffe § HALF and INT8 are significantly faster than FLOAT with no or little loss of accuracy § INT8 slightly better than HALF for batch size 1 – Latency 1.8 vs 2.0 msec – Throughput 558 vs 505 imagers/sec § HALF has better throughput and latency at larger batch sizes § Surprising that INT8 accuracy when calibration used only 1 image performed so well – Suggest to follow guidance of using more images as calibration is not that slow and need

  • nly be done once

§ Experiment with your own network and data as your results could vary

20

slide-21
SLIDE 21

Watson Cloud Platform Strategic Customer Success

Thank You

§ Khoa Huynh

– Senior Technical Staff Member (STSM), IBM – khoa@us.ibm.com

§ Larry Brown

– Senior Software Engineer, IBM – ltbrown@us.ibm.com

21