DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu - - PowerPoint PPT Presentation

deep into trtis
SMART_READER_LITE
LIVE PREVIEW

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu - - PowerPoint PPT Presentation

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution Architect, NVIDIA DEEP INTO TRTIS TensorRT Hyperscale Inference Platform overview TensorRT Inference Server Overview and Deep Dive: Key


slide-1
SLIDE 1

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU

Xu Tianhao, Deep Learning Solution Architect, NVIDIA

slide-2
SLIDE 2

2

AGENDA

  • TensorRT Hyperscale Inference Platform overview
  • TensorRT Inference Server
  • Overview and Deep Dive: Key features
  • Deployment possibilities: Generic deployment ecosystem
  • Hands-on
  • NVIDA BERT Overview
  • FasterTransformer and TRT optimized BERT inference
  • Deploy BERT TensorFlow model with custom op
  • Deploy BERT TensorRT model with plugins
  • Benchmarking
  • Open Discussion

DEEP INTO TRTIS

slide-3
SLIDE 3

3

WORLD’S MOST ADVANCED SCALE-OUT GPU INTEGRATED INTO TENSORFLOW & ONNX SUPPORT

TENSORRT HYPERSCALE INFERENCE PLATFORM

TENSORRT INFERENCE SERVER

slide-4
SLIDE 4

4

Universal Inference Acceleration 320 Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s

ANNOUNCING TESLA T4

WORLD’S MOST ADVANCED INFERENCE GPU

slide-5
SLIDE 5

5

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE 65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

slide-6
SLIDE 6

6

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x faster

GNMT

Speedup: 27x faster

ResNet-50 (7ms latency limit)

Speedup: 21X faster

DeepSpeech 2

1.0 10X 36X

  • 5

10 15 20 25 30 35 40

Speedup v. CPU Server

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4 1.0 4X 21X

  • 5

10 15 20 25

Speedup v. CPU Server

Speech Inference

CPU Server Tesla P4 Tesla T4 1.0 10X 27X

  • 5

10 15 20 25 30

Speedup v. CPU Server

Video Inference

CPU Server Tesla P4 Tesla T4 5.5 22 65 130 260

50 100 150 200 250 300

TFLOPS / TOPS

Peak Performance

T4 P4

Float INT8 Float INT8 INT4

slide-7
SLIDE 7

7

NVIDIA TENSORRT OVERVIEW

From Every Framework, Optimized For Each Target Platform

TESLA V100 DRIVE PX 2 NVIDIA T4 JETSON TX2 NVIDIA DLA

TensorRT

slide-8
SLIDE 8

8

NVIDIA TENSORRT OVERVIEW

From Every Framework, Optimized For Each Target Platform

Quantized INT8 (Precision Optimization) Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss Layer Fusion (Graph Optimization) Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform Dynamic Tensor Memory (Memory optimization) Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage Multi Stream Execution Scales to multiple input streams, by processing them parallel using the same model and weights

slide-9
SLIDE 9

9

Un-Optimized Network

concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias

Non-Optimized Network

  • Vertical Fusion
  • Horizonal Fusion
  • Layer Elimination

Network Layers before Layers after VGG19 43 27 Inception V3 309 113 ResNet-152 670 159

GRAPH OPTIMIZATION

concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias

slide-10
SLIDE 10

10

Un-Optimized Network

concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR

TensorRT Optimized Network

  • Vertical Fusion
  • Horizonal Fusion
  • Layer Elimination

Network Layers before Layers after VGG19 43 27 Inception V3 309 113 ResNet-152 670 159

GRAPH OPTIMIZATION

slide-11
SLIDE 11

11

140 305 5700

14 ms 6.67 ms 6.83 ms

5 10 15 20 25 30 35 40 1,000 2,000 3,000 4,000 5,000 6,000 CPU-Only V100 + TensorFlow V100 + TensorRT

Latency (ms) Images/sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.

4 25 550

280 ms 153 ms 117 ms

50 100 150 200 250 300 350 400 450 500 100 200 300 400 500 600 CPU-Only + Torch V100 + Torch V100 + TensorRT

Latency (ms) Images/sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100- PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On

TENSORRT PERFORMANCE

developer.nvidia.com/tensorrt

40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50) 140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT)

slide-12
SLIDE 12

12

AGENDA

  • TensorRT Hyperscale Inference Platform overview
  • TensorRT Inference Server
  • Overview and Deep Dive: Key features
  • Deployment possibilities: Generic deployment ecosystem
  • Hands-on
  • NVIDA BERT Overview
  • FasterTransformer and TRT optimized BERT inference
  • Deploy BERT TensorFlow model with custom op
  • Deploy BERT TensorRT model with plugins
  • Benchmarking
  • Open Discussion

DEEP INTO TRTIS

slide-13
SLIDE 13

13

INEFFICIENCY LIMITS INNOVATION

Difficulties with Deploying Data Center Inference

Single Framework Only Single Model Only Custom Development

Some systems are overused while

  • thers are underutilized

Solutions can only support models from one framework Developers need to reinvent the plumbing for every application

ASR NLP Rec-

  • mmender

!

slide-14
SLIDE 14

14

NVIDIA TENSORRT INFERENCE SERVER

Architected for Maximum Datacenter Utilization

Maximize real-time inference performance of GPUs Quickly deploy and manage multiple models per GPU per node Easily scale to heterogeneous GPUs and multi GPU nodes Integrates with orchestration systems and auto scalers via latency and health metrics Now open source for thorough customization and integration

TensorRT Inference Server NVIDIA T4 NVIDIA T4

TensorRT Inference Server

Tesla V100 Tesla V100

TensorRT Inference Server

Tesla P4 Tesla P4

slide-15
SLIDE 15

15

FEATURES

Utilization Usability Performance Customization Dynamic Batching

Inference requests can be batched up by the inference server to 1) the model-allowed maximum or 2) the user-defined latency SLA Concurrent Model Execution

Multiple models (or multiple instances of same model) may execute on GPU simultaneously

CPU Model Inference Execution

Framework native models can execute inference requests on the CPU

Multiple Model Format Support

PyTorch JIT (.pt) TensorFlow GraphDef/SavedModel TensorFlow+TensorRT GraphDef ONNX graph (ONNX Runtime) TensorRT Plans Caffe2 NetDef (ONNX import path)

Metrics

Utilization, count, memory, and latency

Model Control API

Explicitly load/unload models into and out of TRTIS based on changes made in the model-control configuration

System/CUDA Shared Memory

Inputs/outputs needed to be passed to/from TRTIS are stored in system/CUDA shared memory. Reduces HTTP/gRPC overhead

Library Version

Link against libtrtserver.so so that you can include all the inference server functionality directly in your application

Custom Backend

Custom backend allows the user more flexibility by providing their

  • wn implementation of an

execution engine through the use

  • f a shared library

Model Ensemble

Pipeline of one or more models and the connection of input and

  • utput tensors between those

models (can be used with custom backend)

Streaming API

Built-in support for audio streaming input e.g. for speech recognition

slide-16
SLIDE 16

16

INFERENCE SERVER ARCHITECTURE

Models supported

  • TensorFlow GraphDef/SavedModel
  • TensorFlow and TensorRT GraphDef
  • TensorRT Plans
  • Caffe2 NetDef (ONNX import)
  • ONNX graph
  • PyTorch JIT (.pb)

Multi-GPU support Concurrent model execution Server HTTP REST API/gRPC Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

slide-17
SLIDE 17

17

COMMON WAYS TO FULLY UTILIZE GPU

1. Increase computation intensity – Increase batch size 2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)

slide-18
SLIDE 18

18

DYNAMIC BATCHING SCHEDULER

Framework Backend Dynamic Batcher Runtime Context Context

Batch-1 Request Batch-4 Request

TensorRT Inference Server

slide-19
SLIDE 19

19

DYNAMIC BATCHING SCHEDULER

ModelY Backend Dynamic Batcher Runtime Context Context Preferred batch size and wait time are configuration options. Assume 4 gives best utilization in this example.

Grouping requests into a single “batch” increases

  • verall GPU throughput

TensorRT Inference Server

slide-20
SLIDE 20

20

DYNAMIC BATCHING

TensorRT Inference Server groups inference requests based on customer defined metrics for optimal performance Customer defines 1) batch size (required) and 2) latency requirements (optional) Example: No dynamic batching (batch size 1 & 8) vs dynamic batching

2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency Threshold

slide-21
SLIDE 21

21

MPS VS CUDA STREAMS IN TRTIS

21

TRTIS CUDA Streams are 1-4% slower than MPS but provide some usability advantages and other methods to maximize performance over MPS limitations MPS

  • Multiple processes on a single GPU (no

interconnect/intercommunication between processes)

  • Shares GPU memory between multiple processes, if one

process over subscribes the memory, the others are starved - harder to coordinate memory usage

  • Experimental in nv-docker

CUDA Streams

  • One process on a single GPU with multiple

streams/execution contexts

  • More holistic view of memory - easier to coordinate

memory usage

  • Maximize GPU utilization by using batching vs having

several processes executing at batch size 1

slide-22
SLIDE 22

22

Concurrent Model Execution

max_batch_size: 8 instance_group [ { count: 4 kind: KIND_GPU gpus: [0, 1] }, { count: 4 kind: KIND_CPU gpus: [3, 4] } ]

Create one execution context for each instance of a group of a certain model

slide-23
SLIDE 23

23

CONCURRENT MODEL EXECUTION - RESNET 50

Inference Requests

TensorRT Inference Server

ResNet 50

Request Queue

V100 16GB GPU Time

RN50 Instance 1 CUDA Stream RN50 Instance 2 CUDA Stream RN50 Instance 3 CUDA Stream RN50 Instance 4 CUDA Stream RN50 Instance 5 CUDA Stream RN50 Instance 6 CUDA Stream RN50 Instance 8 CUDA Stream RN50 Instance 7 CUDA Stream RN50 Instance 9 CUDA Stream RN50 Instance 10 CUDA Stream RN50 Instance 11 CUDA Stream RN50 Instance 12 CUDA Stream

4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

14 concurrent requests

Common Scenario

One API using multiple copies of the same model on a GPU

Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8

  • n each inference server instance is

achieved.

slide-24
SLIDE 24

24

CONCURRENT MODEL EXECUTION - RESNET 50

4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

One API using multiple copies of the same model on a GPU

Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8

  • n each inference server instance is

achieved.

slide-25
SLIDE 25

25

CONCURRENT MODEL EXECUTION - RESNET 50

4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

One API using multiple copies of the same model on a GPU

Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8

  • n each inference server instance is

achieved.

slide-26
SLIDE 26

26

Concurrent Model Execution

max_batch_size: 8 instance_group [ { count: 4 kind: KIND_GPU gpus: [0, 1] }, { count: 4 kind: KIND_CPU gpus: [3, 4] } ]

Create one execution context for each instance of a group of a certain model Scheduling threads Multiple streams Priority: MAX, DEFAUTL, MIN

slide-27
SLIDE 27

27

Model Control and Model Configuration

Perform HTTP POST to /api/modelcontrol/<load|unload>/<model name> loads or unloads a model from the inference server Model Control Modes

1) NONE

  • Server attempts to load all models at runtime.
  • Changes to the model repo will be ignored
  • Model control API requests will have no affect

2) POLL

  • Server attempts to load all models at runtime
  • Changes to model repo will be detected and server will

attempt to load and unload models based on changes

  • Model control requests will have no affect

3) EXPLICIT

  • Server does not load any models in the model repo at

runtime

  • All model loading and unloading must be initiated using the

Model Control API

Local model repository

slide-28
SLIDE 28

28

Model Control and Model Configuration

name: "mymodel" platform: "tensorrt_plan" max_batch_size: 8 input [ { name: "input0" data_type: TYPE_FP32 dims: [ 16 ] reshape: { shape: [ ] } } ]

  • utput [

{ name: "output0" data_type: TYPE_FP32 dims: [ 16 ] } ] version_policy: { all { }} instance_group [ { count: 2 kind: KIND_GPU gpus: [0, 1] } ] dynamic_batching { preferred_batch_size: [ 4, 8 ] max_queue_delay_microseconds: 100 }

  • ptimization {

graph { level: 1 }, cuda { graphs: 1 }, priority: PRIORITY_MAX }

  • Dims, -1 for dynamic
  • Reshape for model accepted dims
  • Support multiple backends(platform)
  • Version control: serve selected versions
  • Instances for concurrent exection
  • Select multiple gpus
  • Select CPU or GPU for execution
  • There can be multiple groups
  • Preferred batch size is configurable
  • Set max queue delay for SLA control
  • Multiple optimizations
  • Set graph level to 1 to trigger XLA of TF
  • Set cuda graphs to 1 to using CUDA graph for

small batch sizes inference

  • Set priority to max to set scheduler thread

priority and cuda stream priority (only for TRT now)

  • ExecutionAccelerators, enable onnx-tensorrt
  • r tensorflow-tensorrt to automatically

benefit from tensorrt integration

slide-29
SLIDE 29

29

MODEL ENSEMBLING

  • Pipeline of one or more models

and the connection of input and

  • utput tensors between those

models

  • Use for model stitching or data

flow of multiple models such as data preprocessing → inference → data post-processing

  • Collects the output tensors in

each step, provides them as input tensors for other steps according to the specification

  • Ensemble models will inherit the

characteristics of the models involved, so the meta-data in the request header must comply with the models within the ensemble

ensemble_scheduling { step [ { model_name: "image_preprocess_model" model_version: -1 input_map { key: "RAW_IMAGE" value: "IMAGE" }

  • utput_map {

key: "PREPROCESSED_OUTPUT" value: "preprocessed_image" } }, { model_name: "classification_model" model_version: -1 input_map { key: "FORMATTED_IMAGE" value: "preprocessed_image" }

  • utput_map {

key: "CLASSIFICATION_OUTPUT" value: "CLASSIFICATION" } }, { model_name: "segmentation_model" model_version: -1 input_map { key: "FORMATTED_IMAGE" value: "preprocessed_image" }

  • utput_map {

key: "SEGMENTATION_OUTPUT" value: "SEGMENTATION" } } ] }

CustomBackend

slide-30
SLIDE 30

30

CUSTOM BACKEND

Integrate custom, non-framework code into TRTIS

Not uncommon for model to have some non-ML-model parts BERT: tokenizer, feature extractor Custom backend allows these parts to be integrated into TRTIS Implement code as shared library using backwards compatible C API Benefit from the full TRTIS feature set (same as framework backends)

  • Dynamic batcher, sequence batcher, concurrent execution, multi-GPU, etc.

Provides deployment flexibility; TRTIS provides standard, consistent interface protocol between models and custom components

slide-31
SLIDE 31

31

STREAMING INFERENCE REQUESTS

DeepSpeech2 Wave2Letter

Per Model Request Queues

Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3

New Streaming API

Based on the correlation ID, the audio requests are sent to the appropriate batch slot in the sequence batcher*

*Correct order of requests is assumed at entry into the endpoint Note: Corr = Correlation ID

Inference Request

Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3 Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3 NEW NEW

DeepSpeech2 Sequence Batcher Wav2Letter Sequence Batcher

Framework Inference Backend

slide-32
SLIDE 32

32

Streaming API

FSM maintained in StreamInferContext

Request Done Finished Done Non-Streaming Request Done Finished Done Response Done Initialized Done Next

Unexpected or finish Unexpected or finish Start to write all remaining data back Will call CompleteExecution() to write result back.

Streaming, it’s bidirectional

Reset Reset

slide-33
SLIDE 33

33

TRTIS LIBRARY VERSION

Tightly couple TRTIS functionality into control application via shared library

Smaller binary to Plug TRTIS library into existing application Removes existing REST and gRPC endpoints Still leverage GPU optimizations like dynamic batching and model concurrency Very low communication overhead (same system and CUDA memory address space) Backward compatible C interface

slide-34
SLIDE 34

34

AVAILABLE METRICS

Category Name

Use Case Granularity Frequency

GPU Utilization Power usage Proxy for load on the GPU Per GPU Per second Power limit Maximum GPU power limit Per GPU Per second GPU utilization GPU utilization rate [0.0 - 1.0) Per GPU Per second GPU Memory GPU Total Memory Total GPU memory, in bytes Per GPU Per second GPU Used Memory Used GPU memory, in bytes Per GPU Per second Count

GPU & CPU

Request count Number of inference requests Per model Per request Execution count Number of model inference executions Request count / execution count = avg dynamic request batching Per model Per request Inference count Number of inferences performed (one request counts as “batch size” inferences) Per model Per request Latency

GPU & CPU

Latency: request time End-to-end inference request handling time Per model Per request Latency: compute time Time a request spends executing the inference model (in the appropriate framework) Per model Per request Latency: queue time Time a request spends waiting in the queue before being executed Per model Per request

slide-35
SLIDE 35

35

PERF_CLIENT TOOL

  • Measures throughput (inf/s) and

latency under varying client loads

  • Perf_client Modes

1. Specify how many concurrent

  • utstanding requests and it

will find a stable ltency and throughput for that level 2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached

  • Generates a file containing CSV
  • utput of the results
  • Easy steps to help visualize the

throughput vs latency tradeoffs

slide-36
SLIDE 36

36

Client REC Client IMG API: ASR

Model: Legend ATTIS (CPU/ GPU)

Load balancer

Containerized inference service

(CPU/ GPU) Pre processing Post processing

Cluster

Metrics service Auto scaler

GPU

Model repository Persistent volume Your training/ pruning/ validation flow: dump model Model repository

(Network Storage Location)

GENERIC INFERENCE SERVER DEPLOYMENT ARCHITECTURE

GPU GPU GPU

TensorRT, TensorFlow, C2/ONNX

Already existing New from NVIDIA

Multiple workloads

slide-37
SLIDE 37

37

For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.

TENSORRT INFERENCE SERVER COLLABORATION WITH KUBEFLOW

What is Kubeflow?

  • Open-source project to make ML workflows on Kubernetes simple, portable, and

scalable

  • Customizable scripts and configuration files to deploy containers on their chosen

environment Problems it solves

  • Easily set up an ML stack/pipeline that can fit into the majority of enterprise

datacenter and multi-cloud environments How it helps TensorRT Inference Server

  • TensorRT Inference Server is deployed as a component inside of a production

workflow to

  • Optimize GPU performance
  • Enable auto-scaling, traffic load balancing, and redundancy/failover via

metrics

slide-38
SLIDE 38

38

TRTIS Helm Chart

Helm: Most used “package manager” for Kubernetes We built a simple chart (“package”) for the TensorRT Inference Server. You can use it to easily deploy an instance of the server. It can also be easily configured to point to a different image, model store, …

https://github.com/NVIDIA/tensorrt-inference- server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server

Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server

slide-39
SLIDE 39

39

AGENDA

  • TensorRT Hyperscale Inference Platform overview
  • TensorRT Inference Server
  • Overview and Deep Dive: Key features
  • Deployment possibilities: Generic deployment ecosystem
  • Hands-on
  • NVIDA BERT Overview
  • FasterTransformer and TRT optimized BERT inference
  • Deploy BERT TensorFlow model with custom op
  • Deploy BERT TensorRT model with plugins
  • Benchmarking
  • Open Discussion

DEEP INTO TRTIS

slide-40
SLIDE 40

40

WHAT IS BERT?

BERT: Bidirectional Encoder Representations from Transformers Widely used in Multiple NLP Tasks, due to high accuracy.

slide-41
SLIDE 41

41

WHAT IS BERT

Transformer Encoder Part

slide-42
SLIDE 42

42

TENSORFLOW INFERENCE

Previous TF inference is not efficient:

  • 1. TF Ops are very small, kernel launch causes much time, e.g., GELU/LayerNorm contains several

small Ops;

  • 2. Multi head self attention lacks efficient GPU implementation;
  • 3. TF Scheduling is not good.
slide-43
SLIDE 43

43

NVIDIA’S INFERENCE

Optimization ideas:

  • 1. Optimize the calculations with CUDA, integrate the implementation to TF with custom op
  • 2. Optimize the inference with TensorRT
  • 3. Algorithm Level Acceleration
slide-44
SLIDE 44

44

NVIDIA’S INFERENCE

CUDA Optimization - Performance

<batch_size, layers, seq_len, head_num, size_per_head> P4 FP32 (in ms) T4 FP32 (in ms) T4 FP16 (in ms) (1, 12, 32, 12, 64) 3.43 2.74 1.56 (1, 12, 64, 12, 64) 4.04 3.64 1.77 (1, 12, 128, 12, 64) 6.22 5.93 2.23

Performance over different seq_len on P4, T4

slide-45
SLIDE 45

45

NVIDIA’S INFERENCE

CUDA Optimization - Resources

Where you can find it: FasterTransformer project (open-sourced): https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer

slide-46
SLIDE 46

46

NVIDIA’S INFERENCE

TRT Optimization

slide-47
SLIDE 47

47

NVIDIA’S INFERENCE

TRT Optimization

Before After

slide-48
SLIDE 48

48

NVIDIA’S INFERENCE

TRT Optimization - Resources

Where you can find it: Bert TRT demo(open-sourced): https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT (To be re-located to DeepLearningExamples) Blog: https://devblogs.nvidia.com/nlu-with-tensorrt-bert/

slide-49
SLIDE 49

49

HANDS-ON

1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so 2. Prepare gemm_config.in for best algo of gemm by running the built binaries. 3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model. 4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files

Deploy BERT TensorFlow Model with custom op

slide-50
SLIDE 50

50

HANDS-ON

1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so 2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’ 3. Arrange the model dir in a tree structure: bert_trt/1/model.plan

Deploy BERT TensorRT Model with plugins

slide-51
SLIDE 51

51

HANDS-ON

1. Prepare model_repository

Prepare model_repository, run trtserver and perf_client

model_repository/ |-- bert_fastertransformer | |-- 1 | | `-- model.savedmodel | | |-- saved_model.pb | | `-- variables | | |-- variables.data-00000-of-00001 | | `-- variables.index | `-- config.pbtxt `-- bert_trt |-- 1 | `-- model.plan `-- config.pbtxt

name: “bert_fastertransformer" platform: "tensorflow_savedmodel" input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ 1, 128 ] }, { name: "input_mask" data_type: TYPE_INT32 dims: [ 1, 128 ] }, { name: "segment_ids" data_type: TYPE_INT32 dims: [ 1, 128 ] } ]

  • utput [

{ name: "prediction" data_type: TYPE_FP32 dims: [ 2, 1, 128 ] } ] instance_group { kind: KIND_GPU count: 1 } version_policy: {specific { versions: [1] }} name: "bert_trt" platform: "tensorrt_plan" max_batch_size: 1 input [ { name: "segment_ids" data_type: TYPE_INT32 dims: 128 }, { name: "input_ids" data_type: TYPE_INT32 dims: 128 }, { name: "input_mask" data_type: TYPE_INT32 dims: 128 } ]

  • utput [

{ name: "cls_squad_logits" data_type: TYPE_FP32 dims: [128,2,1,1] } ] instance_group { kind: KIND_GPU count: 1 } version_policy: {specific { versions: [32] }}

config.pbtxt config.pbtxt Model directory

slide-52
SLIDE 52

52

HANDS-ON

1. Launch trtserver over http/grpc

1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3 2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so} 3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True

Prepare model_repository, run trtserver and perf_client

slide-53
SLIDE 53

53

HANDS-ON

1. Run perf_client to infer over grpc

1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients 2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8

Prepare model_repository, run trtserver and perf_client

Request concurrency: 1 Client: Request count: 59 Throughput: 944 infer/sec Avg latency: 34422 usec (standard deviation 288 usec) p50 latency: 34457 usec p90 latency: 34667 usec p95 latency: 34877 usec p99 latency: 35130 usec Avg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec) Server: Request count: 70 Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)

Result reported by perf_client

slide-54
SLIDE 54

54

HANDS-ON

Benchmarking - FasterTransformer

SQuAD task inference (FasterTransformer): batchsize=1, tensorflow backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent CPU FP32 7.7 131 203 1 CPU FP32 10.4 289 339 4 GPU FP32 104.5 9.5 11.8 1 GPU FP32 137 21.9 23.7 4 GPU FP16 267.5 3.7 3.9 1 GPU FP16 461.5 8.7 10.3 4 CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP16 7.7 104.5 461.5 131 9.5 8.7

Virtual GPU feature in Tensorflow to enable multi-stream:

  • -tf-add-vgpu=“0;4;3000”
slide-55
SLIDE 55

55

HANDS-ON

Benchmarking - FasterTransformer

SQuAD task inference (FasterTransformer): batchsize=32, tensorflow backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent CPU FP32 0.4 2491 2810 1 GPU FP32 5.5 182 184 1 GPU FP32 5 186 186 4 GPU FP16 21.8 46 48.8 1 GPU FP16 21.6 46.1 48.1 4 CPU -> GPU FP32 -> GPU FP16 0.4 5.5 21.8 2491 182 46

slide-56
SLIDE 56

56

HANDS-ON

Benchmarking - TensorRT

SQuAD task inference (TensorRT): batchsize=1, TensorRT backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent GPU FP32 163 12.3 12.4 1 GPU FP32 156 12.8 14 4 GPU FP16 438.5 4.6 4.6 1 GPU FP16 473.5 4.2 5.1 4 GPU FP32 -> GPU FP16 163 473.5 12.3 4.2

slide-57
SLIDE 57

57

HANDS-ON

Benchmarking - TensorRT

SQuAD task inference (TensorRT): batchsize=32, TensorRT backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent GPU FP32 6.5 157 159 1 GPU FP32 6.5 316 356 4 GPU FP16 29.5 34.2 34.8 1 GPU FP16 30.5 134 151 4 GPU FP32 -> GPU FP16 6.5 30.5 157 134

slide-58
SLIDE 58

58

Learn more here:

https://nvidia.com/data-center-inference https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server- guide/docs/quickstart.html

Get the ready-to-deploy container with monthly updates from the NGC container registry:

https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver

Open source GitHub repository:

https://github.com/NVIDIA/tensorrt-inference-server

LEARN MORE AND DOWNLOAD TO USE

slide-59
SLIDE 59

59

Engineering developer blog (benchmarks, model concurrency, etc.):

https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

Kubeflow guest blog:

https://www.kubeflow.org/blog/nvidia_tensorrt/

Open source announcement:

https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-

  • pen-source

More:

Data center inference page & TensorRT page DevTalk Forum for Support TensorRT Hyperscale Inference Platform infographic NVIDIA AI Inference Platform technical overview NVIDIA TensorRT Inference Server and Kubeflow NVIDIA TensorRT Inference Server Now Available

ADDITIONAL RESOURCES

slide-60
SLIDE 60

60 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA (DLI)

  • GPU
  • DLI
  • www.nvidia.cn/dli
slide-61
SLIDE 61

DLI 深度学习全天培训 @ GTC CHINA 2019

全球开发者培训证书 | 全配置GPU实验环境 | 5大新课首发 | 一年一度6折特惠

查看课程 和报名 培训 咨询 NEW! NEW! NEW! NEW! NEW!

用多GPU训练神经网络 应对大规模神经网络训练的算法和工程挑战 CUDA Python 轻松实现在GPU上加速运行Python应用 计算机视觉 零基础入门,深度学习方法与实践 自然语言处理 NLP 必备理论与应用技能 多数据类型 机器视觉和NLP技术的融合进阶应用 自动驾驶汽车的感知系统(2019新版) 学用 NVIDIA DRIVE AGX 构建自动驾驶汽车 工业检测 应用深度学习打造自动化工业检测模型 使用 Jetson Nano 开发AI应用 机器人基础入门,获得您的Jetson Nano套件

slide-62
SLIDE 62