DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU
Xu Tianhao, Deep Learning Solution Architect, NVIDIA
DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu - - PowerPoint PPT Presentation
DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution Architect, NVIDIA DEEP INTO TRTIS TensorRT Hyperscale Inference Platform overview TensorRT Inference Server Overview and Deep Dive: Key
Xu Tianhao, Deep Learning Solution Architect, NVIDIA
2
3
WORLD’S MOST ADVANCED SCALE-OUT GPU INTEGRATED INTO TENSORFLOW & ONNX SUPPORT
TENSORRT INFERENCE SERVER
4
Universal Inference Acceleration 320 Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s
WORLD’S MOST ADVANCED INFERENCE GPU
5
MULTI-PRECISION FOR AI INFERENCE 65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
6
Speedup: 36x faster
GNMT
Speedup: 27x faster
ResNet-50 (7ms latency limit)
Speedup: 21X faster
DeepSpeech 2
1.0 10X 36X
10 15 20 25 30 35 40
Speedup v. CPU Server
Natural Language Processing Inference
CPU Server Tesla P4 Tesla T4 1.0 4X 21X
10 15 20 25
Speedup v. CPU Server
Speech Inference
CPU Server Tesla P4 Tesla T4 1.0 10X 27X
10 15 20 25 30
Speedup v. CPU Server
Video Inference
CPU Server Tesla P4 Tesla T4 5.5 22 65 130 260
50 100 150 200 250 300
TFLOPS / TOPS
Peak Performance
T4 P4
Float INT8 Float INT8 INT4
7
TESLA V100 DRIVE PX 2 NVIDIA T4 JETSON TX2 NVIDIA DLA
8
Quantized INT8 (Precision Optimization) Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss Layer Fusion (Graph Optimization) Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform Dynamic Tensor Memory (Memory optimization) Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage Multi Stream Execution Scales to multiple input streams, by processing them parallel using the same model and weights
9
concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias
Network Layers before Layers after VGG19 43 27 Inception V3 309 113 ResNet-152 670 159
concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias
10
concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
Network Layers before Layers after VGG19 43 27 Inception V3 309 113 ResNet-152 670 159
11
140 305 5700
14 ms 6.67 ms 6.83 ms
5 10 15 20 25 30 35 40 1,000 2,000 3,000 4,000 5,000 6,000 CPU-Only V100 + TensorFlow V100 + TensorRT
Latency (ms) Images/sec
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.
4 25 550
280 ms 153 ms 117 ms
50 100 150 200 250 300 350 400 450 500 100 200 300 400 500 600 CPU-Only + Torch V100 + Torch V100 + TensorRT
Latency (ms) Images/sec
Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100- PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On
developer.nvidia.com/tensorrt
12
13
Single Framework Only Single Model Only Custom Development
Some systems are overused while
Solutions can only support models from one framework Developers need to reinvent the plumbing for every application
ASR NLP Rec-
14
Maximize real-time inference performance of GPUs Quickly deploy and manage multiple models per GPU per node Easily scale to heterogeneous GPUs and multi GPU nodes Integrates with orchestration systems and auto scalers via latency and health metrics Now open source for thorough customization and integration
TensorRT Inference Server NVIDIA T4 NVIDIA T4
TensorRT Inference Server
Tesla V100 Tesla V100
TensorRT Inference Server
Tesla P4 Tesla P4
15
Utilization Usability Performance Customization Dynamic Batching
Inference requests can be batched up by the inference server to 1) the model-allowed maximum or 2) the user-defined latency SLA Concurrent Model Execution
Multiple models (or multiple instances of same model) may execute on GPU simultaneously
CPU Model Inference Execution
Framework native models can execute inference requests on the CPU
Multiple Model Format Support
PyTorch JIT (.pt) TensorFlow GraphDef/SavedModel TensorFlow+TensorRT GraphDef ONNX graph (ONNX Runtime) TensorRT Plans Caffe2 NetDef (ONNX import path)
Metrics
Utilization, count, memory, and latency
Model Control API
Explicitly load/unload models into and out of TRTIS based on changes made in the model-control configuration
System/CUDA Shared Memory
Inputs/outputs needed to be passed to/from TRTIS are stored in system/CUDA shared memory. Reduces HTTP/gRPC overhead
Library Version
Link against libtrtserver.so so that you can include all the inference server functionality directly in your application
Custom Backend
Custom backend allows the user more flexibility by providing their
execution engine through the use
Model Ensemble
Pipeline of one or more models and the connection of input and
models (can be used with custom backend)
Streaming API
Built-in support for audio streaming input e.g. for speech recognition
16
Python/C++ Client Library
17
1. Increase computation intensity – Increase batch size 2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)
18
Framework Backend Dynamic Batcher Runtime Context Context
Batch-1 Request Batch-4 Request
TensorRT Inference Server
19
ModelY Backend Dynamic Batcher Runtime Context Context Preferred batch size and wait time are configuration options. Assume 4 gives best utilization in this example.
TensorRT Inference Server
20
21
21
TRTIS CUDA Streams are 1-4% slower than MPS but provide some usability advantages and other methods to maximize performance over MPS limitations MPS
interconnect/intercommunication between processes)
process over subscribes the memory, the others are starved - harder to coordinate memory usage
CUDA Streams
streams/execution contexts
memory usage
several processes executing at batch size 1
22
max_batch_size: 8 instance_group [ { count: 4 kind: KIND_GPU gpus: [0, 1] }, { count: 4 kind: KIND_CPU gpus: [3, 4] } ]
Create one execution context for each instance of a group of a certain model
23
Inference Requests
TensorRT Inference Server
ResNet 50
Request Queue
V100 16GB GPU Time
RN50 Instance 1 CUDA Stream RN50 Instance 2 CUDA Stream RN50 Instance 3 CUDA Stream RN50 Instance 4 CUDA Stream RN50 Instance 5 CUDA Stream RN50 Instance 6 CUDA Stream RN50 Instance 8 CUDA Stream RN50 Instance 7 CUDA Stream RN50 Instance 9 CUDA Stream RN50 Instance 10 CUDA Stream RN50 Instance 11 CUDA Stream RN50 Instance 12 CUDA Stream
14 concurrent requests
Common Scenario
One API using multiple copies of the same model on a GPU
Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8
achieved.
24
Common Scenario
One API using multiple copies of the same model on a GPU
Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8
achieved.
25
Common Scenario
One API using multiple copies of the same model on a GPU
Example: 12 instances of TRT FP16 ResNet50 (each model takes 1.33GB GPU memory) are loaded onto the GPU and can run concurrently on a 16GB V100 GPU. 14 concurrent inference requests happen: each model instance fulfills one request simultaneously and 2 are queued in the per-model scheduler queues in TensorRT Inference Server to execute after the 12 requests finish. With this configuration, 2832 inferences per second at 33.94 ms with batch size 8
achieved.
26
max_batch_size: 8 instance_group [ { count: 4 kind: KIND_GPU gpus: [0, 1] }, { count: 4 kind: KIND_CPU gpus: [3, 4] } ]
Create one execution context for each instance of a group of a certain model Scheduling threads Multiple streams Priority: MAX, DEFAUTL, MIN
27
Perform HTTP POST to /api/modelcontrol/<load|unload>/<model name> loads or unloads a model from the inference server Model Control Modes
1) NONE
2) POLL
attempt to load and unload models based on changes
3) EXPLICIT
runtime
Model Control API
Local model repository
28
name: "mymodel" platform: "tensorrt_plan" max_batch_size: 8 input [ { name: "input0" data_type: TYPE_FP32 dims: [ 16 ] reshape: { shape: [ ] } } ]
{ name: "output0" data_type: TYPE_FP32 dims: [ 16 ] } ] version_policy: { all { }} instance_group [ { count: 2 kind: KIND_GPU gpus: [0, 1] } ] dynamic_batching { preferred_batch_size: [ 4, 8 ] max_queue_delay_microseconds: 100 }
graph { level: 1 }, cuda { graphs: 1 }, priority: PRIORITY_MAX }
small batch sizes inference
priority and cuda stream priority (only for TRT now)
benefit from tensorrt integration
29
and the connection of input and
models
flow of multiple models such as data preprocessing → inference → data post-processing
each step, provides them as input tensors for other steps according to the specification
characteristics of the models involved, so the meta-data in the request header must comply with the models within the ensemble
ensemble_scheduling { step [ { model_name: "image_preprocess_model" model_version: -1 input_map { key: "RAW_IMAGE" value: "IMAGE" }
key: "PREPROCESSED_OUTPUT" value: "preprocessed_image" } }, { model_name: "classification_model" model_version: -1 input_map { key: "FORMATTED_IMAGE" value: "preprocessed_image" }
key: "CLASSIFICATION_OUTPUT" value: "CLASSIFICATION" } }, { model_name: "segmentation_model" model_version: -1 input_map { key: "FORMATTED_IMAGE" value: "preprocessed_image" }
key: "SEGMENTATION_OUTPUT" value: "SEGMENTATION" } } ] }
CustomBackend
30
Provides deployment flexibility; TRTIS provides standard, consistent interface protocol between models and custom components
31
DeepSpeech2 Wave2Letter
Per Model Request Queues
Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3
Based on the correlation ID, the audio requests are sent to the appropriate batch slot in the sequence batcher*
*Correct order of requests is assumed at entry into the endpoint Note: Corr = Correlation ID
Inference Request
Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3 Corr 1 Corr 1 Corr 1 Corr 1 Corr 2 Corr 2 Corr 3 Corr 3 NEW NEW
DeepSpeech2 Sequence Batcher Wav2Letter Sequence Batcher
Framework Inference Backend
32
Request Done Finished Done Non-Streaming Request Done Finished Done Response Done Initialized Done Next
Unexpected or finish Unexpected or finish Start to write all remaining data back Will call CompleteExecution() to write result back.
Streaming, it’s bidirectional
Reset Reset
33
Smaller binary to Plug TRTIS library into existing application Removes existing REST and gRPC endpoints Still leverage GPU optimizations like dynamic batching and model concurrency Very low communication overhead (same system and CUDA memory address space) Backward compatible C interface
34
Category Name
Use Case Granularity Frequency
GPU Utilization Power usage Proxy for load on the GPU Per GPU Per second Power limit Maximum GPU power limit Per GPU Per second GPU utilization GPU utilization rate [0.0 - 1.0) Per GPU Per second GPU Memory GPU Total Memory Total GPU memory, in bytes Per GPU Per second GPU Used Memory Used GPU memory, in bytes Per GPU Per second Count
GPU & CPU
Request count Number of inference requests Per model Per request Execution count Number of model inference executions Request count / execution count = avg dynamic request batching Per model Per request Inference count Number of inferences performed (one request counts as “batch size” inferences) Per model Per request Latency
GPU & CPU
Latency: request time End-to-end inference request handling time Per model Per request Latency: compute time Time a request spends executing the inference model (in the appropriate framework) Per model Per request Latency: queue time Time a request spends waiting in the queue before being executed Per model Per request
35
latency under varying client loads
1. Specify how many concurrent
will find a stable ltency and throughput for that level 2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached
throughput vs latency tradeoffs
36
Client REC Client IMG API: ASR
Model: Legend ATTIS (CPU/ GPU)
Load balancer
Containerized inference service
(CPU/ GPU) Pre processing Post processing
Cluster
Metrics service Auto scaler
GPU
Model repository Persistent volume Your training/ pruning/ validation flow: dump model Model repository
(Network Storage Location)
GPU GPU GPU
TensorRT, TensorFlow, C2/ONNX
Already existing New from NVIDIA
Multiple workloads
37
For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.
What is Kubeflow?
scalable
environment Problems it solves
datacenter and multi-cloud environments How it helps TensorRT Inference Server
workflow to
metrics
38
https://github.com/NVIDIA/tensorrt-inference- server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server
Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server
39
40
BERT: Bidirectional Encoder Representations from Transformers Widely used in Multiple NLP Tasks, due to high accuracy.
41
42
Previous TF inference is not efficient:
small Ops;
43
44
45
46
47
Before After
48
49
1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so 2. Prepare gemm_config.in for best algo of gemm by running the built binaries. 3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model. 4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files
50
1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so 2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’ 3. Arrange the model dir in a tree structure: bert_trt/1/model.plan
51
1. Prepare model_repository
model_repository/ |-- bert_fastertransformer | |-- 1 | | `-- model.savedmodel | | |-- saved_model.pb | | `-- variables | | |-- variables.data-00000-of-00001 | | `-- variables.index | `-- config.pbtxt `-- bert_trt |-- 1 | `-- model.plan `-- config.pbtxt
name: “bert_fastertransformer" platform: "tensorflow_savedmodel" input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ 1, 128 ] }, { name: "input_mask" data_type: TYPE_INT32 dims: [ 1, 128 ] }, { name: "segment_ids" data_type: TYPE_INT32 dims: [ 1, 128 ] } ]
{ name: "prediction" data_type: TYPE_FP32 dims: [ 2, 1, 128 ] } ] instance_group { kind: KIND_GPU count: 1 } version_policy: {specific { versions: [1] }} name: "bert_trt" platform: "tensorrt_plan" max_batch_size: 1 input [ { name: "segment_ids" data_type: TYPE_INT32 dims: 128 }, { name: "input_ids" data_type: TYPE_INT32 dims: 128 }, { name: "input_mask" data_type: TYPE_INT32 dims: 128 } ]
{ name: "cls_squad_logits" data_type: TYPE_FP32 dims: [128,2,1,1] } ] instance_group { kind: KIND_GPU count: 1 } version_policy: {specific { versions: [32] }}
config.pbtxt config.pbtxt Model directory
52
1. Launch trtserver over http/grpc
1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3 2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so} 3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True
53
1. Run perf_client to infer over grpc
1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients 2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8
Request concurrency: 1 Client: Request count: 59 Throughput: 944 infer/sec Avg latency: 34422 usec (standard deviation 288 usec) p50 latency: 34457 usec p90 latency: 34667 usec p95 latency: 34877 usec p99 latency: 35130 usec Avg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec) Server: Request count: 70 Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)
Result reported by perf_client
54
SQuAD task inference (FasterTransformer): batchsize=1, tensorflow backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent CPU FP32 7.7 131 203 1 CPU FP32 10.4 289 339 4 GPU FP32 104.5 9.5 11.8 1 GPU FP32 137 21.9 23.7 4 GPU FP16 267.5 3.7 3.9 1 GPU FP16 461.5 8.7 10.3 4 CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP16 7.7 104.5 461.5 131 9.5 8.7
Virtual GPU feature in Tensorflow to enable multi-stream:
55
SQuAD task inference (FasterTransformer): batchsize=32, tensorflow backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent CPU FP32 0.4 2491 2810 1 GPU FP32 5.5 182 184 1 GPU FP32 5 186 186 4 GPU FP16 21.8 46 48.8 1 GPU FP16 21.6 46.1 48.1 4 CPU -> GPU FP32 -> GPU FP16 0.4 5.5 21.8 2491 182 46
56
SQuAD task inference (TensorRT): batchsize=1, TensorRT backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent GPU FP32 163 12.3 12.4 1 GPU FP32 156 12.8 14 4 GPU FP16 438.5 4.6 4.6 1 GPU FP16 473.5 4.2 5.1 4 GPU FP32 -> GPU FP16 163 473.5 12.3 4.2
57
SQuAD task inference (TensorRT): batchsize=32, TensorRT backend, max QPS Tesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Processor Precision QPS AvgL(ms) TP99(ms) Concurrent GPU FP32 6.5 157 159 1 GPU FP32 6.5 316 356 4 GPU FP16 29.5 34.2 34.8 1 GPU FP16 30.5 134 151 4 GPU FP32 -> GPU FP16 6.5 30.5 157 134
58
Learn more here:
https://nvidia.com/data-center-inference https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server- guide/docs/quickstart.html
Get the ready-to-deploy container with monthly updates from the NGC container registry:
https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver
Open source GitHub repository:
https://github.com/NVIDIA/tensorrt-inference-server
59
Engineering developer blog (benchmarks, model concurrency, etc.):
https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/
Kubeflow guest blog:
https://www.kubeflow.org/blog/nvidia_tensorrt/
Open source announcement:
https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-
More:
Data center inference page & TensorRT page DevTalk Forum for Support TensorRT Hyperscale Inference Platform infographic NVIDIA AI Inference Platform technical overview NVIDIA TensorRT Inference Server and Kubeflow NVIDIA TensorRT Inference Server Now Available
60 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
查看课程 和报名 培训 咨询 NEW! NEW! NEW! NEW! NEW!
用多GPU训练神经网络 应对大规模神经网络训练的算法和工程挑战 CUDA Python 轻松实现在GPU上加速运行Python应用 计算机视觉 零基础入门,深度学习方法与实践 自然语言处理 NLP 必备理论与应用技能 多数据类型 机器视觉和NLP技术的融合进阶应用 自动驾驶汽车的感知系统(2019新版) 学用 NVIDIA DRIVE AGX 构建自动驾驶汽车 工业检测 应用深度学习打造自动化工业检测模型 使用 Jetson Nano 开发AI应用 机器人基础入门,获得您的Jetson Nano套件