TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the - PDF document

TensorRT: C++ API 1. Initializations and import of a Caffe network model in TensorRT TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O data structures and launch of the inference Alessandro Biondi 4. Parsing the results and clean-up of the environment 2 TensorRT: C++ API TensorRT: C++ API (1) 1. Initializations and import of a Caffe network model in IHostMemory *trtModelStream{nullptr}; Empty TensorRT model TensorRT std::vector<std::string> net_outputs = { Defines output blobs <name of output blob #1>, 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine <name of output blob #2>, … }; caffeToTRTModel( Converts a Caffe model to <.prototxt filename>, 3. I/O data structures and launch of the inference a TensorRT model <.caffemodel filename>, net_outputs, <batch size>, Number of runs in a batch trtModelStream ); 4. Parsing the results and clean-up of the environment 3 <continue…> 4 TensorRT: Parsing Caffe Models TensorRT: Parsing Caffe Models (1) void caffeToTRTModel ( const char* deployFile, 1. Initializations and creation of a TensorRT network const char* modelFile, const std::vector<std::string>& outputs, unsigned int batchSize, IHostMemory*& trtModelStream) 2. Convert a Caffe network into a TensorRT network 2. Convert a Caffe network into a TensorRT network { IBuilder* builder = TensorRT object to build a createInferBuilder(gLogger); network 3. Mark the network outputs INetworkDefinition* network = Network initialization builder->createNetwork(); <continue…> 4. Launch the TensorRT Optimization Engine 5 6 1

TensorRT: Parsing Caffe Models (2) TensorRT: Parsing Caffe Models (3) void caffeToTRTModel ( void caffeToTRTModel ( const char* deployFile, const char* deployFile, const char* modelFile, const char* modelFile, const std::vector<std::string>& const std::vector<std::string>& outputs, outputs, unsigned int batchSize, unsigned int batchSize, IHostMemory*& trtModelStream) IHostMemory*& trtModelStream) { { <…> <…> ICaffeParser* parser = TensorRT object to parse for(auto& s : outputs) Caffe networks createCaffeParser(); network->markOutput( Tells TensorRT which are *blobNameToTensor->find( the output blobs const IBlobNameToTensor* s.c_str() blobNameToTensor = parser->parse( )); deployFile, modelFile, builder->setMaxBatchSize( Parse the network Sets the maximum batch *network, maxBatchSize); size DataType::kFLOAT); <continue…> 7 <continue…> 8 TensorRT: C++ API TensorRT: Parsing Caffe Models (4) void caffeToTRTModel ( const char* deployFile, 1. Initializations and import of a Caffe network model in const char* modelFile, TensorRT const std::vector<std::string>& outputs, unsigned int batchSize, IHostMemory*& trtModelStream) { 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine <…> Launches the TensorRT ICudaEngine* engine = optimization engine builder->buildCudaEngine(*network); 3. I/O data structures and launch of the inference The optimized network is trtModelStream = engine->serialize(); serialized for portability network->destroy(); parser->destroy(); engine->destroy(); 4. Parsing the results and clean-up of the environment builder->destroy(); } 9 10 TensorRT: C++ API (2) TensorRT: C++ API 1. Initializations and import of a Caffe network model in <…> TensorRT IRuntime* runtime = Creates TensorRT runtime createInferRuntime(gLogger); engine 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine ICudaEngine* engine = Retrieves the optimized runtime->deserializeCudaEngine( network trtModelStream->data(), trtModelStream->size(), NULL 3. I/O data structures and launch of the inference ); Initializes the execution IExecutionContext* context = context for TensorRT engine->createExecutionContext(); 4. Parsing the results and clean-up of the environment <continue…> 11 12 2

TensorRT: C++ API (3) TensorRT: Launching inference void doInference (…) <…> { float inputData[<batch size> * Allocates the input buffer const ICudaEngine& engine = Retrieves the network <input size>]; context.getEngine(); void* buffers[<num input blobs> + struct OUTPUT_RESULT outputData; Allocates the output buffers Pointers to GPU input and <num output blobs>]; output buffers fillImageData(inputData); fillImageData(inputData); Fills the input buffer // for each input blob Get id of input buffer int inputIndex = engine.getBindingIndex doInference( (<name of input blob>); *context, cudaMalloc(&buffers[inputIndex], Allocate input buffer on <batch size>, <input size>); GPU memory inputData, Launches the inference! outputData // for each output blob ); Get id of output buffer int outputIndex = engine.getBindingIndex <continue…> (<name of output blob>); cudaMalloc(&buffers[outputIndex], Allocate output buffer on <output size>); GPU memory 13 14 <continue…> TensorRT: Launching inference TensorRT: C++ API void doInference ( IExecutionContext& context, 1. Initializations and import of a Caffe network model in int batchSize, float* input, TensorRT struct OUTPUT_RESULT& output) { <…> 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine // for each input cudaMemcpy(buffers[inputIndex], input, Copy input buffer to the inputSize, GPU input buffer cudaMemcpyHostToDevice) Launch the inference! 3. I/O data structures and launch of the inference context.execute(batchSize, buffers); (blocking call) // for each output cudaMemcpy(output.<buffer>, Copy GPU output buffer to buffers[outputIndex], output buffer 4. Parsing the results and clean-up of the environment <output size>, cudaMemcpyDeviceToHost) 15 16 <free buffers> TensorRT: C++ API (4) ResNet18 • DNN for object classification and detection • Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition” • Available as a Caffe model • We use a pre-trained version that detects 3 classes ( people, two wheelers, <…> and cars ) Extract the output data parseResult(outputData); (details will follow) context->destroy(); engine->destroy(); Cleaning up the runtime->destroy(); environment <delete allocated dyn memory> 17 18 3

ResNet18: input ResNet18: outputs Input size = 1*3*368*640 640 368 19 20 ResNet18: outputs ResNet18: outputs X columns • For each class, one bbox for each cell The number of columns and rows are fixed parameters of the DNN Y rows Each Bbox is For each class c , each cell characterized by (x,y) is assigned to a confidence of top-left edge , and in [0, 1] with which an item (x,y) of bottom-right of c is detected edge 21 22 ResNet18: outputs ResNet18: outputs • Confidence values are meant to be compared with thresholds • Final bboxes are built with post-processing 0.2 0.35 0.8 0.9 0.9 0.7 0.9 With a threshold 0.8 0.6 set to 0.3 , this 0.35 0.8 bbox is discarded 23 24 4

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the - PDF document

TensorRT: C++ API 1. Initializations and import of a Caffe network model in TensorRT TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O data structures and launch of the inference Alessandro Biondi

Fast Neural Network Inference with TensorRT on Autonomous Vehicles Zejia Zheng (zheng@zoox.com)

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

INTEGRATION OF DALI WITH TENSORRT ON XAVIER Josh Park (joshp@nvidia.com), Manager - Automotive Deep

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna Deep Learning in Production -

8-bit Inference with TensorRT Szymon Migacz, NVIDIA May 8, 2017 Intro Goal: Convert FP32

Real - Time Face Recognition on Jetson Tx2 using TensorRT Tamas Grobler 11 . 10 . 2017 GTC Table

TensorRT Inference with TensorFlow Pooya Davoodi (NVIDIA) Chul Gwon (Clarifai) Guangda Lai

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff

TensorRT Optimizations for Embedded Facial Recognition Alexey Kadeishvili, CTO, Vocord Vocord

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Neural Network Deployment with DIGITS and TensorRT Twin Karmakharm Certified Instructor, NVIDIA

S9243 Fast and Accurate Object Detection Floris Chabert , Solutions Architect with PyTorch and

Using ONNX for accelerated inferencing on cloud and edge Prasanth Pulavarthi (Microsoft) Kevin

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken Hester Deep Learning in

Automation and Programming with Stata Christopher F Baum Boston College and DIW Berlin NCER,

Structural Health Monitoring Mechanical and impact damage

Many stories, two basic plots. common Fun with Gaussian distributions: average A few billion

Cross ring data move 1. Segmentation based protection breaks 2. Kernel level actual data move

Improved MOX fuel calculations using new Pu-239, Am-241 and Pu-240 evaluations Wonder 2012 | G.

CSE 127 Computer Security Deian Stefan, Stefan Savage, Winter 2018, Lecture 3 Low Level Software

higher-order type theory and Coq Vladimir Komendantsky University of St Andrews, UK 18

Automated Verification of Shape and Size Properties via Separation Logic Huu Hai Nguyen 1 ,