TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the - - PDF document

tensorrt
SMART_READER_LITE
LIVE PREVIEW

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the - - PDF document

TensorRT: C++ API 1. Initializations and import of a Caffe network model in TensorRT TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O data structures and launch of the inference Alessandro Biondi


slide-1
SLIDE 1

1

TensorRT

Alessandro Biondi

TensorRT: C++ API

  • 1. Initializations and import of a Caffe network model in

TensorRT

  • 2. Setup of the TensorRT inference engine

2

  • 3. I/O data structures and launch of the inference
  • 4. Parsing the results and clean-up of the environment
  • 2. Setup of the TensorRT inference engine

TensorRT: C++ API

  • 2. Setup of the TensorRT inference engine
  • 1. Initializations and import of a Caffe network model in

TensorRT

3

  • 3. I/O data structures and launch of the inference
  • 4. Parsing the results and clean-up of the environment
  • 2. Setup of the TensorRT inference engine

TensorRT: C++ API (1)

IHostMemory *trtModelStream{nullptr}; std::vector<std::string> net_outputs = { <name of output blob #1>, <name of output blob #2>, …

Empty TensorRT model Defines output blobs

4

}; caffeToTRTModel( <.prototxt filename>, <.caffemodel filename>, net_outputs, <batch size>, trtModelStream ); <continue…>

Converts a Caffe model to a TensorRT model Number of runs in a batch

TensorRT: Parsing Caffe Models

  • 1. Initializations and creation of a TensorRT network
  • 2. Convert a Caffe network into a TensorRT network

5

  • 3. Mark the network outputs
  • 4. Launch the TensorRT Optimization Engine
  • 2. Convert a Caffe network into a TensorRT network

TensorRT: Parsing Caffe Models (1)

void caffeToTRTModel( const char* deployFile, const char* modelFile, const std::vector<std::string>&

  • utputs,

unsigned int batchSize, IHostMemory*& trtModelStream) { 6

IBuilder* builder = createInferBuilder(gLogger); INetworkDefinition* network = builder->createNetwork(); <continue…>

TensorRT object to build a network Network initialization

slide-2
SLIDE 2

2

TensorRT: Parsing Caffe Models (2)

void caffeToTRTModel( const char* deployFile, const char* modelFile, const std::vector<std::string>&

  • utputs,

unsigned int batchSize, IHostMemory*& trtModelStream) { 7

<…> ICaffeParser* parser = createCaffeParser(); const IBlobNameToTensor* blobNameToTensor = parser->parse( deployFile, modelFile, *network, DataType::kFLOAT); <continue…>

TensorRT object to parse Caffe networks Parse the network

TensorRT: Parsing Caffe Models (3)

void caffeToTRTModel( const char* deployFile, const char* modelFile, const std::vector<std::string>&

  • utputs,

unsigned int batchSize, IHostMemory*& trtModelStream) { 8

<…> for(auto& s : outputs) network->markOutput( *blobNameToTensor->find( s.c_str() )); builder->setMaxBatchSize( maxBatchSize); <continue…>

Tells TensorRT which are the output blobs Sets the maximum batch size

TensorRT: Parsing Caffe Models (4)

void caffeToTRTModel( const char* deployFile, const char* modelFile, const std::vector<std::string>&

  • utputs,

unsigned int batchSize, IHostMemory*& trtModelStream) { 9

<…> ICudaEngine* engine = builder->buildCudaEngine(*network); trtModelStream = engine->serialize(); network->destroy(); parser->destroy(); engine->destroy(); builder->destroy(); }

Launches the TensorRT

  • ptimization engine

The optimized network is serialized for portability

TensorRT: C++ API

  • 1. Initializations and import of a Caffe network model in

TensorRT

  • 2. Setup of the TensorRT inference engine

10

  • 3. I/O data structures and launch of the inference
  • 4. Parsing the results and clean-up of the environment
  • 2. Setup of the TensorRT inference engine

TensorRT: C++ API (2)

<…> IRuntime* runtime = createInferRuntime(gLogger); ICudaEngine* engine =

Creates TensorRT runtime engine Retrieves the optimized

11

runtime->deserializeCudaEngine( trtModelStream->data(), trtModelStream->size(), NULL ); IExecutionContext* context = engine->createExecutionContext(); <continue…>

network Initializes the execution context for TensorRT

TensorRT: C++ API

  • 1. Initializations and import of a Caffe network model in

TensorRT

  • 2. Setup of the TensorRT inference engine

12

  • 4. Parsing the results and clean-up of the environment
  • 2. Setup of the TensorRT inference engine
  • 3. I/O data structures and launch of the inference
slide-3
SLIDE 3

3

TensorRT: C++ API (3)

<…> float inputData[<batch size> * <input size>]; struct OUTPUT_RESULT outputData; fillImageData(inputData);

Allocates the input buffer Allocates the output buffers

13

fillImageData(inputData); doInference( *context, <batch size>, inputData,

  • utputData

); <continue…>

Fills the input buffer Launches the inference!

TensorRT: Launching inference

void doInference(…) { const ICudaEngine& engine = context.getEngine(); void* buffers[<num input blobs> + <num output blobs>];

Retrieves the network Pointers to GPU input and

  • utput buffers

14 // for each input blob int inputIndex = engine.getBindingIndex (<name of input blob>); cudaMalloc(&buffers[inputIndex], <input size>); // for each output blob int outputIndex = engine.getBindingIndex (<name of output blob>); cudaMalloc(&buffers[outputIndex], <output size>); <continue…>

Get id of input buffer Allocate input buffer on GPU memory Get id of output buffer Allocate output buffer on GPU memory

TensorRT: Launching inference

void doInference( IExecutionContext& context, int batchSize, float* input, struct OUTPUT_RESULT& output) { <…> // for each input 15 cudaMemcpy(buffers[inputIndex], input, inputSize, cudaMemcpyHostToDevice) context.execute(batchSize, buffers); // for each output cudaMemcpy(output.<buffer>, buffers[outputIndex], <output size>, cudaMemcpyDeviceToHost) <free buffers>

Copy input buffer to the GPU input buffer Launch the inference! (blocking call) Copy GPU output buffer to

  • utput buffer

TensorRT: C++ API

  • 1. Initializations and import of a Caffe network model in

TensorRT

  • 2. Setup of the TensorRT inference engine

16

  • 3. I/O data structures and launch of the inference
  • 4. Parsing the results and clean-up of the environment
  • 2. Setup of the TensorRT inference engine

TensorRT: C++ API (4)

<…> parseResult(outputData);

Extract the output data (details will follow)

17

context->destroy(); engine->destroy(); runtime->destroy(); <delete allocated dyn memory>

Cleaning up the environment

ResNet18

  • DNN for object classification and detection
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual

Learning for Image Recognition”

  • Available as a Caffe model
  • We use a pre-trained version that detects 3 classes (people, two wheelers,

and cars)

18

slide-4
SLIDE 4

4

ResNet18: input

Input size = 1*3*368*640

19

368 640

ResNet18: outputs

20

ResNet18: outputs

X columns The number of columns

21

Y rows and rows are fixed parameters of the DNN For each class c, each cell is assigned to a confidence in [0, 1] with which an item

  • f c is detected

ResNet18: outputs

  • For each class, one bbox for each cell

22 Each Bbox is characterized by (x,y)

  • f top-left edge, and

(x,y) of bottom-right edge

ResNet18: outputs

0.35 0.2

  • Confidence values are meant to be compared with thresholds

23

0.8 0.35 0.9 0.8 0.9 0.9 0.8 0.6 0.7

With a threshold set to 0.3, this bbox is discarded

ResNet18: outputs

  • Final bboxes are built with post-processing

24

slide-5
SLIDE 5

5

ResNet18: outputs

  • ResNet18 detects C classes of items
  • For each class:

– For each cell of the grid

  • Matching confidence (to be compared with a threshold)
  • Coordinates of the top-left edge of a bbox

25

  • Coordinates of the top-left edge of a bbox
  • Coordinates of the bottom-right edge of a bbox
  • Total number of outputs

C * <# rows> * <# cols> (1 + (2 + 2))

ResNet18: outputs

  • ResNet18 has two separate output blobs for the

confidence values and the bboxes: Layer11_bbox and

Layer11_cov

  • Our test-case ResNet18 is trained for C=3 classes

Sizes for the grid

26

g

  • 40 columns
  • 23 rows

12 = 3 classes * (2+2)

  • utputs for the bbox

coordinates

ResNet18: outputs

  • ResNet18 has two separate output blobs for the

confidence values and the bboxes: Layer11_bbox and

Layer11_cov

  • Our test-case ResNet18 is trained for C=3 classes

Sizes for the grid

27

g

  • 40 columns
  • 23 rows

3 classes

TensorRT: C++ API (1) - Example

IHostMemory* trtModelStream{nullptr}; std::vector<std::string> net_outputs = { “Layer11_bbox”, “Layer11_cov”

M h fid f h Coordinates of the bboxes

28

}; caffeToTRTModel( “resnet18.prototxt”, “resnet18.caffemodel”, net_outputs, 1, trtModelStream );

Match confidence for each pair (cell, class) Model file Weights file

TensorRT: C++ API (3) - Example

void doInference(…) { <…> void* buffers[ 1 + 2 ]; int inputIndex = engine.getBindingIndex (“data”); cudaMalloc(&buffers[inputIndex],

Input image

29 p 3*640*368); int output1Index = engine.getBindingIndex (“Layer11_bbox”) cudaMalloc(&buffers[output2Index], 3*4*40*23); int output2Index = engine.getBindingIndex (“Layer11_cov”) cudaMalloc(&buffers[output2Index], 3*40*23);

Coordinates of the bboxes Match confidence for each pair (cell, class) Input image

Profiling with TensorRT

struct Profiler : public IProfiler { typedef std::pair<std::string, float> Record; std::vector<Record> mProfile; virtual void reportLayerTime( const char* layerName, float ms) {

Method called by the TensorRT r ntime Standard class of TensorRT

30 { auto record=std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; }); if (record == mProfile.end()) mProfile.push_back( std::make_pair(layerName, ms)); else record->second += ms; } <continue…>

TensorRT runtime engine to report on execution times of layers

slide-6
SLIDE 6

6

Profiling with TensorRT

struct Profiler : public IProfiler { <…> void printLayerTimes() { float totalTime = 0; for (size_t i = 0; i < mProfile.size(); i++) 31 { printf("%-40.40s %4.3fms\n", mProfile[i].first.c_str(), mProfile[i].second); totalTime += mProfile[i].second; } printf("Time over all layers: %4.3f\n", totalTime); } } gProfiler;

TensorRT: C++ API (2) – with profiler

IRuntime* runtime = createInferRuntime(gLogger); ICudaEngine* engine = runtime->deserializeCudaEngine(

32

trtModelStream->data(), trtModelStream->size(), NULL ); IExecutionContext* context = engine->createExecutionContext(); context->setProfiler(&gProfiler);

The profiler can be attached to a TensorRT execution context

Profiling with TensorRT

Layer1 + Layer1_relu 3.159ms Layer2 0.652ms Layer3_block_0 + Layer3_block_0_relu 1.638ms Layer3_block_1 + Layer3 + Layer3_relu 1.731ms Layer4_block_0 + Layer4_block_0_relu 1.660ms Layer4_block_1 + Layer4 + Layer4_relu 1.822ms Layer5_block_0 + Layer5_block_0_relu 1.248ms Layer5_proj_block 0.241ms Layer5_block_1 + Layer5 + Layer5_relu 1.576ms Layer6 block 0 + Layer6 block 0 relu 1 564ms

Layers merged by TensorRT

33

Layer6 block 0 + Layer6 block 0 relu 1.564ms Layer6_block_1 + Layer6 + Layer6_relu 1.652ms Layer7_block_0 + Layer7_block_0_relu 1.266ms Layer7_proj_block 0.194ms Layer7_block_1 + Layer7 + Layer7_relu 1.831ms Layer8_block_0 + Layer8_block_0_relu 2.732ms Layer8_block_1 + Layer8 + Layer8_relu 2.999ms Layer9_block_0 + Layer9_block_0_relu 3.515ms Layer9_proj_block 0.636ms Layer9_block_1 + Layer9 + Layer9_relu 9.899ms Layer10_block_0 + Layer10_block_0_relu 6.807ms Layer10_block_1 + Layer10 + Layer10_relu 6.754ms Layer11_bbox 0.139ms <…>

Compiling a TensorRT Application

  • A

typical C++ application developed with the TensorRT API requires to be linked to the following dynamic libraries:

– nvparsers, nvinfer, cudart, pthread

  • The Nvidia libraries are available within the CUDA

and TensorRT folders e g

34

and TensorRT folders, e.g.,

– /usr/local/cuda/lib/ and /tensorrt/TensorRT- 4.0.1.7/lib64/ on a standard configuration of the Jetson

TX2

  • CUDA and TensorRT headers are also available

within their folders, e.g.,

– /usr/local/cuda/include/ and /tensorrt/TensorRT- 4.0.1.7/include/ on a standard configuration of the

Jetson TX2

Example Makefile

APP := my_app CC := g++ CUDA_INSTALL_PATH ?= /usr/local/cuda TRT_INSTALL_PATH ?= /tensorrt/TensorRT-4.0.1.7 SRCS := my_app.cpp OBJS := $(SRCS:.cpp=.o) CPPFLAGS := -std=c++11 -I"$(TRT INSTALL PATH)/include" \ CPPFLAGS : std c++11 I $(TRT_INSTALL_PATH)/include \

  • I"$(CUDA_INSTALL_PATH)/include"

LDFLAGS := -lnvparsers –lnvinfer –lcudart -pthread \

  • L"$(TRT_INSTALL_PATH)/lib" \
  • L"$(CUDA_INSTALL_PATH)/lib64"

all: $(APP) %.o: %.cpp @ $(CC) $(CPPFLAGS) -c $< $(APP): $(OBJS) @ $(CC) -o $@ $(OBJS) $(CPPFLAGS) $(LDFLAGS) clean: rm -rf $(APP) $(OBJS)

35

Off-line Profiling Tools

  • Nvidia offers two interesting tools to enable a fine-grained profiling of
  • f

computational activities

  • n

GPUs, which are hence also particularly useful to profile the execution of DNNs – Visual profiler: is a graphical profiling tool that displays a timeline

  • f your application's CPU and GPU activity, and that includes an

automated analysis engine to identify optimization opportunities.

36

– Nvprof: is a run-time tool that allows collecting and viewing profiling data from the command-line

  • Typical usage:
  • 1. Collect profiling data at run-time with nvprof
  • 2. Export the profiling data
  • 3. Import the data in the Visual profiler for a graphical representation

Warning: these two tools will be integrated in a new tool named Nvidia Nsight Compute

slide-7
SLIDE 7

7

Visual Profiler

  • Then, import timeline.prof in the Visual profiler

nvprof --export-profile timeline.prof your_app

  • First run nvprof from command line on your application

37

Visual Profiler

38

Visual Profiler

Note: other flags for nvprof may be required (e.g., --analysis-metrics)

39

CUDA Execution Models

TensorRT offers two execution models to launch inference

CPU

memcpy memcpy

  • 1. Synchronous

40

GPU

time

CPU GPU

time memcpy memcpy

  • 2. Asynchronous

CUDA Streams

  • A CUDA Stream is a sequence of operations that execute on the

GPU in the order they are issued

– Think at a stream as a “container” of operations to be performed on the GPU

  • CUDA operations in different streams may run concurrently

– They can be effectively executed in parallel on different sets of

41

y y p GPU cores (stream multiprocessors) – CUDA operations from different streams may be interleaved on the same set of GPU cores

  • Limited details on how CUDA operations are scheduled are

publicly available

– A sort of round-robin scheduling among CUDA streams with the same priority can be experimentally observed

CUDA Streams: Priorities

  • CUDA streams may be assigned to priorities when they are

created

  • The operations of a high-priority stream can preempt the one of a

low-priority stream

  • Limited information on how this preemption is implemented is

publicly available

42

  • Note that devices support only a limited number of priorities

– For instance, the Jetson TX2 supports just two priorities: -1 (high priority) and 0 (low priority)

slide-8
SLIDE 8

8

Working with CUDA Streams

int min_prio; int max_prio; cudaDeviceGetStreamPriorityRange(&min_prio, &max_prio);

  • The range of priorities supported by the device can be get as

follows:

43

cudaStream_t stream; cudaStreamCreateWithPriority(&stream, <flag>, <priority>);

  • A CUDA stream can be created as follows, where

– <flag> can be either 0x0 (default) or 0x1 (cudaStreamNonBlocking - work running in the created stream may run concurrently with work in stream 0 and that the created stream should perform no implicit synchronization with stream 0)

TensorRT: Launching inference (synchronous)

void doInference( IExecutionContext& context, int batchSize, float* input, struct OUTPUT_RESULT& output) { <…> // for each input 44 cudaMemcpy(buffers[inputIndex], input, inputSize, cudaMemcpyHostToDevice) context.execute(batchSize, buffers); // for each output cudaMemcpy(output.<buffer>, buffers[outputIndex], <output size>, cudaMemcpyDeviceToHost) <free buffers>

Copy input buffer to the GPU input buffer Launch the inference! (blocking call) Copy GPU output buffer to

  • utput buffer

TensorRT: Launching inference (asynchronous)

void doInference( IExecutionContext& context, int batchSize, float* input, struct OUTPUT_RESULT& output) { <…> // for each input 45 cudaMemcpy(...) context.enqueue(batchSize, buffers, stream, nullptr); <do other work on CPU> cudaStreamSynchronize(stream); // for each output cudaMemcpy(...) <free buffers>

Launch the inference! (non-blocking call) Wait for completion of the inference (blocking call)

LIVE DEMOS LIVE DEMOS

Nvidia Repository

  • Nvidia is continuously updating a GitHub repository with several
  • pen-source examples and demos based on TensorRT
  • Check it out!

https://github.com/dusty-nv/jetson- i f

47

inference

Credits

[1] http://gpucomputing.shef.ac.uk/static/slides/2018-07-19-dl-cv/deployment.pdf [2] http://on-demand.gputechconf.com/gtcdc/2017/presentation/dc7172- shashank-prasanna-deep-learning-deployment-with-nvidia-tensorrt.pdf [3] http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit- inference with tensorrt pdf

48

inference-with-tensorrt.pdf [4] Venieris et al., “Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions”, https://arxiv.org/pdf/1803.05900.pdf [5] https://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/