 
              S9545 - USING THE DEEPSTREAM SDK FOR AI-BASED VIDEO ANALYTICS Anudeep Nallamothu - NVIDIA Solutions Architect Andrew Bull - NVIDIA Solutions Architect
Realtime Streaming Video Analytics • Framework for Analyzing Video • • Understand the Basics: DeepStream SDK 3.0 • Hardware Platforms AGENDA An Overview of TensorRT 5.0 • Transfer Learning Toolkit • Build with DeepStream: Example Applications • Getting Started Resources • 2
REALTIME STREAMING VIDEO ANALYTICS 3
REALTIME STREAMING VIDEO ANALYTICS FROM EDGE TO CLOUD Parking Management Traffic Engineering Access Control Managing operations Managing Logistics Retail Analytics Optical Inspection Content Filtering 4
FRAMEWORK FOR ANALYZING VIDEO 5
FRAMEWORK FOR ANALYZING VIDEO MULTIMEDIA APIs COMPOSITE STREAM &BATCH MULTIMEDIA APIs CUDA TENSORRT, CUDA PROCESSING LOCAL DISPLAY TRACK, METADATA Metadata DECODE PRE-PROCESS DETECT, PROCESSING CLASSIFY REMOTE DISPLAY Data Analytics Perception 6
DEEPSTREAM FOR AI APPLICATION PERFORMANCE AND SCALE Perception and Analytics NEXT • Multi-GPU containerized applications • 360D cameras Solution framework Perception – edge to cloud • Dynamic stream management • Optical flow Scalability IOT services • Unified APIs across platforms • • Remote display Perception Multi-streams/ multi-DNNs • • Multi-GPU dynamic • Platform specific APIs orchestration • Custom graphs Indexed video storage and • Streams: Multi (Tesla), single(Jetson) • retrieval v3.0 • Workflow templates for full solutions v2.0 v1.0 NVIDIA Other Other Other Other DeepStream Next – POR can change 7
DEEPSTREAM 3.0 8
DEEP LEARNING FOR IVA End-to-end workflow Accelerate building and deploying heterogeneous applications for IVA use cases with TLT & DeepStream 3.0 9
DEEPSTREAM SDK 10
NVIDIA IVA PLATFORM Deploy from the edge to the cloud EDGE / ON-PREMISE CORE/CLOUD Inference Training and Inference NVR Camera Data center NVR / APPLIANCE SERVER DEEPSTREAM  TENSORRT  JETPACK QUADRO / TESLA TESLA / DGX JETSON 11
WHAT’S NEW IN DEEPSTREAM 3.0 LATEST GPUs - TESLA T4, DYNAMIC STREAM NEW PLUGINS JETSON XAVIER MANAGMENT PLUGIN LOW LEVEL LIB GPU Add, remove, modify Increased capability TensorRT 5, CUDA 10 streams on the fly and throughput EASY TO SCALE AND HIGH EFFIENCY AND CONNECT EDGE TO CLOUD THROUGHPUT WITH TLT MANAGE Deploy in Docker TLT model files are plug-n- Stream and Batch Analytics Containers play on Metadata 12
DEEPSTREAM STREAMING ARCHITECTURE IMAGE DISPLAY/ RTSP/RAW DECODE/ISP DNN(s) BATCHING TRACKING VIZULIZATION PROCESSING STORAGE DECODE, SCALE, STREAM DETECT & ON SCREEN CAPTURE CAMERA DEWARP, TRACKING OUTPUT MGMT CLASSIFY DISPLAY PROCESS CROP GigE NVDEC GPU CPU GPU GPU GPU HDMI ISP ISP VPA TC VPA VIC SATA VIC DLA CPU 13
DEEPSTREAM BUILDING BLOCK A plugin model based pipeline architecture • • Graph-based pipeline interface to allow high-level component interconnect Input + Output + PLUGIN [Metadata] Metadata • Heterogenous processing on GPU and CPU Hides parallelization and synchronization • under the hood Low Level API LOW LEVEL LIB Inherently multi-threaded • Hardware GPU 14
NVIDIA-ACCELERATED PLUGINS Plugin Name Functionality Accelerated video decoders gst-nvvideocodecs Stream aggregator - muxer and batching gst-nvstreammux TensorRT based inference for detection & classification gst-nvinfer Reference KLT tracker implementation gst-nvtracker On-Screen Display API to draw boxes and text overlay gst-nvosd Renders frames from multi-source into 2D grid array gst-tiler Accelerated X11 / EGL based renderer plugin gst-eglglessink Scaling, format conversion, rotation gst-nvvidconv Dewarping for 360 Degree camera input Gst-nvdewarp Meta data generation Gst-nvmsgconv Gst-nvmsgbroker Messaging to Cloud 15
SCALE WITH DEEPSTREAM IN DOCKER Discover GPU-Accelerated Containers Innovate in Minutes, Not Weeks Stay Up to Date https://www.nvidia.com/en-us/gpu-cloud/ 16
DEEPSTREAM IOT 17
DEEPSTREAM WITH AZURE IOT EDGE APPLIANCE Azure CLOUD DeepStream container IoT DPS IoT Edge Runtime IoT Edge Agent IoT Edge Hub IoT Hub Web Client Storage and Indexer Service Docker CUDA DRIVER IoT Edge Daemon Search & Query NVIDIA GPU HSM 18
HARDWARE PLATFORMS 19
NVIDIA T4 UNIVERSAL INFERENCE ACCELERATOR 320 Turing Tensor Cores 2,560 CUDA Cores H.264 Decode Throughput H.265 Decode Throughput 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS (Streams) (Streams) 16GB | 320GB/s 80 120 70 W 100 60 80 40 60 40 20 20 0 0 720p30 1080p30 4K30 720p30 1080p30 4K30 P4 T4 P4 T4 20
THE JETSON FAMILY From AI at the Edge to Autonomous Machines JETSON NANO JETSON TX2 JETSON AGX XAVIER 5 - 10W 7 – 15W 10 – 30W 0.5 TFLOPS (FP16) 1.3 TFLOPS (FP16) 10 TFLOPS (FP16) | 32 TOPS (INT8) 45mm x 70mm 50mm x 87mm 100mm x 87mm $129 AVAIABLE IN Q2 $299 - $749 $1099 AI at the edge Fully autonomous machines Multiple devices - Same software 21
JETSON NANO JETSON TX2 JETSON AGX XAVIER 128 Core Maxwell 256 Core Pascal NVIDIA Volta architecture with 512 NVIDIA CUDA GPU 0.5 TFLOPs (FP16) 1.3 TFLOPS (FP16) cores and 64 Tensor cores CPU 4 core ARM A57 @ 1.43 GHz 6 core Denver and A57 @ 2GHz 8-core ARM v8.2 64-bit CPU, 8 MB L2 + 4 MB L3 8 GB 128 bit 4 GB 128 bit LPDDR4 Memory 4 GB 64 bit LPDDR4 25.6 GB/s LPDDR4 16 GB 256-bit LPDDR4x 51 GB/s 58 GB/s Softwa Storage 16 GB eMMC 16 GB eMMC 32 GB eMMC 32 GB eMMC 5.1 re 2x1000MP/sec | 4x 4K @ 60 (HEVC) 4K @ 30 | 4x 1080p @ 30 | 8x 720p @ 30 2x 4K @ 60 | 4x 4K @ 30| 14x 1080p @ 30 Video Encode 8x 4K @ 30 (HEVC)| 16x 1080p @ 60 (HEVC) (H.264/H.265) (H.264/H.265) 32x 1080p @ 30 (HEVC) Power mode 5W|10W 7.5W|15W 7.5W|15W 10W|20W 2x1500MP/sec| 2x 8K @ 30 (HEVC) 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 16x 2x 4K @ 60| 4x 4K @ 30| 14x 1080p @ 30 Video Decode 6x 4K @ 60 (HEVC) | 12x 4K @ 30 (HEVC) 720p @ 30 | (H.264/H.265) (H.264/H.265) 26x 1080p @ 60 (HEVC) |52x 1080p @ 30 (HEVC) 16 lanes MIPI CSI-2 | 8 SLVS-EC 12 (3x4 or 4x2) MIPI CSI-2 DPHY 1.1 lanes 12 (3x4 or 6x2) MIPI CSI-2 D-PHY 1.2 lanes D-PHY 1.2 (2.5Gb/s per pair, total up to 40 Gbps) Camera (1.5 Gbps) (30 Gbps) C-PHY 1.1(2.5Gsym/s per trio, total up to 109 Gbps) WiFi/BT Requires external chip Requires external chip Onboard Requires external chip HDMI 2.0 or DP1.2 | eDP 1.4 | DSI (1 x2) HDMI 2.0 or DP 1.2 | eDP 1.4 | DSI (2 x4) Three multi-mode DP 1.2a/eDP 1.4/HDMI 2.0 a/b Display 2 simultaneous 3 simultaneous No DSI support UPHY 1 x1/2/4 PCIE | 1 USB 3.0 1+ 1 x4 or 1+1+1 x1/x2 PCIe or 3xUSB 3.0 16 lanes PCIe Gen 4 1x8 + 1x4 + 1x2 + 2x1 SATA None 1x SATA through PCIe x1 Bridge Power mode 5W|10W 7.5W|15W 7.5W|15W 10W|20W USB OTG Not supported Not Supported Not Supported Not Supported 69.6mm x 45mm 260 pin edge connector, No 87mm x 50mm 400 pin connector, Integrated 100 mm x87 mm Mechanical 22 TTP TTP 699-pin connector
JETSON NANO RUNS MODERN AI Inference 50 40 30 Img/sec 20 10 0 Resnet50 Inception v4 VGG-19 SSD SSD SSD Tiny Yolo Unet Super OpenPose Mobilenet-v2 Mobilenet-v2 Mobilenet-v2 resolution (300x300) (960x544) (1920x1080) Coral dev board (Edge TPU) Raspberry Pi 3 + Intel Neural Compute Stick 2 Jetson Nano Not supported/DNR 24
TENSORRT 26
NVIDIA TensorRT From Every Framework, Optimized For Each Target Platform TESLA T4 JETSON Xavier TensorRT TensorRT DRIVE AGX NVIDIA DLA TESLA V100 27
TENSORRT OVERVIEW High-performance Deep Learning Inference Engine for Production Deployment ONNX ONNX ONNX We Are Here 28
FRAMEWORKS GPU PLATFORMS Inference Optimizer and Runtime TESLA T4 TensorRT NVIDIA TensorRT 5 Optimizer Runtime DRIVE PX 2 Data center, embedded & automotive NVIDIA DLA In-framework support for TensorFlow TESLA V100 Support for all other frameworks and ONNX TensorRT inference server microservice with Docker and Kubernetes integration New layers and APIs New OS support for Windows and CentOS *New in TRT5 29
MODEL IMPORTING AI Researchers  Data Scientists  Example: Importing a TensorFlow model Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 30 developer.nvidia.com/tensorrt
FP16, INT8 PRECISION CALIBRATION Reduced Precision Inference Performance Precision Dynamic Range (ResNet50) FP32 INT8 Difference 38 ~ +3.4x10 6,000 Top 1 38 Top 1 Training precision FP32 -3.4x10 FP16 Googlenet 68.87% 68.49% 0.38% Tensor Core No calibration required FP16 -65504 ~ +65504 5,000 VGG 68.56% 68.45% 0.11% Requires calibration INT8 -128 ~ +127 Resnet-50 73.11% 72.54% 0.57% 4,000 Images/Second Resnet-152 75.18% 74.56% 0.61% 3,000 Precision calibration for INT8 inference: 2,000 INT8  Minimizes information loss between FP32 and FP32 1,000 INT8 inference on a calibration dataset FP32 FP32  Completely automatic 0 CPU-Only P4 V100 31
Recommend
More recommend