S9391 GstCUDA: Easy GStreamer and CUDA Integration Eng. Daniel - - PowerPoint PPT Presentation

s9391 gstcuda easy gstreamer and cuda integration
SMART_READER_LITE
LIVE PREVIEW

S9391 GstCUDA: Easy GStreamer and CUDA Integration Eng. Daniel - - PowerPoint PPT Presentation

S9391 GstCUDA: Easy GStreamer and CUDA Integration Eng. Daniel Garbanzo MSc. Michael Grner GTC March 2019 About RidgeRun GStreamer Overview CUDA Overview GstCUDA Introduction Agenda Application Examples Performance Statistics GstCUDA


slide-1
SLIDE 1

S9391 GstCUDA: Easy GStreamer and CUDA Integration

  • Eng. Daniel Garbanzo
  • MSc. Michael Grüner

GTC March 2019

slide-2
SLIDE 2

Agenda

About RidgeRun GStreamer Overview CUDA Overview GstCUDA Introduction Application Examples Performance Statistics GstCUDA Demo on TX2 Q&A

2

slide-3
SLIDE 3
  • US Company - R&D Lab in Costa Rica
  • 15 years of experience
  • Embedded Linux and GStreamer experts
  • Custom multimedia solutions
  • Digital signal/image processing
  • AI and Machine Learning solutions
  • System optimization: CUDA, GStreamer, OpenCL, OpenGL, OpenVX, Vulkan
  • Support for embedded and resource constrained systems
  • Professional services, dedicated teams and specialized tools

About Us

3

slide-4
SLIDE 4
  • Complex multimedia applications require a lot of processing resources
  • GStreamer offers a flexible way for creating multimedia applications
  • CUDA offers high performance accelerated processing capabilities

Medical Industry Automotive Industry Smart Devices Computer Vision

4

slide-5
SLIDE 5
  • Open source framework for audio and video applications
  • Based on a pipeline architecture
  • Extensible design based on plugins (more than 1000 freely available)
  • Automatic format and synchronization handling
  • Tools for easy prototyping

Modularity Flexibility Portability

5

slide-6
SLIDE 6
  • Each plugin represents a different processing module
  • The plugins are linked and arranged in a pipeline
  • Freedom to build arbitrary pipelines for different applications

6

Basic MP4 player GStreamer Pipeline

slide-7
SLIDE 7

Modular design lets you change your application easily!

7

Easily change your application end use Easily change from SW to HW accelerated processing

slide-8
SLIDE 8

Code equivalent : gst-launch v4l2src ! videoconverter ! omxh265enc ! mpegtsmux ! udpsink Code equivalent : gst-launch v4l2src ! videoconverter ! x265enc ! mpegtsmux ! filesink

Modular design lets you change your application easily!

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

GstCUDA

10

slide-11
SLIDE 11

GstCUDA

11

slide-12
SLIDE 12

What Does GstCUDA Solve?

12

slide-13
SLIDE 13
  • Integration Complexities

13

slide-14
SLIDE 14

Development Time

Without GstCUDA With GstCUDA

3 Months 10 days 5 days

Create GStreamer plugin with CUDA support Generate CUDA algorithm Integrate CUDA algorithm

10 days 0.1 day

Generate CUDA algorithm Integrate CUDA algorithm Total = 3.5 months Total = 10.1 days

  • Reduce development time
  • Focus on the CUDA logic
  • Minimize time to market

14

slide-15
SLIDE 15
  • Memcpy

Memcpy

Performance Bottleneck

15

slide-16
SLIDE 16

Performance Bottleneck

Without GstCUDA With GstCUDA

  • Efficient memory handling

improves performance

  • Up to 2x 4K@60fps
  • Data transfers bottleneck

cause poor performance

  • Limited framerate at high

resolutions

16

slide-17
SLIDE 17

Supported Platforms

  • Focused for NVIDIA Embedded Platforms

Jetson TX1, TX2, TX2i and Nano Jetson AGX Xavier

17

slide-18
SLIDE 18

GstCUDA Key Features

18

slide-19
SLIDE 19

GstCUDA Key Features

19

slide-20
SLIDE 20

Framework Overview

20

slide-21
SLIDE 21

Quick Prototyping Elements

21

slide-22
SLIDE 22

location = median_filter.so

Cudafilter Element

22

slide-23
SLIDE 23

location = thermal_overlay.so

Cudamux Element

IR

23

slide-24
SLIDE 24

CUDA Algorithm Interface

  • Make your CUDA algorithm compatible by implementing these interfaces

Cudafilter Interface

bool open(); bool close(); bool process (const GstCudaData &inbuf, GstCudaData &outbuf); bool process_ip (const GstCudaData &inbuf, GstCudaData &outbuf); bool open(); bool close(); bool process (vector<GstCudaData> &inbufs, GstCudaData &outbuf); bool process_ip (vector<GstCudaData> &inbufs, GstCudaData &outbuf);

Cudamux Interface

24

slide-25
SLIDE 25

Buffer Processing Methods

process_ip (In place) process (Not in place)

25

slide-26
SLIDE 26

Create Your Custom Element

  • Some applications may require specialized elements
  • GstCUDA provides bases classes to simplify development

26

slide-27
SLIDE 27

GstCUDA Framework Usage Example

  • 27
slide-28
SLIDE 28

GstCUDA Framework Summary

  • Utils to handle

memory interfaces

  • GStreamer Unified

Memory allocators

  • Parent classes for

different topologies

  • The framework includes:
  • Generic elements to

evaluate custom algorithms

  • Runtime loading of

CUDA algorithms

  • Complete GstCUDA

element boilerplate

  • CUDA algorithms for

the prototyping elements

GstCUDA API Quick prototyping elements Set of examples

28

slide-29
SLIDE 29

GstCUDA Application Areas Examples Video

29

slide-30
SLIDE 30

Industrial Applications: Border Enhancement

30

slide-31
SLIDE 31

Automation Applications: Hough Transform

31

slide-32
SLIDE 32

Security Applications: Motion Detection/Estimation

32

slide-33
SLIDE 33

Performance Statistics

33

slide-34
SLIDE 34

Varying Algorithm / Fixed Image Size

  • Image convolution algorithm
  • Stressing compute capabilities
  • Variable convolution kernel size
  • 1080p@240fps / 1080p@60fps

stream input

  • Cudafilter element
  • Unified Memory allocator
  • Jetson TX2 platform
  • Not In-place

Test Conditions

location = convolution.so

34

slide-35
SLIDE 35

Varying Algorithm / Fixed Image Size

Framerate Stats

35

slide-36
SLIDE 36

Varying Algorithm / Fixed Image Size

Processing Time Stats

36

slide-37
SLIDE 37

Varying Algorithm / Fixed Image Size

CPU Load Stats GPU Load Stats

37

*baseline = simple capture pipeline (without GstCUDA)

slide-38
SLIDE 38

Fixed Algorithm / Varying Image Size

  • Memory copy algorithm
  • Stressing data transfer
  • Variable input resolution
  • Cudafilter element
  • Unified Memory allocator
  • Jetson TX2 platform
  • In-place vrs not In-place

Test Conditions

location = memcpy.so

38

slide-39
SLIDE 39

Fixed Algorithm / Varying Image Size

Framerate Stats

39

Note: Maximum Framerate limited to 245 fps by the video source

slide-40
SLIDE 40

Fixed Algorithm / Varying Image Size

Processing Time Stats

40

slide-41
SLIDE 41

Fixed Algorithm / Varying Image Size

CPU Load Stats GPU Load Stats

41

*baseline = simple capture pipeline (without GstCUDA)

slide-42
SLIDE 42

Fixed Algorithm / Varying Image Size

  • Simple image mixing algorithm
  • Stressing data transfer
  • Variable input resolution
  • Cudamux element
  • Unified Memory allocator
  • In-place=True
  • Jetson TX2 platform

Test Conditions

location = mixer.so

42

slide-43
SLIDE 43

Fixed Algorithm / Varying Image Size

Framerate Stats

43

Note: Maximum Framerate limited to 240fps by the video source

slide-44
SLIDE 44

Fixed Algorithm / Varying Image Size

CPU Load Stats GPU Load Stats

44

*baseline = simple capture pipeline (without GstCUDA)

slide-45
SLIDE 45

GstCUDA Live Demo on Jetson TX2 Sobel Filter 1080p60fps

45

gst-launch-1.0 nvcamerasrc sensor-id=2 fpsRange=60,60 ! "video/x-raw(memory:NVMM),width=1920,height=1080,framerate=6 0/1,format=I420" ! nvvidconv ! "video/x-raw" ! queue ! cudafilter in-place=false location=/borders.so ! queue ! nvoverlaysink Code equivalent :

slide-46
SLIDE 46
  • GstCUDA wiki page:

○ gstcuda.ridgerun.com

  • RidgeRun Website:

○ ridgerun.com

  • RidgeRun Contact:

○ ridgerun.com/contact

Resources

46