NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video - - PowerPoint PPT Presentation

nvidia video technologies
SMART_READER_LITE
LIVE PREVIEW

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video - - PowerPoint PPT Presentation

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing Video Enhancements AGENDA Video Codec SDK Updates Benchmarks Roadmap 2 NVIDIA VIDEO TECHNOLOGIES 3 NVIDIA GPU VIDEO CAPABILITIES Decode HW*


slide-1
SLIDE 1

Abhijit Patait, 3/20/2019

NVIDIA VIDEO TECHNOLOGIES

slide-2
SLIDE 2

2

AGENDA

NVIDIA Video Technologies Overview Turing Video Enhancements Video Codec SDK Updates Benchmarks Roadmap

slide-3
SLIDE 3

3

NVIDIA VIDEO TECHNOLOGIES

slide-4
SLIDE 4

4

CPU

NVDEC NVENC

CUDA Cores

Buffer

Decode HW* Encode HW*

Formats:

  • H.264
  • H.265
  • Lossless

Bit depth:

  • 8 bit
  • 10 bit

Color**

  • YUV 4:4:4
  • YUV 4:2:0

Resolution

  • Up to 8K***

Formats:

  • MPEG-2
  • VC1
  • VP8
  • VP9
  • H.264
  • H.265
  • Lossless

Bit depth:

  • 8/10/12 bit

Color**

  • YUV 4:2:0
  • YUV 4:4:4

Resolution

  • Up to 8K***

NVIDIA GPU VIDEO CAPABILITIES

* See support diagram for previous NVIDIA HW generations ** 4:4:4 is supported only on HEVC for Turing; 4:2:2 is not natively supported on HW *** Support is codec dependent

slide-5
SLIDE 5

5

VIDEO CODEC SDK

A comprehensive set of APIs for GPU- accelerated video encode and decode NVENCODE API for video encode acceleration NVDECODE API for video & JPEG decode acceleration (formerly called NVCUVID API) Independent of CUDA/3D cores on GPU for pre-/post-processing

Video archiving Intelligent video analytics Remote desktop streaming Gamestream Video transcoding Video editing

slide-6
SLIDE 6

6

NVIDIA VIDEO TECHNOLOGIES

SOFTWARE HARDWARE

Video Encode and Decode for Windows and Linux CUDA, DirectX, OpenGL interoperability

VIDEO CODEC, OPTICAL FLOW SDK

Video decode

NVDEC NVIDIA DRIVER NVENC

Video encode

CUDA TOOLKIT

Easy access to GPU video acceleration

APIs, libraries, tools, samples

DeepStream SDK cuDNN, TensorRT , cuBLAS, cuSPARSE

CUDA

High-performance computing on GPU

DALI

slide-7
SLIDE 7

7

VIDEO CODEC SDK UPDATE

slide-8
SLIDE 8

8

VIDEO CODEC SDK UPDATE

2016

SDK 7.x

Pascal 10-bit encode FFmpeg ME-only for VR Quality++

2017

SDK 8.0

10-bit transcode 10/12-bit decode OpenGL

  • Dec. optimizations

WP, AQ, Enc. Quality

Q1 2018

SDK 8.1

B-as-ref QP/emphasis map 4K60 HEVC encode Reusable classes & new sample apps

Q3 2018

SDK 8.2

Decode + inference

  • ptimizations

SDK 9.0

Turing Multi-NVDEC HEVC 4:4:4 decode Encode quality++ HEVC B frames

2019

slide-9
SLIDE 9

9

VIDEO CODEC SDK 9.0

Feature Who it benefits Higher video encode quality HEVC B-frames Higher encode quality Cloud gaming Game broadcasting (e.g. Twitch) Video transcoding (e.g. Youtube, Facebook) OTT/M&E HEVC 4:4:4 decode End-to-end high-quality remote desktop Mutiple NVDECs Higher decode + inference throughput Direct output to vidmem Higher perf with post-processing Power 9 + Tesla V100 SXM2 Video SDK for IBM platforms

Soul

slide-10
SLIDE 10

10

TURING UPDATES - NVDEC

slide-11
SLIDE 11

11

MULTIPLE NVDECS IN TURING

GPU Number of NVDECs per GPU Volta, Pascal & earlier 1 Turing – GeForce (RTX) 1 Turing – Quadro & Tesla (TU106) 3 Turing – Quadro & Tesla (TU104) 2 Turing – others 1

➢ Quadro & Tesla feature ➢ Auto-load-balanced by driver

slide-12
SLIDE 12

12

PASCAL & EARLIER

Single NVDEC

NVDEC

… 1001010111010 … … 0101100010011 … … 1001010111010 … … 0101100010011 …

Scale Scale Scale Scale

High-res Decode 1080p, 720p

Infer Infer Infer Infer

Low-res infer e.g. 300 × 200

Bottleneck

slide-13
SLIDE 13

13

TURING

Multiple NVDECs

… 1001010111010 … … 0101100010011 … … 1001010111010 … … 0101100010011 …

Scale Scale Scale Scale

High-res Decode 1080p, 720p

Infer Infer Infer Infer

Low-res infer e.g. 300 × 200

NVDEC 0 NVDEC N

. . .

… 1001010111010 … … 0101100010011 … … 1001010111010 … … 0101100010011 …

. . . . . . . . . . . .

slide-14
SLIDE 14

14

END-TO-END 4:4:4 IN TURING

➢ Preserves chroma: text and thin lines ➢ Valuable in desktop streaming

4:2:0 4:4:4

slide-15
SLIDE 15

15

END-TO-END 4:4:4 IN TURING

HEVC 4:4:4 HW encode & 4:4:4 HW decode

Desktop Capture HW Encode Stream CPU decode Render Network Pascal & earlier Turing HW decode

slide-16
SLIDE 16

16

TURING NVENC ENHANCEMENTS

slide-17
SLIDE 17

17

NVENC - ENCODING QUALITY

Focus for Turing NVENC

Enhancement How to use Rate distortion optimization – RDO Turing only – always ON Multiple reference frames Preset-dependent HEVC B-frames NVENCODE API Others ➢ Higher throughput at same quality as Pascal ➢ Turing GPUs have single NVENC engine with higher quality

slide-18
SLIDE 18

18

TURING NVENC QUALITY

➢ Focus on quality – RDO, multi-ref, HEVC B-frames, … ➢ Quality vs performance trade-off ➢ Quality is content dependent ➢ 600+ videos of 10-20 secs each: Natural, animation, gaming, video conference, movies ➢ 720p, 1080p, 4K, 8K ➢ Quality: PSNR, SSIM, VMAF, subjective ➢ Perf: fps, number of 1080p streams per GPU

slide-19
SLIDE 19

19

H.264 ENCODE BENCHMARK

Non latency critical – Turing vs Pascal vs x264

“iso” quality = x264 medium

0.98 1.08 1.161.17 0.93 1.00 1.05 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 5 10 15 20

bitrate ratio @ iso quality #1080p30 streams

H.264 - non latency critical

T4 medium T4 fast P4 slow P4 medium x264 slow x264 medium x264 fast

10.60 19.41 17.73 18.73 2.95 5.72 6.28

5 10 15 20 25 T4 medium T4 fast P4 slow P4 medium x264 slow x264 medium x264 fast

#1080p30 streams

H.264 - non latency critical

Higher bitrate savings Higher perf

slide-20
SLIDE 20

20

H.264 ENCODE BENCHMARK

NVENC slow

  • preset slow -bufsize BITRATE*2 -maxrate BITRATE*1.5 -profile:v high -bf 3 -

b_ref_mode 2 -temporal-aq 1 -rc-lookahead 20 -vsync 0 x264 slow

  • preset slow -tune psnr -vsync 0 -threads 4 -vsync 0

NVENC medium

  • preset medium -rc vbr -profile:v high -bf 3 -b_ref_mode 2 -temporal-aq 1
  • rc-lookahead 20 -vsync 0

x264 medium

  • preset medium -tune psnr -threads 4 -vsync 0

NVENC fast

  • preset fast -rc vbr -profile:v high -bf 3 -b_ref_mode 2 -temporal-aq 1
  • rc-lookahead 20 -vsync 0

x264 fast

  • preset fast -tune psnr -vsync 0 -threads 4 -vsync 0

Non latency critical – FFmpeg commands

slide-21
SLIDE 21

21

HEVC ENCODE BENCHMARK

Non latency critical – Turing vs Pascal vs x265

1.10 1.21 1.35 0.92 1.00 1.10 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 5 10 15

bitrate ratio @ iso quality #1080p30 streams

HEVC – non latency critcal

T4 medium T4 fast P4 medium x265 slow x265 medium x265 fast

11.80 4.29 10.71 2.98 1.91 0.85

2 4 6 8 10 12 14 T4 fast T4 medium P4 medium x265 fast x265 medium x265 slow

#1080p30 streams

HEVC – non latency critical

“iso” quality = x265 medium

Higher bitrate savings Higher perf

slide-22
SLIDE 22

22

HEVC ENCODE BENCHMARK

Non latency critical – FFmpeg commands

NVENC slow

  • preset slow -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -rc-lookahead 20 -g

250 -vsync 0 x265 slow

  • preset slow -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0

NVENC medium

  • preset medium -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -rc-lookahead 20
  • g 250 -vsync 0

x265 medium

  • preset medium -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0

NVENC fast

  • preset fast -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -temporal-aq 1 -rc-

lookahead 20 -g 250 -vsync 0 x265 fast

  • preset fast -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0
slide-23
SLIDE 23

23

SOFTWARE UPDATES

slide-24
SLIDE 24

24

RECONFIGURE DECODER

✓ Input resolution ✓ Scaling resolution ✓ Cropping rectangle Codecs Bit-depth and chroma format Deinterlace mode Input resolution beyond max width or max height

Video Codec SDK 8.2

No init time, reuse context, lowers memory fragmentation

slide-25
SLIDE 25

25

DIRECT OUTPUT TO VIDMEM

SDK 8.2 & earlier

Video Codec SDK 9.0

CUDA pre-process NVENC CUDA Post-process CPU process PCIe

SDK 9.0

Host/system memory Video memory Video memory

slide-26
SLIDE 26

26

OTHER UPDATES

➢ Video Codec SDK now supported on Power 9 + Tesla V100 SXM2 ➢ High-level NVDEC error status

slide-27
SLIDE 27

27

OPTICAL FLOW

New HW Functionality

➢ 4 × 4 optical flow vector , up to 4K × 4K ➢ Close to true motion ➢ Robust to intensity changes ➢ 10x faster than CPU; same quality ➢ New Optical Flow SDK ➢ Action recognition, object tracking, video inter/extrapolation, frame-rate upconversion ➢ Legacy ME-only mode support More information: http://developer.nvidia.com/opticalflow-sdk

slide-28
SLIDE 28

28

TIPS FOR NVENC OPTIMIZATION

slide-29
SLIDE 29

30

OPTIMIZATION STRATEGIES

➢ Minimize PCIe transfers

➢ Eliminate, if possible ➢ Use CUDA for video pre-/post-processing

➢ Multiple threads/processes to balance enc/dec utilization

➢ Monitor using nvidia-smi: nvidia-smi dmon -s uc -i <GPU_index> ➢ Analyze using GPUView on Windows

➢ Minimize disk I/O ➢ Optimize encoder settings for quality/perf balance

General Guidelines

slide-30
SLIDE 30

31

FFMPEG VIDEO TRANSCODING

➢ Look at FFmpeg users’ guide in NVIDIA Video Codec SDK package ➢ Use –hwaccel keyword to keep entire transcode pipeline on GPU ➢ Run multiple 1:N transcode sessions to achieve M:N transcode at high perf

Tips

slide-31
SLIDE 31

32

LOW LATENCY STREAMING (1/3)

➢ Low latency ≠ Low encoding time ➢ Latency determined by

➢ B-frames ➢ Look-ahead ➢ VBV buffer size & avlbl bandwidth

Optimization tips

slide-32
SLIDE 32

33

LOW LATENCY STREAMING (2/3)

➢ For 1-2 frame latency (e.g. cloud gaming), use

➢ RC_CBR_LOWDELAY_HQ & Low VBV buffer size

➢ Minimizes frame-to-frame variations

➢ Any preset (Default, HQ, HP preferred)

➢ LL presets have resolution-dependent behavior

➢ No look-ahead ➢ No B-frames

Optimization tips

slide-33
SLIDE 33

34

LOW LATENCY STREAMING (3/3)

➢ Similar to HQ (non latency critical) encoding ➢ For higher (8-10 frames) latency (e.g. OTT, broadcast), use

➢ Any RC mode ➢ Any preset (default, HQ, HP preferred) ➢ VBV buffer size as per channel bandwidth constraints ➢ Look-ahead depth < tolerable latency ➢ B-frames as needed

Optimization tips

slide-34
SLIDE 34

35

VIDEO DL TRAINING

Typical Workflow

Loader Augment

Color space Resize Crop Reorder

Training

Decoded frames

NVDEC

With DALI

Instantiate operator: self.input = ops.VideoReader(device="gpu", filenames=data, sequence_length=len) Use it in the DALI graph: frames = self.input(name="Reader")

  • utput_frames = self.Crop(frames)

return output_frames

➢ FFmpeg ➢ NVDECODE API ➢ CUDA pst-processing

slide-35
SLIDE 35

36

ROADMAP

slide-36
SLIDE 36

37

ROADMAP

➢ Q3 2018 ➢ Error handling – Retrieve last error ➢ Perf/quality tuning ➢ Support for CUStream

Video Codec SDK 9.1

slide-37
SLIDE 37

38

RESOURCES

Video Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk FFmpeg GIT: https://git.ffmpeg.org/ffmpeg.git FFmpeg builds with hardware acceleration: http://ffmpeg.zeranoe.com/builds/ Video SDK support: video-devtech-support@nvidia.com Video SDK forums: https://devtalk.nvidia.com/default/board/175/video- technologies/ Connect with Experts (CE9103): Wednesday, March 20, 2019, 3:00 pm

slide-38
SLIDE 38