NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017 NVIDIA Video - - PowerPoint PPT Presentation

nvidia video technologies
SMART_READER_LITE
LIVE PREVIEW

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017 NVIDIA Video - - PowerPoint PPT Presentation

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017 NVIDIA Video Technologies New SDK Release Major Focus Areas AGENDA Video SDK Features Software Flow FFmpeg Performance and Benchmarking Tips Benchmarks 2 NVIDIA VIDEO TECHNOLOGIES 3


slide-1
SLIDE 1

Abhijit Patait, 5/8/2017

NVIDIA VIDEO TECHNOLOGIES

slide-2
SLIDE 2 2

AGENDA

NVIDIA Video Technologies New SDK Release Major Focus Areas Video SDK Features Software Flow FFmpeg Performance and Benchmarking Tips Benchmarks

slide-3
SLIDE 3 3

NVIDIA VIDEO TECHNOLOGIES

slide-4
SLIDE 4 4

VIDEO CODEC SDK

A comprehensive set of APIs for GPU- accelerated Video Encode and Decode

NVIDIA Video Codec SDK technology is used to stream video with NVIDIA ShadowPlay running on NVIDIA GPUs

The SDK consists of two hardware acceleration interfaces: NVENCODE API for video encode acceleration NVDECODE API for video decode acceleration (formerly called NVCUVID API) Independent of CUDA/3D cores on GPU

slide-5
SLIDE 5 5

NVENC

Independent Hardware Decoder Function

NVDEC

Independent Hardware Encoder Function

NVIDIA VIDEO TECHNOLOGIES

Easy access to NVIDIA GPU hardware acceleration

FFMPEG & LIBAV

SOFTWARE HARDWARE

A comprehensive set of APIs for GPU-accelerated Video Encode and Decode for Windows and Linux CUDA, DirectX, OpenGL interoperability

VIDEO CODEC SDK NVIDIA DRIVER

slide-6
SLIDE 6 6

CPU

NVDEC NVENC

CUDA Cores

Buffer

Decode HW* Encode HW*

Formats:

  • H.264
  • H.265
  • Lossless

Bit depth:

  • 8 bit
  • 10 bit

Color**

  • YUV 4:4:4
  • YUV 4:2:0

Resolution

  • Up to 8K***

Formats:

  • MPEG-2
  • VC1
  • VP8
  • VP9
  • H.264
  • H.265
  • Lossless

Bit depth:

  • 8 bit
  • 10 bit

Color**

  • YUV 4:2:0

Resolution

  • Up to 8K***

NVIDIA VIDEO TECHNOLOGIES

* See support diagram for previous NVIDIA HW generations ** 4:2:2 is not natively supported on HW *** Support is codec dependent
slide-7
SLIDE 7 7

VIDEO SDK EVOLUTION

Video SDK 8.0

2014

SDK 4.0

Maxwell 1 H.264 4:4:4, lossless

2015

SDK 5.0

Maxwell 2 HEVC Perf++

2015

SDK 6.0

ARGB Quality+ Dec+Enc ME-only

2016

SDK 7.x

Pascal 10-bit encode FFmpeg ME-only for VR Quality++

2017

SDK 8.0

10-bit transcode 10/12-bit decode OpenGL

  • Dec. optimizations

WP, AQ, Enc. Quality

slide-8
SLIDE 8 8

MAJOR FOCUS AREAS

slide-9
SLIDE 9 9

VIDEO TRANSCODING

➢ Content variety ➢ Codecs, resolutions, quality, bitrate ➢ Live, VOD, ultra-low-latency, broadcast, archives ➢ Pre-encoded or encoded-on-demand ➢ Performance/Watt

Performance/Watt

slide-10
SLIDE 10 10

Stream ➢ Interactive, single frame latency ➢ Capture: NvFBC, Encode: NvENC, Decode: NvDEC ➢ 4K, HDR Record, Broadcast ➢ Quality

Ultra-low-latency

GAME/APP STREAMING

slide-11
SLIDE 11 11

GPU VIRTUALIZATION

➢ Capture + encode ➢ Low-latency ➢ H.264, HEVC ➢ 4:2:0, 4:4:4, lossless ➢ Multiple-displays

Quality & reliability

slide-12
SLIDE 12 12

➢ Video frame interpolation ➢ Camera stitching (mono to stereo) ➢ Camera stabilization ➢ Computer vision

Accuracy

MOTION-ESTIMATION ONLY MODE

Frame #N N+1 N+2 N+1.5 Frame #(N+1.5) is interpolated based on motion vectors between frame #N and frame #(N+1)

slide-13
SLIDE 13 13

VIDEO SDK FEATURES

slide-14
SLIDE 14 14

ENCODE FEATURES (1/2)

H.264 HEVC Use-case

Base, Main, High Main, Main10 Baseline standards 8-bit 8-bit, 10-bit 10-bit for HDR B-frames No B-frames Higher compression & quality Up to 4096 × 4096 Up to 8192 × 8192 High-res YUV 4:2:0, 4:4:4 Subsampled or full-res chroma (e.g. wireframes) Lossless High-quality archiving Error resiliency: Intra refresh, LTR, ref-pic invalidation Handle streaming bit errors

slide-15
SLIDE 15 15

ENCODE FEATURES (2/2)

H.264 HEVC Use-case

Rate control modes:1-pass, 2-pass Quality vs performance Look-ahead Efficient bit distribution across GOP; higher quality Adaptive quantization, ∆QP Finer quality control Weighted prediction (SDK 8.0) Fade-in/fade-out, explosion RGB inputs Direct NVFBC interoperability ME-only mode, MV-hints (SDK 8.0) Motion stabilization, Optical flow for VR stereo stitching, Frame interpolation 1-3 NVENCs per chip High throughput CUDA, DX, OGL (Linux) (SDK 8.0) Easy integration

slide-16
SLIDE 16 16

DECODE FEATURES

Feature Use-case

MPEG2, VC-1, MPEG-4, H.264, HEVC, VP8, VP9 Baseline standards 8-bit (all codecs), 10/12 bit (HEVC, VP9) (SDK 8.0) HDR decoding Up to 8192 × 8192 for HEVC, 4096 × 4096 for H.264 High-res Error resiliency and concealment Internet streaming

slide-17
SLIDE 17 17

VIDEO SDK – CONTENTS (1/2)

➢ Header, documentation, sample applications ➢ Binaries (.dll, .so) in NVIDIA display driver ➢ Unified API for Windows & Linux ➢ NVIDIA developer zone ➢ Encode limitations

➢ Unconstrained: Tesla, GRID, Quadro ≥ X2000 (X = K, M, P) ➢ 2 sessions/system: GeForce, Quadro < X2000

➢ No decode limitations

slide-18
SLIDE 18 18

VIDEO SDK – CONTENTS (2/2)

➢ Decode: DX9, DX11, CUDA, OpenGL ➢ Encode: Basic functionality, features (NvEncoder) ➢ Encode: Performance (NvEnodePerf) ➢ Encode: CUDA interop, D3D interop, OGL interop, ➢ Encode: Low-latency (NVEncoderLowLatency) ➢ Transcode (NvTranscoder) ➢ Coming soon: Reusable classes

Sample Applications

slide-19
SLIDE 19 19

FFMPEG/LIBAV

➢ Major SW focus area for past 6 months ➢ Feature parity with Video SDK 7.1, SDK 8.0 post GTC ➢ End-to-end FFmpeg transcoding @ best possible quality & perf

slide-20
SLIDE 20 20

SOFTWARE FLOW

slide-21
SLIDE 21 21

ENCODE APP FLOW

Client application NVENC API NVENC Driver CUDA DirectX NVENC firmware + hardware Initialize, Configure, Encode Configure HW HW Encode Encoded bitstream OpenGL OpenGL-CUDA interop NVENC-CUDA interop

slide-22
SLIDE 22 22

ENCODE APP FLOW

API Functions Structures Defined in nvEncodeAPI.h

APIs

Open encode Session CUDA DirectX OpenGL Device Type Query capabilities Codec, presets, features NvEncGetEncodeCaps NvEncGetInputFormats NvEncGetEncodePresetGUIDs NvEncOpenEncodeSessionEx Initialize encoder NvEncInitializeEncoder NV_ENC_INITIALIZE_PARAMS NV_ENC_CONFIG_H264/HEVC NV_ENC_RC_PARAMS W/H, framerate, preset, RC, codec- specific params Allocate buffers NvEncRegisterResource NV_ENC_REGISTER_RESOURCE Internal/external DIRECTX, CUDADEVICEPTR, OPENGL_TEX Encode picture NvEncEncodePicture NV_ENC_PIC_PARAMS Picture-level config parameters Synchronous (Win/Lnux) Async (Win) Clean-up NvEncLockBitstream NvEncUnlockBitstream Buffers, session, device Retrieve bitstream NvEncUnregisterResource
slide-23
SLIDE 23 23

DECODE APP FLOW

Parser Source Client application Bitstream Callbacks NVDEC Driver NVDEC

Video frames

  • YUV
  • RGB
  • DX
  • CUDA

NV DECODE API

Demux

Data flow Decode API calls

slide-24
SLIDE 24 24

DECODE APP FLOW

API functions Structures Defined in dynlink_nvcuvid.h, dynlink_cuviddec.h

APIs

Query capabilities Codecs, resolutions supported cuvidGetDecoderCaps() CUVIDDECODECAPS Create decoder W/H, scaling, bit-depth cuvidCreateDecoder() CUVIDDECODECREATEINFO Decode picture Picture parameters from bitstream parser cuvidDecodePicture() CUVIDPICPARAMS Post- processing scaling, CSC Etc. CUDA kernels Clean-up cuvidDestroyDecoder()
slide-25
SLIDE 25 25

FFMPEG APP FLOW

➢ Chain of filters ➢ -hwaccel cuvid: Use end-to-end NVIDIA hardware acceleration ➢ h264_cuvid: Use NVCUVID/NVDECODE ➢ h264_nvenc: Use NVENCODE ➢ scale_npp: high-perf CUDA scaling

ffmpeg -y -vsync 0 –hwaccel cuvid -c:v h264_cuvid -i input.mp4 -c:a copy –vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output.mp4

Input

Post- processing

Decode

h264_cuvid

Scale

scale_npp=x:y

Encode Output

h264_nvenc

slide-26
SLIDE 26 26

HARDWARE ACCELERATED TRANSCODE USING FFMPEG

slide-27
SLIDE 27 27

PERFORMANCE CONSIDERATIONS - FFMPEG

➢ Minimize memory (PCIe) transfers ➢ Saturate on-chip encoder/decoder ➢ Efficient M:N command line ➢ Minimize I/O ➢ Encode settings ➢ GPU Clocks

slide-28
SLIDE 28 28

SW TRANSCODE

ffmpeg -c:v h264 -i input.mp4 -c:a copy -c:v h264 -b:v 5M output.mp4

SW Decode SW Encode YUV Bitstream YUV Bitstream System Memory

32 fps*

*1:2 transcode, fps per session 4 GHz Intel i7-6700K
slide-29
SLIDE 29 29

SW TRANSCODE + SCALE

ffmpeg -c:v h264 -i input.mp4 -vf scale=1280:720 -c:a copy -c:v h264 -b:v 5M output.mp4

SW Decode Preprocess

(e.g. scaling)

SW Encode YUV Bitstream YUV YUV YUV Bitstream System Memory

29 fps*

*1:2 transcode, fps per session 4 GHz Intel i7-6700K
slide-30
SLIDE 30 30

NVENC Encode

GPU UNOPTIMIZED TRANSCODE

ffmpeg -y -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4

NVDEC Decode YUV Bitstream Bitstream System Memory PCIe transfer GPU Memory YUV PCIe transfer

*1:2 transcode, fps per session GP104 GPU

288 fps*

slide-31
SLIDE 31 31

NVENC Encode

GPU UNOPTIMIZED TRANSCODE + CPU SCALE

ffmpeg -y -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy –vf scale=1280:720 -c:v h264_nvenc -b:v 5M output.mp4

NVDEC Decode Preprocess

(e.g. scaling)

YUV Bitstream Bitstream System Memory PCIe transfer GPU Memory YUV PCIe transfer

*1:2 transcode, fps per session GP104 GPU

76 fps*

slide-32
SLIDE 32 32

NVENC Encode

HIGH-PERF GPU OPTIMIZED TRANSCODE

ffmpeg -y -vsync 0 –hwaccel cuvid -c:v h264_cuvid -i input.mp4 -c:a copy –vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output.mp4

NVDEC Decode YUV Bitstream Bitstream System Memory GPU Memory YUV YUV Preprocess

(scaling in CUDA)

YUV

*1:2 transcode, fps per session GP104 GPU

472 fps*

slide-33
SLIDE 33 33

PERFORMANCE CONSIDERATIONS

➢ Pipelining ➢ Input/output buffers ➢ Tools: nvidia-smi, Microsoft GPUView

Saturating encoder/decoder

slide-34
SLIDE 34 34

ANALYZING PERFORMANCE BOTTLENECKS

Microsoft GPUView (Windows only)

 

slide-35
SLIDE 35 35

ANALYZING PERFORMANCE BOTTLENECKS

nvidia-smi (Windows & Linux)

 

slide-36
SLIDE 36 36

PARALLEL TRANSCODES (1:N)

ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4

  • vf scale_npp=1920:1080 -c:a copy -c:v h264_nvenc -b:v 8M output_1080p.mp4
  • vf scale_npp=1280:720 -c:a copy -c:v h264_nvenc -b:v 5M output_720p.mp4
  • vf scale_npp=640:480 -c:a copy -c:v h264_nvenc -b:v 3M output_480p.mp4
  • vf scale_npp=320:240 -c:a copy -c:v h264_nvenc -b:v 2M output_240p.mp4
  • vf scale_npp=160:128 -c:a copy -c:v h264_nvenc -b:v 1M output_128p.mp4

Single command line

slide-37
SLIDE 37 37

PARALLEL TRANSCODES (1:N)

ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4 -vf scale_npp=1920:1080 -c:a copy -c:v h264_nvenc -b:v 5M output1.mp4

Multiple command lines

ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4 -vf scale_npp=1280:720 -c:a copy -c:v h264_nvenc -b:v 5M output2.mp4 ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4 -vf scale_npp=640:480 -c:a copy -c:v h264_nvenc -b:v 5M output3.mp4 ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4 -vf scale_npp=320:240 -c:a copy -c:v h264_nvenc -b:v 5M output4.mp4 ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i input.mp4 -vf scale_npp=160:128 -c:a copy -c:v h264_nvenc -b:v 5M output5.mp4

slide-38
SLIDE 38 38

PARALLEL TRANSCODES (1:N)

Single command line CONS PROS

Low init time per transcode (amortized) Minimize memory transfers Leverage high encoder perf Low memory overhead Complex command line 1:N use-case only Unsuitable for 1:1 VOD Typically encoder-limited

slide-39
SLIDE 39 39

PARALLEL TRANSCODES (1:N)

Multiple command lines CONS PROS

Simple command line Easy scripting Use-case: 1:1 VOD High init time per transcode High memory overhead Process-level scheduling

  • ptimizations

Typically decoder-limited Multiple disk I/O for input

slide-40
SLIDE 40 40

PARALLEL TRANSCODES (M:N)

Hybrid approach

Most flexible approach Balance memory utilization/complexity/perf Maximum utilization of encode/decode capacity

slide-41
SLIDE 41 41

ENCODE SETTINGS

Highest Quality Minimum Delay Highest Performance

Use-case Transcoding, Archiving, Broadcast streaming (w/ latency), surveillance Game & app streaming, surveillance All, w/ high performance requirement NVENC API preset to use High quality (HQ) presets Low-latency (Low delay) presets High performance (HP) presets Latency (set by the application via VBV buffer size) Depends on what application sets; Typically > 8-10 frames Depends on what application sets; Typically 1 frame Depends on what application sets PSNR delta (0 = High quality)* 0 dB (reference)
  • Approx. -0.5 dB
  • Approx. -0.5-2 dB
Advanced features typically used Look-ahead, B-frames (H.264 only), adaptive B-frames (H.264 only), AQ (Adaptive Quantization) Strict frame-size compliance low VBV (Video Buffering Verifier), AQ (Adaptive Quantization) Motion search and mode High, 2-pass Medium, 1-pass/2-pass Low, 1-pass Modes All high quality modes Most modes Most modes Entropy coding CABAC (H.264) CABAC (H.264) CAVLC (H.264)
slide-42
SLIDE 42 42

BENCHMARKS

slide-43
SLIDE 43 43

ENCODE PERFORMANCE

Performance represents an approximation of max performance and may vary based on GPU clock speed, OS, software versions, and motherboard configuration

H.264 1080p (1920x1080) 4:2:0 8bit 30fps (SINGLE NVENC)

21 13 12 7 9 6 4 2 PASCAL MAXWELL 2ND GEN MAXWELL 1ST GEN KEPLER

Number of Streams / NVENC

Highest Quality Highest Performance

#NVENC GPUs Kepler Quadro K2000/K2000D/K4000/K4200/ K5000/K5200/K6000 Kepler Tesla K20X/K40 Maxwell Quadro K2200 (1st Gen)/M2000 (2nd Gen) Maxwell (2nd Gen) Tesla M4 Pascal Quadro P2000/P4000 Kepler Tesla K10/K80 Kepler GRID K2/K520 Maxwell (2nd Gen) Quadro M4000/M5000/M6000 Maxwell (2nd Gen) Tesla M6/M40 Pascal Quadro P5000/P6000 Pascal Tesla P4/P40 Pascal Quadro GP100 Pascal Tesla P100 Kepler GRID K1/K340 Maxwell (2nd Gen) Tesla M60

Note: All GPUs not featured above are limited to 2 simultaneous sessions

slide-44
SLIDE 44 44

ENCODE PERFORMANCE

Performance represents an approximation of max performance and may vary based on GPU clock speed, OS, software versions, and motherboard configuration

HEVC 4K (3840x2160) 4:2:0 8bit 30fps (SINGLE NVENC)

13 7 5 3 PASCAL MAXWELL (2ND GEN)

Number of Streams / NVENC

Highest Quality Highest Performance

#NVENC GPUs Maxwell Quadro M2000 Maxwell Tesla M4 Pascal Quadro P2000/P4000 Maxwell Quadro M4000/M5000/M6000 Maxwell Tesla M6/M40 Pascal Quadro P5000/P6000 Pascal Tesla P4/P40 Pascal Quadro GP100 Pascal Tesla P100 Maxwell Tesla M60

Note: All GPUs not featured above are limited to 2 simultaneous sessions

slide-45
SLIDE 45 45

ENCODE PERFORMANCE

Performance represents an approximation of max performance and may vary based on GPU clock speed, OS, software versions, and motherboard configuration

H.264 1080p (1920x1080) 4:4:4 8bit 30fps (SINGLE NVENC)

13 9 7 7 5 4 PASCAL MAXWELL 2ND GEN MAXWELL 1ST GEN

Number of Streams / NVENC

Highest Quality Highest Performance

#NVENC GPUs Maxwell Quadro K2200 (1st Gen)/M2000 (2nd Gen) Maxwell (2nd Gen) Tesla M4 Pascal Quadro P2000/P4000 Maxwell (2nd Gen) Quadro M4000/M5000/M6000 Maxwell (2nd Gen) Tesla M6/M40 Pascal Quadro P5000/P6000 Pascal Tesla P4/P40 Pascal Quadro GP100 Pascal Tesla P100 Maxwell (2nd Gen) Tesla M60

Note: All GPUs not featured above are limited to 2 simultaneous sessions

slide-46
SLIDE 46 46

ENCODE PERFORMANCE

Performance represents an approximation of max performance and may vary based on GPU clock speed, OS, software versions, and motherboard configuration

Pascal HEVC 10bit 30fps (SINGLE NVENC)

12 9 3 2 5 4 1 1 1080P 4:2:0 1080P 4:4:4 4K 4:2:0 4K 4:4:4

Number of Streams / NVENC

Highest Quality Highest Performance

#NVENC GPUs Pascal Quadro P2000/P4000 Pascal Quadro P5000/P6000 Pascal Tesla P4/P40 Pascal Quadro GP100 Pascal Tesla P100

Note: All GPUs not featured above are limited to 2 simultaneous sessions

slide-47
SLIDE 47 47

DECODE PERFORMANCE

NVDEC H.264 YUV 4:2:0

11 8 5 4 2 4 6 8 10 12 TESLA P40 TESLA M60

Number of 30fps Streams / NVDEC

4096 x 4096 3840 x 2160 2560 x 1440

Performance represents an approximation of max performance and may vary based on GPU clock speed, OS, software versions, and motherboard configuration
slide-48
SLIDE 48 48

ENCODE PERF/QUALITY

Encode quality latest results (slow/med: ±0.4 dB within x264)

medium/720p medium/1080p medium/2160p slow/720p slow/1080p slow/2160p
  • 2.0
  • 1.0
0.0 4x 6x 8x 10x 12x

BD-PSNR FPS RATIO

BD-PSNR vs FPS RATIO

NVENC vs libx264 HQMEDIUM NVENC vs libx264 HQSLOW
slide-49
SLIDE 49 49

MOTION VECTOR QUALITY

➢ KITTI Vision Benchmark Suite for Optical Flow ➢ Measures distortion of motion vectors compared to “true” motion ➢ Average distortion ≈ 7%, improves 1-2% by motion hints

slide-50
SLIDE 50 50

ME-ONLY MODE

Frame 0

Source: http://www.cvlibs.net/datasets/kitti/, under Creative Commons License
slide-51
SLIDE 51 51

ME-ONLY MODE

Frame 1

Source: http://www.cvlibs.net/datasets/kitti/, under Creative Commons License
slide-52
SLIDE 52 52

ME-ONLY MODE

Motion Vector Distortion

“True” motion NVENC estimated motion Distortion score = 2%

slide-53
SLIDE 53 53

RESOURCES

Video Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk FFmpeg GIT: https://git.ffmpeg.org/ffmpeg.git Libav GIT: https://git.Libav.org/libav.git FFmpeg builds with hardware acceleration: http://ffmpeg.zeranoe.com/builds/ Video SDK support: video-devtech-support@nvidia.com Video SDK forums: https://devtalk.nvidia.com/default/board/175/video- technologies/

slide-54
SLIDE 54