 
              NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017
NVIDIA Video Technologies New SDK Release Major Focus Areas AGENDA Video SDK Features Software Flow FFmpeg Performance and Benchmarking Tips Benchmarks 2
NVIDIA VIDEO TECHNOLOGIES 3
VIDEO CODEC SDK A comprehensive set of APIs for GPU- accelerated Video Encode and Decode The SDK consists of two hardware acceleration interfaces: NVENCODE API for video encode acceleration NVIDIA Video Codec SDK technology is used to stream video with NVIDIA ShadowPlay running on NVIDIA GPUs NVDECODE API for video decode acceleration (formerly called NVCUVID API) Independent of CUDA/3D cores on GPU 4
NVIDIA VIDEO TECHNOLOGIES FFMPEG & LIBAV Easy access to NVIDIA GPU hardware acceleration SOFTWARE VIDEO CODEC SDK A comprehensive set of APIs for GPU-accelerated Video Encode and Decode for Windows and Linux CUDA, DirectX, OpenGL interoperability NVIDIA DRIVER NVENC NVDEC HARDWARE Independent Hardware Encoder Function Independent Hardware Decoder Function 5
NVIDIA VIDEO TECHNOLOGIES Decode HW* Encode HW* CPU Formats: • Formats: MPEG-2 • • H.264 VC1 • • H.265 VP8 • • Lossless VP9 • H.264 • Bit depth: H.265 • • 8 bit Lossless NVENC NVDEC Buffer • 10 bit Bit depth: • Color** 8 bit • • YUV 4:4:4 10 bit • YUV 4:2:0 Color** • CUDA Cores Resolution YUV 4:2:0 • Up to 8K*** Resolution • Up to 8K*** * See support diagram for previous NVIDIA HW generations 6 ** 4:2:2 is not natively supported on HW *** Support is codec dependent
VIDEO SDK EVOLUTION Video SDK 8.0 SDK 7.x Pascal 10-bit encode SDK 5.0 FFmpeg ME-only for VR Maxwell 2 Quality++ HEVC Perf++ SDK 6.0 SDK 4.0 SDK 8.0 ARGB Maxwell 1 Quality+ 10-bit transcode H.264 Dec+Enc 10/12-bit decode 4:4:4, lossless ME-only OpenGL Dec. optimizations WP, AQ, Enc. Quality 2014 2015 2015 2016 2017 7
MAJOR FOCUS AREAS 8
VIDEO TRANSCODING Performance/Watt ➢ Content variety ➢ Codecs, resolutions, quality, bitrate ➢ Live, VOD, ultra-low-latency, broadcast, archives ➢ Pre-encoded or encoded-on-demand ➢ Performance/Watt 9
GAME/APP STREAMING Ultra-low-latency Stream ➢ Interactive, single frame latency ➢ Capture: NvFBC, Encode: NvENC, Decode: NvDEC ➢ 4K, HDR Record, Broadcast ➢ Quality 10
GPU VIRTUALIZATION Quality & reliability ➢ Capture + encode ➢ Low-latency ➢ H.264, HEVC ➢ 4:2:0, 4:4:4, lossless ➢ Multiple-displays 11
MOTION-ESTIMATION ONLY MODE Accuracy Frame # N N +1 ➢ Video frame interpolation N +1.5 ➢ Camera stitching (mono to stereo) N +2 ➢ Camera stabilization ➢ Computer vision Frame #( N +1.5) is interpolated based on motion vectors between frame # N and frame #( N +1) 12
VIDEO SDK FEATURES 13
ENCODE FEATURES (1/2) H.264 HEVC Use-case Base, Main, High Main, Main10 Baseline standards 8-bit 8-bit, 10-bit 10-bit for HDR B-frames No B-frames Higher compression & quality Up to 4096 × 4096 Up to 8192 × 8192 High-res YUV 4:2:0, 4:4:4 Subsampled or full-res chroma (e.g. wireframes) Lossless High-quality archiving Error resiliency: Intra refresh, LTR, ref-pic Handle streaming bit errors invalidation 14
ENCODE FEATURES (2/2) H.264 HEVC Use-case Rate control modes:1-pass, 2-pass Quality vs performance Look-ahead Efficient bit distribution across GOP; higher quality Adaptive quantization, ∆QP Finer quality control Weighted prediction (SDK 8.0) Fade-in/fade-out, explosion RGB inputs Direct NVFBC interoperability ME-only mode, MV-hints (SDK 8.0) Motion stabilization, Optical flow for VR stereo stitching, Frame interpolation 1-3 NVENCs per chip High throughput CUDA, DX, OGL (Linux) (SDK 8.0) Easy integration 15
DECODE FEATURES Feature Use-case MPEG2, VC-1, MPEG-4, H.264, HEVC, VP8, VP9 Baseline standards 8-bit (all codecs), 10/12 bit (HEVC, VP9) (SDK 8.0) HDR decoding Up to 8192 × 8192 for HEVC, 4096 × 4096 for H.264 High-res Error resiliency and concealment Internet streaming 16
VIDEO SDK – CONTENTS (1/2) ➢ Header, documentation, sample applications ➢ Binaries (.dll, .so) in NVIDIA display driver ➢ Unified API for Windows & Linux ➢ NVIDIA developer zone ➢ Encode limitations ➢ Unconstrained: Tesla, GRID, Quadro ≥ X 2000 ( X = K, M, P) ➢ 2 sessions/system: GeForce, Quadro < X 2000 ➢ No decode limitations 17
VIDEO SDK – CONTENTS (2/2) Sample Applications ➢ Decode: DX9, DX11, CUDA, OpenGL ➢ Encode: Basic functionality, features (NvEncoder) ➢ Encode: Performance (NvEnodePerf) ➢ Encode: CUDA interop, D3D interop, OGL interop, ➢ Encode: Low-latency (NVEncoderLowLatency) ➢ Transcode (NvTranscoder) ➢ Coming soon: Reusable classes 18
FFMPEG/LIBAV ➢ Major SW focus area for past 6 months ➢ Feature parity with Video SDK 7.1, SDK 8.0 post GTC ➢ End-to-end FFmpeg transcoding @ best possible quality & perf 19
SOFTWARE FLOW 20
ENCODE APP FLOW Client application Encoded Initialize, Configure, Encode bitstream NVENC API Configure HW NVENC OpenGL DirectX CUDA Driver HW Encode NVENC firmware + hardware OpenGL-CUDA interop NVENC-CUDA interop 21
ENCODE APP FLOW CUDA Open encode Device NvEncOpenEncodeSessionEx DirectX Session Type OpenGL APIs NvEncGetEncodeCaps Query Codec, presets, NvEncGetInputFormats capabilities features NvEncGetEncodePresetGUIDs NvEncInitializeEncoder API Functions W/H, framerate, Initialize NV_ENC_INITIALIZE_PARAMS preset, RC, codec- NV_ENC_CONFIG_H264/HEVC encoder specific params NV_ENC_RC_PARAMS Structures Internal/external DIRECTX, Allocate NvEncRegisterResource CUDADEVICEPTR, NV_ENC_REGISTER_RESOURCE buffers OPENGL_TEX Encode NvEncEncodePicture Picture-level config Defined in nvEncodeAPI.h NV_ENC_PIC_PARAMS picture parameters NvEncLockBitstream Retrieve Synchronous (Win/Lnux) NvEncUnlockBitstream bitstream Async (Win) Buffers, Clean-up NvEncUnregisterResource session, device 22
DECODE APP FLOW NV DECODE API Client application Bitstream • YUV Video Demux frames RGB NVDEC • Source Parser Driver • DX CUDA • NVDEC Callbacks Data flow Decode API calls 23
DECODE APP FLOW APIs Query cuvidGetDecoderCaps() Codecs, resolutions CUVIDDECODECAPS capabilities API functions supported Structures Create cuvidCreateDecoder() W/H, scaling, CUVIDDECODECREATEINFO decoder bit-depth Defined in dynlink_nvcuvid.h, Decode cuvidDecodePicture() Picture parameters CUVIDPICPARAMS picture from bitstream parser dynlink_cuviddec.h Post- CUDA kernels scaling, CSC Etc. processing cuvidDestroyDecoder() Clean-up 24
FFMPEG APP FLOW ffmpeg -y -vsync 0 – hwaccel cuvid -c:v h264_cuvid -i input.mp4 -c:a copy – vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output.mp4 ➢ Chain of filters Post- Input Decode Scale Encode Output processing h264_cuvid scale_npp= x:y h264_nvenc ➢ -hwaccel cuvid : Use end-to-end NVIDIA hardware acceleration ➢ h264_cuvid : Use NVCUVID/NVDECODE ➢ h264_nvenc : Use NVENCODE ➢ scale_npp : high-perf CUDA scaling 25
HARDWARE ACCELERATED TRANSCODE USING FFMPEG 26
PERFORMANCE CONSIDERATIONS - FFMPEG ➢ Minimize memory (PCIe) transfers ➢ Saturate on-chip encoder/decoder ➢ Efficient M:N command line ➢ Minimize I/O ➢ Encode settings ➢ GPU Clocks 27
SW TRANSCODE ffmpeg -c:v h264 -i input.mp4 -c:a copy -c:v h264 -b:v 5M output.mp4 System Memory SW SW Decode Encode Bitstream Bitstream YUV YUV 32 fps* *1:2 transcode, fps per session 4 GHz Intel i7-6700K 28
SW TRANSCODE + SCALE ffmpeg -c:v h264 -i input.mp4 -vf scale=1280:720 -c:a copy -c:v h264 -b:v 5M output.mp4 System Memory SW SW Preprocess Decode Encode (e.g. scaling) Bitstream Bitstream YUV YUV YUV YUV 29 fps* *1:2 transcode, fps per session 4 GHz Intel i7-6700K 29
GPU UNOPTIMIZED TRANSCODE ffmpeg -y -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4 System Memory PCIe transfer PCIe transfer Bitstream Bitstream 288 fps* *1:2 transcode, fps per session GP104 GPU NVENC NVDEC Encode Decode YUV YUV GPU Memory 30
GPU UNOPTIMIZED TRANSCODE + CPU SCALE ffmpeg -y -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy – vf scale=1280:720 -c:v h264_nvenc -b:v 5M output.mp4 System Memory PCIe transfer PCIe transfer Preprocess (e.g. scaling) Bitstream Bitstream 76 fps* NVENC NVDEC Encode Decode YUV YUV *1:2 transcode, fps per session GP104 GPU GPU Memory 31
HIGH-PERF GPU OPTIMIZED TRANSCODE ffmpeg -y -vsync 0 – hwaccel cuvid -c:v h264_cuvid -i input.mp4 -c:a copy – vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output.mp4 System Memory 472 fps* Bitstream Bitstream *1:2 transcode, fps per session GP104 GPU NVENC NVDEC Preprocess Encode Decode (scaling in CUDA) YUV YUV YUV YUV GPU Memory 32
PERFORMANCE CONSIDERATIONS Saturating encoder/decoder ➢ Pipelining ➢ Input/output buffers ➢ Tools: nvidia-smi, Microsoft GPUView 33
ANALYZING PERFORMANCE BOTTLENECKS Microsoft GPUView (Windows only)   34
ANALYZING PERFORMANCE BOTTLENECKS nvidia-smi (Windows & Linux)   35
Recommend
More recommend