High Quality Video Transcoding in Data Center Jensen Zhang Sep. - - PowerPoint PPT Presentation

high quality video transcoding in data center
SMART_READER_LITE
LIVE PREVIEW

High Quality Video Transcoding in Data Center Jensen Zhang Sep. - - PowerPoint PPT Presentation

High Quality Video Transcoding in Data Center Jensen Zhang Sep. 2019 Company Proprietary and Confidential High Quality Video Transcoding in Data Center Whats the current Status of Data Center for video? Explosive growth of different


slide-1
SLIDE 1

Company Proprietary and Confidential

High Quality Video Transcoding in Data Center

Jensen Zhang

  • Sep. 2019
slide-2
SLIDE 2

Company Proprietary and Confidential

High Quality Video Transcoding in Data Center

▲What’s the current Status of Data Center for video?

►Explosive growth of different kinds of video streams ►Compute requirements skyrocketing

◼More complexity video codecs formats, higher video resolutions

◼CPUs are too slow for video transcoding by software, especially for live video ►Huge Demands for better economics

slide-3
SLIDE 3

Company Proprietary and Confidential

Video Acceleration Overview

►Today market is dominated by high-powered x86 servers for video processing , servers struggle with video apps /new codecs and high resolution ► Huge growth video PUSH forward alternate architectures, but still not saving enough ◼ NVidia NVENC/NVDEC - Hardware based codec engine ◼ Intel Hardened (QSV)- using consumer GPU with hardened video engine to achieve higher density , Intel VCA2 PCIE Card ◼ Xilinx VU9P PCIe card- FPGA integrates H264/H265/VP9 codecs ► Giant SNS company like FB Requires ASIC to save much more cost !!!

slide-4
SLIDE 4

Company Proprietary and Confidential

Huge demands require ASIC solution to solve the troubles

▲ Huge demands require ASIC solution

►Strong requirements by internet company, can't to wait ►Server company, including chip design, OEMs ►FPGA company, AI company , etc.

▲VeriSilicon build up Video Transcoding Solution to solve the troubles

►Excellent codec IPs work for Data center and Edge Server ►Total solution with BOTH HW and SW

slide-5
SLIDE 5

Company Proprietary and Confidential

5

VeriSilicon leading video transcoding IP & customized ASIC

5

CPU vs Video transcoding ASIC

6X HEVC 4K Processing

Power Consumption

1 13

Much Smaller Size

slide-6
SLIDE 6

Company Proprietary and Confidential

World Leading Video Product

slide-7
SLIDE 7

Company Proprietary and Confidential

Hantro Video IP Track Record

▲Multi-generations of Hantro encoders and decoders

►More than 100 licensees ►Billions of shipped devices

▲Market leader with success in multiple market segments:

slide-8
SLIDE 8

Company Proprietary and Confidential

VeriSilicon Technology in Edge Device, Edge Server and Cloud

Cloud,

Data Center

Video Transcoding Pixel Compression High Performance Computing Surveillance Smart Home, Vision, Voice AR/VR Wearables

I n s t r u m e n t C l u s t e r

In f

  • t

a i n m e nt nt

T e l e m a t i c s V 2 X C a m e r a s D r i v e r a n d P a s s e n g e r M

  • b

i l e D e v i c e s A D A S , B

  • d

y a n d P

  • w

e r t r a i n E C U s CL CL OU OU D A u d i

  • A

m p l i f i e r R e a r S e a t E n t e r t a i n m e n t

Automotive

Edge Server Edge Device

slide-9
SLIDE 9

Company Proprietary and Confidential

Strengths of the Solution

slide-10
SLIDE 10

Company Proprietary and Confidential

Easy Integration as a Whole Solution

VC8000D VC8000E

Optional CU Tree

DEC400 L2 DEC400

AXI master System BUS Fabric AXI master System BUS Fabric APB slave APB slave Optional AXI master Decoder Cluster Driver Encoder Cluster Driver Gstreamer OMX-IL LibVA V4L2 FFMPEG

Integrated Decoder Cluster

Ready software and hardware integration and configuration VC8000D: VeriSilicon multi-format decoder IP: H.264, H.265, VP9, AVS2, JPEG and legacy formats DEC400: VeriSilicon system-adaptive frame compression IP L2: Data cache and burst shaper for DRAM efficiency

Integrated Encoder Cluster

Ready software and hardware integration and configuration VC8000E: VeriSilicon multi-format encoder IP: H.264, H.265 and JPEG DEC400: VeriSilicon system-adaptive frame compression IP CU Tree: Optional hardware for 2-pass encoding analysis

Transcoding Slice

Decoder cluster + encoder cluster optimized for transcoding

  • Optimized transoding data paths
  • Optimized transcoding operations
  • FFMPEG and Gstreamer ready solution
slide-11
SLIDE 11

Company Proprietary and Confidential

Ready Software library support

▲ Native Encoder/Decoder API are provided to fully explore the HW features;

▲ Small CPU load for full HW algorithm. ▲ Porting to different CPU: ARM, MIPS, PowerPC, C51. ▲ Optimized according to HW flow. ▲ Multi-core supported. ▲ Multi-Instance support of interleave working for different format or resolutions.

▲OMX-IL or VAPPI(libva/libdrm) components provide standard interface to help media framework integration easily; ▲ All software is provided as source code.

Application/Media Framework OMX-IL/VAAPI Other Encapsulations HW Driver Codec Hardware Hantro Encoder/Decoder API Encoder/Decoder Wrapper Layer

HW Hantro SW Customer SW

slide-12
SLIDE 12

Company Proprietary and Confidential

Power & Area Efficient ASIC Solution

▲100% ASIC design in the high Performance Decoding & Encoding video IP products ▲Low area cost ▲Low power consumption

4K60 10-bit H.264 & H.265 configuration area at 16 nm (mm2) Decoder Cluster Encoder Cluster Transcoder VC8000D DEC400 L2 Total VC8000E DEC400 Total Total 1.05 0.16 0.23 1.44 3.39 0.12 3.51 4.95 4K60 10-bit H.264 & H.265 configuration power consumption at 16 nm (mW) Decoder Cluster Encoder Cluster Transcoder VC8000D DEC400 L2 Total VC8000E DEC400 Total Total 230 12 22 264 532 11 543 807

slide-13
SLIDE 13

Company Proprietary and Confidential

Low DRAM Bandwidth Requirements

Decoder Encoder DRAM buffer

Decoder reference picture Decoder post processed picture Encoder reference picture Read-only cache

Compressed reference frame Compressed post processed frame Read reference frame Read resized frame Compressed reference frame Line buffer Crop, scaling , … Crop, blending, … Bandwidth saving technology applied everywhere Decoder Cluster Encoder Cluster Transcoder

  • All frames are compressed: saving 45~55%
  • >90% bursts are aligned: NO overhead
  • Configurable L2 cache size for reference frame,

saving 0.8 ~ 1.6 GB/s

  • All frames are compressed: saving 45~55%
  • >90% bursts are aligned: NO overhead
  • Configurable line buffer for reference frame,

saving 1.2 ~ 2.4 GB/s

  • Encoder directly read decoder reference frame
  • Crop and down scaled output from decoder
  • Blending in encoder input

Typical bandwidth: 2.2 GB/s Typical bandwidth: 3.49 GB/s Typical bandwidth: 5.69 GB/s Ultra saving bandwidth: 1.4 GB/s Ultra saving bandwidth: 2.4 GB/s Ultra saving bandwidth: 3.8 GB/s

slide-14
SLIDE 14

Company Proprietary and Confidential

High BUS Latency Tolerance

▲Provide enough performance even in SoC with high BUS latency (up to 700 cycles)

Cycles/MB budget at 500 MHz: 4096x2160@60fps: 258 cycles/MB 3840x2160@60fps: 242 cycles/MB

slide-15
SLIDE 15

Company Proprietary and Confidential

Low DRAM Footprint

▲Use packed storage in DRAM for 10-bit data

►Our solution: 64 MB DRAM size for one 8K 10-bit picture ☺ ►Unpacked 16-bit: 102 MB DRAM size for one 8K 10-bit picture 

▲Allocate frame buffer on demand ▲Direct reading decoder reference frame buffer which eliminates up to 10 frames of buffer from extra decoder output

slide-16
SLIDE 16

Company Proprietary and Confidential

Robust Decoding and Encoding

▲Silicon proved video IP ▲Rich test pattern database including multiple commercial test streams, streams from customers, compatibility streams, and self generated random error streams. ▲Strong error handling

►Stream error detection in decoder ►BUS error detection ►Frame compression error concealment

▲Complex transcoding runs stably in hundreds of hours real product test

slide-17
SLIDE 17

Company Proprietary and Confidential

Flexible Controllability by FLEXA API Video

▲FLEXA API Video is a Software & hardware interface enables VC8000E and VC8000D to cooperate with an AI engine ▲FLEXA API Video Examples

VC8000E

FLEXA

VC8000D

FLEXA

AI Engine

FLEXA

VC8000E VC8000D

  • Various GOP structure setting: hierarchal B, IDR, long term etc.
  • Rate control setting: Frame level and coding block level
  • ROI map: coding control down to 8x8 block such as qp and coding mode
  • Special coding area: Intra area, ROI area, IPCM area
  • RDO level: trade off between quality and performance
  • Other controls: Global MV, GDR, CIR etc.
  • Coding information output to DRAM
  • PSNR and SSIM report
  • Coding information output to DRAM
  • Multiple down scaled frames
slide-18
SLIDE 18

Company Proprietary and Confidential

High Quality Video Encoding

▲HEVC encoding quality achieves similar quality as x265(preset=very slow) .

▲ Compare PSNR with x265-2.6+49: ▲ Quality tuning based on JCTVC streams.

▲H.264 encoding quality achieves similar to x264 medium.

28.0 29.0 30.0 31.0 32.0 33.0 34.0 35.0 5,000,000 10,000,000 15,000,000

crowd_run

x265-2.6+49 veryslow VC8000E HEVC C- Model (CL207156) 41.5 42.0 42.5 43.0 43.5 44.0 44.5 45.0 45.5 2,000,000 4,000,000 6,000,000 8,000,000

FourPeople

x265-2.6+49 veryslow VC8000E HEVC C-Model (CL207156) 42.5 43.0 43.5 44.0 44.5 45.0 45.5 46.0 2,000,000 4,000,000 6,000,000 8,000,000

Johnny

x265-2.6+49 veryslow VC8000E HEVC C-Model (CL207156) 43.5 44.0 44.5 45.0 45.5 46.0 46.5 47.0 2,000,000 4,000,000 6,000,000 8,000,000

Vidyo1

x265-2.6+49 veryslow VC8000E HEVC C-Model (CL207156)

slide-19
SLIDE 19

Company Proprietary and Confidential

Video Transcoding

slide-20
SLIDE 20

Company Proprietary and Confidential

Basic Transcoding Flow

Decoding DPB PPB Inline post processing Inline preprocessing encoding Low-level control

DPB: decoded picture buffer (reference picture buffer) PPB: post processed picture buffer Low-level control: Encoding control in a picture including QP map, ROI map, IPCM

Crop, down/up scale, color conversion Crop, rotation, color conversion, blending

slide-21
SLIDE 21

Company Proprietary and Confidential

Resize, Blending, and Multicast

Decoding Encoding Encoding Encoding Encoding

4K stream 4K video 1080p video 720p video 360p video Decoder can support up to 4 resized outputs (down scaling and up scaling) Encoder can support blending between 1 video layer and 1 another layer, sparsely with up to 8 regions One input stream can have multiple output streams resized and with/without blending Specific operation between decoding and encoding can be discussed, such as de-watermark

slide-22
SLIDE 22

Company Proprietary and Confidential

Multi-stream transcoding

▲Scalable multi-stream transcoding

►Proportional CPU load increase ►Proportional performance scaling

▲Concurrent video and JPEG transcoding is available by standalone JPEG only hardware ▲Job switch at picture level and are flexibly scheduled by software driver

►Maximize the overall throughput ►Ensure latency by priority management

slide-23
SLIDE 23

Company Proprietary and Confidential

Transcoding latency control

▲Transcoding latency typically is less than 100 ms ▲When the application requires, several ms ultra low-latency transcoding is possible

►Sub picture level synchronization ►On-chip SRAM for data transfer minimizes DDR traffic ►Low-latency encoding or transcoding PCIE Decoding Data transfer cache or DRAM buffer Encoding Low-delay GOP No tile column

slide-24
SLIDE 24

Company Proprietary and Confidential

Hardware accelerated 2-pass Encoding

Decoding Decoded pictures

(up to 41 frames)

¼ sized pictures 1st pass encoding 1st pass meta data

(up to 40 frames)

Analysis 2nd pass meta data 2nd pass encoding

  • The whole look-ahead 2-pass encoding process is hardware accelerated
  • Support up to 40 frames look ahead
  • The 2nd pass analysis hardware is a configurable module
  • Adds 0.8 GB/s bandwidth for 1 4K60 2-pass encoding
  • Saves 4200 MHz from meta data processing by CPU (18-frame look ahead)
  • The first pass encoding is performed on ¼ sized picture. For example, when input is 4K, 1st pass encoding picture size is 1080p
slide-25
SLIDE 25

Company Proprietary and Confidential

Build Comprehensive Transcoding System by FLEXA API Video

Decoding DPB PPB Meta data Inline post processing Extended processing Offline encode analysis Inline preprocessing Non-encoding purpose AI/ML Extended processing encoding Meta data Decoder firmware/driver Detection, classification result Encoder firmware/driver Transcoding software frame work Low-level control high-level control Quality assessment Crop, down/up scale, color conversion Crop, rotation, color conversion, blending Software interface (API) in FLEXA API Video, providing ability to flexibly configure and control the encoding and decoding AI and 3rd party computing processors cooperate with the encoder and decoder through hardware/buffer interface in FLEXA API video A possible comprehensive transcoding system enabled by VeriSilicon transcoding solution

slide-26
SLIDE 26

Company Proprietary and Confidential

Complete Software Stack – Embedded to your framework easily.

VC8000D VC8000E Multi-media Framework VSI Hardware Kernel Drivers VSI Control Software

Multi-Format

Decoder Control HAL

Multi-Format

Encoder Control OSAL

Embedded RTOS Linux

HW Driver V4L2 Multi-media API OMX-IL VA API V4L2 API

VSI Video API

FFMPEG GStreamer Stagefright

slide-27
SLIDE 27

Company Proprietary and Confidential

Seamless Integrate with Industry standard FFmpeg framework

Typical Transcoding Process VSI Provided Plug-ins FFMPEG Components

vsi_h264_dec

Decoding Filtering Encoding

Libavcodec libavcodec libavfilter

… vsi_hevc_dec vsi_h264_enc … vsi_hevc_enc vsi_splitter*

VSI Control Software

VSI Video API

Multi-Format Decoder Control Multi-Format Encoder Control

Split Scale

vsi_splitter*: VSI hardware decoder has inside Post-Processors that is capable of scaling. Using the splitter filter mechanism to simply "copied" these scaled video.

slide-28
SLIDE 28

Company Proprietary and Confidential