Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs - PowerPoint PPT Presentation

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by Point Based Rendering Bo Han, Bingfeng Zhou Peking University

Outline Outline • Motivation and Goal • Previous work • Review of video decoding • Point based decoding framework • Results • Discussion

Motivation Motivation • Diverse video applications • Range from HDTV to mobile devices • Multi video standards coexist • most concerns: video playback • Computation and Bandwidth • Successful decoding system need: • High performance and programming flexibility • CPU + additional hardware

Motivation Motivation • GPUs are powerful and flexible • Attractive coprocessors for GPGPU • Spreading to everywhere • History of offloading video decoding tasks • Overlay surface for YUV to RGB • dedicated hardware for DVD (DXVA) • Programmable Video Engine (PureVideo, AVIVO) • What’s the next? (shader based?)

Our Goals Our Goals • Video decoding framework • Built on Graphics pipeline and Shader programs • Hardware performance + Software flexibility • Additional advantages • Independent of Hardware and platform • Graphics API and shader languages • Save hardware resources • Amazing growth rate over Moore’s law

Previous w ork Previous w ork • Video/ image decoding process • Motion compensation on GPUs [ Shen. etc 2005] • DCT/ IDCT on GPUs [ NVIDIA 2005] [ Fang. etc 2005] • Fast interpolation for ME [ Kelly. etc 2004] • H.263 decoder on GPUs [ Hirvonen. etc 2005] • Limitation and weakness • Single quad-texture for the whole picture • Ignore the features of video data • Performance and flexibility not satisfying

Our Contributes Our Contributes • Generic video decoding framework • Flexible point-based representation • Easily exploit parallelisms of decoding process • Efficiently map to graphics tasks • Both performance and flexibility

Outline Outline • Motivation and Goals • Previous work • Review of video decoding • Point based decoding framework • Results • Discussion

Review of video decoding Review of video decoding U V Y 4:2:0 Macroblock • DCT-MCP hybrid coding • DCT & Motion compensation and prediction • Block based structure • Block and macroblock (basic processing units)

Review of video decoding Review of video decoding GPU CPU coefficient block Reconstructed Residual Display Bitstream Frame Variable Inverse Inverse Color Space + Length Quantize DCT Conversion Decoding YUV to RGB Frame Frame Frame Buffers Buffers Prediction Buffers macroblock Reference Motion Frames Compensation • VLD is sequential bit-wise operation • Others show parallelism and streaming For “each macroblock” Do For “each coefficient block” Do perform MC perform IQ and IDCT

I nverse Quantize ( I Q) I nverse Quantize ( I Q) 15 6 1 -1 8 16 19 22 220 96 -38 22 4 -2 16 16 22 64 -32 1 19 22 … … Quant [15,6,4,1,-2,1,-1,…] X X 19 … 22 . . . . Parameter . . 69 69 83 8 x 8 block Quant matrix • Inverse Zigzag scan: reconstruct block = × × • IQ: X ( , ) u v X ( , ) u v QM u v ( , ) qp IQ Q • Characteristics: • Sparse and Coefficient-level parallelism

I nverse DCT I nverse DCT • IDCT is typically computation intensive • Many fast algorithms, but not for GPU • Coefficient and its basis image • Parallel and stream processing = ∑∑ 8 8 = T T x T XT X u v T u ( , )[ ( ) T v ( )] = = 0 0 u v = × + × + + × X (0,0) X (1,0) ..... X (7,7)

Motion Com pensation Motion Com pensation = + Reconstructed Prediction Residual forward P I B B backward bidirectional • Memory and Computation intensive • Block translation according to motion vectors • Per-pixel arithmetic operations • Fit well with texture sampling scheme

Overview of our fram ew ork Overview of our fram ew ork Basis CPU GPU Bitstream images Variable Inverse Quantize IDCT Coefficient Length Point sets Inverse DCT buffer Decoding Display Color Space Residual Conversion Frame MV data Frame Macroblock Frame Motion Buffers Buffers Buffers Point sets Compensation Reference Frames Prediction • Convey block-wise information with point’s attributes • Batch points into vertex arrays • Render points to active shader programs

Map Video blocks to Graphics points Map Video blocks to Graphics points Size Attributes Point Variable position, normal, color, texcoords0-7… prim itive 16x16 position, motion vectors, Macroblock MB type, DCT coding type Coefficient 8x8 position, quant parameter, sparse coefficients block • Natural for vertex processing • Rasterized to fragment blocks (flexible size) • Fragment processing • Point sprite extension and WPOS semantics

Batch points to feed GPUs Batch points to feed GPUs • Challenge • Various video block prediction or coding types • Irregular distribution and number of coefficients • Highly regular and well batched for GPU • Expensive branch penalty on GPU • Solution • Divide and conquer • Use CPU to classify points into different sets

Coefficient Points Coefficient Points Basis images Inverse Quantize IDCT Coefficient Point sets Inverse DCT buffer 32 x 32 • Apply a regular pattern to generate points • Solve irregular distribution of coefficients • Only convey non-zero 4D Vector and its index • Balance visual quality and computation complexity Slot 0 Slot 1 Slot 15 …. 15 6 4 1 -2 1 -1 0 [15,6,4,1,-2,1,-1,…] Point 0 Point 1

Render coefficient points ( I Q) Render coefficient points ( I Q) = dot Vertex Fragment Blending Rasterizer Processors Processors Units • Single pass to perform both IQ and IDCT • Vertex processors: = × × ( , ) ( , ) ( , ) • Perform IQ X u v X u v QM u v qp IQ Q • Quant matrix as uniform parameters • Quant parameter and slot index in point’s attributes • Locate coordinates of the basis image

Render coefficient points ( I DCT) Render coefficient points ( I DCT) = dot Vertex Fragment Blending Rasterizer Processors Processors Units = × + × + + × • IDCT: X (0,0) X (1,0) ..... X (7,7) • Rasterizer: scalar-matrix � per-fragment • Fragment processors: sample texels ; dot product • Blending units: set function to Add • Accumulate the results from multi points

Macroblock points Macroblock points IDCT Residual buffer Frame Frame Macroblock Frame Motion Buffers Buffers Reference Buffers Point sets Compensation Frames Prediction • Arrange MB-points to different sets • According to different MB type (intra, forward, bidir… ) • Convey MVs in point’s attributes • Set texture access mode • Bilinear filter for sub-pixel MVs • Clamp address mode for unrestricted MVs

Render MB points ( MC) Render MB points ( MC) reference residual Vertex Fragment Rasterizer + Processors Processors • Vertex processers • Fragment processors: • Output position and size • offset WPOS with MVs • Preprocess MVs : • Sample textures • Set proper decimal parts • Sum and saturate • field prediction; field DCT

Evaluation Results Evaluation Results • Our experimental environment • 2.8G Pentium 4 with an Nvidia Geforce 6800GT • MPEG-2 decoder with OpenGL and Cg 1.4 • Five different implementations and test clips • CPU-only � lor 480p 4.6Mbps • CPU-noCSC � shuttle 720p 15.5Mbps • GPU-Texture � australia 1080i 12.3Mbps � 007 1080p 10.9Mbps • GPU-Vertex � crawford 1080i 30.0Mbps • GPU-Point

Perform ance Perform ance • Overall decoding frame rates • Significantly outperform other competitors 350 CPU- Only 300 CPU- noCSC GPU- Texture 250 frame rate ( fps) GPU- Vertex GPU- Point 200 150 100 50 0 lor shuttle australia 007 clip craw ford

Perform ance Perform ance • Time costs of decoding stages • statistics on the clip “australia” (1440x1080) 18 16 CPU- Only 14 Tim e cost ( m s) CPU- noCSC 12 GPU- Texture 10 GPU- Vertex 8 GPU- Point 6 4 2 0 VLD& Others I DCT MC CSC&Disp

Picture Quality Picture Quality • Nearly degradation free of the quality • MPEG test sequences (CIF) GOP= 15, 2.0Mbps • No drift-error accumulation observed • Slight degradation: different rounding control for sub-pixel interpolation (P and B frames) Average Y-PSNR Degradation (db) Sequences PSNR (db) I P B • stefan 31.722 0.006 0.008 0.021 • mobilecal 31.134 0.003 0.010 0.030 • foreman 37.245 -0.011 0.027 0.055

Discussion Discussion • Strength and advantages • Save bandwidth and computation • Fully utilize the graphics pipeline • Neat and flexible framework • Weakness • High pixel fill-rate for performance • Floating point blending for precision • Constrain shape to be a square • Non-bilinear interpolation benefit less

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs - PowerPoint PPT Presentation

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by Point Based Rendering Bo Han, Bingfeng Zhou Peking University Outline Outline Motivation and Goal Previous work Review of video decoding

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Decoding Reed-Muller codes over product sets John Kim, Swastik Kopparty Rutgers University May

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS Abhijit Patait Eric Young April 4 th , 2016

Zoomable Video Playback on Mobile Devices by Selective Decoding Feipeng Liu and Wei Tsang Ooi

Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Matthew Curry, H.

All - in - One Interactive Display Made in/for India +91 9980878877 | satish@aidio.in

Kindergarten Orientation Windsor Elementary School Welcome Introductions Mrs. Drake, Windsor

Research-Based Practice to Improve Student Math Outcomes Lynn Lamers Sourcewell Technology

Integrated Energy Design AAR4616 2nd Semester of MSc in Sustainable Architecture Learning Aims

Fast decoding in neural machine translation with Ray MAREK STRELEC Time cost statistics for

Decoding How to Do Research to Produce High Quality Output 12th APGA-ASAN ON-LINE Workshop 27

A GPU Implementation of Belief Propagation Decoder for Polar Codes Bharath Kumar Reddy L. and

Contents H.264 Overview First Parts of H.264 Decoder NAL Unit Unwrapping Details