Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs - - PowerPoint PPT Presentation

efficient video decoding on gpus efficient video decoding
SMART_READER_LITE
LIVE PREVIEW

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs - - PowerPoint PPT Presentation

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by Point Based Rendering Bo Han, Bingfeng Zhou Peking University Outline Outline Motivation and Goal Previous work Review of video decoding


slide-1
SLIDE 1

Peking University

Efficient Video Decoding on GPUs by Point Based Rendering Efficient Video Decoding on GPUs by Point Based Rendering

Bo Han, Bingfeng Zhou

slide-2
SLIDE 2

Outline Outline

  • Motivation and Goal
  • Previous work
  • Review of video decoding
  • Point based decoding framework
  • Results
  • Discussion
slide-3
SLIDE 3

Motivation Motivation

  • Diverse video applications
  • Range from HDTV to mobile devices
  • Multi video standards coexist
  • most concerns: video playback
  • Computation and Bandwidth
  • Successful decoding system need:
  • High performance and programming flexibility
  • CPU + additional hardware
slide-4
SLIDE 4

Motivation Motivation

  • GPUs are powerful and flexible
  • Attractive coprocessors for GPGPU
  • Spreading to everywhere
  • History of offloading video decoding tasks
  • Overlay surface for YUV to RGB
  • dedicated hardware for DVD (DXVA)
  • Programmable Video Engine (PureVideo, AVIVO)
  • What’s the next? (shader based?)
slide-5
SLIDE 5

Our Goals Our Goals

  • Video decoding framework
  • Built on Graphics pipeline and Shader programs
  • Hardware performance + Software flexibility
  • Additional advantages
  • Independent of Hardware and platform
  • Graphics API and shader languages
  • Save hardware resources
  • Amazing growth rate over Moore’s law
slide-6
SLIDE 6

Previous w ork Previous w ork

  • Video/ image decoding process
  • Motion compensation on GPUs [ Shen. etc 2005]
  • DCT/ IDCT on GPUs [ NVIDIA 2005] [ Fang. etc 2005]
  • Fast interpolation for ME [ Kelly. etc 2004]
  • H.263 decoder on GPUs [ Hirvonen. etc 2005]
  • Limitation and weakness
  • Single quad-texture for the whole picture
  • Ignore the features of video data
  • Performance and flexibility not satisfying
slide-7
SLIDE 7

Our Contributes Our Contributes

  • Generic video decoding framework
  • Flexible point-based representation
  • Easily exploit parallelisms of decoding process
  • Efficiently map to graphics tasks
  • Both performance and flexibility
slide-8
SLIDE 8

Outline Outline

  • Motivation and Goals
  • Previous work
  • Review of video decoding
  • Point based decoding framework
  • Results
  • Discussion
slide-9
SLIDE 9

Review of video decoding Review of video decoding

  • DCT-MCP hybrid coding
  • DCT & Motion compensation and prediction
  • Block based structure
  • Block and macroblock (basic processing units)

4:2:0 Macroblock Y U V

slide-10
SLIDE 10

Review of video decoding Review of video decoding

  • VLD is sequential bit-wise operation
  • Others show parallelism and streaming

Variable Length Decoding

Inverse Quantize Inverse DCT Color Space Conversion Motion Compensation Frame Buffers

+

Frame Buffers Frame Buffers macroblock Prediction Bitstream Reconstructed Frame Display YUV to RGB Residual Reference Frames

GPU CPU

For “each coefficient block” Do perform IQ and IDCT For “each macroblock” Do perform MC

coefficient block

slide-11
SLIDE 11

I nverse Quantize ( I Q) I nverse Quantize ( I Q)

  • Inverse Zigzag scan: reconstruct block
  • IQ:
  • Characteristics:
  • Sparse and Coefficient-level parallelism

15 6 4 1

  • 2

1 -1

[15,6,4,1,-2,1,-1,…] 8 x 8 block

… . .

8 16 16 19 16 19 22 22 22 22

Quant matrix

… . .

83 69 69 X 220

… . .

96 -38 22 64 -32 19

Quant Parameter

X

( , ) ( , ) ( , )

IQ Q

X u v X u v QM u v qp = × ×

slide-12
SLIDE 12

I nverse DCT I nverse DCT

  • IDCT is typically computation intensive
  • Many fast algorithms, but not for GPU
  • Coefficient and its basis image
  • Parallel and stream processing

8 8

( , )[ ( ) ( )]

T T u v

x T XT X u v T u T v

= =

= = ∑∑

(0,0) (1,0) ..... (7,7) X X X = × + × + + ×

slide-13
SLIDE 13

Motion Com pensation Motion Com pensation

  • Memory and Computation intensive
  • Block translation according to motion vectors
  • Per-pixel arithmetic operations
  • Fit well with texture sampling scheme

= + I B P B

backward forward bidirectional Reconstructed Prediction Residual

slide-14
SLIDE 14

Outline Outline

  • Motivation and Goals
  • Previous work
  • Review of video decoding
  • Point based decoding framework
  • Results
  • Discussion
slide-15
SLIDE 15

Overview of our fram ew ork Overview of our fram ew ork

  • Convey block-wise information with point’s attributes
  • Batch points into vertex arrays
  • Render points to active shader programs

Variable Length Decoding

Inverse Quantize Inverse DCT Color Space Conversion Motion Compensation Frame Buffers Frame Buffers MV data Residual Bitstream Reference Frames Display

GPU CPU

Basis images IDCT buffer Frame Buffers Coefficient Point sets Macroblock Point sets Prediction

slide-16
SLIDE 16

Map Video blocks to Graphics points Map Video blocks to Graphics points

  • Natural for vertex processing
  • Rasterized to fragment blocks (flexible size)
  • Fragment processing
  • Point sprite extension and WPOS semantics

Size Attributes Point prim itive

Variable position, normal, color, texcoords0-7…

Macroblock

16x16 position, motion vectors, MB type, DCT coding type

Coefficient block

8x8 position, quant parameter, sparse coefficients

slide-17
SLIDE 17

Batch points to feed GPUs Batch points to feed GPUs

  • Challenge
  • Various video block prediction or coding types
  • Irregular distribution and number of coefficients
  • Highly regular and well batched for GPU
  • Expensive branch penalty on GPU
  • Solution
  • Divide and conquer
  • Use CPU to classify points into different sets
slide-18
SLIDE 18

Coefficient Points Coefficient Points

  • Apply a regular pattern to generate points
  • Solve irregular distribution of coefficients
  • Only convey non-zero 4D Vector and its index
  • Balance visual quality and computation complexity

[15,6,4,1,-2,1,-1,…] Slot 0 15 6 4 1

  • 2 1 -1 0

Slot 1 Point 0 Point 1 Slot 15 ….

Inverse Quantize Inverse DCT Basis images IDCT buffer Coefficient Point sets 32 x 32

slide-19
SLIDE 19

Render coefficient points ( I Q) Render coefficient points ( I Q)

  • Single pass to perform both IQ and IDCT
  • Vertex processors:
  • Perform IQ
  • Quant matrix as uniform parameters
  • Quant parameter and slot index in point’s attributes
  • Locate coordinates of the basis image

( , ) ( , ) ( , )

IQ Q

X u v X u v QM u v qp = × ×

Vertex Processors Rasterizer Fragment Processors Blending Units dot

=

slide-20
SLIDE 20

Render coefficient points ( I DCT) Render coefficient points ( I DCT)

  • IDCT:
  • Rasterizer: scalar-matrix per-fragment
  • Fragment processors: sample texels ; dot product
  • Blending units: set function to Add
  • Accumulate the results from multi points

(0,0) (1,0) ..... (7,7) X X X = × + × + + ×

Vertex Processors Rasterizer Fragment Processors Blending Units dot

=

slide-21
SLIDE 21

Macroblock points Macroblock points

  • Arrange MB-points to different sets
  • According to different MB type (intra, forward, bidir…

)

  • Convey MVs in point’s attributes
  • Set texture access mode
  • Bilinear filter for sub-pixel MVs
  • Clamp address mode for unrestricted MVs

Motion Compensation Frame Buffers Frame Buffers Reference Frames Frame Buffers Macroblock Point sets Prediction Residual IDCT buffer

slide-22
SLIDE 22

Render MB points ( MC) Render MB points ( MC)

  • Vertex processers
  • Output position and size
  • Preprocess MVs :
  • Set proper decimal parts
  • field prediction; field DCT

+

reference residual Vertex Processors Rasterizer Fragment Processors

  • Fragment processors:
  • offset WPOS with MVs
  • Sample textures
  • Sum and saturate
slide-23
SLIDE 23

Outline Outline

  • Motivation and Goals
  • Previous work
  • Review of video decoding
  • Point based decoding framework
  • Results
  • Discussion
slide-24
SLIDE 24

Evaluation Results Evaluation Results

  • Our experimental environment
  • 2.8G Pentium 4 with an Nvidia Geforce 6800GT
  • MPEG-2 decoder with OpenGL and Cg 1.4
  • Five different implementations and test clips
  • CPU-only
  • CPU-noCSC
  • GPU-Texture
  • GPU-Vertex
  • GPU-Point

lor

480p 4.6Mbps

shuttle 720p 15.5Mbps australia 1080i 12.3Mbps 007 1080p 10.9Mbps crawford 1080i 30.0Mbps

slide-25
SLIDE 25

Perform ance Perform ance

  • Overall decoding frame rates
  • Significantly outperform other competitors

50 100 150 200 250 300 350 lor shuttle australia 007 clip craw ford frame rate ( fps) CPU- Only CPU- noCSC GPU- Texture GPU- Vertex GPU- Point

slide-26
SLIDE 26

Perform ance Perform ance

  • Time costs of decoding stages
  • statistics on the clip “australia” (1440x1080)

2 4 6 8 10 12 14 16 18 VLD& Others I DCT MC CSC&Disp Tim e cost ( m s) CPU- Only CPU- noCSC GPU- Texture GPU- Vertex GPU- Point

slide-27
SLIDE 27

Picture Quality Picture Quality

  • Nearly degradation free of the quality
  • MPEG test sequences (CIF) GOP= 15, 2.0Mbps
  • No drift-error accumulation observed
  • Slight degradation: different rounding control for

sub-pixel interpolation (P and B frames)

  • stefan

31.722 0.006 0.008 0.021

  • mobilecal

31.134 0.003 0.010 0.030

  • foreman 37.245 -0.011 0.027 0.055

Sequences Average PSNR (db) Y-PSNR Degradation (db) I P B

slide-28
SLIDE 28

Discussion Discussion

  • Strength and advantages
  • Save bandwidth and computation
  • Fully utilize the graphics pipeline
  • Neat and flexible framework
  • Weakness
  • High pixel fill-rate for performance
  • Floating point blending for precision
  • Constrain shape to be a square
  • Non-bilinear interpolation benefit less
slide-29
SLIDE 29

Conclusion Conclusion

  • An efficient decoding framework on GPU
  • Analyze parallelism and features of decoding
  • Flexible point-based representation for video block
  • Efficient IQ, IDCT and MC by rendering points
  • Results demonstrate efficiency and flexibility
  • Future work
  • Apply to more standards, even HDR video
  • Video encoding and transcoding
slide-30
SLIDE 30

Question Question

  • Thanks for your attention…

… .Question?

  • Contact:
  • hanbo@icst.pku.edu.cn
  • zbf@pku.edu.cn