Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin - - PowerPoint PPT Presentation

design challenge of a quadhdtv video decoder
SMART_READER_LITE
LIVE PREVIEW

Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin - - PowerPoint PPT Presentation

Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin Department of Computer Science National Tsing Hua University MPSOC2007, Japan 2 YLLIN NTHU-CS More Pixels NHK Proposes UHD TV Broadcast Super HiVision 7680x4320 pixels at 60


slide-1
SLIDE 1

Design Challenge of a QuadHDTV Video Decoder

Youn-Long Lin Department of Computer Science National Tsing Hua University

MPSOC2007, Japan

slide-2
SLIDE 2

YLLIN NTHU-CS 2

More Pixels

slide-3
SLIDE 3

YLLIN NTHU-CS 3

NHK Proposes UHD TV Broadcast

  • Super HiVision 7680x4320 pixels at 60 fps

(16XHDTV)

  • Baseband signal is 24 Gbps. Using 16 MPEG-2

encoding chips, the signal was compressed to 250 Mbps for transmission.

  • HDTV signals at present are 1.5 Gbps for

baseband and 20 Mbps for compressed signals.

  • High Performance compression /

decompression and transmission / storage are needed for 24 Gbps ~300 Mbps

slide-4
SLIDE 4

YLLIN NTHU-CS 4

3840x2160 – QFHD TV 7680x4320 – UHD TV

SDTV

1920x1080 – HDTV

slide-5
SLIDE 5

YLLIN NTHU-CS 5

Video Coding Technology Trend

H.264 50% 69%

slide-6
SLIDE 6

YLLIN NTHU-CS 6

Features of Video Coding Standards

64kbps ~ 150Mbps 64kbps~2Mbps 2-15 Mbps Up to 1.5 Mbps Transmission rate I, P, B I, P, B I, P, B I, P, B Picture type Multiple (5) frames One frame One frame One frame Reference frames ¼ pel ¼ pel ½ pel ½ pel Pixel accuracy 41 MVs per MB Yes Yes Yes ME, MC

VLC, CAVLC and CABAC

VLC VLC VLC Entropy coding 4*4 int transform DCT/ Wavelet DCT DCT Transform

16*16, 16*8, 8*16, 8*8, 8*4, 4*8, 4*4

16*16, 8*8 8*8 8*8 Block size 16*16 16*16

16*16(frame)

16*16 MB size H.264 MPEG-4 MPEG-2 MPEG-1 Standard

slide-7
SLIDE 7

YLLIN NTHU-CS 7

Not all H.264/AVC systems are equal

32 16 8 8.87 2.54 1 1 55.7 24.6 16.9 5 Search Range #Ref Frames

Video Coding with H.264/AVC: Tools, Performance and Complexity, J. Ostermann et al, IEEE CAS Mag., Q1 2004.

Relative Computational Complexity

slide-8
SLIDE 8

YLLIN NTHU-CS 8

Quality vs Bit-rate vs Decoding Throughput

65 307 26 55 704 21 44 1723 16 Fps Bit Rate (Kbps) QP H.264/AVC Baseline Profile Decoder Complexity Analysis,

  • M. Horowitz, IEEE T-CSVT, July 2003

Decoding Capability of a 600MHz CPU

slide-9
SLIDE 9

YLLIN NTHU-CS 9

Our Target

‧Single-Chip Decoder for QFHD (3840x2160) H.264/AVC High Profile Video

–CABAD –8x8 Transform –Commodity DDR External Memory –Platform-Based Design

slide-10
SLIDE 10

YLLIN NTHU-CS 10

Performance

675.0 170.0 75.0 28.1 8.3 2.0 1.0

Size

Digital signage、Medical video、 Satellite image、 Space exploration 249 MHz QFHD (3840 x 2160) 62 MHz 1080HD (1920 x 1088) Home theater 30 MHz 720HD (1080 x 720) Car TV、Surveillance 10 MHz D2 (720 x 480) Mobile TV 3 MHz CIF (352 x 288) 0.8 MHz QCIF (176 x 144) Video phone 0.4 MHz SQCIF (128 x 96)

Application Clock Frequency Resolution

slide-11
SLIDE 11

YLLIN NTHU-CS 11

Essential Issues

‧Memory

–Tradeoff Between the Size of Internal Memory and Bandwidth of External Access

‧Massive Parallelism ‧Macroblock Decoding Scheduling

slide-12
SLIDE 12

YLLIN NTHU-CS 12

NTHU H.264 Decoder Architecture

Parser CAVLD/ CABAD IQ & IT MVG IPRED INTERP BSG DF MAU & AMBA Interface Translator

H.264 Video Decoder

CPU Display Memory Controller Ethernet AHB para & predinfo recon bs residual mv & ridx coeff mvdinfo

slide-13
SLIDE 13

Memory

slide-14
SLIDE 14

YLLIN NTHU-CS 14

size vs. b/w in ME

Memory Bandwidth (MB/s) Memory Size (Bytes)

19658 240 1200 4977 124929 1516 317 62 A B C D

Full HD 30fps, # of rf =1 , SRV=SRH=64 Level A : 240 Bytes, 19658 MB/s Level B : 1200 Bytes, 1516MB/s Level C: 4977 Bytes, 317MB/s Level D: 124,929 Bytes, 62 MB/s

slide-15
SLIDE 15

YLLIN NTHU-CS 15

CB mem rf0 mem rf1 mem CMB reg CB AG rf AG rf reg array comparator comparator comparator comparator MV mem

IME block diagram

CMB reg CMB reg CMB reg rf router

MVGen rf0 MVGen rf0 MVGen rf0 MVGen rf0 MVGen rf0 MVGen rf0 MVGen rf0 MVGen rf0

MV AG

slide-16
SLIDE 16

YLLIN NTHU-CS 16

size vs. b/w in ME

Memory Bandwidth (MB/s) Memory Size (Bytes)

19658 240 1200 4977 124929 1516 317 62 A B C D

  • urs
slide-17
SLIDE 17

YLLIN NTHU-CS 17

Reference-data Pre-fetch System

  • No redundant fetching

– Collecting several MB’s motion vectors, and read the same place by only one single operation

  • Minimize the number of burst initials

– Averagely 2 burst initials per MB (1 for luma, 1 for chroma) : a group of sequentially read (burst read)

slide-18
SLIDE 18

YLLIN NTHU-CS 18

CABAC

Reference-data Pre-fetch System (Cont)

MB7 MB8 MB9 MB10

Motion Vector Generator

Translator Reference Region & Index Register

Region Analyzer / Searcher OES manager

MB6 MB7 MB4 MB5 MB6 MB7 MB4 MB5 MB2 MB3 MB4 MB1 MB2 MB0 MB1 MB2 MB0

R0 R1

R0 R1 R2 R3 R4 R5 R6 R7

Buffer

R2

MB7 MB6 MB5 MB4 MB3 MB2 MB1 MB0

R2 Information R2 Information

Interp

R0/R1 Data R2 Data from SDRAM MB7 Information MB7 MV MB7 Region Information

. . . .

MAU Interface

slide-19
SLIDE 19

Massive Parallelism

slide-20
SLIDE 20

YLLIN NTHU-CS 20

4

RLD/IQ/IDCT Timing Diagram

t

3 212

1 1 1 1 1 1

195

~ 16 0~15 chroma ac_6_7 1 0~16 luma ac_0_1 0~16 luma ac_14_15 0~16 luma ac_0_1 0~16 luma ac_14_15 1 1 1 4 1 1 1 1 4 ~4 dc 0~15 chroma ac_0_1 0~15 chroma ac_6_7 0~15 chroma ac_0_1 0~15 chroma ac_6_7 1 1 1 1 1 4 1 1 1 1 4

122 140 144 161

2 4 4 4

219

0~16 luma ac_0_1 ~ 16 ~ 16 ~ 16 ~ 16 0~15 chroma ac_0_1 ~ 15 ~ 15 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 16 ~ 15 ~ 15 ~ 15 ~ 15 0~16 luma ac_14_15 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 1 1

IDCT stage 1 coeflag_mem read coeff_mem read IQ stage 1 IQ stage 2 residual_mem write IDCT stage 2

4 4 4 4 4 4 4 4 4

slide-21
SLIDE 21

YLLIN NTHU-CS 21

DF Timing Diagram

slide-22
SLIDE 22

YLLIN NTHU-CS 22

L31 L30 L32 L33 L21 L20 L22 L23 L11 L10 L12 L13 L01 L00 L02 L03 Strong filter (Bs=4)/ Left delta calculation M01 M00 M02 M03 R01 R00 R02 R03 M11 M10 M12 M13 M21 M20 M22 M23 M31 M30 M32 M33 R11 R10 R12 R13 R21 R20 R22 R23 R31 R30 R32 R33 Right Weak filter (Bs<4) R21 delta R21 delta calculation Right delta Left Weak Filter (Bs<4) Right delta calculation R21 filter

Dual Pipelined Edge Filter

Right Weak filter (Bs<4)

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Left delta

Read Pixels Write Pixels

slide-23
SLIDE 23

System-Level Optimization

Cyclic-Queue-Based IP Interface

slide-24
SLIDE 24

YLLIN NTHU-CS 24

Sequential Decoder Timing Diagram (I Frame)

CABAD IQ/IT BSG DF

(time)

Header information decode Initial context table and condition offset

IPRED

MB 0 decode MB 1 decode MB 2 decode

PARSER

slide-25
SLIDE 25

YLLIN NTHU-CS 25

Elastic Pipeline Decoder Timing Diagram (I Frame)

CABAD PARSER IQ/IT BSG DF

(time)

Header information decode Initial context table and condition offset

IPRED

MB 0 decode MB 4 decode MB 1 decode MB 5 decode MB 2 decode MB 3 decode MB 6 decode

slide-26
SLIDE 26

YLLIN NTHU-CS 26

ASAP Decode with Cyclic Queue Timing Diagram (I frame)

CABAD IQ/IT BSG DF

(time)

Header information decode Initial context table and condition offset

IPRED

MB 0 decode MB 4 decode MB 1 decode MB 5 decode MB 2 decode MB 3 decode MB 6 decode MB 8 decode MB 7 decode

PARSER

slide-27
SLIDE 27

YLLIN NTHU-CS 27

Comparison of Different Scheduling Methods

2.62 5.6 5.6 8.3 486 620 644 540 486 161 159 140

50 100 150 200 250 300 350 400 450 500 550 600 650

Sequential Elastic Pipeline ASAP Ping-Pong ASAP Cyclic- queue

1 2 3 4 5 6 7 8 9

SRAM Usage Turnaround Cycle Processing Cycle

(Cycles/ MB) KB

Test Pattern: “pedestrian” Resolution: 720*480 QP: 28 GOP: III… Frame #: 30

slide-28
SLIDE 28

YLLIN NTHU-CS 28

mfu parser cabad idct ipred interp df bsg main_ctrl top amba_wrap mvg def rtl syn vn nlint gate_sim rtl_sim filelist tbench Sub IP hd_amba

Verification Environment

H264 filelist fpga_lib gate_sim asic_lib syn jm11.0 mem netlist rtl_sim tbench lm_wrap nlint vn

xilinx_mem altera_mem artisan_mem

Easy Bug Tracing

slide-29
SLIDE 29

YLLIN NTHU-CS 29

A Multimedia SOC Platform

CPU Accelerator (FPGA)

USB(PHY) Daughter Board ROM/ Flash Memory SRAM SDRAM

VIC USB 2.0

Static memory SDRAM Controller(4-CH) JPEG Codec

DMA

SRAM PWM WDT TIMER

APB Bridge

Capture

Display Controller

DAI SSI SD SM UART GPIO 12C

Audio Codec I2S Flash memory with SSI Flash Card

Button LED

Video-In CCIR601

TV/LCD

High-Speed Bus Peripheral Bus FPGA

slide-30
SLIDE 30

YLLIN NTHU-CS 30

Summary

‧Super High Definition Video Capturing, Delivery and Display are on the Horizon ‧Massive Parallelism is Essential for Making Consumer Applications Possible ‧Tradeoff Among Memory Usage, Bandwidth and Logic Has Profound Impact on the Overall System Performance ‧System Design Should Be Adaptable to Content, Quality Variation