Research on Ultra Low Power SoC for Media Processing SoC for Media - - PowerPoint PPT Presentation

research on ultra low power soc for media processing soc
SMART_READER_LITE
LIVE PREVIEW

Research on Ultra Low Power SoC for Media Processing SoC for Media - - PowerPoint PPT Presentation

Research on Ultra Low Power SoC for Media Processing SoC for Media Processing August 1, 2011 g , S t Satoshi Goto hi G t Waseda University y ISLPED2011 1 Multimedia Applications in Ambient Multimedia Applications in Ambient Society


slide-1
SLIDE 1

Research on Ultra Low Power SoC for Media Processing SoC for Media Processing

August 1, 2011 g , S t hi G t Satoshi Goto Waseda University

1

y

ISLPED2011

slide-2
SLIDE 2

Multimedia Applications in Ambient Multimedia Applications in Ambient Society Society Society Society

TV conference TV conference

  • Security/privacy, recognition, record

Surveillance/medical care Surveillance/medical care Automotive Automotive

  • Scene recognition /
  • Ultra-low power, wireless

anytime anytime safe safe

  • Bi-directions
  • Small frame delay

Scene recognition / Monitoring / Control

  • Information (figure, text)

display

  • Noise and vibration

Mobile/Portable Mobile/Portable an here an here comfortable comfortable

y

  • Wide range capture (fish

eye camera) Noise and vibration

Mobile/Portable Mobile/Portable anywhere anywhere comfortable comfortable Video compression Video compression + Network Security

  • HDTV or Super HDTV (High quality)
  • Recording (High compression)
  • A battery drive (low power)
  • Wireless (error free)

Home entertainment Home entertainment Security Recognition :

  • Recording (High compression)
  • Various contents (a sport, news, a

movie, animation, --)

  • Wireless (error free)
  • New function (Fast

forwarding, etc)

slide-3
SLIDE 3

Media Processing SoC and Research

Area

M Med Med

Digital signal processing

Media enc Media enc Error Error Recogni Recogni En En Pre/Po Pre/Po Sensor Sensor RF RF dia / Envi dia / Envi Netw Netw coding/de coding/de r collectio r collectio ition/Synt ition/Synt ncryption ncryption

  • st proces
  • st proces

r Display Display ironment ironment R

  • r
  • r

work proto work proto ecoding ecoding

  • n
  • n

thesis thesis ssing ssing y/Actuato y/Actuato tal Inform tal Inform RF RF

wire/ wire/ wireless wireless

  • col
  • col
  • r
  • r

mation mation

interpolation interpolation

  • bject tracking
  • bject tracking

/ recognition / recognition H.264 H.264 JPEG2000 JPEG2000 public key public key private key private key t LDPC LDPC turbo turbo UWB UWB i i coordinate coordinate translation translation character character recognition recognition 11a/b/g/n 11a/b/g/n ad hoc ad hoc 10G 10G baseT baseT digital filter digital filter interpolation interpolation CG CG stream stream cipher cipher image image analysis analysis non non-

  • standard

standard codec codec RS RS 10G 10G baseT baseT 4G 4G

Challenge:

3

Challenge: How do we realize Ultra low-power consumption chip?

slide-4
SLIDE 4

Opportunities for power reduction reduction

System Level Algorithm L l Register Transfer Level Gate Level Transistor Level Silicon Level Level Transfer Level Level Level

Precision Error

>50% 25-50% 15-40% 10-20% 5-10%

Power Reduction >70%

50-70% 15-50% 5-15% 3-5%

おける消費電力管理 (

Possible power reduction and observation accuracy of power consumption achieved at each abstraction level

50% 25 50% 15 40% 10 20% 5 10%

From ASIC/ICにおける消費電力管理 (by Synopsys)

Optimizations at System, Algorithm and Register- transfer levels are important for Ultra-low power SoC

4 ISLPED2011

slide-5
SLIDE 5

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 5

  • 7. Conclusions
slide-6
SLIDE 6

Power Consumption

f 2

P = αC VDD f + Ileak VDD

Dynamic Power Static Power

2

Dynamic Power Static Power

α : Operational rate

p

C

: Capacitance

f

Cl k f ( V

V

f

: Clock frequency ∝ ( VDD- VTH )

VDD : Power voltage V

h h ld l

VTH : Threshold voltage Ileak : Leak current

6 ISLPED2011

slide-7
SLIDE 7

Reducing Power Consumption

P = αC VDD f + Ileak VDD

2

∑ ∑ P = αC VDD f + Ileak VDD ∑ ∑

Lower power voltage Less leakage current Less leakage current Turn off power Fine Process Device and low voltage Fine Process, Device and low voltage Power gating

7 ISLPED2011

slide-8
SLIDE 8

Reducing Power Consumption P = αC VDD f + Ileak VDD

2

∑ ∑

(1) L P lt

P = αC VDD f + Ileak VDD ∑ ∑

(1) Lower Power voltage (2) Reduced operational rate (3)L Cl k f (3)Lower Clock frequency (4) Turn off Clock ・Parallel process、Pipeline Parallel process、Pipeline ・Reduce the number of logic circuits ・Reduce the number of computation

ASICON2009 8

Reduce the number of computation ・Clock gating

slide-9
SLIDE 9

Low Power Design g

Signal Recognition Encode Cypher ECC NW Protocol Application

Low Power Algorithm

Process Recognition Decode Encription ECC NW Protocol pp Algorithm

Low Power Algorithm

Algorithm Hetero Multi-core Execution Control

Task4 Task3 Task1 Task2 Programmable HW Low Power IP Core

ASIP

Multi Processor

Implementation Platform

Low Power Design basic Technologies

Clock GT Power GT Floor Plan High Level Synth RTL

slide-10
SLIDE 10

Low Power Design Goal

Our Goal

g

Signal Recognition Encode Cypher ECC NW Protocol

Our Goal 1/100 =1/5x1/10 Low Power Algorithm

Process Recognition Decode Encription ECC NW Protocol

1/5 1/5x1/10 x1/2 Low Power Algorithm 1/5

Hetero Multi-core Execution Control

Task4 Task3 Task1 Task2 Programmable HW Low Power IP Core

ASIP

Multi Processor

1/10

Low Power Design basic Technologies

1/2

Clock GT Power GT Floor Plan High Level Synth

slide-11
SLIDE 11

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 11

  • 7. Conclusions
slide-12
SLIDE 12

Power Reduction at System Power Reduction at System Level

Classification

Classification and I ntegration

– Classify media data into

“Important part” and “Non- Important part” p p p p

I ntegration

– Error-Correcting Coding & Video Processing – Encryption & Video Processing

→Reduce complexity by 25%~ 75%

12

slide-13
SLIDE 13

Power Reduction at System Level - Classification

Video Compression Video Compression(ex. H.264

  • ex. H.264)

Headder, quantization mtx etc DCT coeff. Motion vector Others

General Media Information General Media Information

  • mtx. etc

DCT coeff. vector

Text Information Text Information (1000 Chars, 16KBits 1000 Chars, 16KBits) Image Image Information Information (Still image 10 Still image 10 pictures, pictures, 240Mbits) 240Mbits)

  • Data leaked

→Meaning leaked

  • Error occurred

I f ti l t

  • Data leaked

→Image information

leaked

  • Error occurred

h l i l t

  • Part of data leaked

≠Whole information leaked

  • Partial error occurred

h l f l

  • Part of data leaked

≠Whole image information leaked

  • Partial error occured

h l f l

I mportant I nformation Non-I mportant I nformation

→Information lost →whole image lost

≠Whole information lost ≠Whole image information lost

I mportant I nformation Non I mportant I nformation

High Cipher Strength

(2000bit RSA AES ect)

Encryption

Decrease Cipher Strength

depending on the importance

(2000bit RSA, AES, ect)

High error correcting capability

(10000bit LDPC code etc)

Encoding

depending on the importance

Decrease error correcting

capability depending on the

13

(10000bit LDPC code, etc)

capability depending on the importance

slide-14
SLIDE 14

Power Reduction at Power Reduction at System Level

System Level

  • Integration of Video Encoding and Error

Integration of Video Encoding and Error-

  • Correcting Coding

Correcting Coding -

  • Experiment for the Integration of Image Processing and

Error Correcting Coding Classification of H.264 Video Data Error-Correcting Coding Important Non-important

Coding Ratio Low High #Repetitions of #Repetitions of LDPC Code Large Small LDPC Code LDPC Code Length Long Short Computing

14

Computing time Large Small

slide-15
SLIDE 15

Classified Data(bit) Classified Data(bit)

Partition A Partition B (Important) (Unimportant) foreman_qcif 21659 34657 football_qcif 22312 77581 salesman_qcif 13091 33167 container_qcif 6698 25036

15 ISLPED2011

slide-16
SLIDE 16

Power Reduction at System Level

  • Integration of Video Encoding and Error-Correcting Coding -

Experiment for Unequal ECC in Video Encoder

34 35 36 33 35 37 31 32 33 29 31 33 NOUEP 28 29 30 3 3.1 3.2 3.3 3.4 3.5 25 27 2.9 3 3.1 3.2 3.3 3.4 UEP1 UEP2 UEP3(提案)

foreman container Integrated Independent Integrated Foreman Football Container

16

Power Reduction 25.5% 25.4% 56.6%

slide-17
SLIDE 17

Results for UEEC

SASIMI 2009

EQ UEEC

SASIMI 2009

EQ UEEC

PSNR

52.5 38.0

Vid Video Quality

ITC-CSCC 2010 17

Computation time 3 : 1

slide-18
SLIDE 18

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 18

  • 7. Conclusions
slide-19
SLIDE 19

Power Reduction at Algorithm L l Level Classification

Region of Interest (ROI) based complexity reduction is Introduced Classification by inside ROI or outside ROI Classification by inside ROI or outside ROI

►video conference systems, etc

19 ISLPED2011

slide-20
SLIDE 20

ROI ROI based complexity based complexity reduction reduction

 Region-of-interest (ROI): the video contents should be

encoded in higher priority

 Attract human’s attention  Important for overall performance

 ROI based video coding: Flexibly assign encoding resources

 Resources: bits budget, encoding power consumption  Advantages: Higher subjective quality, higher overall performance

20

Normal encoding ROI encoding

(ROI ≡ Human face)

slide-21
SLIDE 21

ROI Detection Algorithm

E ti ti Estimation step w/o continuity check Final results

21

(ROI ≡ Human face)

slide-22
SLIDE 22

ROI based Complexity Reduction Scheme

C

Experimental results:

77% of encoding time is reduced

IEICE Trans. on Electronics, Apr. 2011.

BDPSNR : Average PSNR difference in dB over the whole range of bit-rates whole range of bit rates. BDBR : Average bit-rate difference in % over the 22 difference in % over the whole range of PSNR

slide-23
SLIDE 23

Power Reduction at Algorithm L l Level Computation Reduction Algorithm at Video Encoding

Diff D t ti Al ith i i t d d

at Video Encoding

Difference Detection Algorithm is introduced Particularly effective for Surveillance System

23 ISLPED2011

slide-24
SLIDE 24

Difference Detection Difference Detection

  • Detect the macroblock with difference (caused by

( y motion, lighting and so on) to start encoding

– Regardless of foreground or background – Regardless of human or not

  • Let part of the circuit of the encoder be standby when the

content is not changed

Area w/o difference

content is not changed

Area w/o difference

Area with difference

24 24

n n+1 frame number:

ISLPED2011

slide-25
SLIDE 25

System Architecture System Architecture

  • The proposed method applied prior to mode decision and motion

estimation estimation

  • The content similarity of the input macroblock is analyzed by

comparing with the collocated MB in the previous frame using

N di t f h i t – Norm distance of chrominance components – Coarse motion information

Proposed Encoder Similarity Mode Decision MB Coded MBs

Original Encoder Early Skip Mode Decision

Detector Decision MB Video Frame

  • de MBs

Bit Stream Writer Compressed Bit Stream Video Frame

Skip-m 25 ISLPED2011

slide-26
SLIDE 26

Low Power Video Coding System on Multicore Processor Multicore Processor

Performance Constraint fs Core 0 Workload Estimation DVFS CPU Input Video Slice 0

s

Frequency Mapping

n

Ne fn Bit St H.264 Encoder Ns Nei-1 Difference Detection Bit Stream Writer MB Nei Nsi Core 0 Mode Decision Nei Compressed Core j

Bit Stream Bit Stream Core N-1 Joint

ISLPED2011

slide-27
SLIDE 27

Workload Estimation Workload Estimation

  • Hilbert transform based workload estimation
  • Evaluations on generally used workload estimation algorithms by the Performance

Constraint Deviation (PCD):

 

 

1 2 k fs N    

     

f

S   1

, 1, 1

n n k T S n k fs N

D Ne Ne k f f

  

                 

fS

– For surveillance video encoding applications, fS generally equals to 30 Even if the encoding energy is reduced the encoding speed >= 30 fps to

k

D

ft

S

1 k

D 

  • Even if the encoding energy is reduced, the encoding speed >= 30 fps to

maintain QoS

ISLPED2011

slide-28
SLIDE 28

Multicore processor based power

  • Hardware platforms

Renesas RP1 evaluation board with 2 SuperH SH 4A processors

reduction

– Renesas RP1 evaluation board with 2 SuperH SH-4A processors

  • 4 cores, frequency adjustable {600,300,150,75} MHz

28 ISLPED2011

slide-29
SLIDE 29

Multicore processor based power reduction

System for experiment and Results

ICIP2010 & ICME2010

Real Demo is shown at ULP Exhibition ・Dynamic processor frequency assignment(DFS) Real Demo is shown at ULP Exhibition

Dynamically select the freq. from 600, 300, 150, 75MHz

・Load balancing using Multi-core processor

P ll l ti f 4 b di idi th i t 4 t

Coding Schemes Normal Proposed Reduction

Parallel execution of 4 cores by dividing the screen in to 4 parts

Video data:Street (QCIF)

g p CPU Frequency (MHz) 600 300 50% Total Coding Time (s) 3775.2 68.4 98.2% Power Consumption (w) 2.8 2.2 23.4% Energy Consumption (KJ) 10.63 0.15 98.6%

29

(KJ)

ISLPED2011

slide-30
SLIDE 30

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 30

  • 7. Conclusions
slide-31
SLIDE 31

Power Reduction at Register Transfer Level

Power reduction by switching off unused circuitry

►Fine Grain Power Gating (PG) Using Controlling Value

Power reduction by temporarily disabling clock signals Power reduction by temporarily disabling clock signals

►Optimum Sharing of Clock Gating (CG)Control Logic

Power reduction using Network on Chip

►Synthesis for Low Power Application‐Specific Network‐on‐Chip

Power reduction optimizing trade-off between power and performance p

► Dynamic frequency control under performance constraint

31 ISLPED2011

slide-32
SLIDE 32

Fine Grain Power Gating (PG) Using Controlling Value

  • switching off unused circuitry-

IEICE T F d t l D 2009

Ultra-fine grain PG using controlling value is proposed

 Controlling value of an input of a gate can power-off other blocks

IEICE Trans. on Fundamentals ,Dec. 2009.

 Controlling value of an input of a gate can power-off other blocks  20 % power reduction can be found for ISCAS 85 benchmarks

Pseudo Power Gating

 Control input is manipulated as another input of each controlled gate  Commercial synthesis&layout tools used and 10% reduction was found

Multi stage Power Gating Multi-stage Power Gating

 Fine grain PG is applied inside power gating blocks:10% extra reduction

Controlling Value can Glitch 101 exists Controlling Value can cut off the power of

  • ther blocks

Glitch 101 exists

Original Circuit Pseudo Power Gating Multi-stage Power Gating Power Gating using controlling value Original Circuit

Glitch reduction

32

slide-33
SLIDE 33

Optimum Sharing of Clock Gating (CG)Control Logic

  • temporarily disabling clock signals -

temporarily disabling clock signals

Optimum CG Sharing in Commercial tools

BDD b d h d l d f l l

IEICE Trans. on Fundamentals, Dec. 2010

 BDD based method is applied for gate level circuits

generated with logic synthesis

 LDPC decoder gains 9% reduction compared with

structural method, 38 % for serial parallel interface with our single stage optimization

Multi-stage Clock Gating Synthesis Multi stage Clock Gating Synthesis

 Gated clock is applied to other CG logic  About 9% reduction w.r.t. single stage method for

binary counters and interface circuits binary counters and interface circuits

Multi-stage CG signal Extraction

33

slide-34
SLIDE 34

Fixed-outline Floorplanning

Greatly improve the success rate, wirelength, runtime: respectively compared with a floorplanner from Michigan respectively compared with a floorplanner from Michigan Univ., and a floorplanner from NTU

 Wirelength: 12% and 7% reduction  Runtime: 2.7 to 9.1X speedup Wirelength Time

Number of blocks

Aspect ratio

34 34

Aspect ratio ISLPED2011

slide-35
SLIDE 35

Multilevel Framework based Floorplanning

IEICE Trans. on Fundamentals, April. 2009

 Compared with Flat

Compared with Flat framwork framwork

 9 4% reduction of

9 4% reduction of

 9.4% reduction of

9.4% reduction of wirelength wirelength

 15.6% reduction of time

15.6% reduction of time

Circuit Block Floorplan(multilevel) Floorplan (Flat) wire time(sec) wire improve time(sec) improve n100 100 207,989 3.2 203,624

  • 2.1%

2.9

  • 10.0%

n200 200 371,345 10.4 365,577

  • 1.6%

8.2

  • 21.4%

n300 300 521,365 18.4 491,517

  • 5.7%

16.0

  • 13.0%

Prmy1 752 94,264 59.5 83,974

  • 10.9%

45.5

  • 23.5%

CB37 1810 4 788 047 115 6 3 770 716 21 2% 97 2 15 9%

35

CB37 1810 4,788,047 115.6 3,770,716

  • 21.2%

97.2

  • 15.9%

Prmy2 2907 473,282 202.5 403,707

  • 14.7%

182.4

  • 9.9%

Average

  • 9.4%
  • 15.6%

35

slide-36
SLIDE 36

Floorplaning for a real design: LDPC Decoder

Compared with results obtained by commercial tools only, area:25%, wire: 10%, wire-delay 8%.

Circuit: Nets: 49497 Cell : 44531 Partitioning

300 blocks + 24 macro blocks

Floorplan: Determine the position of macro blocks

Synopsys Design Flow

  • Placement
  • Clock Optimization
  • Routing
  • Post Route

b

R lt f DC Results of DC ・TSMC 0.18u CMOS ・418Mbps@200MHz ・Memory: 24 memory block (Area:1 695 501)

Design Design (without FP) (without FP) Proposed Proposed Design Design 比率 比率 Area Area 16,319,256 16,319,256 11,923,480 11,923,480

  • 25%

25%

2011-8-4 36 Yoshimura Lab

36

  • 24 memory block (Area:1,695,501)

・Total Area: 8,012,999 ・Power: 712,38mW

De Delay ay 6.208 6.208 5.713 5.713

  • 8%

8% Wire Length Wire Length 18,651,412 18,651,412 16,842,454 16,842,454

  • 10%

10%

slide-37
SLIDE 37

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 37

  • 7. Conclusions
slide-38
SLIDE 38

Power Reduction at RTL/A hit t L l RTL/Architecture Level Video Encoder Chip Video Decoder Chip Video Decoder Chip AES E ti Chi AES Encryption Chip LDPC Chip

38

slide-39
SLIDE 39

Power Reduction by Implementation

Low-power H.264 encoder LSI 50% d i

1080P H.264 Encoder LSI (ISSCC2007, IEEE・JSC2009)

50% power reduction Power reduction in AES encryption 50% power reduction 50% power reduction

2.4Gps AES Encryption(ICSEC2008)

Tamper-resistant AES:

Low power baseband LSI

Tamper resistant AES:

3rd place in ISPLED2010 Design contest

189mw@820Mb/s OFDM/UWB Baseband(A-SSCC2009)

Low-power baseband LSI 30% power reduction

39

189mw@820Mb/s OFDM/UWB Baseband(A SSCC2009)

slide-40
SLIDE 40

Power Reduction by Implementation

Low-power multi-format decoder LSI

DDR PHY Data 32b DL L

37% power reduction

1080P H.264/MPEG/AVS decoder LSI (VLSI Symp2009)

Low-power ultra-HDTV(4kx2k)Decoder LSI

DDR CMD

59% power reduction

A 530Mpixels/s 4096x2160@60fps H.264/AVC High Profile decoder LSI (VLSI Symp2010)

R PHY D P

High Profile decoder LSI (VLSI Symp2010)

3rd place in ISPLED2010 Design contest Best Student paper Award at VLSI Symp.2011

PLL

p p y p

Low-power LDPC decoder LSI 90% power reduction

40

Ultra Low Power QC-LDPC Decoder with High Parallelism(IEEE SOCC2011)

slide-41
SLIDE 41

A 530Mpixels/s 4096x2160@60fps H 264/AVC High Profile Video H.264/AVC High Profile Video Decoder Chip

ISLPED2011

slide-42
SLIDE 42

DRAM Bandwidth Issue DRAM Bandwidth Issue

  • Performance Bottleneck

– >8 GB/ s average BW for 4Kx2k@ 60fps – Can’ t be handled even with the fastest DDR2 DRAM

  • Fabrication Cost

– DRAM pins

Decoder Core

p – Wafer/package cost

Core

DRAM Bandwidth

  • Power Consumption

DRAM power times as core power

DRAM Chips

– DRAM power times as core power

42

slide-43
SLIDE 43

DRAM Bandwidth Optimization DRAM Bandwidth Optimization

  • Partial MB Reordering

– Frame Rd. ↓, Line Buf. Wr./Rd. ↓

  • VCR Lossless Frame Recompression

– Frame Rd. ↓, Frame Wr. ↓ 16% 6%

Typical BW Portion w/ Caching

46% 32%

Frame Rd. (w/Caching) Frame Wr.

46%

Line Buf. Wr./Rd. Rest

43 ISLPED2011

slide-44
SLIDE 44

Bandwidth Summary Bandwidth Summary

1317mW 46.7 68.7 23.3 VLSI'09 [1] 147.4

  • 22%

1317mW DDR2/667 46 7 54 1 PMBR

22%

1090mW DDR2/533 46.7 54.1 5.8 115.3

Rest Frame Wr

  • 38%

DDR2/533 23.1 30.9 PMBR + VCR-LFRC 71 7

Frame Wr. Frame Rd. Line Buf. W/R L th W/R

552mW LPDDR/333 3.2 50 100 150 200 71.7 M Words (128-bit)

Length W/R

LPDDR/333

44 ISLPED2011

slide-45
SLIDE 45

4k x 2k Chip Specification p p

Technology SMIC 90nm G CMOS Voltage 1.0V core, 1.8V/2.5V IO

DDR PHY Data 32b D D

g Die size 4x4mm2 Package 176-pin LQFP

DDR CMD

H.264/AVC Video Decoder

Gates/SRAM 662K/59.6KB

  • Ext. memory

64b LPDDR

P D

Video Decoder Core

Core power 189mW@175MHz (4096x2160@60fps)

DDR PHY Data 32b

Chi i h

176mW@166MHz (3840x2160@60fps) 48mW@36MHz

Chip micrograph

48mW@36MHz (1920x1080@60fps)

45

3rd place in I SPLED2010 Design contest Best Student paper Award at VLSI Symp.2011

slide-46
SLIDE 46

Test Platform

DRAM Video Decoder Chip JTAG UART DRAM UART FPGA

46

slide-47
SLIDE 47

Comparison

VLSI Symp.2010

This Work VLSI’09 [1] JSSC’07 [2] Video format(s) H 264 HP H.264 HP, H 264 MP Video format(s) H.264 HP MPEG-1/2, AVS H.264 MP

  • Max. throughput

4096x2160@60 1920x1080@60 1920x1080@30 Gates/SRAM 662K/59 6KB 367K/11 0KB 160K/4 5KB Gates/SRAM 662K/59.6KB 367K/11.0KB 160K/4.5KB DRAM config. 64b DDR 32b DDR 32b DDR + 32b SDR Technology 90nm/1.0V 0.13μm/1.2V 0.18μm/1.8V Core power 189mW 257mW 320mW Scaled & norm. core power 0.36pJ/pixel 1.0pJ/pixel* 0.79pJ/pixel** N DRAM

  • Norm. DRAM

power 1.11pJ/pixel 2.65pJ/pixel

  • * Power90 = Power130/(V130/V90)2/(C130/C90) = Power130/2.08

** Power90 = Power180/(V180/V90)2/(C180/C90) = Power180/6.48

47

Power90 Power180/(V180/V90) /(C180/C90) Power180/6.48

slide-48
SLIDE 48

H 264 H 264 Full Encoder Chip Full Encoder Chip H.264 H.264 Full Encoder Chip Full Encoder Chip

1920x1080 30 flame/S

TSMC 0.18um CMOS 1P6M 5 44mm×4 98mm (= 27 1 mm2)

64Mb System-in-Silicon DRAM

5.44mm×4.98mm ( 27.1 mm2) Clock: 200MHz Power:1409mw(DRAM is included) Logic Gates:1140K gates Logic Gates:1140K gates SRAM:108KB

Presented at 2009 VLI Symposium

48

Presented at 2009 VLI Symposium

ISLPED2011

slide-49
SLIDE 49

Power Dissipation Power Dissipation Power Dissipation Power Dissipation

3.5 2 5 3.0 2.0 2.5

Memory Memory

½ Power reduction ½ Power reduction

1.5

Processor Processor

0.5 1.0

ASIC ASIC ASIC ASIC SoC SoC NTU ASIC NTU ASIC (720P) (720P) NTU (1080P) NTU (1080P) Our design SoC (1080P) (1080P)

49

(720P) (720P) (1080P) (1080P)

ISLPED2011

slide-50
SLIDE 50

Real experiment for LDPC Chip Clock gating technique Clock gating technique Power Gating technique Power Gating technique Floor Plan Design

50

slide-51
SLIDE 51

Implementation Implementation

51

Layout of proposed decoder

slide-52
SLIDE 52

Comparison

IEEE SOCC2011

This work VLSIC2010[19] JSSC2008[21] VLSIC2007[22] Technology 65nm 0.13µm 90nm 0.13µm Supply voltage 1.2V 1.2V 1.0V 1.2V C d l th 576 2304 Code length 576~2304 Code rate 1/2, 2/3A, 2/3B, 3/4A, 3/4B, 5/6 1/2 Cycle# /iteration 24~48 48~54 ~160 ~350 Logic gate count 597k 470k 380k 420k Memory bits 56 448 72 522 89 856 76 800 Memory bits 56,448 72,522 89,856 76,800

  • Eq. gate count

968k 946k 970k 924k Frequency 110MHz 214MHz 150MHz 83.3MHz Iteration number 10 10 20 2~8 Throughput(Mbps) 1056 955 105 111 Power(mW) 115 397 264 52 Normalized power 230 397 484 52

  • Nor. Power Eff.

(pJ/bit/iteration) 21.8 42 230 216

52

slide-53
SLIDE 53

1 What is a low power design ? 1.What is a low power design ? 2 Power reduction at System level

  • 2. Power reduction at System level

3 Power reduction at Algorithm level

  • 3. Power reduction at Algorithm level
  • 4. Power reduction at RTL level
  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 53

  • 7. Conclusions
slide-54
SLIDE 54

Human detection algorithm on a low g power platform SoC with programmable hardware hardware

This project is supported by CREST to transfer the developed This project is supported by CREST to transfer the developed technology to real-world product as quickly as possible.

ISLPED2011

slide-55
SLIDE 55

Human detection algorithm on a low power platform SoC with programmable hardware platform SoC with programmable hardware.

Low Power Platform SoC Task based runtime system Human Detection Algorithm Task based runtime system (Waseda University) (Renesas) ・Extract Parallelism Running at Lower Clock Speed ・Running at Lower Clock Speed Less than 1/40 energy ti f d kt PC

55

consumption of desktop PC

slide-56
SLIDE 56

Human Detection Algorithm g

  • The human detection algorithm is based on the HOT (Histogram of Template)

feature developed by Waseda university.

Input Image Result

Human

Based on the HOT feature

Human Detection Core Algorithm Non- H

56

Human Set of Detection Windows

slide-57
SLIDE 57

Implementation of the Algorithm

Th l ith i d d i t l t k t t k h

  • The algorithm is decomposed into several tasks to compose task graph.
  • Each task is processed by CPU and/or STP controlled with the runtime system

for the task graph.

Human Detection Core Algorithm Human Detection Core Algorithm

Down Scale Mean Shift Gray Scale Gradient Status Table Hist. SVM

Decompose into primitive tasks

Down Scale Mean Shift Gray Scale Gradient Status Table Hist. SVM

Decompose into primitive tasks

Down Scale Gradient Status Table Hist. SVM Gray Scale Mean Shift Down Scale Gradient Status Table Hist. SVM Gradient Status Table Hist. SVM 1 1 2 2 1 2 1 2 1 2

Down Scale Gradient Status Table Hist. SVM Gray Scale Mean Shift Down Scale Gradient Status Table Hist. SVM Gradient Status Table Hist. SVM

1 1 2 2 1 2 1 2 1 2

Down Scale Gradient Status Table Hist. SVM Gray Scale Mean Shift Down Scale Gradient Status Table Hist. SVM Gradient Status Table Hist. SVM

1 1 2 2 1 2 1 2 1 2

Task Graph Manager

Down Scale Gradient Status Table Hist. SVM N N=26

・・・・・・

N N N

Runtime System Task Graph

Down Scale Gradient Status Table Hist. SVM

N N=26

・・・・・・

N N N

Down Scale Gradient Status Table Hist. SVM

N N=26

・・・・・・

N N N

Multi Task Scheduler

CPU STP

System XBridge

Processed by CPU and/or STP controlled with the runtime system

STP: Stream Transpose Processor

57 ISLPED2011

STP: Stream Transpose Processor

slide-58
SLIDE 58

Energy Consumption Measuring Method

  • Measured currents are integrated numerically to compute energy consumption.
  • Energy consumption in both processing part (CPU, STP) and system part

(Chipset, MC, Memory, etc) are measured independently.

Measurement interval = Processing time for 1 image

3600

C t Measurement interval = Processing time for 1 image

3600

C t Measurement interval = Processing time for 1 image

3600

C t

3600 3600

C t

Desktop PC (Core2Duo@3GHz)

= Processing time for 1 image = 551ms 12V: CPU 3.3V: ChipSet, Misc 5V: ChipSet, Memory, Misc

3200 2800 2400 2000 1600

[mA] Current = Processing time for 1 image = 551ms 12V: CPU 3.3V: ChipSet, Misc 5V: ChipSet, Memory, Misc

3200 2800 2400 2000 1600

[mA] Current = Processing time for 1 image = 551ms 12V: CPU 3.3V: ChipSet, Misc 5V: ChipSet, Memory, Misc

3200 2800 2400 2000 1600

[mA] Current 12V: CPU 3.3V: ChipSet, Misc 5V: ChipSet, Memory, Misc

3200 2800 2400 2000 1600 3200 2800 2400 2000 1600

[mA] Current

p ( @ )

Measured Current

CPU: 12V

Desktop PC Mother Board

CPU: 12V

Desktop PC Mother Board

CPU: 12V

Desktop PC Mother Board

12V: ChipSet, Misc

1600 1200 800 400

12V: ChipSet, Misc

1600 1200 800 400

12V: ChipSet, Misc

1600 1200 800 400

12V: ChipSet, Misc

1600 1200 800 400 1600 1200 800 400

12V Chipset, Misc 5V Chipset, Memory, Misc 3.3V Chipset, Misc Exclude GPU 12V Chipset, Misc 5V Chipset, Memory, Misc 3.3V Chipset, Misc Exclude GPU 12V Chipset, Misc 5V Chipset, Memory, Misc 3.3V Chipset, Misc Exclude GPU

Measurement interval = Processing time for 1 image = 486ms 1.0V: STP

900 800 700 600

[mA] Current Measurement interval = Processing time for 1 image = 486ms 1.0V: STP

900 800 700 600

[mA] Current Measurement interval = Processing time for 1 image = 486ms 1.0V: STP

900 800 700 600

[mA] Current Measurement interval = Processing time for 1 image = 486ms 1.0V: STP

900 800 700 600 900 800 700 600

[mA] Current

XBridge (MIPS 4KEc@200MHz, STP@44.4MHz)

Measured Current

CPU: 1.0V

XBrigde Evaluation Board

CPU: 1.0V

XBrigde Evaluation Board

1.0V: CPU 1 8V: XBridge 1.8V: Memory

500 400 300 200 100

1.0V: CPU 1 8V: XBridge 1.8V: Memory

500 400 300 200 100

1.0V: CPU 1 8V: XBridge 1.8V: Memory

500 400 300 200 100

1.0V: CPU 1 8V: XBridge 1.8V: Memory

500 400 300 200 100 500 400 300 200 100

(CPU, System bus, etc) XBridge: 1.8V (Except CPU and STP) Memory: 1 8V

XBridge

STP: 1.0V (STP)

XBrigde Evaluation Board

(CPU, System bus, etc) XBridge: 1.8V (Except CPU and STP) Memory: 1 8V

XBridge

STP: 1.0V (STP)

XBrigde Evaluation Board

58

1.8V: XBridge (Except CPU and STP)

100

1.8V: XBridge (Except CPU and STP)

100

1.8V: XBridge (Except CPU and STP)

100

1.8V: XBridge (Except CPU and STP)

100 100

Memory: 1.8V Memory: 1.8V

ISLPED2011

slide-59
SLIDE 59

Benchmark Results

Comparison with desktop PC

Real Demo is shown at ULP Exhibition

  • Comparison with desktop PC

(Core2Duo@3GHz)

E ti [O ll] 2 4% (1/42) – Energy consumption: [Overall] 2.4% (1/42), – [Processor part] 3.5% (1/28) Processing speed: 118%

30.000 Processor Part System Part

100%

g p

Desktop PC Core2Duo @3GHz MIPS 4KEc @266MHz 0.582 17.311 100 0% 3 4% XBridge Processing Time / Image [Sec] Relative Performance Platform Processor 4KEc@200MHz STP@44.4MHz 0.494 117 9%

20.000 25.000 ption[J]

100%

100.0% 3.4% CPU 13.036 10.968 STP

  • SubTotal

13.036 10.968 14.925 5.746 27.961 16.714 100 0% 84 1% 3 5% (1/28) 0.251 0.208 Relative Performance Processor Part Relative Energy Energy Consumption / Image [J] Processor Part System Part Total 0.663 117.9% 0.460 0.203

10.000 15.000 Energy Consum

100%

100.0% 84.1% 3.5% (1/28) 100.0% 59.8% 2.4% (1/42) Processor Part Total Relative Energy Consumption

0.000 5.000 Desktop PC (Core2Duo@3GHz) XBridge (4KEc@200MHz STP@44 4MH )

2.4% 3.5%

59

Energy Consumption Comparison between desktop PC and XBridge

STP@44.4MHz)

ISLPED2011

slide-60
SLIDE 60

1.What is a low power design ?

  • 2. Power reduction at System level

y

  • 3. Power reduction at Algorithm level

g 4 Power reduction at RTL level

  • 4. Power reduction at RTL level

5 Power reduction examples by Chip

  • 5. Power reduction examples by Chip
  • 6. Product oriented activity

ISLPED2011 60

  • 7. Conclusions
slide-61
SLIDE 61

Low Power Design g

Signal Recognition Encode Cypher ECC NW Protocol

Low Power Algorithm

Process Recognition Decode Encription ECC NW Protocol

Low Power Algorithm

Hetero Multi-core Execution Control

Task4 Task3 Task1 Task2 Programmable HW Low Power IP Core

ASIP

Multi Processor

Low Power Design basic Technologies

Clock GT Power GT Floor Plan High Level Synth

1/3(2mw) 1/5(40mw) 1/10(50mw) 1/5(100mw) 1/5(100mw) 1/40(0.6J)

slide-62
SLIDE 62

Low Power Design

Goal

g

Signal Recognition Encode Cypher ECC NW Protocol

Achieved Low Power Algorithm

Process Recognition Decode Encription ECC NW Protocol

1/5 1/3~1/5 Low Power Algorithm 1/10 1/3~1/10

Hetero Multi-core Execution Control

Task4 Task3 Task1 Task2

1/3 1/10

Programmable HW Low Power IP Core

ASIP

Multi Processor

1/2 2/3~1/2

Low Power Design basic Technologies

1/100

Clock GT Power GT Floor Plan High Level Synth

1/3(2mw) 1/5(40mw) 1/10(50mw) 1/5(100mw) 1/5(100mw) 1/40(0.6J)

2/27~ 1/100

slide-63
SLIDE 63

M th k t ll t W d Many thanks to my colleagues at Waseda University and Renesas Electronics to carry

  • ut the project:

63