[PPT] - Innovative Power Control for Ultra Low-Power and High- Ultra Low PowerPoint Presentation

SLIDE 1

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance System LSIs

Hiroshi Nakamura (Univ. of Tokyo) Hideharu Amano (Keio Univ.) Masaaki Kondo (Univ. of Electro-Communications) Mitaro Namiki (Tokyo Univ. of Agriculture and Tech.) Kimiyoshi Usami (Shibaura Inst. of Tech.)

JST-CREST ULP Workshop (H.Nakamura)

Kimiyoshi Usami (Shibaura Inst. of Tech.)

1

SLIDE 2

Objective and Strategy

 Objective:

d ti d ti f

C

System Software

drastic power reduction of high-performance system LSIs

Strategy:

Co-Opt

 Strategy:

innovative power control through tight Co Optimization /

timizat

Compiler

through tight Co-Optimization / Co-Design of system software, architecture and circuit design

ion/Co

Architecture

architecture, and circuit design.

 Principle:

Performance: limited by a bottleneck

-Desig

Performance: limited by a bottleneck Power: summation of whole system

 Low power and slow operation for

gn

Circuit Technology 2 JST-CREST ULP Workshop (H.Nakamura)

 Low power and slow operation for unhurried / idle parts

SLIDE 3

Role of Design Hierarchy for Low Power

OS

When?

OS Architecture

Where?

Circuit

How? throttle lever of / f

Device

Clock Gating, Dual Vth, DVFS Power Gating Back-bias

power/performance

 Circuit Level : Provide levers to throttle performance / power  Architecture OS Level :

DVFS, Power Gating, Back bias, ..

 Architecture, OS Level :

Find a chance to set levers, when and where ??

 architecture: Intra-task/process optimization  OS: Inter-task/process optimization

JST-CREST ULP Workshop (H.Nakamura) 3

SLIDE 4

Preferable Throttle Lever

 Effectiveness of

Reconfig S t Processor

Power Reduction

 Low Overhead in Area,

System int fp cache Processor Cache

busy

Performance, Power

 Controlling the throttle

Memory Network Processor int fp cache

lever itself takes time and consumes power Fi C t l G l it System LSI

 Fine Control Granularity

in both Space and Time L ti f b /

 Locations of busy /

idle parts are small and change frequently

idle

and change frequently

4 JST-CREST ULP Workshop (H.Nakamura)

time

SLIDE 5

Example of Throttle Levers

 for dynamic power: Clock Gating, DVFS

b th ff ti DVFS ti l (P Vdd2 )

 both effective, DVFS particular (Power ∝ Vdd2 )  Clock Gating: very fine-grained control with little overhead

 easily utilized within circuit level design

 DVFS: tens of μs to change Vdd through regulator

 moderate granularity

 for leakage power: Power Gating, Body Biasing  both effective, but large overhead

in power and performance

Circuit Bl k Vdd  Body biasing: spatial granularity

 statically defined regions t f fi i d t l

Block VGND sleep Tr sleep signal  not easy for fine-grained control

JST-CREST ULP Workshop (H.Nakamura)

sleep Tr. GND sleep signal

Power Gating

5

SLIDE 6

Role of Design Hierarchy for Low Power: The Ideal

System

When?

OS System Architecture

When? Where?

OS Architecture

When? Where? H ?

Architecture Circuit

How?

Architecture Circuit

How?

 Spatial and Temporal

Device Device

 Spatial and Temporal

Granularity is important

 Co-Design of Circuit, Architecture and OS for Power  Co Optimization of Throttle Lever Control:  Co-Optimization of Throttle Lever Control:

especially, Co-Optimization of Spatial and Temporal Granularity ex activity localization to make full use of throttle levers

ex. activity localization to make full use of throttle levers

characteristics by architecture/OS

JST-CREST ULP Workshop (H.Nakamura) 6

SLIDE 7

Team Formation of our Research Project

System Software

C S a

Sub-theme (leader)

y

Co-operative System Soft- ware with Arch. (Prof. Namiki) Co-Optim System S and Arch

( )

Reconfig Network

Ultra Low-Power Reconf. Architecture (Prof. Amano) mization o Software hitecture

System Memory Processor int fp cache

Data Resident Architecture (Prof Nakamura) ( )

Architecture/ Compiler

C

f

e e (Prof. Nakamura) Co-Optim Archite Circuit Data Resident Compiler (Prof. Kondo)

VddH VddL

logic block

Ultra Low-Power Circuit Design (Prof. Usami)

Ci it D i

mization cture an t Design ( )

7 JST-CREST ULP Workshop (H.Nakamura) block

g ( )

Circuit Design

f

nd n

SLIDE 8

(Project 1) Geyser: Low Power Processor through Fine-grained Runtime Power Gating g g

 Target: Leakage Power  Background: Leakage reduction techniques so far,  Standby time:

 power-gating (Coarse Grain)

 Runtime:

 Cache-decay, Drowsy-cache, (Coarse Grain in temporal)

 Leakage for logic parts (ALU, multiplier, etc.) gets serious  Fast but Leaky transistors are used  Active ratio of those parts are not necessarily high, but active

y g parts change frequently, that is, cycle by cycle

Objective : Reduce runtime leakage power of logic parts Challenge: how to optimize the granularity of power gating

JST-CREST ULP Workshop (H.Nakamura) 8

SLIDE 9

Instruction Pipeline with Power-Gating

 Geyser: MIPS compatible processor with 5-stage pipeline,  Straightforward PG (power-gating)

 Turn EX-units into active mode only if necessary  Ex unit gets active when an affecting instruction enters the IF stage  Ex-unit gets active when an affecting instruction enters the IF stage  The activated EX-unit returns to sleep mode after execution

IF ID EX MEM WB IF ID EX MEM WB Inst

ALU Mult

Operation Operation

SHIFT SHIFT i SHIFT SHIFT i

Detects which unit

Shift Div

S d k i l

Instruction Instruction Instruction Instruction

Shift

JST-CREST ULP Workshop (H.Nakamura)

Detects which unit will be used

Sends wake-up signal

MIPS R3000 pipeline

9

SLIDE 10

Challenges for Run-Time Power-Gating: Energy Overhead Energy Overhead

Power Break-Even Time (BET) Power

: Energy overhead

1 3

+

2 : part of leakage saving 3 1

Normal Leakage

2 : part of leakage saving 2 1 3

+ = ( )

4 3 1 2

Time

4 : Net Energy saving

Break-Even Time(BET)

 Sleep period should be longer than BET

Sleep Wake- Up

 Sleep period should be longer than BET

 Otherwise, total energy consumption increases

BET t ll th ll t l it f P G ti

JST-CREST ULP Workshop (H.Nakamura)

 BET tells the smallest granularity for Power Gating

10

SLIDE 11

Break Even Time of Each Functional Unit

11 114 25℃ 65℃ 100℃ 125℃

90 nm technology

74 74 92 25℃ 65℃ 100℃ 125℃

Cycl

44 26 38 28

les

@20

26 22 14 28 12 16 10 6 12 8 10 8 2 8

00MHz

ALU Shift Mult Div CP0  BET is shortened when the chip temperature climbs up  BET is shortened when the chip temperature climbs up

 Leakage current depends on temperature heavily

 We need Novel PG strategies taking BET into account

JST-CREST ULP Workshop (H.Nakamura)

 We need Novel PG strategies taking BET into account

11

SLIDE 12

Power Gating Strategies

Requirement: Power off Ex-units longer than BET

 static strategy  static strategy

 straightforward：Ex-units always in sleep after execution  ideal compiler (ideal compiler-directed): exact average idle time of  ideal compiler (ideal compiler directed): exact average idle time of

Ex-units after each instruction is known (for reference only)

 dynamic strategy

 L1 miss: Ex-units fall asleep only if encountering L1 cache misses

 L1 miss penalty = 15 cycles

 L2 miss: Ex units fall asleep only if encountering L2 cache misses  L2 miss: Ex-units fall asleep only if encountering L2 cache misses

 L2 miss penalty = 200 cycles

 both static and dynamic strategies

bo s a c a d dy a c s a eg es

 ideal compiler + L2 cache miss

 ideal (God) : ideal dynamic strategy

( ) y gy

 exact idle time of Ex-units are known at anytime,

upper limit of PG (for reference only)

JST-CREST ULP Workshop (H.Nakamura) 12

SLIDE 13

Result for Frequently Used Execution Unit

straightfor ard BET is ideal compiler: less chance

FPADD for MGRID

straightforward: BET is longer than sleep time  waste of energy for longer BET L1: resulting sleep time is about 15 

straightforward ideal compiler

ideal for BET<15, but waste of energy for longer BET Relative Energy

L1 L2 ideal comp. + L2

L2: resulting sleep time is 200  ideal for longer BET for shorter BET, compiler is effective compared to non-PG

ideal (God)

 ideal for longer BET

BET(cycle)

JST-CREST ULP Workshop (H.Nakamura) 13

SLIDE 14

Collaboration with Compiler / OS

Suggested Power Gating Strategy

 Co-optimization on Control Granularity of the PG lever

 compiler direction by assuming short BET,

p y g , because compiler-directed PG is effective for shorter BET

 for shorter BET (high temperature) compiler direction is  for shorter BET (high temperature), compiler direction is

put into use, and take (compiler + L2-miss) strategy f ( )

 for longer BET (low temperature), take L2-miss strategy,

but ignore compiler direction

 OS is expected to switch between strategies by observing

changes on BET g

 Power Gating Collaborated with Compiler / OS

JST-CREST ULP Workshop (H.Nakamura) 14

SLIDE 15

Leakage Monitor

[Koyama et. al. ITC-CSCC 08] [Usami et. al. ISLPED2011 (poster 15)]  BET depends on the dynamic environment, such as

temperature and the process variation temperature and the process variation.

 on-chip leakage monitoring circuit

 More leakage results in faster charging of VGND  More leakage results in faster charging of VGND  Estimate leakage by measuring rise-time of VGND to VREF

 OS can select the best PG strategy by observing this monitor

age (V) More leakage ON OFF ND Volta Less leakage

'1' '0'

VGND

VGN Reference(VREF)

JST-CREST ULP Workshop (H.Nakamura)

Sleep time (s) Rise Rise

15

SLIDE 16

Co-Optimization of Throttle Lever Control in Fine-grained Runtime Power Gating e g a ed u e

e Ga

g

PG Strategy OS PG Control through gy Architecture PG Control through Activity Localization Circuit

Lever controlled best granularity changes dynamically

PG

 Who should be responsible for PG Control

Lever controlled in 10~100cycles changes dynamically (e.g. temperature)

 Who should be responsible for PG Control

 depends on granularity of Control

 PG control granularity (BET) : 10 ~ 100 cycles  PG control granularity (BET) : 10 ~ 100 cycles  best granularity of control changes every msec

16 JST-CREST ULP Workshop (H.Nakamura)

SLIDE 17

Prototype CPU : Geyser-1 [Ikebuchi et. al. ASSCC ’09]

 MIPS R3000

 Fujitsu e-shuttle 65nm  Fujitsu e shuttle 65nm  Vdd= 1.2V

 successfully in operation

 the first successful cycle

by cycle power gating

2.1 mm 4.2 mm Shifter DIV MULT ALU leakage monitor

17 JST-CREST ULP Workshop (H.Nakamura)

SLIDE 18

Prototype CPU : Geyser-2

Geyser-2: 2nd Prototype

 with caches and

TLBs on-chip

 max working

frequency : 210MHz (wakeup latency is less than 5ns)

r [mW] ge Powe

Demonstration @ ISLPED2011 booth ④

Leakag

18 JST-CREST ULP Workshop (H.Nakamura)

ISLPED2011 booth ④

Temperature [C]

SLIDE 19

(Project 2) Cool Mega Array

 Reconfigurable Accelerator:

g not for performance but power-efficiency

 PE array consists of only a combinatorial logic  Power consumption of registers

and clock distribution is reduced

 PE array consists of only a combinatorial logic

combinational circuit DVS region

and clock distribution is reduced

 Low-voltage and Low-power PE

ti b l d ith

PE

array operation balanced with data bandwidth of memory

…

SE

 localization of operations

 Operation / Reg. access

… … …

DME

…

DME DME DME

… … … … …

 Performance / Power

19 JST-CREST ULP Workshop (H.Nakamura)

DME M DME M DME M DME M DMEM DMEM DMEM DMEM

Architecture of CMA

SLIDE 20

Prototype : CMA-1 C

 Fujitsu 65nm  8x8 PE array  12KB data memory  control part : 1.2V  Maximum

Power Efficiency [MOPS/mW]

power efficiency 223.2 [MOPS/mW]

Demonstration @ ISLPED2011 booth ④

20 JST-CREST ULP Workshop (H.Nakamura)

ISLPED2011 booth ④

PE Array Voltage [V]

SLIDE 21

Summary and Future Direction

 Geyser : Run-time Power Gating Processor

fi t l b l ti

 first cycle-by-cycle power gating processor

 Cool Mega Array :

P Effi i A l t

CMA CMA CMA CMA CMA CMA

Power Efficiency Accelerator

 Other Projects

CMA CMA Geyser Geyser CPU CPU CMA CMA CMA CMA L2 L2 Cache Cache

 Fine Grain Power Gating NoCs

Main Main M CPU CPU L2 L2 Cache Cache

[Matsutani et. al. NOCS 2010] [Matsutani et. al. IEEE Trans. on CAD, 4/2011]

 Linux-based Evaluation Platform

Memory emory

Demonstration @ISLPED2011 booth ④

Towards Integrated System LSIs

 Evaluation through real integration via

g g 3D wireless NoCs

21 JST-CREST ULP Workshop (H.Nakamura)

SLIDE 22

Selected Publications

1.

N. Seki, et.al., “A Fine Grain Dynamic Sleep Control Scheme in MIPS

R3000” Proc of ICCD-2008 pp 612-617 2008 R3000 , Proc. of ICCD 2008, pp. 612 617, 2008

2.

K.Usami, et.al., “Design and Implementation of Fine-grain Power Gating with Ground Bounce Suppression”, Proc. of VLSI Design 2009, pp. 381- 386 2009 386, 2009

3.

N.Takagi, et.al., “Cooperative Shared Resource Access Control for Low Power Chip Multiprocessors”, ISLPED-2009, pp. 177-182, 2009 S S it t l "M CCRA C b ：A 3D D i ll R fi bl

4.

S.Saito, et.al., "MuCCRA-Cube：A 3D Dynamically Reconfigurable Processor with Inductive Coupling link," Proc. of FPL09, pp.6-11, 2009

5.

D.Ikebuchi, et.al., “Geyser-1: A MIPS R3000 CPU core with fine grain ti ti ” P f IEEE ASSCC 2009 281 284 2009 runtime power gating”, Proc. of IEEE ASSCC-2009, pp. 281-284, 2009

6.

H. Matsutani, et.al., "Ultra Fine-Grained Run-Time Power Gating of On-

Chip Routers for CMPs", Proc. of NOCS'10, pp.61-68, 2010.

7.

H. Matsutani, et.al., "Performance, Area, and Power Evaluations of

Ultrafine-Grained Run-Time Power-Gating Routers for CMPs", IEEE Trans.

n CAD (TCAD), Vol.30, No.4, pp.520-533. Apr 2011.

8.

K.Usami, et.al., “On-chip Detection Methodology for Break-Even Time of Power Gated Function Units”, Proc. of ISLPED-2011, (to appear)

22 JST-CREST ULP Workshop (H.Nakamura)