Low Power Design Dr Z Wang and Prof Dr J Henkel Dr. Z. Wang and - - PowerPoint PPT Presentation

low power design
SMART_READER_LITE
LIVE PREVIEW

Low Power Design Dr Z Wang and Prof Dr J Henkel Dr. Z. Wang and - - PowerPoint PPT Presentation

Hardware Power 1 Low Power Design Dr Z Wang and Prof Dr J Henkel Dr. Z. Wang and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany 3. Hardware power optimization and estimation estimation


slide-1
SLIDE 1

Hardware Power 1

Low Power Design

Dr Z Wang and Prof Dr J Henkel

  • Dr. Z. Wang and Prof. Dr. J. Henkel

CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany

  • 3. Hardware power optimization and

estimation estimation

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-2
SLIDE 2

Hardware Power 2

Overview

hardware memory Components consuming power Levels of abstraction

  • system

interconnect memory

  • system
  • RTL
  • gate
  • transistor

Tasks

  • Optimize (i e
  • Optimize (i.e.

minimize for low power)

  • Design / co-design

(synthesize, compile,

Battery issues

(synthesize, compile, …)

  • Estimate and

simulate

software OS

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

software

slide-3
SLIDE 3

Hardware Power 3

Generic HW synthesis flow synthesis flow

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-4
SLIDE 4

Hardware Power 4

Low power HW design flow p g

Energy/power needs to be analyzed and

  • ptimized at each level of

abstraction abstraction Therefore, appropriate power models for each level are necessary Shown in Fig:

  • a) design flow w/o

energy/power

  • b) d

i fl ith

  • b) design flow with

energy/power

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-5
SLIDE 5

Hardware Power 5

Power consumption in HW p

A more detailed version than in the intro …

h l k

P P P P P = + + +

(Src: [Anand98])

. . avg sw cap short circuit leakage static

P P P P P

+ + +

In general, four components:

(Src: [Anand98])

g , p Switching capacity power Short-circuit power Short-circuit power Leakage power St ti

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Static power

slide-6
SLIDE 6

Hardware Power 6

Switching capacity power g p y p

Caused by parasitic capacitors during switching: Fig. shows C_L which is the effective capacitance of all parasitic capacitances

1

Per transition: Means: a) reduce operating frequency b) reduce C L c) reduce voltage d)

2 .

1 2

sw cap L DD

P C V N f = i i i

Means: a) reduce operating frequency, b) reduce C_L, c) reduce voltage, d) reduce switching activity. Most common: reduce voltage

  • P

bl d l f t t d i t !

( )

2 L DD d

k C V t V V ∗ ∗ =

=> Problem: delay of gate t_d increases too!

(Src: [Anand98]) CMOS inverter

( )

DD th

V V −

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-7
SLIDE 7

Hardware Power 7

Short circuit power p

Explanation:

Caused by direct supply-to-ground path When CMOS inverter in Fig changes from 1 >0 there is a short When CMOS inverter in Fig. changes from 1->0 there is a short time frame within which both, nMOS and pMOS transistors are conducting => short circuit current is drawn from power supply

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-8
SLIDE 8

Hardware Power 8

Leakage power g p

Leakage can be divided into two components

First component: I_diode – refers to the diodes that are formed between diffusion regions and substrate Second component:

(Src: [Anand98])

p ‘off’ transistors still conduct some current K S technology parameters; W eff effective transistor channel K, S, technology parameters; W_eff effective transistor channel width NOTE: leakage power is predicted to be dominant in future silicon technologies

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

technologies

slide-9
SLIDE 9

Hardware Power 9

Static power p

Not relevant in CMOS circuits Note: in some literature leakage power is denoted as “static power” Static power: only relevant in some nMOS circuits h th i t t th l t d where there is a constant path supply-to-ground

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-10
SLIDE 10

Hardware Power 10

Power consumption in HW: b kd breakdown

100 Dynamic power

Leakage power will dominate in future ( <100nm) silicon technologies

Subthreshold leakage 11

  • ne means to reduce leakage

power is to deploy dielectrics with a high k-value

Trajectory if high-k dielectrics reach production leakage Gateoxide 0,01 Gateoxide leakage 10-4 10-6 1 9 1 9 2 2 2 2 2

(Src: [Heer04]) time

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

9 9 9 9 5 5 1 1 5 2

(Src: [Heer04])

slide-11
SLIDE 11

Hardware Power 11

Hardware synthesis for low power low power

Considered here: high-level synthesis (HLS) e.g.:

Operator scheduling Mod le selection Module selection Glitch power reduction State transition reduction …

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-12
SLIDE 12

Hardware Power 12

Operator scheduling for low power low power

What is scheduling in the context of high level synthesis? What is scheduling in the context of high-level synthesis? Scheduling assigns operations in the behavioral description to control steps or controller states. Scheduling determines cycle-by- cycle behavior i.e. sequence in which operations are performed y q p p Some repetition from ESI: multicycling (clock period is rather short)

(Src: [Anand98])

chaining (clock period rather long) finding the right clock cycle time is an optimization task itself S h d li d t i th i hi h th i Scheduling determines the sequence in which the various

  • perations of the behavioral description are performed, and also

dictates which operations and variables can share the same functional units and registers. Thus, scheduling can be used to bl h i f l b i th t l t d g g enable resource sharing for low power by ensuring that correlated variables and operations with correlated operands are appropriately sequenced so that they can share the same resources

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-13
SLIDE 13

Hardware Power 13

Operator scheduling for low power (cont’d) low power (cont d)

Scheduling can be performed so as to enable maximum resource sharing between operations that belong to instances of the same computational pattern, resulting in maximal exploitation of regularity during resource sharing regularity during resource sharing Scheduling can be used to distribute the slacks or mobilities of various operations in the DFG appropriately so that some

  • perations may be performed using slower more energy-efficient
  • perations may be performed using slower, more energy efficient

functional units. Thus, scheduling has an impact on the power trade-offs through module selection Scheduling determines the distribution of operations over time, g p , and hence affects the profile of the power consumption in the implementation over time (control steps or clock cycles). Reducing peak power is important due to packaging, cooling, and reliability considerations The effect of scheduling on peak power will be

  • considerations. The effect of scheduling on peak power will be

illustrated later. => these tasks will be discussed in the following (some in the

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

=> these tasks will be discussed in the following (some in the context of module selection)

slide-14
SLIDE 14

Hardware Power 14

Operator scheduling for LP (cont’d) p g ( )

C V ∗

d

t

( )

2 L DD d DD th

C V t V V ∗ = −

Basic idea: use slack in a data flow graph (dependent upon timing constraints) and:

DD

V

Shown: normalized

( p p g ) a) Vary V_dd of the ALU where operator is to be executed, or b) Assign operator(s) to a different ALU with a

Shown: normalized, dependency t_d = f(V_DD) (src:[Saraff95])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

b) Assign operator(s) to a different ALU with a lower/higher (fixed) V_dd

slide-15
SLIDE 15

Hardware Power 15

Operator scheduling for LP (cont’d) p g ( )

Problem: Obtain a mapping

  • f a data

flow graph G=(V,E) given a base execution time t_c (or V_dd) and a timing constraint k * t c minimize

:V S τ →

2

( )

i

i V υ

τ υ

k t_c minimize

i

such that the critical path length of the DFG is <= k * t_c

1 2

{ , ,..., }

c c ci

S V V V =

(src:[Saraff95]) (src:[Saraff95])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-16
SLIDE 16

Hardware Power 16

Operator scheduling for LP (cont’d) l i h

  • algorithm -

Step 1: initialization Step 2: computing slack

(src:[Saraff95])

l(v) – longest path of the graph that goes through node v

(src:[Saraff95])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

that goes through node v

slide-17
SLIDE 17

Hardware Power 17

Operator scheduling for LP (cont’d) algorithm

  • algorithm -

Step 3: compute max slack value Step 4: compute dual graph

(src:[Saraff95]) (src:[Saraff95])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-18
SLIDE 18

Hardware Power 18

Operator scheduling for LP (cont’d) algorithm

  • algorithm -

Step 5: weight assignment St 6 t l t i ht d th Step 6: compute longest weighted path Step 7: reassign voltages to node in longest path

(src:[Saraff95]) (src:[Saraff95])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-19
SLIDE 19

Hardware Power 19

Operator scheduling for LP (cont’d) algorithm

  • algorithm -

Step 8: go back to step 2

Conclusion

p g p

Power consumption can be reduced depending on constraints up to around 25%

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-20
SLIDE 20

Hardware Power 20

Module selection for LP

What is module selection?

The process of mapping operations from the CDFG to component templates of the RTL library

Observation: with each operation a functional unit template but not a specific instance is associated (that would be mapping)

component templates of the RTL library

instance is associated (that would be mapping) Example: a “+” operation may be implemented using a

  • A) ripple-carry adder
  • B)

l k h d dd

  • B) carry-lookahead adder
  • C) …
  • Ripple-carry adder is slow but more efficient in switched capacitance, carry-

lookahead adder is faster but less efficient in switched capacitance lookahead adder is faster but less efficient in switched capacitance

  • Similar tradeoffs exist in other operations

Idea: Tradeoffs can be exploited to fulfill power constraints through module

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

dea adeo s ca be e p o ted to u po e co st a ts t

  • ug
  • du e

selection

slide-21
SLIDE 21

Hardware Power 21

Module selection for LP (cont’d) ( )

(Src: [Anand98])

Each operation in the DFG (middle) has been mapped to fast component in

  • rder to meet performance constraints (in that case constraint: 85ns)

But is that really necessary? => no not all ops need necessarily be mapped to But is that really necessary? > no, not all ops need necessarily be mapped to fastest module. Focus (for timing constraint) should rather be on critical path Idea: slack in off-critical path ops may be used to select slower functional units that may have a better efficiency in switched capacitance (see right DFG). There mult op uses less power (but not less energy)

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

There, mult op uses less power (but not less energy) Important: to have a large module library with distinct switching capacity efficiencies and performance characteristics

slide-22
SLIDE 22

Hardware Power 22

Module selection for LP (cont’d) i l i l Vdd

  • using multiple Vdd -

Multiple supply voltages have also been used at logic level Idea: obtain low power at rather small performance overhead (remember the equation for power through switched capacitances) (remember the equation for power through switched capacitances) How to do in HL synthesis?

Need RTL component library that contains multiple versions of each component each referring to a different Vdd Need to extend module selection process to assign each op to a lib component template plus a supply voltage

  • f

The insertion of a necessary level converter needs to be accomplished The delay models (that evaluate the module selection) need to be sensitive to the dependence of delay and supply voltage N t h i f LP ( l t ) d t b t i d t Note: resource sharing for LP (see later) needs to be constrained: must not allow sharing of functional units that are assigned to different supply voltages

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-23
SLIDE 23

Hardware Power 23

Module selection for LP (cont’d) i l i l Vdd

  • using multiple Vdd -

E Ex:

Off-critical-path op ‘*3’ has been assigned to a lower been assigned to a lower Vdd (3.3V) level converter has been inserted: convert result of inserted: convert result of ‘*3’ to higher output level various ways for level converter: a) a separate level converter circuit like DCVS (differential

(Src: [Anand98])

circuit like DCVS (differential cascode voltage switch b) may be integrated into a register to reduce overhead

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

register to reduce overhead

slide-24
SLIDE 24

Hardware Power 24

Peak power p

Example: two possible schedules of a given graph (assumption: same power

(Src: [Anand98])

p p g g p ( p p for each operation) ASAP schedule results in high peak power A possible slack may be used to reduce peak power without sacrificing

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

p y p p g any performance

slide-25
SLIDE 25

Hardware Power 25

Peak power (cont’d) p ( )

Example graph, schedule and module lib

  • Assumption: ‘1’, ‘2’ and ‘0’, ‘3’ may pair-wise share a module. The according times are t_A, t_B:

(Src: [Knight])

  • The average power consumption for the two schedules is:
  • Ex1: t_A = 1,300ns, t_B = 660ns, P_A = 28.6mW, P_B = 56.4mW
  • Note: average power has considerably improved but energy is approx the same!
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

  • Note: average power has considerably improved but energy is approx the same!
  • => trade-off between power and performance has been exploited to protect circuit
slide-26
SLIDE 26

Hardware Power 26

Peak power (cont’d) p ( )

(Src: [Knight]) Mi d l f t h th ti t ll

Window is 60ns

Min delay: refers to when the operations are actually done whereas the other case always assumes 60ns window

Window is 60ns ‘C’ uses multicycling and no parallelism, has lowest peak power and is slowest ‘H’ is fastest, has highest peak power but comparable average power (according to the window)

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

window) Note: optimizing for peak power or for average power are completely distinct tasks!

slide-27
SLIDE 27

Hardware Power 27

DFG restructuring for LP g

A typical DSP operation: constant multiplication and addition: Y = A * X (e.g. as part of an IIR filter) Assumptions:

m-bit data value X multiplied by m-bit constant A

E i t Experiment:

Applying random values of X for varying values of A X and A are represented as two’s complement X and A are represented as two s complement A 0.0 … 1.0 normalized Observing average switching activity per bit at multiplier output Result: next slide

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-28
SLIDE 28

Hardware Power 28

DFG restructuring for LP (cont’d) g ( )

Observation: Wh t t l A i ‘0’

er bit

(Src: [Anand98])

When constant value A is ‘0’ => no switching activity (as expected) When A is 1.0, output

activity p

When A is 1.0, output switching activity is equal to switching activity of X In between: it is monotonically increasing

witching a

monotonically increasing

(i.e. A)

Output sw

Example

An adder: two m bit data

O

(Src: [Ragh99])

An adder: two m-bit data values X1, X2. It can be shown that:

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-29
SLIDE 29

Hardware Power 29

DFG restructuring for LP (cont’d)

Example: linear time-invariant signal processing system (e.g. IIR filter)

(Src: [Anand98])

Left figure Right figure:

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-30
SLIDE 30

Hardware Power 30

Resource sharing for LP g

What is resource sharing?

Mapping ops and variables in CDFG to FUs, registers etc. and defining interconnection between to form RTL implementation It is the mapping from function to structure It directly impacts power consumption by determining switching activity at i i l b i bl k various signals, buses, wires, macro blocks etc.

Observation:

Result of resource sharing of variables’ values are time multiplexed Result of resource sharing of variables values are time multiplexed registers Values that appear as input operands of ops are time-multiplexed to appear at inputs of FUs appear at inputs of FUs Values that are transferred between FUs are sequenced to appear on interconnect units (buses and multiplexers) Word-level temporal correlations of values on data path signals are Word level temporal correlations of values on data path signals are determined by the correlations among variables and input operands of

  • perations that are grouped together during resource sharing

Word-level correlation, in turn determine bit-level switching activity

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Idea: can signal correlations be exploited to reduce switched capacitances?

slide-31
SLIDE 31

Hardware Power 31

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

(S [A d98])

Analysis of above scenario:

  • Focus on ‘+1’ and ‘+2’ operations

(Src: [Anand98])

  • Two consecutive iterations of the DFG are shown. It is : “+1_1”, “+2_1”, “+1_2”,

“+2_2”

  • Values seen at adder input are: (a1, b1), (c1, d1), (a2, b2), (c2, d2)
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

What is the switching activity at the adder inputs determined by?

slide-32
SLIDE 32

Hardware Power 32

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

… switching activity is determined by:

first iteration,

Idea:

Exploit correlations between variables in behavioral description to minimize switched capacitance at RTL level

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

p

slide-33
SLIDE 33

Hardware Power 33

Resource sharing for LP (cont’d)

exploiting signal correlation

  • exploiting signal correlation -

Scenarios of exploiting correlations:

S1: S1: S2:

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-34
SLIDE 34

Hardware Power 34

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

Scenarios of exploiting correlations (cont’d):

S3:

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-35
SLIDE 35

Hardware Power 35

Resource sharing for LP (cont’d)

  • exploiting signal regularity -

What is regularity?

Regularity refers to the repeated occurrence of computational patterns within an algorithm patterns within an algorithm

Idea:

Regularity can be exploited to reduce interconnect power by Regularity can be exploited to reduce interconnect power by detecting instances of repetitive patterns in the computation and resource sharing in the sense that the same interconnect structure in the data path is reused for as many instances of computation p y p patterns as possible

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-36
SLIDE 36

Hardware Power 36

Resource sharing for LP (cont’d)

exploiting signal regularity

  • exploiting signal regularity -

Ex 1:

Scenario b): given on the left-hand side a data flow graph; on the right an implementation Problem: for A2 -> M1, A2 -> M1, A1 - > S1 multiplexers are needed > S1, multiplexers are needed Scenario a): same graph but mapping of DFG to data path does not require multiplexers since A1 -> q p M1, for example, can be reused Conclusion: scenario b) does not preserve regularity, a) does So, where does overhead power consumption come from?

  • 1. the more fan-outs, the larger are

the interconnects and therefore the switched capacitances

  • 2. the overhead of multiplexers (and

probably buffers) leads to more switching activity and therefore hi h ti (Src: [Mehra])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu higher power consumption

slide-37
SLIDE 37

Hardware Power 37

Resource sharing for LP (cont’d)

exploiting signal regularity

  • exploiting signal regularity -

Idea: defining E-instances: a pair of nodes connected by an edge E-template: type of an E-instances classified by type of input/output port Ex: (add->add.right) means a t l t ith dd h template with an adder where

  • utput maps to right input of another

adder E-coverage: # of instances of that t di id d b t t l # f d i type divided by total # of edges in graph task here: using E-templates in synthesis as to minimize power with

(Src: [Mehra])

e-coverage as quality measure through regular assignment Ex 2: a fourth-order cascade filter Disadvantages? Disadvantages?

  • may require more hardware units

since sufficient E-templates need to be provided -> under circumstances power savings t t f h d

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu come at cost of more hardware

slide-38
SLIDE 38

Hardware Power 38

Resource sharing for LP (cont’d)

exploiting signal regularity

  • exploiting signal regularity -

Example: E-template-based scheduling

(Src: [Mehra])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-39
SLIDE 39

Hardware Power 39

Glitch power reduction p

What is “glitch power”? Power consumption that is related to hazards i.e. temporary values at the input/output of gates that cannot be explained when considering a truth input/output of gates that cannot be explained when considering a truth table only. It is due to different propagation delays in combinatorial paths

glitch Shown: # transitions / # transition w/o glitches g (Src: [Ragh99])

Analysis of the above example unveiled:

  • 1

f f

(Src: [Ragh99])

A rising transition on signal x1 was frequently accompanied by a falling transition on c11. Thus, the rising transition on x1 and the falling transition on c11 are highly correlated. Transitions on signal x1 arrive earlier than transitions on signal c11

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Transitions on signal x1 arrive earlier than transitions on signal c11 due to: a) non-balanced paths, b) wiring delays.

slide-40
SLIDE 40

Hardware Power 40

Glitch power reduction (cont’d) p ( )

In general, glitches are generated at the control signals due to the simultaneous presence of the following two conditions. 1) Functional: Correlation between rising and falling transitions at two or more signals that f d t 2) T l Th t lli t t lli t iti i feed a gate. 2) Temporal: The controlling to non-controlling transition arrives earlier at the gate’s input (see example last slide) Note: glitchy signal propagates (and therefore consumes even more power at

  • ther gates in the circ itr

> tr to eliminate the glitch as close to the location

  • ther gates in the circuitry => try to eliminate the glitch as close to the location
  • f first occurrence (i.e. where it is generated) as possible

Glitches may also be generated by data path blocks: examples

Assumption: input signals are considered to be glitch free and to Note: comparator ‘=’ is glitch-free -> seems to indicate to be glitch -free and to arrive simultaneously => glitches reported are generated by the respective block that the logic inside is well- balanced

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

respective block

(Src: [Ragh99])

slide-41
SLIDE 41

Hardware Power 41

Glitch power reduction (cont’d) p ( )

Example: multiplexer of 2 8-bit words is controlled by a comparator “<“

  • A

ti t t lit h A B lit h f

  • Assumption: comparator generates glitches; A,B are glitch-free

Shown: a bit slice of words A and B Shown: a bit slice of words A and B

In table: # of transitions Glitchy and non- glitchy signals transitions w/o glitches, total # of transitions

Discussion (denotation: <A i,B i>)

(Src: [Ragh99])

( _ , _ )

  • <0,0> : glitch cannot propagate
  • <0,1>, <1,0> : glitch propagates at G1 for <1,0> and at G2 for <0,1>
  • <1,1> : glitch always propagates

(Src: [Ragh99])

How to prevent glitches ?

  • For example: use spatial correlations
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-42
SLIDE 42

Hardware Power 42

Glitch power reduction (cont’d) p ( )

Spatial correlation: observation: value of S is irrelevant in case <1,1> (anyway ‘1’ at OUT_i) => insert gate G_c => propagation of glitch on S is prohibited _ ) g _ p p g g p Result:

(Src: [Ragh99])

Q: why can’t just a buffer be inserted between S and input of G2 (idea here: the glitches compensate and therefore eliminate each other) ?

These and other techniques need to be built into synthesis tools and component libraries in order to pre ent glitches in the first place component libraries in order to prevent glitches in the first place More techniques to prevent glitches during synthesis can be found in [Ragh99] P bl ith lit h d ti t h i Problem with glitch reduction techniques:

Power overhead of additional gates (might reduce gains obtained through reduced number of glitches) Increased power of existing gates due to more inputs (example above:

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

p g g p ( p G_3 has 3 instead of 2 inputs) …

slide-43
SLIDE 43

Hardware Power 43

Clock gating g g

What is clock gating? Idea: suppress or disable transitions from propagating to parts of the clock network under specific conditions that are determined by the clock gating circuitry How can clock gating result in power savings?

(Src: [Anand98])

g g g

Reduced capacitive switching in the clock network like

  • clock buffers
  • interconnect of the clock network
  • latches/registers that are fed by the clock signal

Also:

  • may prevent storage elements from loading unnecessary new values and

thus saving power

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

thus saving power

slide-44
SLIDE 44

Hardware Power 44

Clock gating (cont’d) g g ( )

and98])

?

(Src: [Ana

Register re loads Scheme 1: register Register re-loads previous value when comparator

  • utput is ‘0’ =>

t iti t th Scheme 1: register clock input would be forced to ‘0’ when comp is ‘0’ (desired) Scheme 1: does not work since comp output is not t bl b f l k d transition at the clock input to register can be suppressed and t iti b Scheme 2: register clock input is forced to ‘1’ when comparator stable before clock edge rises Scheme 2: OK (as long as ti diti t bili

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

transitions can be spared

  • utput evaluates to

‘0’ (desired) gating condition stabilizes before clock does ’0’ -> ‘1’)

slide-45
SLIDE 45

Hardware Power 45

Clock gating (cont’d) g g ( )

Typically: a) existing signals in the circuitry may be used for gating parts of the clock network or, b) signals from previous clock may be used (in that case those signal values need to be stored in latches) ( g ) Example:

Decode stage of a micro-processor pipeline can be used to clock-gate later stages later stages

In other case:

Additional circuitry needs to be added

Pitfalls and overheads:

Introducing additional gates in clock tree may lead to an increase in clock delay and clock skew y Ensure that gating clocks does not introduce glitches, otherwise: malfunction due to spurious loading of registers Circuits with gated clock introduce additional complexity to synthesis and

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

g p y y analysis tools

slide-46
SLIDE 46

Hardware Power 46

Clock gating (cont’d) g g ( )

Consideration: identify conditions when next state and primary output conditions and primary output conditions do not change Gating the clock on its roots (e g for whole circuitry) (e.g. for whole circuitry) => eliminates clock skew problem Add d i it i

(Src: [Anand98])

?

Added circuitry may incur additional power etc. => try to detect subset of idle condition at low

(Src: [Anand98])

Idea: automated gated-clock synthesis (architecture, see above) idle condition at low

  • verhead
  • Synthesizing an activation function F_a:

goes to ‘1’ when clock needs to be stopped Latch L ensures that glitches are not propagated to clock signal AND t t ll l k f h l i it

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

AND gate suppresses eventually clock for whole circuitry

slide-47
SLIDE 47

Hardware Power 47

Clock gating (cont’d)

synthesizing F a

  • synthesizing F_a -

Given: a Moore FSM

set of inputs Note: for Moore FSM output is a function set of outputs set of states initial states Note: for Moore FSM output is a function

  • nly of the current state and not of input

variables. A self-loop in state transition graph (STG) next-state function Output function corresponds to an idle condition => condition where clock to FSM register can be suppressed State s_i with self-loop function such that iff x_i – decoded state variable corresponding to s_i; x_i = 1 iff FSM is in state s_i t th t f i t diti d hi h th lf l captures the set of input conditions under which the self-loop

  • f state s_i is traversed

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Activation function:

Note: activation function might be complex

slide-48
SLIDE 48

Hardware Power 48

Clock gating for data paths

Often in data paths: output of register is fed back as one of the data

g g p

Often in data paths: output of register is fed back as one of the data inputs through, for example, a multiplexer network Task: find condition under which this is the case by traversing the path through the multiplexer network through the multiplexer network

The gating condition for the clock input is:

(Src: [Anand98])

The gating condition for the clock input is:

[0] [1] contr contr ∧

Note: select signals are already present in Note: select signals are already present in network => only invert (if necessary) and conjunction need to be provided for gating condition Caution: strategy does not guarantee timing requirements (i.e. gating condition should stabilize before clk goes ‘1’->’0’) I d t id l l ki d i

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

In order to avoid slower clocking: derive reduced gating condition

slide-49
SLIDE 49

Hardware Power 49

Clock gating for data paths (cont’d) g g p ( )

Shown: a clock tree that has t d l k diti t i

Tradeoff:

gated clock conditions at various points (levels)

Tradeoff:

  • Disabling clock at higher level in the

tree => a larger capacitance (sum of all smaller

l k t

  • nes) is prevented from switching
  • But: clock transition at certain level can
  • nly be suppressed if _all_ registers of

the certain sub-tree can retain their old

clock tree

the certain sub-tree can retain their old values ⇒gating condition is satisfied fewer times => reduction of # of transitions saved On the other side: when doing at lower level, more transitions could have been saved but that costs more logic (that

(Src: [Anand98])

CLOCK

saved but that costs more logic (that itself consumes power …)

CLOCK IDLE GATED CLOCK

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

CONDTION

slide-50
SLIDE 50

Hardware Power 50

Clock gating

clock tree construction

  • clock tree construction -

Observation: the way the clock tree is constructed has an impact on gating the clock tree (see example): ( p )

(Src: [Anand98])

Left: shown a clk tree with four registers R1,R2,R3,R4 and the conditions under which clk tree can be disabled. x1,…,x4 are decoded controller stated variables which are mutually Right: gating condition for sub-tree under “A”:

1 ( 1 3) 1

A

GC x x x x = ⋅ + →

controller stated variables which are mutually exclusive (none of them can assume a ‘1’simultaneously) Observation: R1, R2 are grouped under a clock tree even though their conjunction can

( )

A

(“B” may be grouped similarly) Advantage: more suited to gated clock since groups of registers with similar or overlapping

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

clock tree even though their conjunction can never be true => not possible to gate the clock at point “A”, for example (similar R3, R4) groups of registers with similar or overlapping idle conditions closer together => trees be more efficiently shut down

slide-51
SLIDE 51

Hardware Power 51

Clock gating

multiple clocks

  • multiple clocks -

Observation: some components of a circuit may follow some simple regular patterns In particular a component may be idle and active in regular patterns. In particular, a component may be idle and active in alternating clock cycles or so

=> clock gating circuitry needs not necessarily to be data dependent

Example: A circuit with all registers fed by a single clock; whole capacitance is C and frequency is f

f f

q y Assume: design is partitioned into two parts each fed by clock signals with f/2 and capacitances C1, C2. Power savings can be achieved if:

1 2

2 2 f f C C C f ⋅ + ⋅ < ⋅

so circ it needs to be partitioned caref ll hat

2 C C C + <

so, circuit needs to be partitioned carefully what should often be possible to achieve Note: - savings here are for clock tree only

1 2

2 C C C + < ⋅

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

  • the f/2 does not result in performance penalties in this

example

slide-52
SLIDE 52

Hardware Power 52

Clock gating

multiple clocks (cont’d)

  • multiple clocks (cont d) -

Idea: using multiple non-overlapping clocks. Example:

Scheduled DFG (left fig.): clock cycles of the schedule s1,…,s5 have been assigned to two non-overlapping clock domains, CLOCK1, CLOCK2 in alternating way Right fig: shows single-clock RTL circuit that implements given DFG using minimal resources but does not implement the clock partitioning shown in minimal resources but does not implement the clock partitioning shown in left fig.

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-53
SLIDE 53

Hardware Power 53

Clock gating

multiple clocks (cont’d)

  • multiple clocks (cont d) -

Shown: RTL circuit that has been implemented with two clocks Restrictions: ) h d l d i CLOCK1 t h FU ith i CLOCK2

(Src: [Anand98])

a) an op scheduled in CLOCK1 cannot share an FU with an op in CLOCK2 b) a variable generated in CLOCK1 cannot share an FU with a variable generated in CLOCK2 Why? => 1 ensure that each register can be clocked by either CLOCK1 or CLOCK2

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Why? > 1. ensure that each register can be clocked by either CLOCK1 or CLOCK2

  • 2. data path can be partitioned into two domains such that there is only

switching activity in their respective active clock cycles

slide-54
SLIDE 54

Hardware Power 54

Clock gating some disadvantages

  • some disadvantages -

Inserting additional gates into the clock tree can lead to an increase in the clock delay and clock skew Circuit malfunction due to spurious loading of registers when not taking into consideration that gating logic might introduce glitches introduce glitches Increase of complexity to synthesis and analysis tools

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-55
SLIDE 55

Hardware Power 55

Power savings through pre computation pre-computation

Basic idea: Pre-compute (i e predict) output of a

  • riginal circuit

Pre-compute (i.e. predict) output of a circuit one cycle ahead with additional logic and then switch off original logic What is the rational behind? For a majority of input values, the output might be computed with very simple logic but in order to cover all input

Circuit with pre-computation logic

(Src: [Devadas])

logic but in order to cover all input, complex logic is necessary Assumptions: ‘A’ represents combinational logic; has

  • ne output

Architecture: i t d f ti

1 1 1 g f = ⇒ =

introduce new functions g1, g2 as follows

2 1 g f = ⇒ =

Tasks of g1, g2: a) pre-compute, b) switch off original circuit in certain cases when prediction is possible

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

prediction is possible Note: a) g1, g2 only cover a subset of all x1,…,xn (desirable: a large coverage) b) imposes overhead in form of power area and probably performance

slide-56
SLIDE 56

Hardware Power 56

Power savings through pre computation pre-computation

Example: a comparator of two n-bit values C and D that results in ‘1’ if C > D I th t 1 b d fi d In that case, g1 can be defined to test the MSB: Accordingly, g1 is defined as: If g1=1 => C>D; g2=1 => D>C If g1 1 > C>D; g2 1 > D>C So, XNOR needs to be computed for the pre- computation logic Note: assuming a uniform Note: assuming a uniform probability for the inputs, the probability that XNOR results to ‘1’ is 50% !

(Src: [Devadas])

If the bit MSB-1 is also tested, the probability is 75%! => With little effort the circuit => With little effort the circuit can be predicted and power can be saved by switching off (gating) the original circuit For large n, the power dissipation of that additional

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

(g g) g p XNOR gate can be neglected

slide-57
SLIDE 57

Hardware Power 57

Managing power through scheduling scheduling

Recall: “pre-computation” is a shut-off technique based on clock cycle basis and is limited to the given structure of the logic Can power be managed at a higher level? Can power be managed at a higher level? Observation: it is common for performance optimization to compute all

  • utcomes of a conditional operation in parallel with the condition

l ti it lf evaluation itself The appropriate result will be chosen and the other one(s) will be discarded => this is ineffective in terms of power consumption Idea: power effective: enforce control dependencies between

  • perations in CDFG and conditional operation such that they depend

during scheduling May be accomplished by meeting performance constraints first and

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

May be accomplished by meeting performance constraints first and then optimize power consumption

slide-58
SLIDE 58

Hardware Power 58

Managing power through scheduling (cont’d) scheduling (cont d)

a) c) b) a) c) b) Example: expression |a-b| Schedule b): ignores control dependencies;

(Src: [Anand98])

p p | | Assumption: each op ‘-’ and ‘>’ Takes one clock cycle and ‘sel’

  • peration may be chained with any

th ti ) g p ; a-b and b-a are executed independently of ‘>’; There is a flexibility in scheduling ‘>’ Problem: from power point of view b) is i ffi i t b th b d b l t d

  • ther operation

Constraint: a schedule within 2 cycles => Two possible schedules b) and c) inefficient: both a-b and b-a are always executed Schedule c): a-b or b-a are activated exclusively due to outcome of ‘>’. a-b, b-a may be assigned to same or to two

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Two possible schedules b) and c) a b, b a may be assigned to same or to two different subtractors (latter case: one needs to be shut down)

slide-59
SLIDE 59

Hardware Power 59

Power savings through operand isolation isolation

Note: previous techniques (pre-computation, gated clocks) are only applicable to clocks) are only applicable to blocks of combinational logic that are fed by registers Here: applicability to circuit Here: applicability to circuit blocks that are embedded within combinational logic Idea here:

Disable transitions at inputs

  • f variables

Insert transparent latches at all inputs of embedded bl k p block If block does not perform any useful operation: a) transparent latches at inputs are disabled b) retain

(Src: [Anand98])

are disabled, b) retain previous cycle’s values => avoids unnecessary power consumption

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-60
SLIDE 60

Hardware Power 60

Power savings through operand isolation:

guarded evaluation

  • guarded evaluation -
  • – signal in a combinational circuit

F – logic that is computing o pure guarded evaluation g p g I – set of inputs to F ODC_o – observability don’t care set with respect to o i.e. set of primary input assignments t th ti i it h th t th l f h

1 LE ODC + ≡

to the entire circuit such that the value of o has no influence on the values at the primary outputs LE – an arbitrary value of the existing circuit such that LE => ODC o ,

1

  • LE

ODC + ≡

that LE ODC_o , Thus, when LE = 1, the value on o is not needed to compute the primary outputs.

(Src: [Anand98])

earliest time at which any of the inputs in I can change its value when LE=1 th l t t ti t hi h LE t bili t l i l ‘1’

1

( )

e LE

t I

=

( ) t LE

the latest time at which LE can stabilize to logic value ‘1’

1

( )

l LE

t LE

=

=> LE can be used to control guard logic. Transparent latches need to be disabled in time i.e. early enough to

1 1

( ) ( )

e LE l LE

t I t LE

= =

<

>

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

y g cut off transitions on any of the inputs in I (and such save power) i.e. LE can be used to control guard logic

(Src: [Tivari])

slide-61
SLIDE 61

Hardware Power 61

Power savings through operand isolation:

relaxed guarded evaluation (cont’d)

  • relaxed guarded evaluation (cont d) -

Idea: use guarded evaluation but: use a relaxed condition such that it becomes easier to find the shut-off condition (remember: signal LE must be available h i h i i i i ) anywhere in the existing circuit): Timing condition: same as before Timing condition: same as before at “pure guarded evaluation” Two cases for LE=1 (note: for LE 0 i i i f i i LE=0 circuit is functioning correctly anyway) 1.

  • is not needed to compute

primary outputs (OK) primary outputs (OK) 2.

  • is needed to compute

primary output. Circuit may

  • perate incorrectly. An ‘OR’

gate is needed (see figure) Comparing: pre-computation and guarded evaluation

  • pre-computation needs additional circuitry; GE is derived from within the circuit

Pre computation may require re synthesis to efficiently derive additional

(Src: [Anand98]) (Src: [Tivari])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

  • Pre-computation may require re-synthesis to efficiently derive additional

circuits whereas GE leaves circuit as is (especially important in hand-optimized circuitry)

slide-62
SLIDE 62

Hardware Power 62

Power savings through operand isolation:

application to high level synthesis

  • application to high-level synthesis -

Idea: apply operand isolation from logic level to high-level synthesis In fact: the conditions under which a resource (e g a functional unit FU ) In fact: the conditions under which a resource (e.g. a functional unit FU ) is not used are readily available from the scheduling and resource sharing!

  • idl

l b d i d f th i it ( t lid ) => idle cycles can be derived from the circuits (see next slide)

Scheduled example DFG

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-63
SLIDE 63

Hardware Power 63

Power savings through operand isolation:

application to high level synthesis (cont’d)

  • application to high-level synthesis (cont d) -

Idle cycles of functional units: Idle cycles of functional units: Insert transparent latches at the FU’s input to perform operand isolation: Q: Why no latches of input of “CMP1”?

(Src: [Anand98])

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

y p A: values that feed its inputs do not change in cycles in which it is idle

slide-64
SLIDE 64

Hardware Power 64

Power savings through operand isolation:

application to high level synthesis (cont’d)

  • application to high-level synthesis (cont d) -

Possible disadvantages: The isolation technique attempts to eliminate spurious activity at the inputs of embedded resources (e.g. functional units) by inserting transparent latches into the RTL implementation. resources (e.g. functional units) by inserting transparent latches into the RTL implementation.

  • incurring power and area overheads due to the addition of extra circuitry
  • operand isolation also requires some delay constraints (the disabling transition at the

transparent latch enable input should arrive before its data input can change).

  • > Satisfaction of the delay constraints may require the addition of extra circuit delay in the critical

> Satisfaction of the delay constraints may require the addition of extra circuit delay in the critical path, which may not be acceptable for high-performance designs.

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-65
SLIDE 65

Hardware Power 65

Power savings through constrained register sharing register sharing

Idea: rather than applying transparent latches to an already scheduled DFG, can’t the scheduling, mapping etc already take into consideration , g, pp g y power shut-down techniques?

Impact of variable assignment on power consumption:

two candidate assignments Assignment 1 and Assignment 2 two candidate assignments, Assignment 1 and Assignment 2, shown in Table (next slide). Architectures obtained using these assignments were subject to: logic synthesis optimizations and placed and routed using a logic synthesis optimizations, and placed and routed using a cell library. The transistor-level netlists extracted from the layouts were simulated using a switch-level simulator with typical input traces simulated using a switch level simulator with typical input traces to measure power. For the circuit Design 1, synthesized from Assignment 1, the power consumption was 30.71mW, and for the circuit Design 2, p , g , synthesized from Assignment 2, the power consumption was 18.96mW !

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-66
SLIDE 66

Hardware Power 66

Power savings through constrained register sharing (cont’d) register sharing (cont d)

Example: a DFG and two possible register assignments that differ significantly in power consumption significantly in power consumption

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-67
SLIDE 67

Hardware Power 67

Power savings through constrained register sharing (cont’d) register sharing (cont d)

Two distinct schedules. Shown: FU (in box), input variables left and right of operation, grey- shaded: variables at input of respective operation change value; spurious input transitions i.e. those that do not correspond to an DFG operations are marked with an ‘X’. Observation: A functional unit that does not l i i d f alter its input does not perform a spurious operation Conclusion: Constrained register sharing can save significant energy/power: Upper case: 7 operations that pp p do not correspond to a DFG operation Lower case: only one such

  • peration

p Note: a) Number of control steps is not increased b) Number of HW resources

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

b) Number of HW resources (FUs) is still the same

slide-68
SLIDE 68

Hardware Power 68

Power savings through dynamic variable rebinding variable rebinding

Problem in previous example: Problem in previous example: There is still a spurious operation is control step s1 of each

  • peration:
  • The MUX selects R5 (to which v12 is assigned) from

control step s3 of each iteration to control step s1 of the control step s3 of each iteration to control step s1 of the next iteration

  • v12 acquires a new value at step s1

Idea:: Combine dynamic variable rebinding with variable assignment to completely eliminate spurious operation How::

  • Need to preserve old (previous iteration) value of v12 at

input of SUB1 until new value of v3 in current iteration is generated

  • then, spurious operation can be eliminated
  • swapping the variables assigned to registers R5 and R6

in alternate iterations

  • result: see figure on the right
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-69
SLIDE 69

Hardware Power 69

Managing power through the controller controller

Main idea: Finding a way to reduce power without having to spend large overhead in terms of hardware like transparent latches etc.

  • Rather: re-designing the existing control logic in order to reconfigure the

multiplexer networks and functional units in the data path

  • Might not completely eliminate activity but is low-cost
  • best suited to control flow intensive designs:
  • best suited to control flow intensive designs:
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-70
SLIDE 70

Hardware Power 70

Managing power through the controller (cont’d) controller (cont d)

Example: X.25 protocol

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu (Src: [Anand98])

slide-71
SLIDE 71

Hardware Power 71

Managing power through the controller: X 25 example 1 controller: X.25 example 1

Shown:

  • left: a part of the data path
  • middle/right: a) logic

expression for control signals (x_i=1 => controller is in state s_i); b) activity graphs of ALU _ ) ) y g p (state transitions with actions involved; ex: sel(0) sel(1), sel(2), SelectFunc(0) are 1, 0, 1, 0, respectively => “byte- , , p y y byteCount” is performed Observation:

  • some states are actually idle

y states (gray shaded). This can be found out through scheduling info from HL synthesis.

(Src: [Anand98])

y Idea:

  • re-specify idle states such that switching is minimized.
  • Example: state transition s6->s4 : same signals are on muxes

such that same operands )stay stable) Since they do not Conclusion: re-specifying control signals can lead to power savings (without

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

such that same operands )stay stable). Since they do not change => no switching => no unnecessary power consumption p g ( changing anything else)

slide-72
SLIDE 72

Hardware Power 72

Managing power through the controller: X 25 example 2 controller: X.25 example 2

Shown:

  • Different part of the X.25 designs:

Different part of the X.25 designs: register that stores variable I and muxes that feed it through signals sel(18), M(18), sel(19)

  • Same convention as before apply (gray

Same convention as before apply (gray shaded states are inactive states as far as that respective part of the design is concerned) Idea:

  • Can reduced activity in the mux tree

lead to reduced power consumption? Id Idea:

  • Consider s7->s1: “count+byteCount” ->

“bytes-byteCount). Since operation chnages, also operand c28 changes (c28 is output of ALU and feed the mux tree) solution:

  • re-specifying the signals prevents from

propagating this variable into the shown t ( i ht t t di )

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

mux tree (see right state diagram)

slide-73
SLIDE 73

Hardware Power 73

Managing power through the controller (cont’d) controller (cont d)

Re-labeling activity graphs :How to label an idle vertex in an activity graph: Different incoming and outgoing transitions into the idle state have different execution probabilities idle state have different execution probabilities The values of data operands fed to the mux trees The values of data operands fed to the mux trees may themselves change => Only selecting the same

  • perand does not ensure that switching activity is

i i i d minimized

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

slide-74
SLIDE 74

Hardware Power 74

Managing power through the controller: formalizing the problem controller: formalizing the problem

Shown:

  • A part of a data path and a part

A part of a data path and a part

  • f the activity graph for the “<“
  • perator
  • s3 is an idle state
  • shown are all incoming and

shown are all incoming and

  • utgoing arcs to/from s3

Goal:

  • using s3 for one of the labels

L1, L2, L3 such that activity at the input of the “<“ is minimized Some conventions:

  • P(si - > sj) - probability of the controller state transition from si to sj

P(si > sj) probability of the controller state transition from si to sj

  • AM si->sj - activity matrix stores the cost of (average bit transitions for respective state transition).

Has only entry in row/column if the respective transition is actually in the state transition graph

  • Nodes in the state transition graph are to be re-labeled while minimizing labeling costs:
  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

Goal: find an L* such that cost function is minimized => best re-labeling

slide-75
SLIDE 75

Hardware Power 75

Conclusion: Hardware power p

Make sure what is to be reduced: peak power, average power (different strategies) HW power sources: HW power sources:

Data path Control path Clock tree

Optimization strategies:

Operator scheduling for low power Operator scheduling for low power Hardware power management (clock gating) Re-labeling of controller

Note:

Very often there is a tradeoff: reduced power may lead to: more logic, more complex design, reduced performance, …

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu

g , p g , p ,

slide-76
SLIDE 76

Hardware Power 76

Reference and sources

  • [Heer04] Ch. Herr, U. Schlichtmann, “Ultra-Low-Power Design: Device and logic design approaches”,
  • pp. 1-20, in “Ultra Low-Power Electronics and Design” by Kluwer, 2004.
  • [Anand98] A. Raghunathan, N.K. Jha, S. Dey, “High-level power analysis and optimization”, Kluwer

Academic Publishers,1998.

  • [Sarraf95] S. Raje, M. Sarrafzadeh, “Variable voltage scheduling”, IEEE/ACM ISLPED 1995. pp. 9-

14, 1995.

  • [Knight] R.S. Martin, J.P. Knight, “Power-Profiler: Optimizing ASIC’s Power Consumption at the

behavioral level”, Proc. Of IEEE/ACM Design Automation Conf. (DAC’95), pp.42-47,1995.

  • [Macii04] E. Macii (Ed.), “Ultra Low-Power Electronics and Design”, Kluwer Academic Publishers,

2004.

  • [Devadas] Alidina, M.; Monteiro, J.; Devadas, S.; Ghosh, A.; Papefthymiou, M.; “Precomputation-

based Sequential Logic Optimization For Low Power”, Computer-Aided Design (ICCAD), 1994., IEEE/ACM International Conference on November 6 10 1994 Page(s):74 81 IEEE/ACM International Conference on November 6-10, 1994 Page(s):74 – 81.

  • [Ragh99] Raghunathan, A.; Dey, S.; Jha, N.K.; “Register transfer level power optimization with

emphasis on glitch analysis and reduction”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 18, Issue 8, Page(s):1114 – 1131, Aug. 1999.

  • [Tivari] Tiwari V ; Malik S ; Ashar P ; “Guarded evaluation: pushing power management to logic
  • [Tivari] Tiwari, V.; Malik, S.; Ashar, P.; Guarded evaluation: pushing power management to logic

synthesis/design”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 17, Issue 10, Page(s):1051 – 1060, Oct. 1998.

  • [Mehra] R. Mehra, J. Rabaey, “Exploiting Regularity for Low Power Design”, IEEE/ACM Intl’

Conference on Computer Aided Design (ICCAD96), pp. 166-172, 1996.

  • Z. Wang and J. Henkel, KIT, SS12

http://ces.itec.kit.edu p g ( ), pp ,