Low Power Design Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for - - PowerPoint PPT Presentation

low power design
SMART_READER_LITE
LIVE PREVIEW

Low Power Design Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for - - PowerPoint PPT Presentation

1 Hardware Power Low Power Design Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany 3. Hardware power optimization and estimation http://ces.itec.kit.edu T. Ebi and J. Henkel,


slide-1
SLIDE 1
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 1

Low Power Design

Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany

  • 3. Hardware power optimization and

estimation

slide-2
SLIDE 2
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 2

Overview

 Levels of abstraction

  • system

  • RTL

  • gate

  • transistor

 Tasks

 Optimize (i.e. minimize for low power)  Design / co-design (synthesize, compile, …)  Estimate and simulate

Battery issues software OS interconnect hardware memory Components consuming power

slide-3
SLIDE 3
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 3

Generic HW synthesis flow

(Src: [Anand98])

slide-4
SLIDE 4
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 4

Low power HW design flow

 Energy/power needs to be analyzed and

  • ptimized at each level of

abstraction  Therefore, appropriate power models for each level are necessary  Shown in Fig:  a) design flow w/o energy/power  b) design flow with energy/power

(Src: [Anand98])

slide-5
SLIDE 5
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 5

Power consumption in HW

 A more detailed version than in the intro …

 In general, four components:  Switching capacity power  Short-circuit power  Leakage power  Static power

(Src: [Anand98])

. . avg sw cap short circuit leakage static

P P P P P

   

slide-6
SLIDE 6
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 6

Switching capacity power

 Caused by parasitic capacitors during switching: Fig. shows C_L which is the effective capacitance of all parasitic capacitances  Per transition:  Means: a) reduce operating frequency, b) reduce C_L, c) reduce voltage, d) reduce switching activity.  Most common: reduce voltage  => Problem: delay of gate t_d increases too!

(Src: [Anand98]) CMOS inverter

 

2 L DD d DD th

k C V t V V    

2 .

1 2

sw cap L DD

P C V N f 

slide-7
SLIDE 7
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 7

Short circuit power

 Explanation:

 Caused by direct supply-to-ground path  When CMOS inverter in Fig. changes from 1->0 there is a short time frame within which both, nMOS and pMOS transistors are conducting => short circuit current is drawn from power supply

(Src: [Anand98])

slide-8
SLIDE 8
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 8

Leakage power

 Leakage can be divided into three components

 Idiode – refers to the diodes that are formed between diffusion regions and substrate Very small compared to the other two:  Ioxide – electrons tunneling through the gate oxide Drops off exponentially with gate length  ‘off’ transistors still conduct some current  K, S, technology parameters; Weff effective transistor channel width  NOTE: leakage power is predicted to be dominant in future silicon technologies

(Src: [Anand98])

Vdd I I I P

diode

  • xide

ld subthresho leakage

    ) (

slide-9
SLIDE 9
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 9

Static power

 Not relevant in CMOS circuits  Note: in some literature leakage power is denoted as “static power”  Static power: only relevant in some nMOS circuits where there is a constant path supply-to-ground

slide-10
SLIDE 10
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 10

Power consumption in HW: breakdown

Trajectory if high-k dielectrics reach production Dynamic power Subthreshold leakage Gateoxide leakage 100 11 0,01 10-4 10-6 1 9 9 1 9 9 5 2 2 5 2 1 2 1 5 2 2

 Leakage power will dominate in future ( <100nm) silicon technologies  one means to reduce leakage power is to deploy dielectrics with a high k-value

(Src: [Heer04]) time

slide-11
SLIDE 11
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 11

Hardware synthesis for low power

 Considered here: high-level synthesis (HLS) e.g.:

 Operator scheduling  Module selection  Glitch power reduction  State transition reduction  …

slide-12
SLIDE 12
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 12

Operator scheduling for low power

 What is scheduling in the context of high-level synthesis?  Scheduling assigns operations in the behavioral description to control steps or controller states. Scheduling determines cycle-by- cycle behavior i.e. sequence in which operations are performed  Some repetition from ESI:  multicycling (clock period is rather short)  chaining (clock period rather long)  finding the right clock cycle time is an optimization task itself  Scheduling determines the sequence in which the various

  • perations of the behavioral description are performed, and also

dictates which operations and variables can share the same functional units and registers. Thus, scheduling can be used to enable resource sharing for low power by ensuring that correlated variables and operations with correlated operands are appropriately sequenced so that they can share the same resources

(Src: [Anand98])

slide-13
SLIDE 13
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 13

Operator scheduling for low power (cont’d)

 Scheduling can be performed so as to enable maximum resource sharing between operations that belong to instances of the same computational pattern, resulting in maximal exploitation of regularity during resource sharing  Scheduling can be used to distribute the slacks or mobilities of various operations in the DFG appropriately so that some

  • perations may be performed using slower, more energy-efficient

functional units. Thus, scheduling has an impact on the power trade-offs through module selection  Scheduling determines the distribution of operations over time, and hence affects the profile of the power consumption in the implementation over time (control steps or clock cycles). Reducing peak power is important due to packaging, cooling, and reliability

  • considerations. The effect of scheduling on peak power will be

illustrated later.  => these tasks will be discussed in the following (some in the context of module selection)

(Src: [Anand98])

slide-14
SLIDE 14
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 14

Operator scheduling for LP (cont’d)

Basic idea: use slack in a data flow graph (dependent upon timing constraints) and: a) Vary V_dd of the ALU where operator is to be executed, or b) Assign operator(s) to a different ALU with a lower/higher (fixed) V_dd

DD

V

d

t

 

2 L DD d DD th

C V t V V   

Shown: normalized, dependency t_d = f(V_DD) (src:[Saraff95])

slide-15
SLIDE 15
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 15

Operator scheduling for LP (cont’d)

1 2

{ , ,..., }

c c ci

S V V V 

2

( )

i

i V 

 

Problem: Obtain a mapping of a data flow graph G=(V,E) given a base execution time t_c (or V_dd) and a timing constraint k * t_c minimize such that the critical path length of the DFG is <= k * t_c

:V S  

(src:[Saraff95])

slide-16
SLIDE 16
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 16

Operator scheduling for LP (cont’d)

  • algorithm -

 Step 1: initialization  Step 2: computing slack l(v) – longest path of the graph that goes through node v

(src:[Saraff95])

slide-17
SLIDE 17
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 17

Operator scheduling for LP (cont’d)

  • algorithm -

 Step 3: compute max slack value  Step 4: compute dual graph

(src:[Saraff95])

slide-18
SLIDE 18
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 18

Operator scheduling for LP (cont’d)

  • algorithm -

 Step 5: weight assignment  Step 6: compute longest weighted path  Step 7: reassign voltages to node in longest path

(src:[Saraff95])

slide-19
SLIDE 19
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 19

Operator scheduling for LP (cont’d)

  • algorithm -

 Conclusion Power consumption can be reduced depending on constraints up to around 25%

 Step 8: go back to step 2

slide-20
SLIDE 20
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 20

Module selection for LP

 What is module selection?  Observation: with each operation a functional unit template but not a specific instance is associated (that would be mapping)  Example: a “+” operation may be implemented using a

 A) ripple-carry adder  B) carry-lookahead adder  C) …  Ripple-carry adder is slow but more efficient in switched capacitance, carry- lookahead adder is faster but less efficient in switched capacitance  Similar tradeoffs exist in other operations

 Idea: Tradeoffs can be exploited to fulfill power constraints through module selection

The process of mapping operations from the CDFG to component templates of the RTL library

slide-21
SLIDE 21
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 21

Module selection for LP (cont’d)

 Each operation in the DFG (middle) has been mapped to fast component in

  • rder to meet performance constraints (in that case constraint: 85ns)

 But is that really necessary? => no, not all ops need necessarily be mapped to fastest module. Focus (for timing constraint) should rather be on critical path  Idea: slack in off-critical path ops may be used to select slower functional units that may have a better efficiency in switched capacitance (see right DFG). There, mult op uses less power (but not less energy)  Important: to have a large module library with distinct switching capacity efficiencies and performance characteristics

(Src: [Anand98])

slide-22
SLIDE 22
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 22

Module selection for LP (cont’d)

  • using multiple Vdd -

 Multiple supply voltages have also been used at logic level  Idea: obtain low power at rather small performance overhead (remember the equation for power through switched capacitances)  How to do in HL synthesis?

 Need RTL component library that contains multiple versions of each component each referring to a different Vdd  Need to extend module selection process to assign each op to a lib component template plus a supply voltage  The insertion of a necessary level converter needs to be accomplished  The delay models (that evaluate the module selection) need to be sensitive to the dependence of delay and supply voltage  Note: resource sharing for LP (see later) needs to be constrained: must not allow sharing of functional units that are assigned to different supply voltages

(Src: [Anand98])

slide-23
SLIDE 23
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 23

Module selection for LP (cont’d)

  • using multiple Vdd -

 Ex:

Off-critical-path op ‘*3’ has been assigned to a lower Vdd (3.3V) level converter has been inserted: convert result of ‘*3’ to higher output level various ways for level converter: a) a separate level converter circuit like DCVS (differential cascode voltage switch b) may be integrated into a register to reduce overhead

(Src: [Anand98])

slide-24
SLIDE 24
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 24

Peak power

 Example: two possible schedules of a given graph (assumption: same power for each operation)  ASAP schedule results in high peak power  A possible slack may be used to reduce peak power without sacrificing any performance

(Src: [Anand98])

slide-25
SLIDE 25
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 25

Peak power (cont’d)

 Assumption: ‘1’, ‘2’ and ‘0’, ‘3’ may pair-wise share a module. The according times are t_A, t_B:  The average power consumption for the two schedules is:  Ex1: t_A = 1,300ns, t_B = 660ns, P_A = 28.6mW, P_B = 56.4mW  Note: average power has considerably improved but energy is approx the same!  => trade-off between power and performance has been exploited to protect circuit (Src: [Knight]) Example graph, schedule and module lib

slide-26
SLIDE 26
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 26

Peak power (cont’d)

 Window is 60ns  ‘C’ uses multicycling and no parallelism, has lowest peak power and is slowest  ‘H’ is fastest, has highest peak power but comparable average power (according to the window)  Note: optimizing for peak power or for average power are completely distinct tasks!

(Src: [Knight]) Min delay: refers to when the operations are actually done whereas the other case always assumes 60ns window

slide-27
SLIDE 27
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 27

Resource sharing for LP

 What is resource sharing?

 Mapping ops and variables in CDFG to FUs, registers etc. and defining interconnection between to form RTL implementation  It is the mapping from function to structure  It directly impacts power consumption by determining switching activity at various signals, buses, wires, macro blocks etc.

 Observation:

 Result of resource sharing of variables’ values are time multiplexed registers  Values that appear as input operands of ops are time-multiplexed to appear at inputs of FUs  Values that are transferred between FUs are sequenced to appear on interconnect units (buses and multiplexers)  Word-level temporal correlations of values on data path signals are determined by the correlations among variables and input operands of

  • perations that are grouped together during resource sharing

 Word-level correlation, in turn determine bit-level switching activity  Idea: can signal correlations be exploited to reduce switched capacitances?

slide-28
SLIDE 28
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 28

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

 Analysis of above scenario:

 Focus on ‘+1’ and ‘+2’ operations  Two consecutive iterations of the DFG are shown. It is : “+1_1”, “+2_1”, “+1_2”, “+2_2”  Values seen at adder input are: (a1, b1), (c1, d1), (a2, b2), (c2, d2)

 What is the switching activity at the adder inputs determined by?

(Src: [Anand98])

slide-29
SLIDE 29
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 29

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

 … switching activity is determined by:

first iteration,

 Idea:

 Exploit correlations between variables in behavioral description to minimize switched capacitance at RTL level

slide-30
SLIDE 30
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 30

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

 Scenarios of exploiting correlations:

 S1:  S2:

slide-31
SLIDE 31
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 31

Resource sharing for LP (cont’d)

  • exploiting signal correlation -

 Scenarios of exploiting correlations (cont’d):

 S3:

(Src: [Anand98])

slide-32
SLIDE 32
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 32

Resource sharing for LP (cont’d)

  • exploiting signal regularity -

 What is regularity?

 Regularity refers to the repeated occurrence of computational patterns within an algorithm

 Idea:

 Regularity can be exploited to reduce interconnect power by detecting instances of repetitive patterns in the computation and resource sharing in the sense that the same interconnect structure in the data path is reused for as many instances of computation patterns as possible

slide-33
SLIDE 33
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 33

Resource sharing for LP (cont’d)

  • exploiting signal regularity -

 Ex 1:

 Scenario b): given on the left-hand side a data flow graph; on the right an implementation Problem: for A2 -> M1, A2 -> M1, A1 - > S1, multiplexers are needed  Scenario a): same graph but mapping of DFG to data path does not require multiplexers since A1 -> M1, for example, can be reused  Conclusion: scenario b) does not preserve regularity, a) does  So, where does overhead power consumption come from?

  • 1. the more fan-outs, the larger are

the interconnects and therefore the switched capacitances 

  • 2. the overhead of multiplexers (and

probably buffers) leads to more switching activity and therefore higher power consumption (Src: [Mehra])

slide-34
SLIDE 34
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 34

Resource sharing for LP (cont’d)

  • exploiting signal regularity -

 Idea: defining E-instances: a pair of nodes connected by an edge E-template: type of an E-instances classified by type of input/output port Ex: (add->add.right) means a template with an adder where

  • utput maps to right input of another

adder E-coverage: # of instances of that type divided by total # of edges in graph task here: using E-templates in synthesis as to minimize power with e-coverage as quality measure through regular assignment  Ex 2: a fourth-order cascade filter  Disadvantages?

 may require more hardware units since sufficient E-templates need to be provided -> under circumstances power savings come at cost of more hardware (Src: [Mehra])

slide-35
SLIDE 35
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 35

Resource sharing for LP (cont’d)

  • exploiting signal regularity -

 Example: E-template-based scheduling

(Src: [Mehra])

slide-36
SLIDE 36
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 36

DFG restructuring for LP

 A typical DSP operation: constant multiplication and addition: Y = A * X (e.g. as part of an IIR filter)  Assumptions:

 m-bit data value X multiplied by m-bit constant A

 Experiment:

 Applying random values of X for varying values of A  X and A are represented as two’s complement  A 0.0 … 1.0 normalized  Observing average switching activity per bit at multiplier output  Result: next slide

slide-37
SLIDE 37
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 37

DFG restructuring for LP (cont’d)

 Observation:  When constant value A is ‘0’ => no switching activity (as expected)  When A is 1.0, output switching activity is equal to switching activity of X  In between: it is monotonically increasing

(i.e. A)

Output switching activity per bit

(Src: [Anand98]) (Src: [Ragh99])

 Example

 An adder: two m-bit data values X1, X2. It can be shown that:

slide-38
SLIDE 38
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 38

DFG restructuring for LP (cont’d)

 Example: linear time-invariant signal processing system (e.g. IIR filter)

(Src: [Anand98])

 Left figure  Right figure:

slide-39
SLIDE 39
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 39

Glitch power reduction

 What is “glitch power”?  Power consumption that is related to hazards i.e. temporary values at the input/output of gates that cannot be explained when considering a truth table only. It is due to different propagation delays in combinatorial paths  Analysis of the above example unveiled: A rising transition on signal x1 was frequently accompanied by a falling transition on c11. Thus, the rising transition on x1 and the falling transition on c11 are highly correlated. Transitions on signal x1 arrive earlier than transitions on signal c11 due to: a) non-balanced paths, b) wiring delays.

glitch (Src: [Ragh99]) Shown: # transitions / # transition w/o glitches

slide-40
SLIDE 40
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 40

Glitch power reduction (cont’d)

 In general, glitches are generated at the control signals due to the simultaneous presence of the following two conditions. 1) Functional: Correlation between rising and falling transitions at two or more signals that feed a gate. 2) Temporal: The controlling to non-controlling transition arrives earlier at the gate’s input (see example last slide)  Note: glitchy signal propagates (and therefore consumes even more power at

  • ther gates in the circuitry => try to eliminate the glitch as close to the location
  • f first occurrence (i.e. where it is generated) as possible

 Glitches may also be generated by data path blocks: examples

Assumption: input signals are considered to be glitch -free and to arrive simultaneously => glitches reported are generated by the respective block Note: comparator ‘=’ is glitch-free -> seems to indicate that the logic inside is well- balanced

(Src: [Ragh99])

slide-41
SLIDE 41
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 41

Glitch power reduction (cont’d)

 Example: multiplexer of 2 8-bit words is controlled by a comparator “<“

 Assumption: comparator generates glitches; A,B are glitch-free

 Shown: a bit slice of words A and B

In table: # of transitions w/o glitches, total # of transitions

 Discussion (denotation: <A_i,B_i>)

 <0,0> : glitch cannot propagate  <0,1>, <1,0> : glitch propagates at G1 for <1,0> and at G2 for <0,1>  <1,1> : glitch always propagates

 How to prevent glitches ?

 For example: use spatial correlations

(Src: [Ragh99]) Glitchy and non- glitchy signals

slide-42
SLIDE 42
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 42

Glitch power reduction (cont’d)

 Spatial correlation: observation: value of S is irrelevant in case <1,1> (anyway ‘1’ at OUT_i) => insert gate G_c => propagation of glitch on S is prohibited  Result:  Q: why can’t just a buffer be inserted between S and input of G2 (idea here: the glitches compensate and therefore eliminate each other) ?

 These and other techniques need to be built into synthesis tools and component libraries in order to prevent glitches in the first place  More techniques to prevent glitches during synthesis can be found in [Ragh99]  Problem with glitch reduction techniques:

 Power overhead of additional gates (might reduce gains obtained through reduced number of glitches)  Increased power of existing gates due to more inputs (example above: G_3 has 3 instead of 2 inputs)  …

(Src: [Ragh99])

slide-43
SLIDE 43
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 43

Clock gating

 What is clock gating?  How can clock gating result in power savings?

 Reduced capacitive switching in the clock network like

  • clock buffers
  • interconnect of the clock network
  • latches/registers that are fed by the clock signal

 Also:

  • may prevent storage elements from loading unnecessary new values and

thus saving power

Idea: suppress or disable transitions from propagating to parts of the clock network under specific conditions that are determined by the clock gating circuitry

(Src: [Anand98])

slide-44
SLIDE 44
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 44

Clock gating (cont’d)

 Register re-loads previous value when comparator

  • utput is ‘0’ =>

transition at the clock input to register can be suppressed and transitions can be spared  Scheme 1: register clock input would be forced to ‘0’ when comp is ‘0’ (desired)  Scheme 2: register clock input is forced to ‘1’ when comparator

  • utput evaluates to

‘0’ (desired) Scheme 1: does not work since comp output is not stable before clock edge rises Scheme 2: OK (as long as gating condition stabilizes before clock does ’0’ -> ‘1’)

?

(Src: [Anand98])

slide-45
SLIDE 45
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 45

Clock gating (cont’d)

 Typically: a) existing signals in the circuitry may be used for gating parts of the clock network or, b) signals from previous clock may be used (in that case those signal values need to be stored in latches)  Example:

 Decode stage of a micro-processor pipeline can be used to clock-gate later stages

 In other case:

 Additional circuitry needs to be added

 Pitfalls and overheads:

 Introducing additional gates in clock tree may lead to an increase in clock delay and clock skew  Ensure that gating clocks does not introduce glitches, otherwise: malfunction due to spurious loading of registers  Circuits with gated clock introduce additional complexity to synthesis and analysis tools

slide-46
SLIDE 46
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 46

Clock gating (cont’d)

 Idea: automated gated-clock synthesis (architecture, see above)

 Synthesizing an activation function F_a:  goes to ‘1’ when clock needs to be stopped  Latch L ensures that glitches are not propagated to clock signal  AND gate suppresses eventually clock for whole circuitry

 Consideration: identify conditions when next state and primary output conditions do not change  Gating the clock on its roots (e.g. for whole circuitry)  => eliminates clock skew problem  Added circuitry may incur additional power etc.  => try to detect subset of idle condition at low

  • verhead

(Src: [Anand98])

?

slide-47
SLIDE 47
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 47

Clock gating (cont’d)

  • synthesizing F_a -

 Given: a Moore FSM

 set of inputs  set of outputs  set of states  initial states  next-state function  Output function Note: for Moore FSM output is a function

  • nly of the current state and not of input

variables. A self-loop in state transition graph (STG) corresponds to an idle condition => condition where clock to FSM register can be suppressed State s_i with self-loop function such that iff x_i – decoded state variable corresponding to s_i; x_i = 1 iff FSM is in state s_i captures the set of input conditions under which the self-loop

  • f state s_i is traversed

Activation function:

(Src: [Anand98])

Note: activation function might be complex

slide-48
SLIDE 48
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 48

 Often in data paths: output of register is fed back as one of the data inputs through, for example, a multiplexer network  Task: find condition under which this is the case by traversing the path through the multiplexer network

Clock gating for data paths

The gating condition for the clock input is:

[0] [1] contr contr 

Note: select signals are already present in network => only invert (if necessary) and conjunction need to be provided for gating condition Caution: strategy does not guarantee timing requirements (i.e. gating condition should stabilize before clk goes ‘1’->’0’) In order to avoid slower clocking: derive reduced gating condition

(Src: [Anand98])

slide-49
SLIDE 49
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 49

Clock gating for data paths (cont’d)

 Shown: a clock tree that has gated clock conditions at various points (levels)

Tradeoff:

  • Disabling clock at higher level in the

tree => a larger capacitance (sum of all smaller

  • nes) is prevented from switching
  • But: clock transition at certain level can
  • nly be suppressed if _all_ registers of

the certain sub-tree can retain their old values gating condition is satisfied fewer times => reduction of # of transitions saved On the other side: when doing at lower level, more transitions could have been saved but that costs more logic (that itself consumes power …)

clock tree

(Src: [Anand98])

CLOCK CLOCK IDLE CONDTION GATED CLOCK

slide-50
SLIDE 50
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 50

Clock gating

  • clock tree construction -

 Observation: the way the clock tree is constructed has an impact on gating the clock tree (see example):

Left: shown a clk tree with four registers R1,R2,R3,R4 and the conditions under which clk tree can be disabled. x1,…,x4 are decoded controller stated variables which are mutually exclusive (none of them can assume a ‘1’simultaneously) Observation: R1, R2 are grouped under a clock tree even though their conjunction can never be true => not possible to gate the clock at point “A”, for example (similar R3, R4) Right: gating condition for sub-tree under “A”:

1 ( 1 3) 1

A

GC x x x x    

(“B” may be grouped similarly) Advantage: more suited to gated clock since groups of registers with similar or overlapping idle conditions closer together => trees be more efficiently shut down

(Src: [Anand98])

slide-51
SLIDE 51
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 51

Clock gating

  • multiple clocks -

 Observation: some components of a circuit may follow some simple regular patterns. In particular, a component may be idle and active in alternating clock cycles or so

 => clock gating circuitry needs not necessarily to be data dependent

1 2

2 2 f f C C C f     

 Example:  A circuit with all registers fed by a single clock; whole capacitance is C and frequency is f  Assume: design is partitioned into two parts each fed by clock signals with f/2 and capacitances C1, C2. Power savings can be achieved if:   so, circuit needs to be partitioned carefully what should often be possible to achieve  Note: - savings here are for clock tree only

  • the f/2 does not result in performance penalties in this

example

1 2

2 C C C   

slide-52
SLIDE 52
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 52

Clock gating

  • multiple clocks (cont’d) -

 Idea: using multiple non-overlapping clocks. Example:

 Scheduled DFG (left fig.): clock cycles of the schedule s1,…,s5 have been assigned to two non-overlapping clock domains, CLOCK1, CLOCK2 in alternating way  Right fig: shows single-clock RTL circuit that implements given DFG using minimal resources but does not implement the clock partitioning shown in left fig.

(Src: [Anand98])

slide-53
SLIDE 53
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 53

Clock gating

  • multiple clocks (cont’d) -

Shown: RTL circuit that has been implemented with two clocks Restrictions: a) an op scheduled in CLOCK1 cannot share an FU with an op in CLOCK2 b) a variable generated in CLOCK1 cannot share an FU with a variable generated in CLOCK2 Why? => 1. ensure that each register can be clocked by either CLOCK1 or CLOCK2

  • 2. data path can be partitioned into two domains such that there is only

switching activity in their respective active clock cycles

(Src: [Anand98])

slide-54
SLIDE 54
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 54

Clock gating

  • some disadvantages -

 Inserting additional gates into the clock tree can lead to an increase in the clock delay and clock skew  Circuit malfunction due to spurious loading of registers when not taking into consideration that gating logic might introduce glitches  Increase of complexity to synthesis and analysis tools

slide-55
SLIDE 55
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 55

Power savings through pre-computation

Basic idea: Pre-compute (i.e. predict) output of a circuit one cycle ahead with additional logic and then switch off original logic What is the rational behind? For a majority of input values, the output might be computed with very simple logic but in order to cover all input, complex logic is necessary

  • riginal circuit

Circuit with pre-computation logic

Assumptions: ‘A’ represents combinational logic; has

  • ne output

Architecture: introduce new functions g1, g2 as follows

1 1 1 2 1 g f g f      

Tasks of g1, g2: a) pre-compute, b) switch off original circuit in certain cases when prediction is possible Note: a) g1, g2 only cover a subset of all x1,…,xn (desirable: a large coverage) b) imposes overhead in form of power area and probably performance

(Src: [Devadas])

slide-56
SLIDE 56
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 56

Power savings through pre-computation

 Example: a comparator of two n-bit values C and D that results in ‘1’ if C > D In that case, g1 can be defined to test the MSB: Accordingly, g1 is defined as: If g1=1 => C>D; g2=1 => D>C So, XNOR needs to be computed for the pre- computation logic Note: assuming a uniform probability for the inputs, the probability that XNOR results to ‘1’ is 50% ! If the bit MSB-1 is also tested, the probability is 75%! => With little effort the circuit can be predicted and power can be saved by switching off (gating) the original circuit For large n, the power dissipation of that additional XNOR gate can be neglected

(Src: [Devadas])

slide-57
SLIDE 57
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 57

Managing power through scheduling

 Recall: “pre-computation” is a shut-off technique based on clock cycle basis and is limited to the given structure of the logic  Can power be managed at a higher level?  Observation: it is common for performance optimization to compute all

  • utcomes of a conditional operation in parallel with the condition

evaluation itself The appropriate result will be chosen and the other one(s) will be discarded  => this is ineffective in terms of power consumption  Idea: power effective: enforce control dependencies between

  • perations in CDFG and conditional operation such that they depend

during scheduling May be accomplished by meeting performance constraints first and then optimize power consumption

slide-58
SLIDE 58
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 58

Managing power through scheduling (cont’d)

Example: expression |a-b| Assumption: each op ‘-’ and ‘>’ Takes one clock cycle and ‘sel’

  • peration may be chained with any
  • ther operation

Constraint: a schedule within 2 cycles => Two possible schedules b) and c) a) c) b) Schedule b): ignores control dependencies; a-b and b-a are executed independently of ‘>’; There is a flexibility in scheduling ‘>’ Problem: from power point of view b) is inefficient: both a-b and b-a are always executed Schedule c): a-b or b-a are activated exclusively due to outcome of ‘>’. a-b, b-a may be assigned to same or to two different subtractors (latter case: one needs to be shut down)

(Src: [Anand98])

slide-59
SLIDE 59
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 59

Power savings through operand isolation

 Note: previous techniques (pre-computation, gated clocks) are only applicable to blocks of combinational logic that are fed by registers  Here: applicability to circuit blocks that are embedded within combinational logic  Idea here:

 Disable transitions at inputs

  • f variables

 Insert transparent latches at all inputs of embedded block  If block does not perform any useful operation: a) transparent latches at inputs are disabled, b) retain previous cycle’s values => avoids unnecessary power consumption

(Src: [Anand98])

slide-60
SLIDE 60
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 60

Power savings through operand isolation:

  • guarded evaluation -

1

  • LE

ODC  

  • – signal in a combinational circuit

F – logic that is computing o I – set of inputs to F ODC_o – observability don’t care set with respect to o i.e. set of primary input assignments to the entire circuit such that the value of o has no influence on the values at the primary outputs LE – an arbitrary value of the existing circuit such that LE => ODC_o , Thus, when LE = 1, the value on o is not needed to compute the primary outputs. earliest time at which any of the inputs in I can change its value when LE=1 the latest time at which LE can stabilize to logic value ‘1’

1

( )

e LE

t I

 1

( )

l LE

t LE

pure guarded evaluation => LE can be used to control guard logic. Transparent latches need to be disabled in time i.e. early enough to cut off transitions on any of the inputs in I (and such save power) i.e. LE can be used to control guard logic

(Src: [Anand98]) (Src: [Tivari])

1 1

( ) ( )

e LE l LE

t I t LE

 

>

slide-61
SLIDE 61
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 61

Power savings through operand isolation:

  • relaxed guarded evaluation (cont’d) -

Idea: use guarded evaluation but: use a relaxed condition such that it becomes easier to find the shut-off condition (remember: signal LE must be available anywhere in the existing circuit): Timing condition: same as before at “pure guarded evaluation” Two cases for LE=1 (note: for LE=0 circuit is functioning correctly anyway) 1.

  • is not needed to compute

primary outputs (OK) 2.

  • is needed to compute

primary output. Circuit may

  • perate incorrectly. An ‘OR’

gate is needed (see figure) Comparing: pre-computation and guarded evaluation

  • pre-computation needs additional circuitry; GE is derived from within the circuit
  • Pre-computation may require re-synthesis to efficiently derive additional

circuits whereas GE leaves circuit as is (especially important in hand-optimized circuitry)

(Src: [Anand98]) (Src: [Tivari])

slide-62
SLIDE 62
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 62

Power savings through operand isolation:

  • application to high-level synthesis -

 Idea: apply operand isolation from logic level to high-level synthesis  In fact: the conditions under which a resource (e.g. a functional unit FU ) is not used are readily available from the scheduling and resource sharing!  => idle cycles can be derived from the circuits (see next slide)

(Src: [Anand98])

Scheduled example DFG

slide-63
SLIDE 63
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 63

Power savings through operand isolation:

  • application to high-level synthesis (cont’d) -

Idle cycles of functional units: Insert transparent latches at the FU’s input to perform operand isolation: Q: Why no latches of input of “CMP1”? A: values that feed its inputs do not change in cycles in which it is idle

(Src: [Anand98])

slide-64
SLIDE 64
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 64

Power savings through operand isolation:

  • application to high-level synthesis (cont’d) -

Possible disadvantages: The isolation technique attempts to eliminate spurious activity at the inputs of embedded resources (e.g. functional units) by inserting transparent latches into the RTL implementation.

  • incurring power and area overheads due to the addition of extra circuitry
  • operand isolation also requires some delay constraints (the disabling transition at the

transparent latch enable input should arrive before its data input can change).

  • > Satisfaction of the delay constraints may require the addition of extra circuit delay in the critical

path, which may not be acceptable for high-performance designs.

slide-65
SLIDE 65
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 65

Power savings through constrained register sharing

 Idea: rather than applying transparent latches to an already scheduled DFG, can’t the scheduling, mapping etc already take into consideration power shut-down techniques?

 Impact of variable assignment on power consumption:

 two candidate assignments, Assignment 1 and Assignment 2, shown in Table (next slide).  Architectures obtained using these assignments were subject to:  logic synthesis optimizations, and placed and routed using a cell library. The transistor-level netlists extracted from the layouts were simulated using a switch-level simulator with typical input traces to measure power.  For the circuit Design 1, synthesized from Assignment 1, the power consumption was 30.71mW, and for the circuit Design 2, synthesized from Assignment 2, the power consumption was 18.96mW !

slide-66
SLIDE 66
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 66

Power savings through constrained register sharing (cont’d)

Example: a DFG and two possible register assignments that differ significantly in power consumption

slide-67
SLIDE 67
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 67

Power savings through constrained register sharing (cont’d)

Two distinct schedules. Shown: FU (in box), input variables left and right of operation, grey- shaded: variables at input of respective operation change value; spurious input transitions i.e. those that do not correspond to an DFG operations are marked with an ‘X’. Observation: A functional unit that does not alter its input does not perform a spurious operation Conclusion: Constrained register sharing can save significant energy/power: Upper case: 7 operations that do not correspond to a DFG operation Lower case: only one such

  • peration

Note: a) Number of control steps is not increased b) Number of HW resources (FUs) is still the same

slide-68
SLIDE 68
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 68

Power savings through dynamic variable rebinding

Problem in previous example: There is still a spurious operation is control step s1 of each

  • peration:
  • The MUX selects R5 (to which v12 is assigned) from

control step s3 of each iteration to control step s1 of the next iteration

  • v12 acquires a new value at step s1

Idea:: Combine dynamic variable rebinding with variable assignment to completely eliminate spurious operation How::

  • Need to preserve old (previous iteration) value of v12 at

input of SUB1 until new value of v3 in current iteration is generated

  • then, spurious operation can be eliminated
  • swapping the variables assigned to registers R5 and R6

in alternate iterations

  • result: see figure on the right

(Src: [Anand98])

slide-69
SLIDE 69
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 69

Managing power through the controller

Main idea: Finding a way to reduce power without having to spend large overhead in terms of hardware like transparent latches etc.

  • Rather: re-designing the existing control logic in order to reconfigure the

multiplexer networks and functional units in the data path

  • Might not completely eliminate activity but is low-cost
  • best suited to control flow intensive designs:

(Src: [Anand98])

slide-70
SLIDE 70
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 70

Managing power through the controller (cont’d)

 Example: X.25 protocol

(Src: [Anand98])

slide-71
SLIDE 71
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 71

Managing power through the controller: X.25 example 1

Shown:

  • left: a part of the data path
  • middle/right: a) logic

expression for control signals (x_i=1 => controller is in state s_i); b) activity graphs of ALU (state transitions with actions involved; ex: sel(0) sel(1), sel(2), SelectFunc(0) are 1, 0, 1, 0, respectively => “byte- byteCount” is performed Observation:

  • some states are actually idle

states (gray shaded). This can be found out through scheduling info from HL synthesis. Idea:

  • re-specify idle states such that switching is minimized.
  • Example: state transition s6->s4 : same signals are on muxes

such that same operands )stay stable). Since they do not change => no switching => no unnecessary power consumption Conclusion: re-specifying control signals can lead to power savings (without changing anything else)

(Src: [Anand98])

slide-72
SLIDE 72
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 72

Managing power through the controller: X.25 example 2

Shown:

  • Different part of the X.25 designs:

register that stores variable I and muxes that feed it through signals sel(18), M(18), sel(19)

  • Same convention as before apply (gray

shaded states are inactive states as far as that respective part of the design is concerned) Idea:

  • Can reduced activity in the mux tree

lead to reduced power consumption? Idea:

  • Consider s7->s1: “count+byteCount” ->

“bytes-byteCount). Since operation chnages, also operand c28 changes (c28 is output of ALU and feed the mux tree) solution:

  • re-specifying the signals prevents from

propagating this variable into the shown mux tree (see right state diagram)

slide-73
SLIDE 73
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 73

Managing power through the controller (cont’d)

 Re-labeling activity graphs :How to label an idle vertex in an activity graph: Different incoming and outgoing transitions into the idle state have different execution probabilities The values of data operands fed to the mux trees may themselves change => Only selecting the same

  • perand does not ensure that switching activity is

minimized

slide-74
SLIDE 74
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 74

Managing power through the controller: formalizing the problem

Shown:

  • A part of a data path and a part
  • f the activity graph for the “<“
  • perator
  • s3 is an idle state
  • shown are all incoming and
  • utgoing arcs to/from s3

Goal:

  • using s3 for one of the labels

L1, L2, L3 such that activity at the input of the “<“ is minimized Some conventions:

  • P(si - > sj) - probability of the controller state transition from si to sj
  • AM si->sj - activity matrix stores the cost of (average bit transitions for respective state transition).

Has only entry in row/column if the respective transition is actually in the state transition graph

  • Nodes in the state transition graph are to be re-labeled while minimizing labeling costs:

Goal: find an L* such that cost function is minimized => best re-labeling

slide-75
SLIDE 75
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 75

Conclusion: Hardware power

 Make sure what is to be reduced: peak power, average power (different strategies)  HW power sources:

 Data path  Control path  Clock tree

 Optimization strategies:

 Operator scheduling for low power  Hardware power management (clock gating)  Re-labeling of controller

 Note:

 Very often there is a tradeoff: reduced power may lead to: more logic, more complex design, reduced performance, …

slide-76
SLIDE 76
  • T. Ebi and J. Henkel, KIT, SS13

http://ces.itec.kit.edu Hardware Power 76

Reference and sources

 [Heer04] Ch. Herr, U. Schlichtmann, “Ultra-Low-Power Design: Device and logic design approaches”,

  • pp. 1-20, in “Ultra Low-Power Electronics and Design” by Kluwer, 2004.

 [Anand98] A. Raghunathan, N.K. Jha, S. Dey, “High-level power analysis and optimization”, Kluwer Academic Publishers,1998.  [Sarraf95] S. Raje, M. Sarrafzadeh, “Variable voltage scheduling”, IEEE/ACM ISLPED 1995. pp. 9- 14, 1995.  [Knight] R.S. Martin, J.P. Knight, “Power-Profiler: Optimizing ASIC’s Power Consumption at the behavioral level”, Proc. Of IEEE/ACM Design Automation Conf. (DAC’95), pp.42-47,1995.  [Macii04] E. Macii (Ed.), “Ultra Low-Power Electronics and Design”, Kluwer Academic Publishers, 2004.  [Devadas] Alidina, M.; Monteiro, J.; Devadas, S.; Ghosh, A.; Papefthymiou, M.; “Precomputation- based Sequential Logic Optimization For Low Power”, Computer-Aided Design (ICCAD), 1994., IEEE/ACM International Conference on November 6-10, 1994 Page(s):74 – 81.  [Ragh99] Raghunathan, A.; Dey, S.; Jha, N.K.; “Register transfer level power optimization with emphasis on glitch analysis and reduction”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 18, Issue 8, Page(s):1114 – 1131, Aug. 1999.  [Tivari] Tiwari, V.; Malik, S.; Ashar, P.; “Guarded evaluation: pushing power management to logic synthesis/design”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 17, Issue 10, Page(s):1051 – 1060, Oct. 1998.  [Mehra] R. Mehra, J. Rabaey, “Exploiting Regularity for Low Power Design”, IEEE/ACM Intl’ Conference on Computer Aided Design (ICCAD96), pp. 166-172, 1996.