- T. Ebi and J. Henkel, KIT, SS13
http://ces.itec.kit.edu Hardware Power 1
Low Power Design
Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany
- 3. Hardware power optimization and
Low Power Design Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for - - PowerPoint PPT Presentation
1 Hardware Power Low Power Design Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany 3. Hardware power optimization and estimation http://ces.itec.kit.edu T. Ebi and J. Henkel,
http://ces.itec.kit.edu Hardware Power 1
Thomas Ebi and Prof. Dr. J. Henkel CES - Chair for Embedded Systems Karlsruhe Institute of Technology, Germany
http://ces.itec.kit.edu Hardware Power 2
Levels of abstraction
Tasks
Optimize (i.e. minimize for low power) Design / co-design (synthesize, compile, …) Estimate and simulate
Battery issues software OS interconnect hardware memory Components consuming power
http://ces.itec.kit.edu Hardware Power 3
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 4
Energy/power needs to be analyzed and
abstraction Therefore, appropriate power models for each level are necessary Shown in Fig: a) design flow w/o energy/power b) design flow with energy/power
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 5
A more detailed version than in the intro …
In general, four components: Switching capacity power Short-circuit power Leakage power Static power
(Src: [Anand98])
. . avg sw cap short circuit leakage static
http://ces.itec.kit.edu Hardware Power 6
Caused by parasitic capacitors during switching: Fig. shows C_L which is the effective capacitance of all parasitic capacitances Per transition: Means: a) reduce operating frequency, b) reduce C_L, c) reduce voltage, d) reduce switching activity. Most common: reduce voltage => Problem: delay of gate t_d increases too!
(Src: [Anand98]) CMOS inverter
2 L DD d DD th
k C V t V V
2 .
1 2
sw cap L DD
P C V N f
http://ces.itec.kit.edu Hardware Power 7
Explanation:
Caused by direct supply-to-ground path When CMOS inverter in Fig. changes from 1->0 there is a short time frame within which both, nMOS and pMOS transistors are conducting => short circuit current is drawn from power supply
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 8
Leakage can be divided into three components
Idiode – refers to the diodes that are formed between diffusion regions and substrate Very small compared to the other two: Ioxide – electrons tunneling through the gate oxide Drops off exponentially with gate length ‘off’ transistors still conduct some current K, S, technology parameters; Weff effective transistor channel width NOTE: leakage power is predicted to be dominant in future silicon technologies
(Src: [Anand98])
diode
ld subthresho leakage
http://ces.itec.kit.edu Hardware Power 9
Not relevant in CMOS circuits Note: in some literature leakage power is denoted as “static power” Static power: only relevant in some nMOS circuits where there is a constant path supply-to-ground
http://ces.itec.kit.edu Hardware Power 10
Trajectory if high-k dielectrics reach production Dynamic power Subthreshold leakage Gateoxide leakage 100 11 0,01 10-4 10-6 1 9 9 1 9 9 5 2 2 5 2 1 2 1 5 2 2
Leakage power will dominate in future ( <100nm) silicon technologies one means to reduce leakage power is to deploy dielectrics with a high k-value
(Src: [Heer04]) time
http://ces.itec.kit.edu Hardware Power 11
Considered here: high-level synthesis (HLS) e.g.:
Operator scheduling Module selection Glitch power reduction State transition reduction …
http://ces.itec.kit.edu Hardware Power 12
What is scheduling in the context of high-level synthesis? Scheduling assigns operations in the behavioral description to control steps or controller states. Scheduling determines cycle-by- cycle behavior i.e. sequence in which operations are performed Some repetition from ESI: multicycling (clock period is rather short) chaining (clock period rather long) finding the right clock cycle time is an optimization task itself Scheduling determines the sequence in which the various
dictates which operations and variables can share the same functional units and registers. Thus, scheduling can be used to enable resource sharing for low power by ensuring that correlated variables and operations with correlated operands are appropriately sequenced so that they can share the same resources
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 13
Scheduling can be performed so as to enable maximum resource sharing between operations that belong to instances of the same computational pattern, resulting in maximal exploitation of regularity during resource sharing Scheduling can be used to distribute the slacks or mobilities of various operations in the DFG appropriately so that some
functional units. Thus, scheduling has an impact on the power trade-offs through module selection Scheduling determines the distribution of operations over time, and hence affects the profile of the power consumption in the implementation over time (control steps or clock cycles). Reducing peak power is important due to packaging, cooling, and reliability
illustrated later. => these tasks will be discussed in the following (some in the context of module selection)
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 14
Basic idea: use slack in a data flow graph (dependent upon timing constraints) and: a) Vary V_dd of the ALU where operator is to be executed, or b) Assign operator(s) to a different ALU with a lower/higher (fixed) V_dd
DD
V
d
t
2 L DD d DD th
C V t V V
Shown: normalized, dependency t_d = f(V_DD) (src:[Saraff95])
http://ces.itec.kit.edu Hardware Power 15
1 2
c c ci
2
i
i V
Problem: Obtain a mapping of a data flow graph G=(V,E) given a base execution time t_c (or V_dd) and a timing constraint k * t_c minimize such that the critical path length of the DFG is <= k * t_c
:V S
(src:[Saraff95])
http://ces.itec.kit.edu Hardware Power 16
Step 1: initialization Step 2: computing slack l(v) – longest path of the graph that goes through node v
(src:[Saraff95])
http://ces.itec.kit.edu Hardware Power 17
Step 3: compute max slack value Step 4: compute dual graph
(src:[Saraff95])
http://ces.itec.kit.edu Hardware Power 18
Step 5: weight assignment Step 6: compute longest weighted path Step 7: reassign voltages to node in longest path
(src:[Saraff95])
http://ces.itec.kit.edu Hardware Power 19
Conclusion Power consumption can be reduced depending on constraints up to around 25%
Step 8: go back to step 2
http://ces.itec.kit.edu Hardware Power 20
What is module selection? Observation: with each operation a functional unit template but not a specific instance is associated (that would be mapping) Example: a “+” operation may be implemented using a
A) ripple-carry adder B) carry-lookahead adder C) … Ripple-carry adder is slow but more efficient in switched capacitance, carry- lookahead adder is faster but less efficient in switched capacitance Similar tradeoffs exist in other operations
Idea: Tradeoffs can be exploited to fulfill power constraints through module selection
The process of mapping operations from the CDFG to component templates of the RTL library
http://ces.itec.kit.edu Hardware Power 21
Each operation in the DFG (middle) has been mapped to fast component in
But is that really necessary? => no, not all ops need necessarily be mapped to fastest module. Focus (for timing constraint) should rather be on critical path Idea: slack in off-critical path ops may be used to select slower functional units that may have a better efficiency in switched capacitance (see right DFG). There, mult op uses less power (but not less energy) Important: to have a large module library with distinct switching capacity efficiencies and performance characteristics
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 22
Multiple supply voltages have also been used at logic level Idea: obtain low power at rather small performance overhead (remember the equation for power through switched capacitances) How to do in HL synthesis?
Need RTL component library that contains multiple versions of each component each referring to a different Vdd Need to extend module selection process to assign each op to a lib component template plus a supply voltage The insertion of a necessary level converter needs to be accomplished The delay models (that evaluate the module selection) need to be sensitive to the dependence of delay and supply voltage Note: resource sharing for LP (see later) needs to be constrained: must not allow sharing of functional units that are assigned to different supply voltages
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 23
Ex:
Off-critical-path op ‘*3’ has been assigned to a lower Vdd (3.3V) level converter has been inserted: convert result of ‘*3’ to higher output level various ways for level converter: a) a separate level converter circuit like DCVS (differential cascode voltage switch b) may be integrated into a register to reduce overhead
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 24
Example: two possible schedules of a given graph (assumption: same power for each operation) ASAP schedule results in high peak power A possible slack may be used to reduce peak power without sacrificing any performance
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 25
Assumption: ‘1’, ‘2’ and ‘0’, ‘3’ may pair-wise share a module. The according times are t_A, t_B: The average power consumption for the two schedules is: Ex1: t_A = 1,300ns, t_B = 660ns, P_A = 28.6mW, P_B = 56.4mW Note: average power has considerably improved but energy is approx the same! => trade-off between power and performance has been exploited to protect circuit (Src: [Knight]) Example graph, schedule and module lib
http://ces.itec.kit.edu Hardware Power 26
Window is 60ns ‘C’ uses multicycling and no parallelism, has lowest peak power and is slowest ‘H’ is fastest, has highest peak power but comparable average power (according to the window) Note: optimizing for peak power or for average power are completely distinct tasks!
(Src: [Knight]) Min delay: refers to when the operations are actually done whereas the other case always assumes 60ns window
http://ces.itec.kit.edu Hardware Power 27
What is resource sharing?
Mapping ops and variables in CDFG to FUs, registers etc. and defining interconnection between to form RTL implementation It is the mapping from function to structure It directly impacts power consumption by determining switching activity at various signals, buses, wires, macro blocks etc.
Observation:
Result of resource sharing of variables’ values are time multiplexed registers Values that appear as input operands of ops are time-multiplexed to appear at inputs of FUs Values that are transferred between FUs are sequenced to appear on interconnect units (buses and multiplexers) Word-level temporal correlations of values on data path signals are determined by the correlations among variables and input operands of
Word-level correlation, in turn determine bit-level switching activity Idea: can signal correlations be exploited to reduce switched capacitances?
http://ces.itec.kit.edu Hardware Power 28
Analysis of above scenario:
Focus on ‘+1’ and ‘+2’ operations Two consecutive iterations of the DFG are shown. It is : “+1_1”, “+2_1”, “+1_2”, “+2_2” Values seen at adder input are: (a1, b1), (c1, d1), (a2, b2), (c2, d2)
What is the switching activity at the adder inputs determined by?
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 29
… switching activity is determined by:
first iteration,
Idea:
Exploit correlations between variables in behavioral description to minimize switched capacitance at RTL level
http://ces.itec.kit.edu Hardware Power 30
Scenarios of exploiting correlations:
S1: S2:
http://ces.itec.kit.edu Hardware Power 31
Scenarios of exploiting correlations (cont’d):
S3:
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 32
What is regularity?
Regularity refers to the repeated occurrence of computational patterns within an algorithm
Idea:
Regularity can be exploited to reduce interconnect power by detecting instances of repetitive patterns in the computation and resource sharing in the sense that the same interconnect structure in the data path is reused for as many instances of computation patterns as possible
http://ces.itec.kit.edu Hardware Power 33
Ex 1:
Scenario b): given on the left-hand side a data flow graph; on the right an implementation Problem: for A2 -> M1, A2 -> M1, A1 - > S1, multiplexers are needed Scenario a): same graph but mapping of DFG to data path does not require multiplexers since A1 -> M1, for example, can be reused Conclusion: scenario b) does not preserve regularity, a) does So, where does overhead power consumption come from?
the interconnects and therefore the switched capacitances
probably buffers) leads to more switching activity and therefore higher power consumption (Src: [Mehra])
http://ces.itec.kit.edu Hardware Power 34
Idea: defining E-instances: a pair of nodes connected by an edge E-template: type of an E-instances classified by type of input/output port Ex: (add->add.right) means a template with an adder where
adder E-coverage: # of instances of that type divided by total # of edges in graph task here: using E-templates in synthesis as to minimize power with e-coverage as quality measure through regular assignment Ex 2: a fourth-order cascade filter Disadvantages?
may require more hardware units since sufficient E-templates need to be provided -> under circumstances power savings come at cost of more hardware (Src: [Mehra])
http://ces.itec.kit.edu Hardware Power 35
Example: E-template-based scheduling
(Src: [Mehra])
http://ces.itec.kit.edu Hardware Power 36
A typical DSP operation: constant multiplication and addition: Y = A * X (e.g. as part of an IIR filter) Assumptions:
m-bit data value X multiplied by m-bit constant A
Experiment:
Applying random values of X for varying values of A X and A are represented as two’s complement A 0.0 … 1.0 normalized Observing average switching activity per bit at multiplier output Result: next slide
http://ces.itec.kit.edu Hardware Power 37
Observation: When constant value A is ‘0’ => no switching activity (as expected) When A is 1.0, output switching activity is equal to switching activity of X In between: it is monotonically increasing
(i.e. A)
Output switching activity per bit
(Src: [Anand98]) (Src: [Ragh99])
Example
An adder: two m-bit data values X1, X2. It can be shown that:
http://ces.itec.kit.edu Hardware Power 38
Example: linear time-invariant signal processing system (e.g. IIR filter)
(Src: [Anand98])
Left figure Right figure:
http://ces.itec.kit.edu Hardware Power 39
What is “glitch power”? Power consumption that is related to hazards i.e. temporary values at the input/output of gates that cannot be explained when considering a truth table only. It is due to different propagation delays in combinatorial paths Analysis of the above example unveiled: A rising transition on signal x1 was frequently accompanied by a falling transition on c11. Thus, the rising transition on x1 and the falling transition on c11 are highly correlated. Transitions on signal x1 arrive earlier than transitions on signal c11 due to: a) non-balanced paths, b) wiring delays.
glitch (Src: [Ragh99]) Shown: # transitions / # transition w/o glitches
http://ces.itec.kit.edu Hardware Power 40
In general, glitches are generated at the control signals due to the simultaneous presence of the following two conditions. 1) Functional: Correlation between rising and falling transitions at two or more signals that feed a gate. 2) Temporal: The controlling to non-controlling transition arrives earlier at the gate’s input (see example last slide) Note: glitchy signal propagates (and therefore consumes even more power at
Glitches may also be generated by data path blocks: examples
Assumption: input signals are considered to be glitch -free and to arrive simultaneously => glitches reported are generated by the respective block Note: comparator ‘=’ is glitch-free -> seems to indicate that the logic inside is well- balanced
(Src: [Ragh99])
http://ces.itec.kit.edu Hardware Power 41
Example: multiplexer of 2 8-bit words is controlled by a comparator “<“
Assumption: comparator generates glitches; A,B are glitch-free
Shown: a bit slice of words A and B
In table: # of transitions w/o glitches, total # of transitions
Discussion (denotation: <A_i,B_i>)
<0,0> : glitch cannot propagate <0,1>, <1,0> : glitch propagates at G1 for <1,0> and at G2 for <0,1> <1,1> : glitch always propagates
How to prevent glitches ?
For example: use spatial correlations
(Src: [Ragh99]) Glitchy and non- glitchy signals
http://ces.itec.kit.edu Hardware Power 42
Spatial correlation: observation: value of S is irrelevant in case <1,1> (anyway ‘1’ at OUT_i) => insert gate G_c => propagation of glitch on S is prohibited Result: Q: why can’t just a buffer be inserted between S and input of G2 (idea here: the glitches compensate and therefore eliminate each other) ?
These and other techniques need to be built into synthesis tools and component libraries in order to prevent glitches in the first place More techniques to prevent glitches during synthesis can be found in [Ragh99] Problem with glitch reduction techniques:
Power overhead of additional gates (might reduce gains obtained through reduced number of glitches) Increased power of existing gates due to more inputs (example above: G_3 has 3 instead of 2 inputs) …
(Src: [Ragh99])
http://ces.itec.kit.edu Hardware Power 43
What is clock gating? How can clock gating result in power savings?
Reduced capacitive switching in the clock network like
Also:
thus saving power
Idea: suppress or disable transitions from propagating to parts of the clock network under specific conditions that are determined by the clock gating circuitry
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 44
Register re-loads previous value when comparator
transition at the clock input to register can be suppressed and transitions can be spared Scheme 1: register clock input would be forced to ‘0’ when comp is ‘0’ (desired) Scheme 2: register clock input is forced to ‘1’ when comparator
‘0’ (desired) Scheme 1: does not work since comp output is not stable before clock edge rises Scheme 2: OK (as long as gating condition stabilizes before clock does ’0’ -> ‘1’)
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 45
Typically: a) existing signals in the circuitry may be used for gating parts of the clock network or, b) signals from previous clock may be used (in that case those signal values need to be stored in latches) Example:
Decode stage of a micro-processor pipeline can be used to clock-gate later stages
In other case:
Additional circuitry needs to be added
Pitfalls and overheads:
Introducing additional gates in clock tree may lead to an increase in clock delay and clock skew Ensure that gating clocks does not introduce glitches, otherwise: malfunction due to spurious loading of registers Circuits with gated clock introduce additional complexity to synthesis and analysis tools
http://ces.itec.kit.edu Hardware Power 46
Idea: automated gated-clock synthesis (architecture, see above)
Synthesizing an activation function F_a: goes to ‘1’ when clock needs to be stopped Latch L ensures that glitches are not propagated to clock signal AND gate suppresses eventually clock for whole circuitry
Consideration: identify conditions when next state and primary output conditions do not change Gating the clock on its roots (e.g. for whole circuitry) => eliminates clock skew problem Added circuitry may incur additional power etc. => try to detect subset of idle condition at low
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 47
Given: a Moore FSM
set of inputs set of outputs set of states initial states next-state function Output function Note: for Moore FSM output is a function
variables. A self-loop in state transition graph (STG) corresponds to an idle condition => condition where clock to FSM register can be suppressed State s_i with self-loop function such that iff x_i – decoded state variable corresponding to s_i; x_i = 1 iff FSM is in state s_i captures the set of input conditions under which the self-loop
Activation function:
(Src: [Anand98])
Note: activation function might be complex
http://ces.itec.kit.edu Hardware Power 48
Often in data paths: output of register is fed back as one of the data inputs through, for example, a multiplexer network Task: find condition under which this is the case by traversing the path through the multiplexer network
The gating condition for the clock input is:
[0] [1] contr contr
Note: select signals are already present in network => only invert (if necessary) and conjunction need to be provided for gating condition Caution: strategy does not guarantee timing requirements (i.e. gating condition should stabilize before clk goes ‘1’->’0’) In order to avoid slower clocking: derive reduced gating condition
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 49
Shown: a clock tree that has gated clock conditions at various points (levels)
Tradeoff:
tree => a larger capacitance (sum of all smaller
the certain sub-tree can retain their old values gating condition is satisfied fewer times => reduction of # of transitions saved On the other side: when doing at lower level, more transitions could have been saved but that costs more logic (that itself consumes power …)
clock tree
(Src: [Anand98])
CLOCK CLOCK IDLE CONDTION GATED CLOCK
http://ces.itec.kit.edu Hardware Power 50
Observation: the way the clock tree is constructed has an impact on gating the clock tree (see example):
Left: shown a clk tree with four registers R1,R2,R3,R4 and the conditions under which clk tree can be disabled. x1,…,x4 are decoded controller stated variables which are mutually exclusive (none of them can assume a ‘1’simultaneously) Observation: R1, R2 are grouped under a clock tree even though their conjunction can never be true => not possible to gate the clock at point “A”, for example (similar R3, R4) Right: gating condition for sub-tree under “A”:
A
(“B” may be grouped similarly) Advantage: more suited to gated clock since groups of registers with similar or overlapping idle conditions closer together => trees be more efficiently shut down
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 51
Observation: some components of a circuit may follow some simple regular patterns. In particular, a component may be idle and active in alternating clock cycles or so
=> clock gating circuitry needs not necessarily to be data dependent
1 2
2 2 f f C C C f
Example: A circuit with all registers fed by a single clock; whole capacitance is C and frequency is f Assume: design is partitioned into two parts each fed by clock signals with f/2 and capacitances C1, C2. Power savings can be achieved if: so, circuit needs to be partitioned carefully what should often be possible to achieve Note: - savings here are for clock tree only
example
1 2
2 C C C
http://ces.itec.kit.edu Hardware Power 52
Idea: using multiple non-overlapping clocks. Example:
Scheduled DFG (left fig.): clock cycles of the schedule s1,…,s5 have been assigned to two non-overlapping clock domains, CLOCK1, CLOCK2 in alternating way Right fig: shows single-clock RTL circuit that implements given DFG using minimal resources but does not implement the clock partitioning shown in left fig.
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 53
Shown: RTL circuit that has been implemented with two clocks Restrictions: a) an op scheduled in CLOCK1 cannot share an FU with an op in CLOCK2 b) a variable generated in CLOCK1 cannot share an FU with a variable generated in CLOCK2 Why? => 1. ensure that each register can be clocked by either CLOCK1 or CLOCK2
switching activity in their respective active clock cycles
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 54
Inserting additional gates into the clock tree can lead to an increase in the clock delay and clock skew Circuit malfunction due to spurious loading of registers when not taking into consideration that gating logic might introduce glitches Increase of complexity to synthesis and analysis tools
http://ces.itec.kit.edu Hardware Power 55
Basic idea: Pre-compute (i.e. predict) output of a circuit one cycle ahead with additional logic and then switch off original logic What is the rational behind? For a majority of input values, the output might be computed with very simple logic but in order to cover all input, complex logic is necessary
Circuit with pre-computation logic
Assumptions: ‘A’ represents combinational logic; has
Architecture: introduce new functions g1, g2 as follows
1 1 1 2 1 g f g f
Tasks of g1, g2: a) pre-compute, b) switch off original circuit in certain cases when prediction is possible Note: a) g1, g2 only cover a subset of all x1,…,xn (desirable: a large coverage) b) imposes overhead in form of power area and probably performance
(Src: [Devadas])
http://ces.itec.kit.edu Hardware Power 56
Example: a comparator of two n-bit values C and D that results in ‘1’ if C > D In that case, g1 can be defined to test the MSB: Accordingly, g1 is defined as: If g1=1 => C>D; g2=1 => D>C So, XNOR needs to be computed for the pre- computation logic Note: assuming a uniform probability for the inputs, the probability that XNOR results to ‘1’ is 50% ! If the bit MSB-1 is also tested, the probability is 75%! => With little effort the circuit can be predicted and power can be saved by switching off (gating) the original circuit For large n, the power dissipation of that additional XNOR gate can be neglected
(Src: [Devadas])
http://ces.itec.kit.edu Hardware Power 57
Recall: “pre-computation” is a shut-off technique based on clock cycle basis and is limited to the given structure of the logic Can power be managed at a higher level? Observation: it is common for performance optimization to compute all
evaluation itself The appropriate result will be chosen and the other one(s) will be discarded => this is ineffective in terms of power consumption Idea: power effective: enforce control dependencies between
during scheduling May be accomplished by meeting performance constraints first and then optimize power consumption
http://ces.itec.kit.edu Hardware Power 58
Example: expression |a-b| Assumption: each op ‘-’ and ‘>’ Takes one clock cycle and ‘sel’
Constraint: a schedule within 2 cycles => Two possible schedules b) and c) a) c) b) Schedule b): ignores control dependencies; a-b and b-a are executed independently of ‘>’; There is a flexibility in scheduling ‘>’ Problem: from power point of view b) is inefficient: both a-b and b-a are always executed Schedule c): a-b or b-a are activated exclusively due to outcome of ‘>’. a-b, b-a may be assigned to same or to two different subtractors (latter case: one needs to be shut down)
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 59
Note: previous techniques (pre-computation, gated clocks) are only applicable to blocks of combinational logic that are fed by registers Here: applicability to circuit blocks that are embedded within combinational logic Idea here:
Disable transitions at inputs
Insert transparent latches at all inputs of embedded block If block does not perform any useful operation: a) transparent latches at inputs are disabled, b) retain previous cycle’s values => avoids unnecessary power consumption
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 60
1
ODC
F – logic that is computing o I – set of inputs to F ODC_o – observability don’t care set with respect to o i.e. set of primary input assignments to the entire circuit such that the value of o has no influence on the values at the primary outputs LE – an arbitrary value of the existing circuit such that LE => ODC_o , Thus, when LE = 1, the value on o is not needed to compute the primary outputs. earliest time at which any of the inputs in I can change its value when LE=1 the latest time at which LE can stabilize to logic value ‘1’
1
e LE
1
l LE
pure guarded evaluation => LE can be used to control guard logic. Transparent latches need to be disabled in time i.e. early enough to cut off transitions on any of the inputs in I (and such save power) i.e. LE can be used to control guard logic
(Src: [Anand98]) (Src: [Tivari])
1 1
e LE l LE
>
http://ces.itec.kit.edu Hardware Power 61
Idea: use guarded evaluation but: use a relaxed condition such that it becomes easier to find the shut-off condition (remember: signal LE must be available anywhere in the existing circuit): Timing condition: same as before at “pure guarded evaluation” Two cases for LE=1 (note: for LE=0 circuit is functioning correctly anyway) 1.
primary outputs (OK) 2.
primary output. Circuit may
gate is needed (see figure) Comparing: pre-computation and guarded evaluation
circuits whereas GE leaves circuit as is (especially important in hand-optimized circuitry)
(Src: [Anand98]) (Src: [Tivari])
http://ces.itec.kit.edu Hardware Power 62
Idea: apply operand isolation from logic level to high-level synthesis In fact: the conditions under which a resource (e.g. a functional unit FU ) is not used are readily available from the scheduling and resource sharing! => idle cycles can be derived from the circuits (see next slide)
(Src: [Anand98])
Scheduled example DFG
http://ces.itec.kit.edu Hardware Power 63
Idle cycles of functional units: Insert transparent latches at the FU’s input to perform operand isolation: Q: Why no latches of input of “CMP1”? A: values that feed its inputs do not change in cycles in which it is idle
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 64
Possible disadvantages: The isolation technique attempts to eliminate spurious activity at the inputs of embedded resources (e.g. functional units) by inserting transparent latches into the RTL implementation.
transparent latch enable input should arrive before its data input can change).
path, which may not be acceptable for high-performance designs.
http://ces.itec.kit.edu Hardware Power 65
Idea: rather than applying transparent latches to an already scheduled DFG, can’t the scheduling, mapping etc already take into consideration power shut-down techniques?
Impact of variable assignment on power consumption:
two candidate assignments, Assignment 1 and Assignment 2, shown in Table (next slide). Architectures obtained using these assignments were subject to: logic synthesis optimizations, and placed and routed using a cell library. The transistor-level netlists extracted from the layouts were simulated using a switch-level simulator with typical input traces to measure power. For the circuit Design 1, synthesized from Assignment 1, the power consumption was 30.71mW, and for the circuit Design 2, synthesized from Assignment 2, the power consumption was 18.96mW !
http://ces.itec.kit.edu Hardware Power 66
Example: a DFG and two possible register assignments that differ significantly in power consumption
http://ces.itec.kit.edu Hardware Power 67
Two distinct schedules. Shown: FU (in box), input variables left and right of operation, grey- shaded: variables at input of respective operation change value; spurious input transitions i.e. those that do not correspond to an DFG operations are marked with an ‘X’. Observation: A functional unit that does not alter its input does not perform a spurious operation Conclusion: Constrained register sharing can save significant energy/power: Upper case: 7 operations that do not correspond to a DFG operation Lower case: only one such
Note: a) Number of control steps is not increased b) Number of HW resources (FUs) is still the same
http://ces.itec.kit.edu Hardware Power 68
Problem in previous example: There is still a spurious operation is control step s1 of each
control step s3 of each iteration to control step s1 of the next iteration
Idea:: Combine dynamic variable rebinding with variable assignment to completely eliminate spurious operation How::
input of SUB1 until new value of v3 in current iteration is generated
in alternate iterations
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 69
Main idea: Finding a way to reduce power without having to spend large overhead in terms of hardware like transparent latches etc.
multiplexer networks and functional units in the data path
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 70
Example: X.25 protocol
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 71
Shown:
expression for control signals (x_i=1 => controller is in state s_i); b) activity graphs of ALU (state transitions with actions involved; ex: sel(0) sel(1), sel(2), SelectFunc(0) are 1, 0, 1, 0, respectively => “byte- byteCount” is performed Observation:
states (gray shaded). This can be found out through scheduling info from HL synthesis. Idea:
such that same operands )stay stable). Since they do not change => no switching => no unnecessary power consumption Conclusion: re-specifying control signals can lead to power savings (without changing anything else)
(Src: [Anand98])
http://ces.itec.kit.edu Hardware Power 72
Shown:
register that stores variable I and muxes that feed it through signals sel(18), M(18), sel(19)
shaded states are inactive states as far as that respective part of the design is concerned) Idea:
lead to reduced power consumption? Idea:
“bytes-byteCount). Since operation chnages, also operand c28 changes (c28 is output of ALU and feed the mux tree) solution:
propagating this variable into the shown mux tree (see right state diagram)
http://ces.itec.kit.edu Hardware Power 73
Re-labeling activity graphs :How to label an idle vertex in an activity graph: Different incoming and outgoing transitions into the idle state have different execution probabilities The values of data operands fed to the mux trees may themselves change => Only selecting the same
minimized
http://ces.itec.kit.edu Hardware Power 74
Shown:
Goal:
L1, L2, L3 such that activity at the input of the “<“ is minimized Some conventions:
Has only entry in row/column if the respective transition is actually in the state transition graph
Goal: find an L* such that cost function is minimized => best re-labeling
http://ces.itec.kit.edu Hardware Power 75
Make sure what is to be reduced: peak power, average power (different strategies) HW power sources:
Data path Control path Clock tree
Optimization strategies:
Operator scheduling for low power Hardware power management (clock gating) Re-labeling of controller
Note:
Very often there is a tradeoff: reduced power may lead to: more logic, more complex design, reduced performance, …
http://ces.itec.kit.edu Hardware Power 76
[Heer04] Ch. Herr, U. Schlichtmann, “Ultra-Low-Power Design: Device and logic design approaches”,
[Anand98] A. Raghunathan, N.K. Jha, S. Dey, “High-level power analysis and optimization”, Kluwer Academic Publishers,1998. [Sarraf95] S. Raje, M. Sarrafzadeh, “Variable voltage scheduling”, IEEE/ACM ISLPED 1995. pp. 9- 14, 1995. [Knight] R.S. Martin, J.P. Knight, “Power-Profiler: Optimizing ASIC’s Power Consumption at the behavioral level”, Proc. Of IEEE/ACM Design Automation Conf. (DAC’95), pp.42-47,1995. [Macii04] E. Macii (Ed.), “Ultra Low-Power Electronics and Design”, Kluwer Academic Publishers, 2004. [Devadas] Alidina, M.; Monteiro, J.; Devadas, S.; Ghosh, A.; Papefthymiou, M.; “Precomputation- based Sequential Logic Optimization For Low Power”, Computer-Aided Design (ICCAD), 1994., IEEE/ACM International Conference on November 6-10, 1994 Page(s):74 – 81. [Ragh99] Raghunathan, A.; Dey, S.; Jha, N.K.; “Register transfer level power optimization with emphasis on glitch analysis and reduction”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 18, Issue 8, Page(s):1114 – 1131, Aug. 1999. [Tivari] Tiwari, V.; Malik, S.; Ashar, P.; “Guarded evaluation: pushing power management to logic synthesis/design”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 17, Issue 10, Page(s):1051 – 1060, Oct. 1998. [Mehra] R. Mehra, J. Rabaey, “Exploiting Regularity for Low Power Design”, IEEE/ACM Intl’ Conference on Computer Aided Design (ICCAD96), pp. 166-172, 1996.