& - - PDF document

amp
SMART_READER_LITE
LIVE PREVIEW

& - - PDF document

& & &


slide-1
SLIDE 1

Ε Ε ΕΘ Θ ΘΝ Ν ΝΙ Ι ΙΚ Κ ΚΟ Ο Ο & & & Κ Κ ΚΑ Α ΑΠ Π ΠΟ Ο Ο∆ ∆ ∆Ι Ι ΙΣ Σ ΣΤ Τ ΤΡ Ρ ΡΙ Ι ΙΑ Α ΑΚ Κ ΚΟ Ο Ο Π Π ΠΑ Α ΑΝ Ν ΝΕ Ε ΕΠ Π ΠΙ Ι ΙΣ Σ ΣΤ Τ ΤΗ Η ΗΜ Μ ΜΙ Ι ΙΟ Ο Ο Α Α ΑΘ Θ ΘΗ Η ΗΝ Ν ΝΩ Ω ΩΝ Ν Ν Μ Μ ΜΕ Ε ΕΤ Τ ΤΑ Α ΑΠ Π ΠΤ Τ ΤΥ Υ ΥΧ Χ ΧΙ Ι ΙΑ Α ΑΚ Κ ΚΟ Ο Ο Τ Τ ΤΜ Μ ΜΗ Η ΗΜ Μ ΜΑ Α Α Π Π ΠΛ Λ ΛΗ Η ΗΡ Ρ ΡΟ Ο ΟΦ Φ ΦΟ Ο ΟΡ Ρ ΡΙ Ι ΙΚ Κ ΚΗ Η ΗΣ Σ Σ ΠΡΟΗΓΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΥΠΟΛΟΓΙΣΤΩΝ Ε Ε ΕΡ Ρ ΡΓ Γ ΓΑ Α ΑΣ Σ ΣΙ Ι ΙΑ Α Α: : : P P PO O OW W WE E ER R R M M MA A AN N NA A AG G GE E EM M ME E EN N NT T T

Υπ.Καθηγητής: κ.Κ.Χαλάτσης Των: Κωστή Ελένης Μ487 Ραπτοπούλου Κλεονίκης M515 Ψαρρα Τζανετίνας Μ510 ΑΘΗΝΑ 2002 – 2003

slide-2
SLIDE 2

Power Management 1

  • 1. I ntroduction

Over the last three decades a significant improvement in microprocessors’ performance has taken place. The reasons for the evolution in this sector are owed partly to the semiconductor technology scaling and partly to the innovations in computer architecture and in the accompanying software. In the first case, semiconductor technology scaling has resulted in larger numbers of smaller and faster transistors whereas in the second, microprocessors achieve performance greater than what would have been possible by technology scaling alone. In this essay we present some basic definitions relevant to power consumption. Hereafter, our emphasis is on the problem of dynamic and static power dissipation. We will also discuss some of the current power reduction techniques.

  • 2. Architecture I mportance

Before throwing light to the reasons of power dissipation (static and dynamic) we would like to lay particular stress on the importance of architectural decisions in this field. There are two main reasons why architecture is instrumental in boosting performance beyond technology scaling. First, technology scaling is often non-

  • uniform. For example, the technologies used to build processors are optimized for

speed, while technologies used to build main memories are mostly optimized for

  • density. Without the help of new architectural techniques a smaller and faster

processor would simply spend most of its time waiting for the relatively slower

  • memory. Second, technology evolution makes it easier to reach higher integration by

allowing us to pack more transistors on a chip of the same size. On the one hand processor generations deliver higher performance due to the usage

  • f more transistors and higher frequencies. On the other hand there is a

simultaneous increase in power requirements and density, which imposes stringent constraints for modern microprocessors.

slide-3
SLIDE 3

Power Management 2 It is clear thus that micro-architectural mechanisms will have to be (re)designed in

  • rder to focus on power concerns. To understand the opportunities for power
  • ptimizations at the micro-architectural level we first have to understand what are

the sources of power dissipation in modern microprocessors.

  • 3. Power Consumption

Power consumption has become an important factor in microprocessor design. The situation is aggravating in multiprocessor systems such as servers where multiple processors are close to each other. An increase in power dissipation beyond current levels will have as a result disproportionate increases in cost as current power delivery and heat removal systems reach limits. Moreover, power constraints exist for mobile and embedded microprocessors. Maximization of battery life and heat removal are two problems that should be taken into consideration. The fact that power dissipation poses a significant performance limit has led to the consideration of power in the early stages of design process. Architects become more and more involved with this subject as the ability of circuit techniques to control dissipation have been rendered insufficient. As mentioned above, we are going to deal with two types of power consumption: dynamic power and static power. Dynamic power dissipation occurs whenever a transistor or wire changes voltage. As we will see, dynamic power dissipation is proportional to product of the number of devices changing value, of the speed of these changes (i.e., operating frequency) and the square of the voltage change. Reducing power dissipation is possible by reducing each of these factors. Power- aware micro-architectural techniques address the number of devices, and their switching speed, while taking performance into consideration.

f CV P

cc dyn 2

=

(dynamic power equation) On the other hand with the term “leakage power” we refer to the power dissipated even when devices do not change values due to the imperfect nature of semiconductor-based transistors. In existing designs, leakage power is relatively

  • small. Unfortunately, as we move towards smaller transistors and lower voltages,
slide-4
SLIDE 4

Power Management 3 there is a rapid increase in leakage power. Power-aware efforts in this area aim at cutting off power to devices while they are not being used. This is a challenging task as powering on and off devices requires some time and hence can severely impact performance.

leak cc static

I V P =

(static power equation)

The following figure (Figure 1) shows the increases in static and dynamic power for Intel’s past few technologies. Projecting these trends forward, static power dissipation will equal dynamic power dissipation within a few generations. Higher

  • rder effects unimportant today and aggressive dynamic power optimizations could

cause the static and dynamic power contributions to become equal in as little as two

  • generations. It is important thus, for architects, to be aware of how they may control

static power dissipation in future technologies.

  • Figure1. Treads in dynamic and static power dissipation showing increasing

contribution of static power.

slide-5
SLIDE 5

Power Management 4

  • 4. Dynamic Power Management (DPM)

Unlike bipolar technologies, where a majority of power dissipation is static, the bulk

  • f power dissipation in properly-designed CMOS circuits is the dynamic charging and

discharging of capacitances. Thus, a majority of the low power design methodology is dedicated to reducing this predominant factor of power dissipation. The following example underlines the factors that affect the dynamic power

  • dissipation. For the simple inverter gate shown in Figure 2, it can be shown that a

low-to-high output transition draws

2 cc effV

C

Joules (energy) from the power supply,

cc

V

. The high-to-low output transition dissipates the energy stored on the capacitor into the NMOS device. Given a frequency f of low-to-high output transitions, the power drawn from the supply is

2 cc effV

fC

. This simple equation holds for more complex gates, and other logic styles as well, given a periodic input.

  • Figure2. Dynamic Switching Power Dissipation

Accurate calculations for

eff

C

can be done as shown below. The basic capacitor elements are shown in Figure 2. The net loading capacitance,

eff

C

, consists of gate capacitance of subsequent gate inputs attached to the inverter output, interconnect capacitance, and the diffusion capacitance on the drains of the inverter transistors. Test chips have shown that for 1.2 m ICs, the total capacitance is split roughly equally between these three types. As the minimum gate length scales down, though, interconnect capacitance will become dominant.

slide-6
SLIDE 6

Power Management 5 Usually, the value of f is a difficult number to quantify, as it is most likely not periodic, and is correlated with the input test vectors into the circuit. Without doing a switched-level circuit simulation, the best way to calculate f is to perform statistical analysis on the circuit to determine a mean value. Since dynamic switching power is the major component of overall power dissipation, the low-power design methodology concentrates on minimizing total capacitance, supply voltage, and frequency of transitions. Dynamic Power Management (DPM) is a design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such

  • components. DPM encompasses a set of techniques that achieve energy-efficient

computation by selectively turning off (or reducing the performance of) system components when they are idle (or partially unexploited). The following model is used for dynamic power consumption: a: activity factor of node Sx f: clock frequency C: load capacitance of node Sx 2 CC dyn

afCV P =

2 CC

V

: supply voltage (squared)

Based on the previous equation, an effective capacitance, eff

C

, can be defined

aC Ceff =

which combines the physical capacitance being switched, C, and the activity factor a. The effective capacitance can be found from simulation and measurements as: 2 CC dyn eff

fV P C =

For the DPM strategies, the input is the length of an upcoming idle period and the decision to be made is whether to transition to a lower power dissipation state while

slide-7
SLIDE 7

Power Management 6 the system is idle. Analytical solutions to online problems are best described in terms

  • f a competitive ratio that compares the cost of an online algorithm to the optimal
  • ffline solution, which knows the input in advance.

4.1. Online Algorithms for DPM

Before analyzing the deterministic and the probability-based algorithm let us first define some useful notation:

The number of sates will be k+ 1: state k is the active state and state 0 is the

completely powered down (sleep) state.

ai is the power dissipation rate for the state i.

  • i

β is the total energy dissipated in moving fro sate I back to state k.

We will assume that the states are ordered so that

1 −

i i

a a

for all

k i ≤ ≤ 1

. Once an idle interval starts, the algorithm can choose to be in any of the k+ 1 states at any point of time. When a request for service arrives, the algorithm must immediately transition back to sate k. The on-line algorithm does not know the length of the idle interval until it ends. It costs nothing for the algorithm to drop to a lower power dissipation state. This can be assumed because the power-down energy can be incorporated into the start-up energy without loss of generality.

4.1.1. Deterministic Algorithm

To get the optimal cost, plot each line

i it

a c β + =

. This is the cost of spending the entire interval in state i as function of t, the length of the interval. We call the function of the lower envelope of all these lines LE(t). The optimal cost for an interval of length t is

} { min ) (

i i i

t a t LE β + =

. The algorithm will remain in the state, which realizes the minimum in LE and will transition at the discontinuities of the curve. This algorithm does not take into account input patterns. Thus, the worst case scenario, shows that the energy cost resulting from the on-line decisions can be no worse than 2 times the energy cost of the optimal offline strategy which knows the

slide-8
SLIDE 8

Power Management 7 input sequence in advance. Depending on request arrival patterns, this worst case bound may not really happen, and the empirical ration of the online to offline costs may be much lower.

4.1.2. Probability-based Algorithm

In this case we assume that the length of the idle interval is generated by a fixed, know distribution whose density function is π . If the request interrival time probability distribution is known before hand, the worst case competitive ratio can be improved by 21% , with respect to the deterministic case. Knowing that the cost (expected energy consumption) of the optimal offline algorithm is:

∑∫

=

+

+

k i t t i i

i i

dt t a t

1

] )[ ( β π

we must determine k thresholds, in order to determine the online algorithm. The threshold

i

τ is the time at which the online algorithm will transition from state i to

state i-1. In the spirit of the deterministic online algorithm for the multi-sate case, we will let

i

τ be the same as the threshold which would be chosen if i and i-1 where the

  • nly two states.
  • 5. Static Power Dissipation

As we have already mentioned, static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Given that the majority of the factors affecting static power are decided in circuit level an abstract model should be proposed in order to facilitate architects to consider static power consumption in their design decisions. The proposed simplified formula is the following:

slide-9
SLIDE 9

Power Management 8

leak design CC static

I Nk V P ˆ =

Parameter Description Scaling behavior Reducing

CC

V

Power supply voltage Decreases by 30 % per process generation

  • Multiple supply

voltage domains

  • Increase IPC to

allow lower clock frequency (allowing V CC reduction) at same performance

N

Number of transistors in design Increases by 100 % per process generation

  • Reduce functionality

(e.g., removing special purpose circuitry)

  • Use circuit style

requiring fewer transistors for same functionality

design

k

Empirically determined parameter representing the characteristics of an average device Approximately constant

  • Use efficient circuit

style

  • Reduce clock

frequency to allow more complex (high fan-in) logic

leak

I ˆ

Technology parameter describing the per device subthreshold leakage Highly dependent on aggressiveness of VT (threshold voltage) scaling

  • Partition design into

frequency domains allowing use of less aggressive (lower leakage) devices in some domains leak

Table 1 The level of abstraction in the model is appropriate for its application by architects. Each of the parameters is amenable to estimation at the architectural level (either based on the design or the expected target technology). A more detailed model would require accuracy in technology and design parameters that would not be

slide-10
SLIDE 10

Power Management 9 available at an early stage in the design process. Furthermore, absolute accuracy is not as important as relative accuracy when making design tradeoffs. Finally, the model suggests different means of addressing static power early in the design

  • process. Some may claim that architects have no control over static power because
  • f its strong dependence on technology and circuit optimization (which does not

typically involve architects). While lower level optimizations more directly affect the final static power dissipation, awareness of the issue during the architectural definition can result in an architecture better suited to later optimization.

  • 6. Different Ways For Lower Static Power (Reducing Static

Power)

The model for static power presented in the previous section suggests different ways in which static power may be controlled: reducing any factor in the equation will reduce the power requirement. Therefore, the static power may be lowered by reducing the supply voltage (lower VCC), using fewer devices (lower N), using a more power efficient design style (lower kdesign), or using slower devices (lower

leak

I ˆ

). Depending on the method employed, any of these options – whose architectural applications will be discussed in this section - may require performance to be sacrificed to realize power savings.

6.1. Reducing the Supply Voltage

The supply voltage (VCC) is not typically considered to be an architectural controllable

  • parameter. However, the nature of the architecture influences the supply voltage
  • ptimization which occurs at the end of the design cycle. Architects can enable lower

supply voltages by making performance less sensitive to latency. Circuits with less strict latency requirements can operate at a lower clock frequency and supply

  • voltage. By partitioning the circuit into several domains operating at different supply

voltages, both static and dynamic power savings are possible. Modern microprocessors already use this technique to allow for a higher voltage for off-chip communication than is used in the core. Level shifter circuits are required for communications between voltage domains. The partitioning should take into account the extra delay incurred in crossing domain boundaries.

slide-11
SLIDE 11

Power Management 10 To reduce the supply voltage for the entire chip without partitioning, the global clock frequency must be reduced. However, lowering clock speed without changing supply voltage will not necessarily result in any energy savings. By running at fast clock rates and correspondingly high supply voltages, an out-of-order architecture wastes energy, especially given the energy costs of architectural features such as branch prediction, large instruction windows, and multiple functional units. For applications that do not need high performance, we want to run slower with smaller voltages to save energy. When clock speed and supply voltage are reduced together, there is a quadratic reduction in power consumption. Architectures which emphasize high IPC (Instructions per Cycle) over high clock frequencies to achieve performance are superior in power characteristics provided the added complexity does not erase the gains through increased device count. The point at which architecture falls on the frequency-IPC scale directly influences the domain in which the supply voltage may be adjusted.

6.2. Reducing the Number of Devices

slide-12
SLIDE 12

Power Management 11 Another technique that may be employed to reduce static power is to reduce the total number of devices (N). Finding opportunities to reduce the device count enough to impact power dissipation without decreasing performance or functionality is difficult, however. Normal design practices eliminate

  • bvious

redundancy. Furthermore, a large number of devices must be removed to have a noticeable

  • impact. Thus, units with replication make obvious targets. Cache size, number of

functional units, and issue/retire bandwidth may all be reduced with varying degrees

  • f difficulty and performance impact. If power optimization is a goal from the

beginning, effort spent balancing the processor's resources reduces unnecessary replication by allocating fewer overall devices only where they are most needed. Another beneficial task for architects would be to equalize utilization: bursty

  • peration requires a high maximum throughput to attain a given performance level.

Equalizing resource requirements over time results in a lower total resource requirement for a given performance. Each of these approaches is appropriate for study at the architectural level. An additional method to reduce N without actually removing devices is to turn them

  • ff when they are unused. Power gating is analogous to clock gating: the supply

voltage (rather than the clock) of some functional unit is switched on only when the unit is required. Additional circuitry is added to determine the need for the unit. This circuitry may monitor inputs to the switched unit or use other available signals (Figure 3).

user predictor power-gated logic

slide-13
SLIDE 13

Power Management 12 Figure 3. Power gating: gated logic receives power only when PMOS switching device is active The gated circuitry will not dissipate any power when turned off. However, this must be balanced against the power dissipated by the gating circuitry and the power switching device itself. The power switching device must be large enough (W) to handle the average supply current of the circuit while in operation. If the device has a high enough threshold voltage, its leakage power can be lower than that of the gated circuit (which may use lower thresholds to be fast during operation). However, the addition of a gating device can result in reduced performance and noise margins. The major problem with power gating is the latency between when the signal to turn a unit on arrives and when the unit is ready to operate. Due to the huge capacitance

  • n the power supply nodes in a unit, several clock cycles will be needed to allow the

power supply to reach its operating level. There are two alternatives which may apply regarding this latency. If the functional unit is required very rarely or is not on the critical computation path, it may not significantly impact performance to stall until the unit is ready. Alternatively, the requirement for a unit may be predicted far enough in advance for the unit to be ready when it is required. Predicting the need for a functional unit raises the question of what kinds of microarchitectural events can be predicted accurately in advance. One obvious choice is the use of floating point functionality. Some operating systems already track the use of floating point hardware by applications to avoid saving the floating point registers on context switches when unnecessary. Thus, the floating point hardware may be switched at the same granularity as context switches. Portions of the cache may also be turned off provided the working set of the application fits in a subset of the cache. Other opportunities include decode logic for rare or privileged instructions, interrupt logic (a timer interrupt, usually the most frequent interrupt, at 100Hz occurs only every 10 million clock cycles at 1GHz), or logic to handle certain rare exceptions. Architectural study is ideal for determining the impact of increased startup latencies and the feasibility of prediction.

6.3. Using More Static Power Efficient Circuits

slide-14
SLIDE 14

Power Management 13 The design factors comprising kdesign offer few opportunities for static power reduction directly. Architects may not think directly about the distribution of device geometries or stacking factors; however, the requirements of the microarchitecture ultimately determine the type of circuitry, which can be used for its implementation. For example, targeting higher IPC at a lower clock frequency allows for more logic between pipeline latches; power savings are realized by allowing the use of more complex gates with larger average stacking factors. The kdesign values in Table 2 suggest some additional ways of employing power- efficient circuits. Wide multiplexors should be avoided as they have a cost which grows super-linearly with the number of inputs. A tri-state bus with multiple drivers can accomplish the same function with lower total leakage (tri-state drivers have stacked devices where pass-gate multiplexors do not). Associative arrays are approximately three times leakier (including the larger number of transistors) than simple random-access memories. Implementing pseudo-associativity using hashing may be appropriate depending on the exact requirements of the microarchitecture.

6.4. Using Multiple Threshold Voltages

Technologies, which provide multiple threshold voltages allow for an even better trade-off between static power and performance. By using slower transistors, the leakage current may be reduced significantly. Note that it is not sufficient to simply clock a regular device more slowly, since this does not affect the subthreshold

  • leakage. The transistor must actually be slower.

Different transistor speeds may be used in different ways. One method would be the employment of fast devices only along critical timing paths. Although algorithms have been proposed to automatically perform this task, a concern is that automated modification of path delays could result in races. A second technique involves determining which functional units require the lowest latencies and allocating the budget of fast, leaky devices to these units only. To reduce dynamic power consumption, at least one announced product divides core logic into clock domains

  • f different frequencies. Limited partitioning has occurred ever since core frequencies

exceeded bus frequencies.

slide-15
SLIDE 15

Power Management 14 Partitioning enables one to use a device speed appropriate to the particular clock domain in which the device is to be located. Architects are best suited to determine which functionality belongs in which clock domain and what particular method of interdomain communication should be used. This partitioning allows for optimization

  • f both static and dynamic power consumption.

Threshold voltage may also be adjusted by applying a voltage to the body node of a transistor to reverse bias the source-body junction. By raising the threshold voltage, this technique also results in slower devices. The ideal use of such a technique would be to apply the body bias only when the circuitry is unused and return to normal conditions when the circuit is required. The very high resistance of transistor body nodes results in a similar problem as in power gating, but of a much higher magnitude: establishing or removing a body bias will require a long time due to the high resistance of the body nodes of MOSFET’s. Therefore, functional units that have long idle periods and startups that can be accurately predicted with architectural state are most appropriate for these techniques.

6.5. Power Reduction with Speculation

Speculation can be an important tool for architects when designing power-efficient

  • architectures. Specifically, it provides a means of using slower devices without

proportionally impacting performance. The performance critical speculation circuitry employs fast devices, while the slower devices are used to verify the speculative

  • results. The additional latency is incurred only when the speculation is incorrect. In

some cases, the circuitry to perform the speculation is simple and very few of the power-hungry fast devices are required. The verification circuitry may use higher- threshold devices, use a lower supply voltage, run at a lower clock frequency, or some combination resulting in both static and dynamic power savings over a fast, non-speculative solution at little performance cost. An architecture such as DIVA in which a slow checker augments a fast, highly speculative core could directly benefit from intelligent partitioning based on device speed requirements.

slide-16
SLIDE 16

Power Management 15 As a more specific example, consider data speculation on L1 cache accesses. Such speculation is already implemented on Intel’s Willamette for performance reasons . L1 cache accesses are on the critical execution path for load instructions. Recognizing that the majority of such accesses hit in the cache, it is reasonable to speculatively assume that any data retrieved from a direct-mapped cache is correct prior to checking the tags. The cache tags and tag match logic may then be implemented with slower, more efficient circuitry. Mis-speculation detection suffers from an increased latency implied by the slower circuitry. Performance is only impacted in the event of an L1 cache miss. Without speculation, the tags and matching logic would have to be fast to avoid a significant performance penalty. The potential power savings depends on the exact cache behaviour, the amount of logic that was moved off of the critical path, and the amount of additional logic required recovering from mis-speculation. Another application of speculation was referred to briefly in Section titled ‘Reducing the Number of Devices’ in the context of predicting when certain circuitry will be

  • needed. It may be hard to determine when certain functional units are required and

when they may be shut off to save power. Instead of choosing to leave these units

  • n constantly, it may be more appropriate to speculatively power-down such

functional units. Provided the speculation accuracy is reasonable, a large decrease in power consumption would incur only a small performance penalty. Mis-speculation would be visible as increased latency of the functional unit. In architectures which are power-limited (the peak performance is limited by power considerations), such techniques could actually allow for higher performance.

  • 7. Current Power Reduction Techniques

Low power consumption is one of the most important design constraints for modern computer systems. One promising technique to lowering power consumption of microprocessors has been described in previous section in the context of reducing static power. Because modern systems run varying workloads, there may be applications which require high performance at any cost. Our approach in this section is the presentation and analysis of a number of additional methods-techniques to attain power reduction.

slide-17
SLIDE 17

Power Management 16 Firstly, we present techniques which are characterized by the variety of circuit usage within and across applications. These techniques are called circuit techniques and we examine here. Clock Gating and Input Vector Determination Technique. Secondly, we present some architectural techniques. In particular, we introduce a hardware mechanism named Pipeline Gating, to control uncontrolled speculation in the

  • pipeline. It must be noted, that no techniques which are considered to be

architectural, have been published for static power.

7.1. Clock Gating

The sequential circuits in a system are considered major contributors to the power dissipation since one input of sequential circuits is the clock, which is the only signal that switches all the time. In addition, the clock signal tends to be highly loaded. To distribute the clock and control the clock skew, one needs to construct a clock network (often a clock tree) with clock buffers. In general, the power consumption of a clock network is contributed by three factors: modules, clock edges and control

  • signals. All of this adds to the capacitance of the clock net. Recent studies indicate

that the clock signals in digital computers consume a large (15% - 45%) percentage

  • f the system power. Thus, the circuit power can be greatly reduced by reducing the

clock power dissipation. A practical and effective way to do this is to portion the clock network and allow those portions to toggle that are needed on each cycle. This is achieved through clock-gating. Clock-gating is a logic design method in which the clock is disabled to parts of the circuit during periods when they are not required to execute. These parts are said to be in standby mode, also called sleep mode or idle mode. The power supplies to these parts are not turned off, because of the performance and noise penalties that would result if this were done. Thus, whenever a circuit is put in standby, the latches or flip-flops inside it maintain the last state they were in. As a result, the circuit dissipates leakage power during standby corresponding directly to the logic state in which it was left. It is implemented by qualifying the different clock partitions by special “enable” signals. It is well suited for CPUs since it can often be easily integrated into existing clock networks as illustrated in figure below. A regular clock buffer can be changed into a qualifying gate at low area and performance overhead.

slide-18
SLIDE 18

Power Management 17 Figure 3. Clock Gating and Clock However, clock gating cannot be used indiscriminately since there are some issues that need to be considered. An important concern is that the disabled block may not power up in time, or that modified clocks may generate glitches. This imposes strict timing constraints on the enable signals and calls for careful timing verification. Functional validation is becoming a greater challenge with each new CPU generation, and gated clocks make the problem even more difficult. In addition to this, at clock frequencies of 100MHz and above, clock skew (= the difference in clock arrival time between two poins) becomes critical and every extra gate used to qualify the clock can potentially introduce timing critical skews. Thus, the granularity at which clock gating can be applied becomes a tradeoff against overall clock network design time and complexity. Another concern with clock gating is the impact on current variations when large blocks of logic are switched on and off. A CPU may be at peak current levels for some cycles, when few blocks can be clock gated. But it may rapidly transition to low values of current if something like a stall of pipeline flush causes a number of units to be powered off. This increases the variation in transient power (= quantifier of the variability in power consumption.). Switching between a normal operation mode and a standby mode, in which most of the internal clocks are turned off, also causes the same problem. In many cases switching of the clock causes a lot of unnecessary gate activity. For that reason, circuits are being developed with controllable clocks. This means that

slide-19
SLIDE 19

Power Management 18 from the master clock other clocks are derived which, based on certain conditions, can be slowed down or stopped completely with respect to the master clock. Also it is best to place gated clocks at a high level, not at the individual flip-flops. Clock gating at the individual flip-flops can introduce unwanted clock skew and complicate the design. Seventy-five to eighty percent of clock power usage is due to

  • routing. By gating the clock network for a large group of flip-flops you reduce the

capacitance on the clock tree. You also reduce the internal power in the affected registers and reduce the need for data recirculation muxes (= devices that send multiple signals on a carrier channel at the same time in the form of a single, complex signal to another device that recovers the separate signals at the receiving end). The amount of power reduction will depend on the number of registers that are gated and the percentage of time that the gated clock is enabled. Clock gating is widely used because:

  • It is conceptually simple. It has small overhead (clock can be restarted by

simply deasserting the clock-freezing signal.) in terms of additional circuits

  • It has often zero performance overhead because the component can

transition from an idle to an active state in one (or few) cyclesConsequently we would say that:

  • Clock gating is one of the most effective techniques used to reduce power.

Whenever a circuit is idle, it can be powered down with a clock-gating feature.

  • A gated clock requires an enable signal to enable/disable the clock.
  • Don’t use gating clock unless you have thorough understanding what is the

proper way to implement clock gating and what are the consequences for testing and verification.

7.2. Caching Strategies

Recent work has shown that transistor structures can be devised which limit static leakage power by banking the cache and providing “sleep” transistors, which dramatically reduce leakage current by gating off the Vdd. A high hit ratio cache significantly decreases the off-chip memory communications. On the other hand, a

slide-20
SLIDE 20

Power Management 19 cache itself consumes quite a lot of power and chip area as shown in figure below. Caches account for a large (if not dominant) component of leakage energy dissipation in recent designs, and will continue to do so in the future. Recent energy estimates for 0.13m processes indicate that leakage energy accounts for 30% of L1 cache energy and as much as 80% of L2 cache energy. For embedded processors it is usual to provide on-chip RAM and ROM, thus limiting the needs for a cache. It is common that at least two types of caches are present in the current microprocessors: one for instructions and one for the data. Some processors include special caches. The instruction set has a great influence over cache size since it influence to compactness of the code. The same Hobbit microprocessor includes a 3Kb, 3-way set associative on-chip cache with the hit rate roughly equivalent to a 8Kb direct-mapped RISC cache, all because of the high code density of the Hobbit instruction set. Figure 4 Using a simulator, it was shown in a two level cache hierarchy that using special buffers along-side the first level cache, can reduce the power consumed by the memory sub-system buy as much as 13% for certain data cache configuration and as much as 23% for certain instruction cache configuration. These buffers are small caches, between 8-16 entries, located between first and the second level cache, which may be used to hold specific data (e.g. non-temporal or speculative data) or may be used for general data (e.g. “victim” data, the data which was recently expelled from the level one cache). It is critical to notice the importance of a

slide-21
SLIDE 21

Power Management 20 simulator and a good knowledge about the instruction set when designing the cache architecture. For example, in the case of our cache decay policy we are trying to determine when to turn a cache line off. The longer we wait, the higher the leakage energy

  • dissipated. On the other hand, if we prematurely turn off a line that may still have

hits, then we inject extra misses which incur dynamic power for L2 cache accesses. Competitive algorithms point us towards a solution: we could leave each cache line turned on until the static energy it has dissipated since its last access is precisely equal to the dynamic energy that would be dissipated if turning the line off induces an extra miss. With such a policy, we could guarantee that the energy used would be within a factor of two of that used by the optimal offline policy. A novel instruction cache design, the Dynamically Resizable instruction cache (DRI i- cache), which dynamically resizes itself to the size required at any point during application execution and virtually turns off the supply voltage to the cache’s unused sections to eliminate leakage. At the architectural level, a DRI i-cache relies on simple techniques to exploit variability in i-cache usage and reduce the i-cache size dynamically to capture the application’s primary instruction working set. leak current and dissipate energy. A DRI i-cache’s novelty is that it dynamically estimates and adapts to the required i-cache size, and uses a novel circuit-level technique, gated- Vdd , to turn off the supply voltage to the cache’s unused SRAM cells. In this section, we describe the anatomy of a DRI i-cache. In the next section, we present the circuit technique to gate a memory cell’s supply voltage. Because a DRI i-cache’s miss rate impacts both energy and performance, the cache uses its key parameters to achieve tight control over its miss rate. There are two sources of increase in the miss rate when resizing:

  • First, resizing may require remapping of data into the cache and incur a large

number of (compulsory) misses at the beginning of a sense-interval ( DRI i- cache divides an application’s execution time into fixed-length intervals, the sense-intervals, measured in the number of dynamic instructions .

  • Second, downsizing may be sub optimal and result in a significant increase in

miss rate when the required cache size is slightly below a given size. The

slide-22
SLIDE 22

Power Management 21 impact on the miss rate is highest at small cache sizes when the cache begins to thrash. A DRI i-caches uses the size-bound to guarantee a minimum size preventing the cache from thrashing.

7.3. I nput Vector Determination Technique

Accurate power estimation is essential for low power digital CMOS circuit design. Power dissipation is input pattern dependent. Leakage current also depends on input vector combination. To obtain an accurate power estimate, a large input vector set must be used which leads to very long simulation time. One solution is to generate a compact vector set that is representative of the original input vector set and can be simulated in a reasonable time. Both the input and the output vectors are sampled through register arrays. The simulation is performed in an iterative fashion. In each iteration a vector sequence of fixed length called sample is simulated. The simulation results are monitored to calculate the mean value and variance of the samples. The iteration terminates when some stopping criterion is met. Development of an algorithm to find such a vector based on a process of random sampling. Randomly chosen vectors are applied to the circuit and the leakage due to each is monitored, and the vector, which gives the least observed leakage value, is reported. Clearly, the number of vectors to be applied determines the quality of the resulting solution.

7.4. Pipeline Gating

Pipeline gating is an innovative and inexpensive method for power reduction, which unlike previous work that sacrificed flexibility or performance reduces power in high- performance microprocessors without impacting performance. In particular, it is a hardware mechanism used to control rampant speculation in the pipeline and reduce energy consumption. State-of-the-art microprocessors have a high degree of control complexity and a large amount of area dedicated to structures that are essential for high-performance, speculative, out-of-order execution, such as branch prediction units, branch target buffers, TLBs, instruction decoders, integer and floating point queues, register renaming tables, and load-store queues. The fetch and decode stages, along with components necessary to perform dynamic scheduling and out-of-

slide-23
SLIDE 23

Power Management 22

  • rder execution, account for a significant portion of the power budget. Therefore,

pipeline activity is a dominant portion of the overall power dissipation for complex microprocessors. Figure 5. Power Consumption for Pentium Pro chip, broken down by individual processor components The energy consumed by a processor is a function of the amount of work the processor performs to accomplish a given task. In a non-speculative processor all work performed is necessary. In a speculative, multi-issue, dynamically scheduled processor, a large amount of extra work is performed without realizing any performance benefits. The goal of pipeline gating is to reduce the amount of extra work performed to complete a task without affecting the overall performance of the

  • system. Since performance drives the market. The schematic of the processor

pipeline shown in figure below to describe pipeline gating. Rest I nst Fetch I nst Dec Recorder Buf Fp Exec Clock Data Cache I nt Exec Reg Alias Table Ext Bus Logic Reservation Station Rest I nst Fetch I nst Dec Recorder Buf Fp Exec Clock Data Cache I nt Exec Reg Alias Table Ext Bus Logic Reservation Station

slide-24
SLIDE 24

Power Management 23 Figure 6. Pipeline with a two fetch and decode cycles, showing additional hardware required for pipeline gating. The low-confidence branch counter records the number of unresolved branches that reported as low confidence. The counter value is compared against a threshold value (“N”). The processor ceases instruction fetch if there are more than N unresolved low-confident branches in the pipeline. The above sample pipeline uses two fetch and decode cycles to allow the clock rate to be increased. We assume the fetch stage has a small instruction buffer to allow instruction fetch to run ahead of decode. Branch prediction occurs when instructions are fetched to reduce the misfetch penalty. The actual instruction type may not be known until the end of decode. Conditional branches are resolved in the execution stage, and branch prediction logic is updated in the commit stage. Since the processor uses out-of-order execution, instructions may sit in the issue window for many cycles, and there may be several unresolved branches in the processor. For assessing the quality of each branch prediction we use a confidence estimator. A “high confidence” estimate means we believe the branch predictor is likely to be

  • correct. A “low confidence” estimate means we believe the branch predictor has

incorrectly predicted the branch. We use these confidence estimates to decide when the processor is likely to be executing instructions that will not commit; once that decision has been reached, we “gate” the pipeline, stalling specific pipeline stages. The decision to gate can occur in the fetch, decode or issue stages. Equally important is the decision about what to gate and how long to gate. Gating the fetch

  • r decode stages would appear to make the most sense. We used the number of

unresolved low-confident branches to determine when and how long to gate. For example, if the instruction window includes one low-confident branch, and another low-confident branch exits the fetch (or, alternatively, decode or issue) stage, gating would be engaged until one or the other low-confident branch resolves. Figure above illustrates this process for a specific configuration. A counter is used that is incremented whenever the decode encounters a low confident branch and is decremented when a low-confident branch resolves. If the counter exceeds a threshold, the fetch stage is gated. Instructions in the fetch-buffer continue to be decoded and issued, but no new instructions are fetched. It has been found that gating the processor typically stalls the processor for a very short duration.

slide-25
SLIDE 25

Power Management 24 The decision to gate and the actual gating is performed during the first fetch cycle. Experiments have shown that most of the extra work in the pipeline occurs at the fetch and decode stages, and gating at the fetch stage will have the largest impact. The number of unresolved, low-confidence branches were measured at decode. This insures some “slip” between the fetch and decode stages if we made an incorrect gating decision. This increases the extra work (EW) of the stages beyond fetch, but also reduces the performance loss by providing the issue stage with a few instructions from the correct-path while the pipeline catches up from an incorrect gating decision. Several confidence estimation methods have been implemented and compare their performance for pipeline gating. There are two important metrics to characterize the performance of confidence estimators used by pipeline gating: specificity SPEC and the predictive value of a negative test (PVN). The specificity (SPEC) is the fraction of all mispredicted branches actually detected by the confidence estimator as being low

  • confidence. The PVN is the probability of a low-confidence branch being incorrectly
  • predicted. A larger SPEC means that more mispredicted branches are marked as “low

confidence”. A larger PVN means that a given low-confidence branch is more likely to be mispredicted. A confidence estimator could have a perfect specificity by marking all branches as low confidence, but the PVN would then be no more than the branch misprediction rate. In practice, a confidence estimator must balance SPEC vs. PVN to provide a good quality confidence estimate for many branches. Confidence estimation is a diagnostic test that attempts to classify each branch prediction as having “high confidence”, meaning that the branch was likely predicted correctly, or “low confidence”, meaning the branch was likely is predicted. The SPEC and PVN metrics are used to classify the confidence estimators discussed below:

Perfect Confidence Estimation: Although a perfect confidence estimator is

unattainable in practice, we used precise information from the pipeline state to evaluate the potential of pipeline gating, and to determine how much of that potential performance was exploited by other configurations.

Static Confidence Estimation: Static confidence estimation associates a

confidence estimate with each conditional branch instruction. The confidence

slide-26
SLIDE 26

Power Management 25 is determined by running the program through a branch prediction simulator and recording the branch misprediction rate of individual branch sites.

JRS Confidence Estimation: Jacobsen proposed a confidence estimator

that paralleled the structure of the shared branch predictor. This estimator uses a table of miss distance counters (MDC) to keep track of branch prediction correctness. Each MDC entry is a “saturating resetting counter”. Correctly predicted branches increment the corresponding MDC, while incorrectly predicted branches set the MDC to zero.

Saturating Counters: Most branch predictors use some form of saturating

counters to predict the likely branch outcome. Smith mentioned that it may be possible to use these counters as branch confidence estimators.

Distance: It has been found that branch mispredictions were clustered and

that this clustering could be used to build an inexpensive confidence

  • estimator. The conditional probability of a misprediction for branches that

issue d branches after a mispredicted branch is resolved is higher for smaller values of d Varying the distance d affects the SPEC and PVN – smaller values increase the PVN (but reduce the SPEC).

  • 8. Conclusion

We have presented a variety of micro architectural techniques that have facilitated the dramatic improvements in microprocessor performance. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations. Developing power efficient products will require consideration of static power in the earliest phases of design, including architecture and micro architecture definition. With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. A number of recent research efforts are focusing on power-aware micro architectural techniques. These are techniques that facilitate dynamic power versus performance tradeoffs or

  • ffer competitive performance with reduced power demands.

Power-aware efforts in this area aim at cutting off power to devices while they are not being used. This is a challenging task as powering on and off devices requires some time and, hence, can severely impact performance. For example, it is possible

slide-27
SLIDE 27

Power Management 26 to reduce leakage power in caches with a negligible impact on hit rate and performance or to use clock-gating methods causing unnecessary gate activity. Other micro architectural techniques that trade fast (and leaky) transistors for more, but slower transistors are also likely to effective in addressing leakage power. Additional novel micro architectural techniques will, no doubt, be invented, as problems arise, and there is a need to address the problems in an efficient and transparent manner.

slide-28
SLIDE 28

Power Management 27

Concepts

MOSFET: The newly developed technology makes use of metal-oxide semiconductor

field-effect transistor (MOSFET) threshold voltages to generate inherent identification numbers (ID) in each chip. The same dedicated circuit is integrated into all the chips at once for batch ID embedding. According to the firm, "MOSFET threshold voltages have minute variations due to differences in the transistors used, and the number of

  • atoms. A circuit with multiple MOSFETs is created on the chip, and the differences in

threshold voltage are read as analogue signals. Since each signal has a different value, the digitised data can be used as a randomised chip ID."

Branch prediction: The ability to predict at compile time the likelihood of a

particular branch being taken provides valuable information for several optimisations, including global instruction scheduling, code layout, function inlining, interprocedural register allocation and many high level optimisations. For most of optimisations branch predictions are merely an extra piece of useful information, but for global instruction scheduling and instruction cache optimisations the accuracy of the branch predictions can make or break the optimisation.

slide-29
SLIDE 29

Power Management 28

Bibliography

  • Micro-Architectural Innovations: Boosting Microprocessor Performance Beyond

Semiconductor Technology Scaling, Andreas Moshovos, Gurindar S. Sohi.

  • S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: speculation control for

energy reduction. In

  • J. P. Halter and F. N. Najm. A gate-level leakage power reduction method for ultra-

low-power CMOS circuits. In Proc. IEEE Custom Integrated Circuits Conference, pages 475–478, May 1997.J. A. Butts and G. S. Sohi. A static power model for

  • architects. In Proc. 33rd Annual International Symposium on Microarchitecture, pages

248–258, Dec. 200.An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches, Se-Hyun Yang, Michael D. Powell¡, Babak Falsafi, Kaushik Roy¡, and T. N. Vijaykumar

  • Reducing power in high-performance microprocessors, Vivek Tiwari, Deo Singh,

Suresh Rajgobal, Gaurav Mehta, Rakesh patel, Franklin Baez.

  • Clock-Gating and Its Application to Low Power Design of Sequential Circuits, Qing WU

Department of Electrical Engineering-Systems, University of Southern California, Massoud PEDRAM Department of Electrical Engineering-Systems, University of Southern California, Xunwei WU Department of Electronic Engineering, Hangzhou University

  • Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power,

Stefanos Kaxiras Circuits and Systems Research Lab Zhigang Hu, Margaret Martonosi Department of Electrical Engineering.

  • T. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design.

In Proceedings of the 32nd International Symposium on Microarchitecture, November

  • 1999. pp. 196-207
  • P. Glaskowsky. Pentium 4 (Partially) Previewed. Microprocessor Report, August 28,
  • 2000. [18] K. Krewell. Quicktake: Willamette Revealed. Microprocessor Report,

February 2000. p. 19.

  • H. Massalin and C. Pu. Threads and Input/Output in the Synthesis Kernel. In

Proceedings of the 12th Symposium on Operating Systems Principles, December

  • 1989. pp. 191-201.
  • M. Powell, S. Yang, B. Falsafi, K. Roy, T. Vijaykumar. Gated-V DD : A Circuit

Technique to Reduce Leakage in Deep-Submicron Cache Memories. In the Proceedings of the International Symposium on Low Power Electronics and Design, July 2000. pp. 90-95.

slide-30
SLIDE 30

Power Management 29

  • K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. In Proceedings of

the IEEE International Conference on Circuits and Systems, 1998. pp. 167-173. Solid- State Circuits, 9(5), October 1974. pp. 256-268.

  • V. Sundararajan and K. Parhi. Low Power Synthesis of Dual Threshold Voltage CMOS

VLSI Circuits. In Proceedings of the International Symposium on Low Power Electronics and Design, 1999. pp. 139-144.

  • Q. Wang and S. Vrudhula. Static Power Optimization of Deep Submicron CMOS

Circuits for Dual V T Technology. In Proceedings of the International Conference on Computer-Aided Design, 1998. pp. 490-496.

  • Bruce R. Childers, Hongliang Tang, Rami Melhem. Adapting Processor Supply

Voltage to Instruction-Level Parallelism. Koolchips Workshop, during MICRO-33, Monterey, California, December 2001

slide-31
SLIDE 31

Power Management 30

TABLE OF CONTENTS

  • 1. Introduction

1

  • 2. Architecture Importance

1

  • 3. Power Consumption

2

  • 4. Dynamic Power Management (DPM)

4 4.1. Online Algorithms for DPM 6 4.1.1. Deterministic Algorithm 6 4.1.2. Probability-based Algorithm 7

  • 5. Static Power Dissipation

7

  • 6. Different Ways For Lower Static Power (Reducing Static Power)

9 6.1. Reducing the Supply Voltage 9 6.2. Reducing the Number of Devices 10 6.3. Using More Static Power Efficient Circuits 12 6.4. Using Multiple Threshold Voltages 13 6.5. Power Reduction with Speculation 14

  • 7. Current Power Reduction Techniques

15 7.1. Clock Gating 15 7.2. Caching Strategies 18 7.3. Input Vector Determination Technique 20 7.4. Pipeline Gating 21

  • 8. Conclusion

25 Concepts 27 Bibliography 28