Designing for Low Power 1 2 c c * 2 1 Architecture & - - PowerPoint PPT Presentation

designing for low power
SMART_READER_LITE
LIVE PREVIEW

Designing for Low Power 1 2 c c * 2 1 Architecture & - - PowerPoint PPT Presentation

Advanced Digital IC-Design Content: Reduce Power at all Levels All abstraction levels are important Large savings on System & Alamouti Algorithm Information c c * system level [ u u ] 1 2 Source


slide-1
SLIDE 1

1

Advanced Digital IC-Design

Designing for Low Power

All abstraction levels are important

Content: Reduce Power at all Levels

Large savings on system level Large savings

System & Algorithm Architecture & Aritmetic Logic

ADD SUB

Information Source

[ ] * 1 2 1 2 * 2 1 c c u u c c ⎡ ⎤ − → ⎢ ⎥ ⎣ ⎦ Alamouti

g g

  • n technology

level

VDD

Device Circuit

Change in Digital Research

Traditional focus on IC-Design

D l d ti f Hi h S d

Speed

Delay reduction for High Speed Area Reduction

Recent Goal

Add more functions

Low power to reduce heat

  • mplexity

Area

Low power to reduce heat

Heat sink and fan is expensive

Mobile Computing

Low power to increase the time between charging

Co Power F l e x i b i l i t y New Design Space

Why is Power Important?

Impact on product lifetime, mean time between failure Chip failure rate doubles for every 10-20°C increase in temperature

slide-2
SLIDE 2

2

Power Consumption

MIPS/mW 100 4 orders of magnitude 10 1 0 1 0.01

Pentium

0.1

Strong ARM TI-DSP ASIC

Source: R. Brodersen, Berkeley

Power Dissipation

Two measures are important Peak power (Sets wire dimensions) Average power (Battery and cooling)

T max DD DD peak

i V P × = dt (t) i T V P

T DD DD av

=

Dynamic Power Consumption

Energy charged in a capacitor

EC = CV2/2 = CLVDD

2/2

VDD

C

/

L DD /

Energy Ec is also discharged, i.e.

Etot= CL VDD

2

Power consumption

P f C V

2

Charge

P = f CL VDD

2

Discharge

Static Power Consumption

I leakage increases with decreasing VT g

T

Pstat = I leakage × VDD Sub-threshold in this case VDD Open case Leakage

slide-3
SLIDE 3

3

CMOS Power Consumption

t t d t t

P P P = + =

2 tot dyn stat L DD leakage DD

P P P α f C V I V + = + probability for switching α =

An Example: Dynamic vs. Static

In the hearing aid area, devices in the mW range is produced About 25% of the power, is static power consumption, in a 130 nm low power technology In the 90 nm node, the static power will exceed dynamic and in the 65 nm node the static power will in the 65 nm node the static power will dominate totally

Source: Oticon

Optimum VDD vs. VT

2

V C f a P ⋅ ⋅ ⋅ =

DD L CLK Switching

V C f a P ⋅ ⋅ ⋅ =

Power P fCLK = constant PTOTAL

Leakage

  • ff

DD

P I V = ⋅

Supply voltage VDD (V) and VT PSwitching PLeakage

Performance Defines the Operating Point

Flat Minimum

2

V C f a P ⋅ ⋅ ⋅ =

Operating Point moved towards Pswitch

DD L CLK Switching

V C f a P ⋅ ⋅ ⋅ =

Power P fCLK = constant PTOTAL

Leakage

  • ff

DD

P I V = ⋅

Supply voltage VDD (V) and VT PSwitching PLeakage

Performance Defines the Operating Point

slide-4
SLIDE 4

4

Reduce Power at all Levels

Partitioning, Power- down System Complexity, Bit-optimization Parallelism, Pipelining Sizing, Logic Styles, Logic Design Th h ld V lt Algorithm Architecture Circuit/Logic T h l

Large Savings on top and bottom

Threshold Voltage Technology

Off-Chip Connections have High Capacitive Load

System-on-Chip

System Integration will reduce off chip data transfer Ideally a Single Chip Solution Reduced Power Consumption

Processor C Main M

Partitioning for Low Power

Traditional design High flexibility Low throughput

Processor Core Main Memory Core Memory

Partitioned design g p High power

  • (in both processor and memory)

Partly flexible Lower power

Distributed memory ASIC structure Accele- rator Distributed memory

  • Dedicated hardware (less overhead)
  • Higher throughput and lower power
  • Local busses and less data transfer
  • Lower power
  • Smaller memories, lower power

Clock Gating and Power Down

Module A CLK Enable A Module B Enable B Module C Enable C Enable C

Only active modules should be clocked! Clock gating is no solution for leakage!

slide-5
SLIDE 5

5

Clock Gating or Input Disabling

combinational logic can be selectively turned off by the input disabling logic (IL), which consists of transparent latches with an enable signal EN. When units are executing useful calculation, EN makes the latches transparent thus permitting normal latches transparent, thus permitting normal

  • perations. If there is no useful calculation, the

latches retain their previous state and no transitions propagate through the inactive units. This method is called the guarded evaluation where both a theoretical framework and algorithms, which automatically decide when the logic units performing useless calculations should be shut down, are

  • provided. Compared with the clock-gating technique,

this method is less power effective because the power in the clock line is not saved. However, when t f ti h th i t b t k two functions share the same register but never work simultaneously, the register should remain active and the clock-gating methodology cannot be exploited. In this case, by disabling the inputs to either of the two functions, it is still possible to reduce the power. Also, an input-disabling strategy is safer than clock- gating in terms of timing issues. Note that neither method is possible to avoid leakage power as it does not depend on signal transitions.

Power Modes

Active mode Idle mode e.g. ready with task or clock gated Sleep or standby mode for quick recovery. The memory content is often kept Power Down or Hibernate mode for maximum power savings

Sleep Transistors to Control Leakage

Modules are turned of in standby or when not needed B ilt t ki ff t Built on stacking effect

Module B

VDD

Module C

VDD

Module A

VDD

Turn

  • f B

Turn

  • f C

Sleep Control: Example

0.5 2000 4000 6000 8000 10000

  • 1
  • 0.5

x(n)

0.5 1

T(x(n))

2000 4000 6000 8000 10000

time, [ms]

Low noise case: Advanced filtering not needed Approach: Flexible filtering (dual mode)

Pacemaker in a Noisy Environment

slide-6
SLIDE 6

6

Flexible Wavelet Based Filter Structure

Normal mode (most of the time): Patient is not subjected to noise sources (e.g. resting) Low complexity filtering

Alert mode:

Noise Detector From Pace- maker Pulse Gene- rator Treshold Function Wavelet Filter

Generalized Likelihood Ratio Test

Dual mode Filtering Patient is subjected to noise sources (e.g. phys. active)

Generalized Likelihood Ratio Test

Add-On

Wavelet Filter

Add-On

A Dual-Mode Detector in UMC 0.13 μm

Core power consumption reduced by 68 % (simulated) at nominal VDD = 1.2 V

nW

20 40 60 80 100 120 Dynamic Leakage short-circuit total

Turned of in normal mode (sleep transistors)

alert sleep

leakage is dominating

340 uA in Active mode Example: ATMEL PicoPower Processor 150 uA in Idle mode 0.65 uA in Sleep mode 0.1 uA in Power Down mode Intelligent Energy Management

Run task as slow as possible by lowering VDD

Time Performance

Source ARM

Time Performance

Saves Energy

slide-7
SLIDE 7

7 Decreased VDD

Dynamic Voltage scaling Quadratic reduction of power Linear increase of delay

Goal

Just in time processing Low-Energy SoC Design Platform

Pre-characterized for worst case conditions Frequency voltage relationship is Adapts to environmental and process variations Frequency-voltage relationship is stored in a LUT process variations

Scenario:

A system needs to run 109 cycles within

Example: Dynamic Voltage Scaling

A system needs to run 109 cycles within 25 seconds Minimize the power consumption

VDD [V] 5.0 4.0 2.5 Energy per cycle [nJ] 40 25 10 fmax [MHz] 50 40 25

[V2]

2 52 42 52

50 MHz

Ea = 40 [J]

Dynamic Voltage Scaling

t[s]

2.5 5 10 15 20 25

[V2]

2.52 42 52

Eb = 32.5 [J]

25 MHz 50 MHz

t[s]

5 10 15 20 25

[V2] t[s]

2.52 42 52 5 10 15 20 25

Ec = 25 [J]

40 MHz

slide-8
SLIDE 8

8

Repeaters to Reduce the Bus Delay

Consider a 20 mm long wire, 1 um wide

R = 2 kΩ C = 20 pF

Buffer data

R = 200 Ω Req 200 Ω Cout = 0.1 pF Cin = 0.1 pF

Repeaters: Buffers Neglected

R 12

0.69 0.69 2000 20 10 27.6 ns RC

× = × × × =

R/2 C/2 R/2 C/2 C 12

0.69 2 0.69 2 1000 10 10 13.8 ns 2 2 R C

× × = × × × × =

R/3 C/3 R/3 C/3 R/3 C/3

0.69 3 9.2 ns 3 3 R C × × =

Repeaters: Buffers included

Cout R/n C/n Cin Cin Cout R/n C/n 0.69 ( 1)(( 1) ) ( ( 1)( )) ( 1) ( 1)

eq

  • ut

in

R C n n R n C C n n × + + × + × + + + + + Cout C/n Cin Cin Cout C/n

n = number of repeaters

  • 12

0.69 3 (3 ) ( 3 ( )) 3 3 2000 20 0.69 3 (3 200 ) ( 3 0.2) 10 19.1 ns 3 3

eq

  • ut

in

R C R C C × × + × + × + = = × × × + × + × × =

Example: n = 2 Repeaters: Minimum delay

30 35

Minimum delay ith t t

tp (ns)

10 15 20 25 30

Buffers included with two repeaters

1 2 3 4 5 5

Number of repeaters

Buffers excluded

slide-9
SLIDE 9

9

Reduce Power at all Levels

Partitioning, Power- down System Complexity, Bit-optimization Parallelism, Pipelining Sizing, Logic Styles, Logic Design Th h ld V lt Algorithm Architecture Circuit/Logic T h l

Large Savings on top and bottom

Threshold Voltage … Technology

Algorithm Optimization

Complexity Complexity

Trade-off bit-error rate and complexity to gain power

Bit Optimization (Word and Coefficient)

Power consumption decreases linearly Power consumption decreases linearly Often Quadratic in arrays

Complexity

Hardware - Algorithm Co-Optimization Trade-off BER towards Power Consumption Example

| | Φ − = Λ σ γ

Trade off BER towards Power Consumption Example

(Correlation and energy estimation in OFDM Synchronization)

Saves 50% of the operations but costs much less than a dB under normal conditions

Bit Optimization

10 12 14 16

Example: 8k point FFT

25% of the memory is saved

2 4 6 8 1 3 5 7 9 1 1 1 3

4096 2048 2 1

saved Lower power

FIFO FIFO Stage 12 Stage 11 FIFO FIFO Stage 1 Stage

slide-10
SLIDE 10

10

… k=0 Inefficient f (k 0 k k ) Efficient f (j 0 j j )

Efficient Algorithms

Reduced cache changes

j=0 j=1 j=2 k=1 … k=0 k=1 … k=0 k=1 for (k=0; k<=m; k++) for (j=0; j<=n; j++) p[j][k] = … for (j=0; j<=n; j++) for (k=0; k<=n; k++) p[j][k] = …

less switching on address bus

Lower computing time

ability to run at lower VDD

… …

Traditional loop Unrolled loop

Efficient Algorithms: Loop Unrolling

Traditional loop for (j=0; j<=n; j++) p[j] = … Unrolled loop for (j=0; j<=n; j+=2) { p[j] = … p[j+1] = … }

Lower total computing time Lower total computing time

ability to run at lower VDD

Executed in Parallel

Reduce Power on all Levels

System Partitioning, Power- down Algorithm Architecture Circuit/Logic y g, Complexity, Bit-optimization Parallelism, Pipelining Sizing, Logic Styles, Logic Design Technology Threshold Voltage …

Architecture

Operate at lowest supply voltage at lowest possible speed! lowest possible speed! Use architecture optimization to reduce supply voltage and speed, e. g. Parallelism, Pipelining Trade off: Low Power ⇔ Area

slide-11
SLIDE 11

11

Parallel Processing, Pipelining

2

P f C V =

f a c b C = the total switched capacitance f

Register Register Register Compare

Parallel Processing

2

0.5 2.15 (0.58 ) 0.36

par

P f C V P = × × =

a

Register Register Register

c b a

Register Register Register

c b f/2 f/2

Compare Compare

However, increased leakage!

Pipelining

2

1.1 (0.58 ) 0.37

pipe

P f C V P = × × =

f a

Register Register Register

c b

Compare Register Register

Parallel Processing and Pipelining

2 ,

0.5 2.35 (0.4 ) 0.19

par pipe

P f C V P = × × =

a

Register Register Register

c b a

Register Register Register

c b f/2 f/2

Register Register Register Register Compare Compare

slide-12
SLIDE 12

12

Architectures for Low Power

A [dB]

  • 20

3rd order 6th order

α0 Bit-Serial

20 0.2

  • 40

f/fs 0.3

α1 IN OUT

9 Cycle Path

α2

α1 α1 IN OUT

Two Concurrent Samples

α1 OUT

9 Cycle Path

Double processing capacity at the same

even

IN IN α1 OUT

time

α1 OUT IN

9 Cycle Path

  • dd

Eight Concurrent Samples

12 Cycle Path/8 Samples

α1 α1 α1 α1 OUT OUT OUT OUT IN IN IN IN IN IN IN IN α1 α1 α1 α1 OUT OUT OUT OUT

Cost: Silicon Area Result: Architecture Optimization

Theoretical Gain = 16 Measured Gain = 15

2 2

0 11 8 (0 26 ) 0 06 P f C V f C = × ×

  • P. Åström , P. Nilsson, and M. Torkelson

2 8

0.11 8 (0.26 ) 0.06 P f C V P = × × =

slide-13
SLIDE 13

13

Transformations

Example: Complex Multiplication R d C b t

b)

  • d(a

d)

  • a(c

bd ac Re jd) jb)(c (a + = − = = + +

Coefficients (Pre-calculated) Reduces C but increases VDD

b)

  • d(a

d) b(c bc ad Im b) d(a d) a(c bd ac Re + + = + = + = =

a c b d a d b c a d b c d c d Re Im Re Im

Reduce the Data Influence

I0 I1 I1 Q0 Q1 I0 I1 I2 Q0 Q1 Q2

Time shared architecture Parallel architecture T

Q0 Q1 Q2

T 2T

0.0 0.5 1.0

  • 0.5

0.0 0.5 1.0

  • 0.5

I Time shared architecture increase switching activity

  • 1.0

5 10 15 20 25 30 35 40 45 50

Samples

  • 1.0

10 20 30 40 50 60 70 80 90 100

Samples

Q

Balancing Operations

Example: Addition

A C B

Addition

A H G F E D C B G F E D S H G S

Balancing Operations

I t Ch i T

Normalized # of Transitions Reduced logic depth give

A H G F E D C B A F E D C B

Inputs Chain Tree 4 1.45 1 8 2.5 1

depth give reduced switching activity

S H G S

slide-14
SLIDE 14

14

FIR Filter

xn

Tree Adder give shorter critical path

yn xn

critical path Lower VDD

yn

Multiple Voltages

Example: Scheduling Resources: one adder and two lti li multipliers

a c e d f b M1 M2 M1 3 2 1 M1 M2 M1 Y A A M1 5 4 3 Clock Cycle A A M1 Low Voltage Module

Advanced Digital IC-Design

Designing for Low Power Cont.

Arithmetic & Power

Parallel arithmetic

an-1 bn-1 a0 b0 a1 b1 a2 b2

FA FA FA FA

sn-1 s0 s1 s2

Power consumption

2 bit par unit DD unit leak DD

P f C V I V n n

− −

= × ⋅ × + ⋅ ×

n = Number of adder cells

slide-15
SLIDE 15

15

Arithmetic & Power

Serial arithmetic

an-1 a1, a0 sn-1 s1, s0

Power consumption

FA

n 1 1,

bn-1 b1, b0

c

clk

n 1 1,

2 bit ser unit DD unit leak DD

P f C V n I V

− −

= ⋅ × × + ×

n = Number of clock cycles

Dynamic Power Consumption

2 2 bit par unit DD unit leak DD

P f C V I V f C n n

− −

= × ⋅ × + ⋅ ×

In bit-parallel,

n capacitances are switched once

In bit-serial

2 bit ser unit DD unit leak DD

P f C V n I V

− −

= ⋅ × × + ×

In bit-serial,

1 capacitance is switched n times

That is, the same dynamic power consumption

Static Power Consumption

2 2 bit par unit DD unit leak DD

P f C V I V f C n n

− −

= × ⋅ × + ⋅ ×

In bit-parallel,

n units are leaking

In bit-serial

2 bit ser unit DD unit leak DD

P f C V n I V

− −

= ⋅ × × + ×

In bit-serial,

1 unit is leaking

That is, bit-parallel leaks n times more!

Example: A Third Order Half-Band Filter

What is the relative static power reduction?

R

xn y

0.5

Three registers Four adders

R R

yn

One multiplier (a shift)

slide-16
SLIDE 16

16

A Third Order Half-Band Filter

R

xn

0.5

Number

  • f cells

Parallel design Serial design

R R

yn

About the same number of register leaf-cells

  • f cells

design design Register 42 46 Adder 56 4

12-bit wordlength is assumed

(14-bit internal to avoid overflow)

Only 4 adder leaf-cells in the serial design

Low Power (LP) Cell Library

R

xn

0.5

Number

  • f trans.

Parallel design Serial design R i t 672 736

R R

yn

Minimum sized transistors are used The static power is proportional to the number of Register 672 736 Adder 1568 112 Total 2240 848

Transistor count according to: Rabaey et. al. “Digital integrated circuits”

p p p transistors The power is reduced to 848/2240 = 38 %

High Performance (HP) Cell Library

R

xn

0.5

Total width Parallel design Serial design R i t 1142 1251

R R

yn

Transistors sized for uniform delay are used The static power is proportional to the total Register 1142 1251 Adder 5522 394 Total 6664 1646

Transistor sizing according to: Rabaey et. al. “Digital integrated circuits”

s a po s p opo

  • a
  • a

transistor width (W) The power is reduced to 1646/6664 = 25 %

Other Wordlengths

Reduction down to 33 % using a LP library Reduction down to 20 % using a HP library g y

40 60 80 100

Power Ratio (%)

LP Library 20 40 3 6 12 9 15 30

Wordlength (bits)

18 21 24 27 HP Library

slide-17
SLIDE 17

17

Power 12-bit Additions

100 1000

Adder Simulations – 12-bit Case

Bit-serial shows 1.95 times higher

P (µW)

0.1 1 10 Bit-parallel Bit serial

Bit serial

Static

times higher dynamic power

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

Bit-serial shows 6.6 times lower static power

Static

Power 24-bit Additions

100 1000

Adder Simulations – 24-bit Case

Bit-serial shows 1.91 times higher

P (µW)

0.1 1 10 Bit-parallel Bit serial

Bit serial

Static

times higher dynamic power

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

Bit-serial shows 13.4 times lower static power

Static

Power 24-bit Additions

100 1000

Discussion – Dynamic Power

The 1.91 times higher dynamic power is mainly due

P (µW)

0.1 1 10 Bit-parallel Bi i l

p y to the extra register in Bit-serial

FA

ci bi ai

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

si

However, the bit- parallel simulation is done without glitching

Power 24-bit Additions

100 1000

Discussion – Static Power

A power reduction at

P (µW)

0.1 1 10 Bit-parallel Bit serial Static

n= 24 times was expected

FA

ci si bi ai

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

Static

However, the register leaks as well, which give a 13.4 times reduction

slide-18
SLIDE 18

18

Power 24-bit Additions

100 1000

Discussion – The Cross-Over (1)

Simulations in a typical 130 nm technology

P (µW)

0.1 1 10 Bit-parallel Bit serial

gy Denser technologies will have more and more leakage

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

The crossover will thus move up in throughput

Power 24-bit Additions

100 1000

Discussion – The Cross-Over (2)

Simulations done with maximum switching activity

P (µW)

0.1 1 10 Bit-parallel Bit serial

g y The switching activity is usually much lower, down to 10% or less

0.01 0.0005 0.05 0.5 5 50 500 Bit-serial

Throughput (M Words/s)

The crossover will thus move up in throughput

Reduce Power at all Levels

Partitioning, Power- down System Partitioning, Power down Complexity, Bit-optimization Parallelism, Pipelining Sizing, Logic Styles, Logic Design Algorithm Architecture Circuit/Logic System Threshold Voltage … Technology

Number Representation

Two’s Complement vs. Sign Magnitude

Consider a bus where inputs toggles between +1 and –1 (i ll i i ) (i.e., a small noise input) Sign magnitude will toggle fewer bits!

Signed

000 001 111

1

  • 3

Two's

000 001 111

  • 1

1 Signed Magnitude

011 101 110 010 100

  • 1

3 2

  • 2

Two's Complement

011 101 110 010 100

  • 4

3 2

  • 3
  • 2
slide-19
SLIDE 19

19

Logic Style

CPL have low PD

200 30 50 70 100 150 8-bit adders in 2um DCVSL duct (pJ)

CPL have low PD product due to small internal nodes DCVSL suffers from its

7 10 15 20 30 5 VSL Static CMOS CPL D e c r e a s i n g V

D D

Power-delay pro

differential nature

100 10 3 30 Delay (ns)

Switching Activity A X B

If equal input probability

B

A B X

1 4 3

0 = = X

P

1 1 1 1 1

4 1

1 = = X

P

Switching Activity

A X B

A B X 1 1 1 1 1

P(current state) P(next state) Transition i.e. power is consumed

1 1 1 1

3 3 9 4 4 16 3 1 3 1 4 4 16 1 3 3 1 ( ) ( )

X X X X X X X X

P P P P P P P P P P P P P

→ = = → = = = =

= × = × = = × = − × = × =

1 1 1 1 1 1 1 1

1 3 3 1 4 4 16 1 1 1 4 4 16 ( )

X X X X X X X X

P P P P P P P P

→ = = = = → = =

= × = × − = × = = × = × =

Switching Activity

A X B

A B X 1 1 1 1 1

B

1 1 1

16 3 4 1 4 3 ) 1 (

1 1

= × = − × =

= = X X transition

P P P More complicated when input probability is not evenly distributed, which is the case in internal nodes!

slide-20
SLIDE 20

20

Switching Activity

A X B Uneven probabilities

A B X 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 ( )( ) ( ) ( ) ( ) ( )

A B A B A B A B A B A B A B A B A B A B

P P P P P P P P P P P P P P P P P P P P P P P

= = + + = = − − + − + − = − = −

1 1 1

1 1 1 1 1 1 15 1 0 06 4 4 4 4 4 4 256 . ; ; ( ) .

A B

Ex P P P→ = = = × × − × = ≈

Switching Activity

A B C D X Y Z 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

A X B Z

9 3 3 ; 16 3 4 3 4 1 ; 4 3

1 1 1 1

= × = = = =

→ → = = Y X Y X

P P P P P P P

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

C Z Y D

P(current state) P(next state)

16 . 256 42 16 9 16 9 16 ) 1 ( 16 9 4 3 4 3

1 1 1 1 1 1

≈ = × − = − = = × = =

= = → = = = Z Z Z Y X Z

P P P P P P

1 1 1 1 1 1 1

Switching Activity

A B C X Y Z 1 1 1 1 1 1 1 1 1 1 1 1 1

A X B Z

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

B C Z Y

; 16 3 4 3 4 1 ; 4 3

1 1 1 1

= × = = = =

→ → = = Y X Y X

P P P P 23 . 64 15 8 5 8 5 8 ) 1 ( 8 5

1 1 1 1

≈ = × − = − = =

= = → = Z Z Z Z

P P P P

X and Y correlated

Switching Activity due to Glitching

a b=0 z x

A B NOR 1 1 1

Extra transition due to race

b=0 z c a c

1 1 1

Dissipates energy

x z

slide-21
SLIDE 21

21

A X ab xz 00 1 0 01 1 0

1 1 1 1 1 1

1 1 ; 2 2 1 1 1 3 ; 2 2 4 4 1 1 1 2 2 4 1 1 1 (1 ) (1 ) 2 2 4 3 3 9

x a x a z x b z b b b x a a

P P P P P P P P P P P P P P

= = = = = = = = = = = = = =

= = = = = = × = = = = × = = − = − =

  • Unwanted switch when X and B changes but not Z

A Z B 10 0 1 11 0 0

1

3 3 9 (1 )(1 ) 4 4 16 1 1 3 (1 ) (1 ) 0.1875 4 4 16

z x b x b z x b x b z

P P P P P P P P P P P

= = = = = = = = = =

= − − = × = = − = − = =

  • 1
1 1

1 1 3 (1 ) (1 ) 0.1875 4 4 16 1 1 1 4 4 16

x b x b z x b x b

P P P P P P P P P

= = = = = = = = = =

= − = − = = = = × =

  • 1
0& 1 1 1 1& 1 1 1

1 1 1 4 4 16 1 1 1 4 4 16

x b x b x b x b

P P P P P P

= = = = = = = =

= = × = = = × =

  • The total probability for switching is thus

The switching and power increased by 67 %

1 1 1 1 1 1

3 1 5 2 2 2 16 16 16

z x b z x b

P P P P P P P

= = = = = =

= + × = + × = + × =

  • Switching Activity: α > 1?

Ci+1 Si

Addi

Ci+4 Si+3

Addi+3

Ci+3 Si+2

Addi+2

Ci+2 Si+1

Addi+1

Transitions due to carry propagation Multipliers even worse

Glitch Probability: Adder Chain

αglitch can be larger than one

C1

Add0

C4

Add3

C3

Add2

C2

Add1

C5

Add3

C6

Add3

1.25 0.25 0.5 0.75 1.0

αglitch =

S0 Si Si Si Si Si

Let's look at a simple function: f ' = ab+ bc +cd This can be factored two equivalent ways: f = b ( a + c ) + c d f = ab+c(b+d) f = ab+c(b+d) With no additional information. these two factorings seem equivalent. But if switching activity information is available, the lowest-power soluti'on can be

  • determined. If '0 were the high activity

signal, then the first factoring would be better since the high activity only propagates through two of the gates. The second factoring would have the high activity on signal b propagated through all four gates (figure 3). this is definitely not an operation to be performed manually, but is an ideal candidate for logic

  • ptimization.

Se mer i PowerManage.pdf

slide-22
SLIDE 22

22

Long Transistors to Reduce Leakage

All th ith ti i

N l

All paths with timing slack use long transistors Long transistors are 10% slower, but have l l k

Long t i t Normal transistor

3 times lower leakage

transistor

Long transistor = Normal + 10 %

Source: Intel

Long Transistors to Reduce Leakage

All All paths with timing

Source: Intel

Stack Forcing to Reduce Leakage

Same input load

VDD VDD

Low (standby) leakage at input logic 0 W/2 W/2 W

Source: Narendra

Reduce Power on all Levels

Algorithm Architecture Circuit/Logic System Partitioning, Power- down Complexity, Bit-optimization Parallelism, Pipelining Sizing, Logic Styles, Logic Design g Technology g, g y , g g Threshold Voltage …

slide-23
SLIDE 23

23

Advanced Digital IC-Design

Leakage Tolerant Circuits

VDD and VT Scaling

Scaling VDD

  • Lower power

L d l

5

VDD

4

V

  • Larger delay

Scaling VT

  • Lower delay
  • Larger leakage

VT VDD-VT

Gate Overdrive

1 2 3

Voltage scaling thus reduce active power, but increase the standby or leakage power

Technology μm 1.4 0.35 0.6 0.8 1.0 0.18 0.25

VT Scaling: VT and I D Trade-off

Higher ID

2

(

  • )

D GS T

I V V

ID Lower VT g

D

Scaling down VT

Higher I D I d f

VGS

VTLVT

Increased performance

VT Scaling: VT and I D Trade-off

Higher ID

2

( ) − ∼ I V V

ID Lower VT g

D

( ) ∼

D GS T

I V V

Scaling down VT

Higher I D I d f

VGS

VTLVT

Increased performance

slide-24
SLIDE 24

24

VT Scaling: VT and I off Trade-off

Sub-threshold current

GS T h

V V v

I I

ID Higher Lower VT

∝ ∝

th

v

  • ff

subth

I I e

Scaling down VT

Higher I off

VGS

VTLVT

Ioff

More static power

Threshold Voltage Variations

( )

2

  • n

GS T

I V V ∝ −

( )

2
  • n
GS T

I V V ∝ −

( )

GS T

q V V nKT

  • ff

I e

Off current is

Source: Schroder

O cu e t s sensitive to threshold variations

Device Variability – a big Problem

Threshold voltage variations in 90 nm Leakage change exponentially with the threshold The problem increase with denser technologies

slide-25
SLIDE 25

25

Device Variability – a big Problem

I OFF spread > 100X, I ON spread > 2X Device parameters are thus hard to determine

1.0 1.2 1.4

malized ION

NMOS PMOS

2X

0.4 0.6 0.8 0.01 0.1 1 10 100

Normalized IOFF Norm

100X 150nm, 110°C

Threshold and Length Variations

Length variations are not so sensitive compared to threshold variations

Sub-threshold Leakage

Sub-threshold leakage current will increase exponentially with temperature

100 1000 10000

  • ff (na/u)

1 10 30 50 70 90 110 130

Temp (C) Io

0.25um technology

Low-VDD Low-VT Design

Leakage control techniques g q Stacked CMOS Dual-threshold CMOS

slide-26
SLIDE 26

26

Leakage Control Mostly used in standby E l it i t d d Exploit input dependence

  • turn off stacks of transistors
  • intrinsic self-reverse biasing

Dual/Multiple VT useful

  • High VT : suppress sub-threshold leakage
  • Low VT: achieve high performance

Leakage: Example

( ) −

≈ ∝ ×

GS T th

V V v ff b h

I I K e Model: ≈ ∝ ×

  • ff

subth

I I K e I off

Reverse biased transistor

1.5 V 0 V VDS = 1.5 V VGS = 0 V

Ioff

VGS 10 = ⇒ =

GS

  • ff

V I nA

Ioff

VDS = 1.444 V = -0 056 1.5 V

Leakage: Example

( ) −

≈ ∝ ×

GS T th

V V v

I I K e Model:

0.056 0 V VDS = 0.056 VGS = 0 V VGS = 0.056

I off

Reverse biased transistor

≈ ∝ ×

  • ff

subth

I I K e

VGS 0.056 1 = − ⇒ =

GS

  • ff

V I nA

10%

“Stacking Effect”

Most of the voltage over the upper transistor

1 5V 0V 1.5V V 0 020V VDS=1.411V VDS=0.055V 0V 0V 0.089V 0.034V

Source: Kauchic Roy

VDS=0.014V VDS=0.020V 0V 0V 0.014V

slide-27
SLIDE 27

27

Input dependence of leakage Consider 4 input NAND

(VDD = 1.5 V, VT = 0.25V)

0V 1.5V 1.5V IDDQ = 9.96nA 1.5V 1.5V

DDQ

Input dependence of leakage Consider 4 input NAND

(VDD = 1.5 V, VT = 0.25V)

0V 1.5V 0V IDDQ = 1.71nA 1.5V 1.5V

DDQ

Consider 4 input NAND

(VDD = 1.5 V, VT = 0.25 V)

Input dependence of leakage

(

DD

,

T

)

0V 1.5V 0V I = 0 98nA 1.5V 0V IDDQ = 0.98nA

Consider 4 input NAND

(VDD = 1.5 V, VT = 0.25 V)

Input dependence of leakage

(

DD

,

T

)

0V 1.5V 0V I 0 72nA 0V 0V IDDQ = 0.72nA

slide-28
SLIDE 28

28

Leakage Control Static power is reduced to 7% if we use leakage control in stand by g y

0V 1.5V 1.5V I = 9 96nA 0V 1.5V 0V I = 0 72nA 1.5V 1.5V IDDQ = 9.96nA 0V 0V IDDQ = 0.72nA

Input Vector Control – Gate Leakage

Vdd ‘0’

M1

Igdo_M1(Vdd)

Vdd ‘1’

M1

Igc_M1(Vth)

‘0’

VM>0

M2

Igso_M1(VM) Igdo_M2(VM) Igso_M1(Vth_M1)

‘0’

Vdd-Vth_M1 =VINT

M2

Igdo_M2(VINT)

( dd) ( dd h 1)

Source: K Roy

Gate current with ‘10’ is lower than ‘00’

Ig(Vdd) > Ig(Vdd-Vth_M1) Rate of change of gate current increases with an increase in Vox (exponential)

Input Vector Control – Total Leakage

Leakage difference between ’10’ and ’00’ =

'10' '00'

( ) ( ) ( )

leakage

I I I I I I I I I Δ = − = + +

10 00 2 10 1 00 10 00

( ) ( ) ( )

sub sub gdo gdo btbt btbt

I I I I I I

− − − − − −

= − + − + −

Isub-10 > Isub-00

Igdo2-10 < Igdo1-00 Ibtbt-10 ≥ Ibtbt-00

Subthreshold ’00’ is best vector Dominant Leakage Subthreshold Gate Leakage 00 is best vector ’10’ is best vector

Source: K Roy

Leakage vs. # Transistors off

Leakage [nA]

10 4 6 8 10

Number of transistors off in stack

2 1 2 3 4

slide-29
SLIDE 29

29

Model when stacking transistors

In a transistor stack

VG1 1.5V VDS1 ignore transistors which are ON determine VDS of each transistor use I subth equation to calculate leakage VDS3 VDS2 VG3 VG2 leakage VDS4 VG4

( ) −

∝ ×

GS T th

V V v subth

I K e

Results

Circuit Input vector Model HSPICE Iddq [nA] 4 input NAND ABCD=0000 0.72 0.60 Best ABCD=1111 23.2 24.1 Worst 3 input NOR ABC=111 0.13 0.13 Best ABC=000 29.9 29.5 Worst Full Adder ABC=111 7.5 7.8 Best ABC=001 56.0 62.3 Worst 4 bit ripple Add A=B=0000, C=0 102.6 91.3 Best A=B=1111, C=1 102.6 94.0 Best A=B=0101, C=1 258.9 282.9 Worst

Reduced to 12%

Dual Threshold CMOS

Low-VT in critical path for high performance High-VT in non-critical paths to reduce leakage

Critical Path (Low VT) Non critical paths (high VT)

D l V

Critical path delay

All high VT All low VT Dual VT

Dual Threshold Partitioning

slide-30
SLIDE 30

30

Dual Threshold

Margin is needed due to the threshold variations The two VT:s might go in different directions

Dual Threshold CMOS Due to the complexity of a circuit All transistors in non-critical paths can not be assigned a high-VT New critical paths may be found

How to assign dual VT is a difficult problem

Dual Threshold CMOS

Sleep control transistors are turned off in standby to reduce leakage Sleep control only on non critical parts to avoid extra delay

Sleep Control

A A B C D E E

Non critical part

B C D

Sleep Control

High speed part High VT Device Low VT Device

Dual Threshold CMOS

Only one sleep control transistor is needed

A A B C D E E

Non critical part

B C D

Sleep Control

High speed part High VT Device Low VT Device

slide-31
SLIDE 31

31

Two main approaches:

NMOS is often preferred since it is smaller

Sleep Transistors

Header switch Footer switch

VDD

Sleep

VDD

Sleep Sleep

Fine grain Coarse grain

Sleep Transistors

g g

Sleep

VDD

Sleep

VDD

Ring style Grid style

Sleep Transistors Sleep Transistor Sizing (Full Adder)

Wide transistors costs area Small transistors costs performance due to p increased resistance to ground Wider is often preferred

Sleep transistors in Intel processors are > 200um wide

slide-32
SLIDE 32

32

Dual VT Mixed in CMOS Gates

Transistors in the same pull up/down Stacked transistors are not mixed

A B B B B

same pull up/down are not mixed are not mixed

Transistor in critical path

A A

High VT Device Low VT Device

Advantages: Dual Threshold CMOS

Efficient for standby leakage reduction Sleep control is easily implemented based on existing circuits

Disadvantages: Disadvantages:

Increased area and delay Threshold variations

Variable Threshold CMOS

In active mode:

Zero or slightly forward body bias for high speed

In standby mode:

) 2 2 (

F SB F T T

V V V φ φ γ − − + − + =

In standby mode:

Deep reverse body bias for low leakage

Same VDD/GND High VDD Active Standby Low GND

Double well technology is required

Example PMOS Transistor

VT0 = -0.5 V γ = -0.4 2Φ = 0 6V

( 2 2 ) 0 5 0 4( 0 6 0 5 0 6 )

T T F SB F

V V V γ φ φ = + − + − − =

2ΦF = 0.6V VDD = 2.5V Vn-well=3.0V

Same VDD/GND High VDD

0.5 0.4( 0.6 0.5 0.6 ) 0.5 0.4(0.2742) 0.61 = − − − − − − = = − − = −

Active Standby Low GND

slide-33
SLIDE 33

33

Example PMOS Transistor

  • (

)

= = ×

GS T th

V V v

  • ff

subth

I I K e

A change of VT = -0.5 V to VT = -0.61 V Decrease I off by two orders of magnitude

Same VDD/GND High VDD Active Standby Low GND

Adaptive Body Biasing

ABB (Adaptive body biasing) Vt Variation in Progress

Conclusions: Multiple VT

Low VT leads to Higher leakage Better performance, (Shorter propagation delay)

Low VT in critical path

Use multiple VT’s either by multiple VT processes control of the Body Effect, VBS

Power Optimization and Accuracy

Most IC’s are designed Highest savings at highest abstraction g

  • n this level

Time consuming and expensive level SPICE, PowerMill Power Compiler ???????? Prime Power

Tools

Prime Time

slide-34
SLIDE 34

34

Power Estimation - RTL or Gate Level

RTL:

Fast

library ieee; use ieee.std_logic_1164.all; entity adder is port ( a, b, ci : in std_logic; s co

  • t std logic)

Lower accuracy Good to compare architectures Good to find modules that need to be optimized

G t l l

s, co : out std_logic); end adder; architecture behavioral of adder is begin -- behavioral s <= a xor b xor ci; co <= (a and b) or ((a or b) and ci); end behavioral;

Gate level:

High accuracy Time consuming