Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy - - PowerPoint PPT Presentation

activity sensitive flip flop and latch selection for
SMART_READER_LITE
LIVE PREVIEW

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy - - PowerPoint PPT Presentation

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy Seongmoo Heo, Ronny Krashinsky, Krste Asanovi MIT - Laboratory for Computer Science http://www.cag.lcs.mit.edu/scale ARVLSI March 15, 2001 Flip-Flop and Latch


slide-1
SLIDE 1

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy

Seongmoo Heo, Ronny Krashinsky, Krste Asanovi MIT - Laboratory for Computer Science http://www.cag.lcs.mit.edu/scale

ARVLSI March 15, 2001

slide-2
SLIDE 2

Flip-Flop and Latch

(collectively timing elements)

  • Critical Timing Elements (TEs) in modern synchronous VLSI

systems Significant impact on cycle time Big portion of energy consumption

3% 1% 2% 7% 8% 23% 10% 23% 23%

EqualCheck Buffer Shifter Adder ALU RegFile Mux Latch Flipflop

[Heo, MS Thesis, ’00]

Energy breakdown of a MIPS 5 stage pipeline datapath for SPECint 95 programs

Flip-flop Latch

slide-3
SLIDE 3

Motivation

  • Previous work tried to find the most energy-efficient and fastest TEs

assuming a single TE design used uniformly throughout a circuit. using a very limited set of data patterns and un-gated clock signal.

  • Two important observations

There is a wide variation in clock and data activity across different TEs. Many TEs are not in the critical path, and thus have ample time slack.

slide-4
SLIDE 4

Basic Idea

  • Selection from a heterogeneous library of designs, each tuned to

different operating regimes

  • Operating regimes :
  • Different input and clock signal activities
  • Different speed requirements
slide-5
SLIDE 5

Related Work

  • The use of timing slack for reduced energy
  • Examples :
  • Traditional transistor sizing
  • Cluster voltage scaling [Usami and Horowitz ’95]
  • Multiple threshold voltage or series transistor

for reducing leakage current [McPherson et al. ’00, Yamashita et al. ’00, Johnson et al. ’99]

slide-6
SLIDE 6

Our Contribution

  • Detailed energy characterization of wide range of TEs as a

function of signal activities.

  • Detailed measurement of TE signal activities for a micro-

processor running complete programs

  • Exploit signal activity to reduce TE energy by using different TE

structures.

slide-7
SLIDE 7

Overview

  • Flip-Flop and Latch Designs
  • Test Bench and Simulation Setup
  • Delay and Energy Characterization
  • Energy Analysis with Test Waveforms
  • Evaluation with Processor
  • Conclusion
slide-8
SLIDE 8

Latch Designs

Transistor sizes optimized for two extremes: Highest speed vs. Lowest power

slide-9
SLIDE 9

Flip-Flop Designs

Transistor sizes optimized for two extremes: Highest speed vs. Lowest power

slide-10
SLIDE 10

Test Bench

  • Used fixed, realistic input driver
  • Determined appropriate output load
  • As large as 200fF output load was used by previous work.
  • We used 7.2fF (4 min-inv cap) because 60% of output loads in

the VP microprocessor datapath are smaller than 14.4fF.

  • Further work on load-sensitive analysis at upcoming WVLSI
  • Sized clock buffer to give equal rise/fall time

7.2fF

slide-11
SLIDE 11

Simulation Setup

  • Custom layout in 0.25 m TSMC CMOS process with Magic

layout program

  • Layout extraction with SPACE 2D extractor
  • Circuit simulation with Hspice under nominal condition of

Vdd=2.5V and T=25°C

  • Hspice .Measure command to measure delay and energy
slide-12
SLIDE 12

200 400 600 800 1000 1200

delay (ps)

P P C F F S S A F F S A F F M S A F F H L F F H L S F F S S A P L S S A S P L C C P P C F F P P C L A P T L A S S A L A S S A 2 L A C P N L A

lowest power highest speed

Delay Characterization

  • Flip-flop : Minimum D-Q delay [Stojanovic et al. ’99]
  • Latch : D-Q delay

(b) Latches (a) Flip-flops

slide-13
SLIDE 13

Energy Characterization

  • Total energy = input energy + internal energy

+ clock energy – output energy

  • Accurate energy characterization
  • State-transition technique based on [Zyuban and Kogge ’99]

D C Q C Q D

1 2 3 1 2 3

slide-14
SLIDE 14

Energy Tables

(a) Flip-flops (b) Latches

slide-15
SLIDE 15

Energy Tables

(a) Flip-flops

Low-Power Flip-Flop

PPCFF 51.2 6.9 6.9 6.9 49.7 49.7 68.1 68.0 19.4 19.4 19.2 68.1 49.1 46.8 91.5 101 46.3 46.0 47.6 89.2 89.0 95.5 95.4 48.4 011 ↓ 001 111 ↓ 101 110 ↓ 100 010 ↓ 000 001 ↓ 011 101 ↓ 111 100 ↓ 110 000 ↓ 010 111 ↓ 011 101 ↓ 001 110 ↓ 010 100 ↓ 000 011 ↓ 111 010 ↓ 111 001 ↓ 100 000 ↓ 100

(b) Latches

slide-16
SLIDE 16

Test Waveforms

  • Test 1 and 2 : high clock activity, no data and output activity
  • Test 3 and 4 : high data activity, no clock and output activity
  • Test 5, 6, and 7 : high clock, data, and output activity (Traditional)
  • Test 8 : high clock and data activity, no output activity
slide-17
SLIDE 17

Energy Analysis

1131

(a) Flip-flops (b) Latches

50 100 150 200 250 fJ/cycle Test 1 Test 3 Test 5 PPCLA PTLA SSALA SSA2LA CPNLA 100 200 300 400 500 600 700 fJ/cy c le Test 1 Test 3 Test 5 PPCFF SSAFF SAFF MSAFF HLFF HLSFF SSAPL SSASPL CCPPCFF

Low-power flip-flops and latches

slide-18
SLIDE 18

Processor Design and Simulation

  • Evaluation on a microprocessor datapath
  • Vanilla Pekoe Processor
  • A classic 32-bit MIPS RISC 5 stage pipeline with caches and system

coprocessor registers (R3000-compatible)

  • Aggressive clock gating to save energy
  • 22 multi-bit flip-flops and latches, totaling 675 individual bits
  • Simulation with 5 programs of SPECint95 benchmarks
  • A fast cycle-accurate simulator [Krashinsky, Heo, Zhang, and Asanovic

’00] with the ability of counting TE state transitions

  • 1.71 billion instructions and 2.69 billion cycles
  • Some constraints
  • Cannot track the exact timing of signals
  • Cannot model glitches
slide-19
SLIDE 19

Flip-Flops and Latches in Processor

slide-20
SLIDE 20

Flip-Flops and Latches in Processor

slide-21
SLIDE 21

Flip-Flops and Latches in Processor

slide-22
SLIDE 22

Energy Breakdown

0.05 HLFF-lp 0.1 cp0_epc 0.18 HLFF-lp 0.3 cp0_baddr 0.03 HLFF-lp 0.1 cp0_comp 4.80 SSAFF-lp 42.6 cp0_count 4.76 SSAFF-lp 24.6 m_exe 2.57 SAFF-lp 8.0 x_addr 1.06 SAFF-lp 2.6 x_sd 2.55 SSAFF-lp 20.2 m_epc 2.62 SSAFF-lp 20.3 x_epc 2.74 SSAFF-lp 20.5 d_epc 6.52 SSAFF-lp 31.2 d_inst 3.57 SSAFF-lp 25.1 f_recovpc Lowest-Energy HLFF-hs

Flip-flops

2.42 SSALA-lp 2.74 w_result 0.27 SSA2LA-lp 0.30 x_sdalign 3.65 SSALA-lp 3.88 m_exe 0.97 SSALA-lp 1.26 d_aluctrl 0.63 PPCLA-lp 0.65 d_rtshmd 0.70 PPCLA-lp 0.75 d_rsshmd 2.28 SSALA-lp 2.81 d_rtalu 3.16 SSALA-lp 3.27 d_rsalu 1.72 SSALA-lp 2.95 f_pc 2.25 SSALA-lp 3.22 p_pc Lowest-Energy PPCLA-hs

Latches

(unit: mJ) (unit: mJ)

  • 32-bit MIPS 5 stage pipeline datapath
  • SPECint95 benchmarks: perl(test, primes),

ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)

slide-23
SLIDE 23

Processor Energy Results - Flip-Flop

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 500 1000

Flip- flo p De lay ( ps ) To tal Flip- flo p Ene rgy ( J) Unifo rm HLFF- S izing HLFF- AS S S AS P L- S izing S S AS P L- AS SSASPL-hs HLFF-hs SSAFF-lp HLFF-lp SSAFF-hs SSASPL-lp

  • Ref : Total datapath energy – Total TE energy = around 0.21J

HS: Highest-Speed LP: Lowest-Power (A single design used uniformly throughout a circuit)

slide-24
SLIDE 24

Processor Energy Results - Flip-Flop

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 500 1000

Flip- flo p De lay ( ps ) To tal Flip- flo p Ene rgy ( J) Unifo rm HLFF- S izing HLFF- AS S S AS P L- S izing S S AS P L- AS

  • 34% energy saving with conventional transistor sizing

HLFF-hs 34% energy saving

slide-25
SLIDE 25

Processor Energy Results - Flip-Flop

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 500 1000

Flip- flo p De lay ( ps ) To tal Flip- flo p Ene rgy ( J) Unifo rm HLFF- S izing HLFF- HS LE S S AS P L- S izing S S AS P L- AS

  • 52% energy saving over just transistor sizing

with the best performance (HLFF-hs)

HSLE: Activity-Sensitive selection HLFF-hs 69% energy saving 52% energy saving

slide-26
SLIDE 26

Processor Energy Results - Latch

0.015 0.02 0.025 0.03 0.035 0.04 100 200 300 400 500 600

Latc h De lay (ps ) Total Latc h Ene rgy (J) Uniform PPCLA- Siz ing PPCLA- HSLE

2 1

  • 6.1% energy saving over just transistor sizing (1)
  • 8.3% energy saving compared to homogeneous design with PPCLA-hs (2)
  • PPCLA is the fastest and also very energy-efficient.

SSA2LA-lp PPCLA-hs

slide-27
SLIDE 27

Summary of Energy Results

  • 63% TE energy saving compared to a homogeneous design with

HLFF-hs and PPCLA-hs

  • 46% TE energy saving compared to a design with conventional

transistor sizing while keeping the best performance

slide-28
SLIDE 28

Conclusion

We showed that activation patterns for various TEs in a circuit differ considerably. We found that there is wide variation in the optimal TE designs for different regimes. We provided complete energy and delay characterization. We applied our technique to a real processor which we simulated 2.7 billion cycles of programs and showed over 63% TE energy reduction without losing any performance. Difficulty of using a heterogeneous mix of TEs?

  • Already designers have been doing verification for each local

clock and added complexity is minimal.

  • Timing verification for non-critical TEs is simple.