Heterogeneous Latch-based Asynchronous Pipelines Girish - - PowerPoint PPT Presentation

heterogeneous latch based asynchronous pipelines
SMART_READER_LITE
LIVE PREVIEW

Heterogeneous Latch-based Asynchronous Pipelines Girish - - PowerPoint PPT Presentation

Heterogeneous Latch-based Asynchronous Pipelines Girish Venkataramani Tiberiu Chelcea Seth C. Goldstein Presenter: Tobias Bjerregaard April 10 th , 2008 ASYNC 2008 1 Outline Introduction Introduction Latch Selection


slide-1
SLIDE 1

1 April 10th, 2008 ASYNC 2008

Heterogeneous Latch-based Asynchronous Pipelines

Girish Venkataramani Tiberiu Chelcea Seth C. Goldstein Presenter: Tobias Bjerregaard

slide-2
SLIDE 2

2 April 10th, 2008 ASYNC 2008

Outline

  • Introduction

Introduction

  • Latch Selection Algorithm
  • Experimental Results
  • Conclusions
slide-3
SLIDE 3

3 April 10th, 2008 ASYNC 2008

Motivation

+

Delay

C

H/S Ack L a t c h Ack Req Data

  • Normally open latches are

attractive for bundled data designs, e.g., Mousetrap, [Singh, ICCD 01] + High-performance:

Short critical path to open latch

– Power hungry:

Data glitches spill over to downstream stages

slide-4
SLIDE 4

4 April 10th, 2008 ASYNC 2008

Motivation

+

Delay

C

H/S L a t c h Ack Req Data

  • Self-Resetting (SR) Latches

address the glitching problem, [Chelcea, DAC 07]

– D-Latch closed during active computation (filters glitches) – D-Latch is opened just before stage computation stabilizes + 2x improvement in energy-delay* – 10% performance slowdown*

* Mediabench suite, [Lee, Micro 97]

+

Delay

C

H/S SR L a t c h Data Req Ack

slide-5
SLIDE 5

5 April 10th, 2008 ASYNC 2008

Contributions

  • Build heterogeneous pipelines

– Use D-latches for timing critical stages – Use SR-latches for the rest

  • Module Selection problem

– What is timing critical? – When is an SR-latch warranted?

  • Automatic Latch Selection Algorithm

– Experimental results: Heterogeneous pipelines have equivalent performance to D- latches and are more energy-efficient than either homogeneous D-latch or SR-latch pipelines

slide-6
SLIDE 6

6 April 10th, 2008 ASYNC 2008

Outline

  • Introduction
  • Latch Selection Algorithm

Latch Selection Algorithm

  • Experimental Results
  • Conclusions
slide-7
SLIDE 7

7 April 10th, 2008 ASYNC 2008

Latch Selection Algorithm

  • Objectives: Get best of both worlds

– Performance of D-latches – Energy efficiency of SR-latches

  • Approach: Balance the use of SR-latches

– Too many bigger and slower designs – Too few high energy consumption

  • Algo properties: Three heuristics used to track timing

criticality and estimate effect of datapath glitches

slide-8
SLIDE 8

8 April 10th, 2008 ASYNC 2008

Power Heuristics

= glitch

  • Data glitches are

proportional to the datapath fanout

– Use SR-latches if fanout >= 2

  • Protect computation-

intensive stages

– Assign SR-latches to inputs – Bit-operations on datapath used to estimate computation intensiveness

= stage sr

*

sr sr

slide-9
SLIDE 9

9 April 10th, 2008 ASYNC 2008

Timing Criticality

  • SR-latch controllers introduce delay

when opening latches

– Use D-latch if stage is timing critical

  • Determine the system’s critical stages

using the Global Critical Path (GCP), [Venkataramani, DAC 07]

slide-10
SLIDE 10

10 April 10th, 2008 ASYNC 2008

Timing Analysis

  • Analysis produces steady-state event firing times

– Events are handshake signal transitions – Behaviors are dependence relations between events

  • Cycle time: Time difference between an event

recurrence

  • Alternative representation of cycle time

– Set of slack values: time difference between input events – Global Critical Path (GCP): longest zero-slack path – Global slack: Timing budget for GCP tolerance

  • How much can stage be slowed without changing GCP
slide-11
SLIDE 11

11 April 10th, 2008 ASYNC 2008

Event Slack

  • Use concept of Time Separation of Events (TSEs) to

compute slack, GCP and global slack

Behavior, b e1 e2 e3 e3 fires e1 fires e2 fires 5 9 Slack(e2, b) = 4 Slack(e1, b) = 0

Time

0-slack input is locally critical [Fields, ISCA 01] 4

last-arrival input

slide-12
SLIDE 12

12 April 10th, 2008 ASYNC 2008

Global Critical Path (GCP)

  • GCP is longest path of zero-slack

input events, [Venkataramani,DAC 07]

– Equivalent to the critical cycle

  • Bottom-Up computation of Cycle time:

– Length of GCP cycle = cycle time

GCP is the sequential critical path of the system. It represents the primary performance bottleneck

slide-13
SLIDE 13

13 April 10th, 2008 ASYNC 2008

Global Slack

  • Minimum cumulative

slack to the GCP

– If event is on the GCP

  • GSlack(b4) = GSlack(b5) = 0

– Otherwise:

  • GSlack(b2) = 1
  • GSlack(b3) = 4
  • GSlack(b1) = Min(0+4, 2+1) = 3

GCP

b1 b3 b2 b4 b5 2 1 e5 1 4

Global slack is a measure of how much a behavior can be delayed without affecting global performance

slide-14
SLIDE 14

14 April 10th, 2008 ASYNC 2008

Timing Criticality Heuristic

  • Let ∆sr be delay overhead introduced by

SR-latches

  • Iterative algorithm: Assign an SR-latch

when global slack is larger than ∆sr

– Update timing – Repeat; look for more opportunities

slide-15
SLIDE 15

15 April 10th, 2008 ASYNC 2008

Latch Selection Algorithm Overview

G=(V,E)

Is GSlack(vbest) > ∆sr? vbest = stage in (V – Vsr) with most GSlack Add all v in V to Vsr, if Fanout(v) >= 2 Add all v in V to Vsr, if (v,u) is in E and BitOps(u) >= BOmax Add vbest to Vsr Update GSlack No

Vsr

Yes

Complexity: O(|V||E| + |V|2)

slide-16
SLIDE 16

16 April 10th, 2008 ASYNC 2008

Outline

  • Introduction
  • Latch Selection Algorithm
  • Experimental Results

Experimental Results

  • Conclusions
slide-17
SLIDE 17

17 April 10th, 2008 ASYNC 2008

Experimental Setup

  • Implemented latch selection algorithm within CASH,

a compiler synthesizing 4-phase bundled circuits from C [IWLS 04]

  • Applied on 15 Mediabench kernels [Lee, Micro 97]
  • Circuits mapped to [180nm/2V] STMicro standard-cell

library

  • Synopsys DC used to estimate energy, Modelsim

used for gate-level timing estimation

slide-18
SLIDE 18

18 April 10th, 2008 ASYNC 2008

Impact of Heuristics

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e K14.pgp_d K15.pgp_e

% Contribution D-Latch Ops Gslack Compute-intensive Fanout

SR-Latch Stages Combined effect of heuristics contributes to energy efficiency

slide-19
SLIDE 19

19 April 10th, 2008 ASYNC 2008

Energy-Delay

0.5 1 1.5 2 2.5 3

K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e K14.pgp_d K15.pgp_e GM Ratio to D-Latch

SR-Latch Heterogeneous

slide-20
SLIDE 20

20 April 10th, 2008 ASYNC 2008

End-to-End Execution Time

0.6 0.7 0.8 0.9 1 1.1

K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e K14.pgp_d K15.pgp_e GM Ratio to D-Latch

SR-Latch Heterogeneous

slide-21
SLIDE 21

21 April 10th, 2008 ASYNC 2008

Outline

  • Introduction
  • Latch Selection Algorithm
  • Experimental Results
  • Conclusions

Conclusions

slide-22
SLIDE 22

22 April 10th, 2008 ASYNC 2008

Conclusions

  • D-latches are power-hungry and SR-latches

are slow for bundled-data pipelines

  • Heterogeneous latch selection algorithm

– Global slack to guide timing-critical selection – Simple heuristics to guide power-critical selection

  • Heterogeneous latch pipelines are more

energy-efficient than either homogeneous D- latch or homogeneous SR-latch pipelines

slide-23
SLIDE 23

23 April 10th, 2008 ASYNC 2008

Thank You! Questions?

slide-24
SLIDE 24

24 April 10th, 2008 ASYNC 2008

Self-Resetting (SR) Latches

[Chelcea, DAC 07]

trigger Din Dout Done D-latch En SR cntrl C

Timing overhead Area and control-path power

  • verhead

Benefit: Datapath power savings

slide-25
SLIDE 25

25 April 10th, 2008 ASYNC 2008

SR-latch behavior

EnSR+ Done+ EnSR- Done- En+ En-

Data ready Open latches to pass data When data latched close the latches

STG specification [Chelcea, DAC 07]

  • Eliminate glitches:

– open only after data is ready – close as soon as data latched

  • Eliminate overheads:

– open before handshake starts