Long wires and asynchronous control R. Ho, J. Gainsley, R. Drost - - PowerPoint PPT Presentation

long wires and asynchronous control
SMART_READER_LITE
LIVE PREVIEW

Long wires and asynchronous control R. Ho, J. Gainsley, R. Drost - - PowerPoint PPT Presentation

Long wires and asynchronous control R. Ho, J. Gainsley, R. Drost Funded by DARPA contract Sun Microsystems Laboratories NBCH30390002 1 SML2004-0323 Public Information SML2004-0323 How do on-chip wires scale? Are they really as bad as


slide-1
SLIDE 1

SML2004-0323

1

Long wires and asynchronous control

  • R. Ho, J. Gainsley, R. Drost

Sun Microsystems Laboratories

SML2004-0323 Public Information Funded by DARPA contract NBCH30390002

slide-2
SLIDE 2

SML2004-0323

2

  • There are really two kinds of on-chip wires
  • Span a block of constant complexity
  • Scaled-length wires
  • Span a fixed distance
  • Constant-length wires

How do on-chip wires scale?

Are they really as bad as “they” say?

1 10 100 180 130 90 70 50 35 25 18 13 Wire delay (FO4/mm) 1 10 100 180 130 90 70 50 35 25 18 13 Wire delay (FO4/mm)

Projections from R. Ho 2003

Scaled-length wires keep up with gates Fixed-length wires cannot keep up

slide-3
SLIDE 3

SML2004-0323

3

  • Build what the VLSI constraints (wires) demand
  • Global network ties all the blocks together
  • How can we get high bandwidth and low latency?

What this means for designers

Build modular machines Computation block

  • Lots of xstrs
  • Local memory
  • Local communication
  • Locally synchronous?

Global network

  • Explicit (expensive) communication
  • Lots of long wires
  • Globally asynchronous?
slide-4
SLIDE 4

SML2004-0323

4

  • Speeding up global wires
  • Asynchronous control improves performance
  • Optimizing wire latency
  • Well-known circuit models lead to analysis
  • Optimizing wire bandwidth
  • Dual-path control reduces transactional penalty
  • What about power?
  • Conclusion

Outline

slide-5
SLIDE 5

SML2004-0323

5

  • Flow-through repeaters help latency (for power)
  • But they do not improve bandwidth
  • Unless we wave-pipeline them
  • Scary with {device,wire} {static,dynamic} variations

Speeding up global wires

Flow-through repeaters

10 20 30 40 50 1 3 5 7 9 11 13 15 # of gate delays Wire length (mm)

slide-6
SLIDE 6

SML2004-0323

6

  • Latched repeaters improve latency and bandwidth
  • Latency a little worse due to internal delays
  • The problem: they need a fast strobe (~5 FO4s)
  • Can’t use CPU clock (no faster than ~15 FO4/cycle)
  • Local fast clock generation adds complexity

Speeding up global wires

Latched repeaters strobe

slide-7
SLIDE 7

SML2004-0323

7

  • So control the latched repeaters asynchronously
  • Better latency, better bandwidth, don’t need clock
  • Allows for GALS: asynchronous compute modules
  • Treat global wires as flow-through FIFOs
  • So: how do we optimize latency and bandwidth?

Speeding up global wires

Asynchronous latched repeaters ctrl

hand shake

ctrl

hand shake

ctrl

hand shake

slide-8
SLIDE 8

SML2004-0323

8

  • Leverage well-known circuit analysis techniques
  • Use dominant time constant (Elmore) models
  • Not specific to asynchronous circuits
  • But assume source-limited data patterns
  • Turn repeater and wire into component Rs and Cs
  • Parameterize by driver width (w), wire length (L)
  • Latch design sets delay, p/n ratios (β), stepup (s)

Optimizing wire latency

Analytic models

slide-9
SLIDE 9

SML2004-0323

9

  • Formulate RC delay and optimize
  • Partial derivative w.r.t. driver width (w) = 0
  • Partial derivative w.r.t. segment length (L) = 0
  • Example: latch with tristate-able output
  • For minimal delay:
  • In a TSMC 180nm logic process, using M5 wires
  • Delay-minimal L = 3.8mm, w = 20µm

Optimizing wire latency

Analytical formulation leads to optimization

slide-10
SLIDE 10

SML2004-0323

10

  • What about sensitivities to L and w?
  • Normalize to their delay-optimal values
  • So for datapaths, best latency is ~ 3mm to 4.6mm
  • What about bandwidth?

Optimizing wire latency

Sensitivities

0.6 1 1.4 1.8 2.2 0.6 0.8 1 1.2 1.4 1.6 w/wopt L/Lopt

2% delay contours Very flat contours!

slide-11
SLIDE 11

SML2004-0323

11

  • Asynchronous circuits are transactional
  • Each cycle requires a request and a response
  • During the request, data flows
  • During the response, no data flows
  • Control circuit families reflect this imbalance
  • In GasP ACKs (2 gates) are faster than REQs (4)
  • ACKs would be zero, except for hold times

Optimizing wire bandwidth

Transactional nature of controls

slide-12
SLIDE 12

SML2004-0323

12

  • Long wires exacerbate transaction delays
  • Both REQ and ACK require wire RC delay
  • REQ delay matches data delay: useful
  • ACK delay is dead time for datapath: useless
  • Can wire engineering help?
  • Fatten ACK wire
  • Lower its RC delay
  • Get 2.5x speedup easily
  • Much more is too costly

Optimizing wire bandwidth

Implications for wires

1 1.5 2 2.5 3 3.5 5 10 15 20 25 30 Speedup for a 4mm wire Wire width factor

slide-13
SLIDE 13

SML2004-0323

13

  • Level-sensitive control (RZ) is a poor choice
  • Uses four phases: two wire transitions per token
  • Has twice the transactional penalty
  • Transition-encoded control (NRZ) is better
  • Uses two phases: average one transition per token
  • Still has transactional bandwidth limitation
  • Pulse-encoded control (GasP) also okay
  • Has same energy as NRZ, same bandwidth penalty
  • Has the advantage that we’re familiar with GasP

Optimizing wire bandwidth

Control protocol implications for long wires

slide-14
SLIDE 14

SML2004-0323

14

  • By the way, GasP control of long wires isn’t trivial
  • Control wires are bidirectional, data wires are not
  • Capacitance asymmetry between control, data
  • Requires a bit more timing margin
  • Pushing pulses on a moderately long wire is hard
  • Must overcome the “wet noodle” effect
  • Logical effort theory can help CAD sizing
  • But for now, size things manually via spice

Optimizing wire bandwidth

Pulse-encoded control challenges

slide-15
SLIDE 15

SML2004-0323

15

  • A simplification of GasP
  • High = full, or “token present”
  • Low = empty, or “no token present”
  • If (pred==high && succ==low) then
  • Flip the clk, and reset both pred and succ

Optimizing wire bandwidth

Modified GasP for long wires pred succ clk reset low reset high

slide-16
SLIDE 16

SML2004-0323

16

  • Tweak GasP to prevent pulses from disappearing
  • As wires lengthen, RC delays increase
  • …transitions on wires take longer
  • …drive pulses must widen to allow full transitions
  • We can delay the reset of PRED and SUCC lines

Optimizing wire bandwidth

Modified GasP for long wires pred succ clk pred succ

delay delay Vdd

clk

Vdd

slide-17
SLIDE 17

SML2004-0323

17

  • Simulate long wires under GasP control
  • Use M5 wires on a TSMC 180nm logic process
  • Clearly see quadratic effects of long wires
  • Steps: added delays for extended drive pulses
  • Slow signaling rate
  • At 3.8mm, Tc=1.6nS
  • Transactional control

penalty damages BW

Optimizing wire bandwidth

Simulations of GasP

0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 Cycle time (nS) Wire length (mm) Extended drive pulses

slide-18
SLIDE 18

SML2004-0323

18

  • We can eliminate the ACK’s dead time
  • Key notion: Let datapath do work during the ACK
  • If we keep datapath busy, we double the bandwidth
  • Control drawn with two wires for simplicity
  • GasP uses a single wire driven by both ends

Optimizing wire bandwidth

Dual-path control GasP

req ack latch data latch

Inputs

slide-19
SLIDE 19

SML2004-0323

19

  • We can eliminate the ACK’s dead time
  • Key notion: Let datapath do work during the ACK
  • If we keep datapath busy, we double the bandwidth
  • Control drawn with two wires for simplicity
  • GasP uses a single wire driven by both ends

Optimizing wire bandwidth

Dual-path control GasP

req ack latch data latch

Outputs

slide-20
SLIDE 20

SML2004-0323

20

  • We can eliminate the ACK’s dead time
  • Key notion: Let datapath do work during the ACK
  • If we keep datapath busy, we double the bandwidth
  • Control drawn with two wires for simplicity
  • GasP uses a single wire driven by both ends

Optimizing wire bandwidth

Dual-path control GasP

req ack latch data latch

Outputs Outputs fire iff all inputs arrive

slide-21
SLIDE 21

SML2004-0323

21

  • Dual, alternating control paths (top and bot)
  • When top is ACK-ing, bot is REQ-ing, & vice versa
  • But what does the bottom control path drive?

Optimizing wire bandwidth

Dual-path control GasP

req_top ack_top latch data latch req_bot ack_bot

slide-22
SLIDE 22

SML2004-0323

22

  • Answer: we double the datapath latches
  • Latches are muxed so use a tristate output
  • Latch inputs are unconditionally latched by REQ

Optimizing wire bandwidth

Dual-path control GasP

req_top ack_top latch data req_bot ack_bot latch

tristate output

en en clk clk clk clk

unconditional latch

en latch latch en

slide-23
SLIDE 23

SML2004-0323

23

  • Not quite right: two paths must truly alternate
  • Otherwise one path’s data can clobber the other’s
  • So insert an alternation token between paths
  • Alternation path delay should match data delay

Optimizing wire bandwidth

Dual-path control GasP

req_top ack_top latch data latch req_bot ack_bot latch latch

slide-24
SLIDE 24

SML2004-0323

24

  • Recall we used an unconditional latch
  • Causes a critical path in the control
  • Data must flow through latch before

control reaches the GasP stage

  • To fix this, delay the reset of the GasP stage
  • Same tweak we did earlier to drive long wires

Optimizing wire bandwidth

It’s slower for short wires

latch en

slide-25
SLIDE 25

SML2004-0323

25

  • Dual-path control has ~2x bandwidth gain
  • No extra delay at 1.6mm: already added for latches
  • Dual-path control hides wire effects up to ~4mm
  • Datapath best ~3.8mm; control best < 4mm

Optimizing wire bandwidth

Simulations of dual-path control GasP

0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 Cycle time (nS) Wire length (mm) Single control Dual-path control

slide-26
SLIDE 26

SML2004-0323

26

  • Looks promising, but beware the NFL* principle
  • Area and power overhead
  • Added extra control wires and GasP module
  • Minor compared to a 64bit datapath
  • Added extra latch for every datapath bit
  • Not a big deal if the wires span 4mm lengths
  • Reliability (noise) concern, dual path or not
  • Long pulse-encoded control wires are scary
  • We can always trade off area for reliability

*NFL = “no free lunch”

Optimizing wire bandwidth

What’s the cost?

slide-27
SLIDE 27

SML2004-0323

27

  • Performance overhead
  • Only for very short wires
  • Bandwidth improvement for typical length wires
  • Complexity overhead
  • Very large – lots of manual spice simulation
  • Lack of CAD tools and flows complicates design
  • Restricts usage to homogenous,regular wires
  • Design-once, use-many

Optimizing wire bandwidth

Other costs

slide-28
SLIDE 28

SML2004-0323

28

  • Datapath wires consume lots of power
  • Minimizing transitions by coding: helps a little
  • One-hot encoding trades more area for less power
  • E.g., 8-bit bus goes from 4 to 1 avg transitions
  • Minimizing voltage swing: helps a lot
  • 1.8v to 0.1v saves 10x in power
  • Not 20x due to need for differential signaling
  • A lot more savings if we have a reduced supply

What about power?

Wire optimizations

slide-29
SLIDE 29

SML2004-0323

29

  • Reduced-swing data wires mandate data-bundling
  • Data must be amplified back to full logic levels
  • Amplification must be triggered
  • Flow-through amplifiers are inefficient
  • Lots of literature for on-chip low-swing signaling
  • Not much on doing it asynchronously
  • An active area of work for us

What about power?

Data-bundled protocols

slide-30
SLIDE 30

SML2004-0323

30

  • Long wires forcing us to rethink design issues
  • Motivate exploration of asynchronous repeaters
  • Latency: use well-known repeater analytic models
  • Provide lowest-latency datapath design
  • Bandwidth: dual-path control is promising
  • Reduces handshaking transactional penalty
  • Power: Perhaps the most important parameter?

Conclusions

slide-31
SLIDE 31

SML2004-0323

31

Many thanks to:

  • Anonymous reviewers
  • Bill Coates
  • Jo Ebergen
  • Bob Proebsting
  • Ivan Sutherland
  • DARPA, HPCS contract NBCH30390002

Acknowledgments