CS137: Today Electronic Design Automation Bit-Level Addition - - PDF document

cs137 today electronic design automation
SMART_READER_LITE
LIVE PREVIEW

CS137: Today Electronic Design Automation Bit-Level Addition - - PDF document

CS137: Today Electronic Design Automation Bit-Level Addition LUT Cascades For Sums Day 9: January 30, 2006 Applications FSMs Parallel Prefix SATADD Data Forwarding Pointer Jumping Applications


slide-1
SLIDE 1

1

CALTECH CS137 Winter2006 -- DeHon 1

CS137: Electronic Design Automation

Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 2

Today

  • Bit-Level

– Addition – LUT Cascades

  • For Sums

– Applications

  • FSMs
  • SATADD
  • Data Forwarding
  • Pointer Jumping

– Applications

CALTECH CS137 Winter2006 -- DeHon 3

Introduction / Reminder

Addition in Log Time

CALTECH CS137 Winter2006 -- DeHon 4

Ripple Carry Addition

  • Simple “definition” of addition
  • Serially resolve carry at each bit

CALTECH CS137 Winter2006 -- DeHon 5

CLA

  • Think about each

adder bit as a computing a function

  • n the carry in

– C[i]=g(c[i-1]) – Particular function f will depend on a[i], b[i] – G=f(a,b)

CALTECH CS137 Winter2006 -- DeHon 6

Functions

  • What functions can g(c[i-1]) be?

– g(x)=1

  • a[i]=b[i]=1

– g(x)=x

  • a[i] xor b[i]=1

– g(x)=0

  • A[i]=b[i]=0
slide-2
SLIDE 2

2

CALTECH CS137 Winter2006 -- DeHon 7

Functions

  • What functions can g(c[i-1]) be?

– g(x)=1 Generate

  • a[i]=b[i]=1

– g(x)=x Propagate

  • a[i] xor b[i]=1

– g(x)=0 Squash

  • A[i]=b[i]=0

CALTECH CS137 Winter2006 -- DeHon 8

Combining

  • Want to combine functions

– Compute c[i]=gi(gi-1(c[i-2])) – Compute compose of two functions

  • What functions will the

compose of two of these functions be?

– Same as before

  • Propagate, generate,

squash

CALTECH CS137 Winter2006 -- DeHon 9

Compose Rules (LSB MSB)

SS SP SG PS PP PG GS GP GG Result Compose

CALTECH CS137 Winter2006 -- DeHon 10

Compose Rules (LSB MSB)

S SS S SP G SG S PS P PP G PG S GS G GP S GG Result Compose

CALTECH CS137 Winter2006 -- DeHon 11

Combining

  • Do it again…
  • Combine g[i-3,i-2] and g[i-1,i]
  • What do we get?

CALTECH CS137 Winter2006 -- DeHon 12

Reduce Tree

slide-3
SLIDE 3

3

CALTECH CS137 Winter2006 -- DeHon 13

Associative Reduce Prefix

  • Shows us how to compute the Nth value

in O(log(N)) time

  • Can actually produce all intermediate

values in this time

– w/ only a constant factor more hardware

CALTECH CS137 Winter2006 -- DeHon 14

Prefix Tree

Prefix Tree

CALTECH CS137 Winter2006 -- DeHon 15

Parallel Prefix

  • Important Pattern
  • Applicable any time operation is

associative

  • Function Composition is always

associative

CALTECH CS137 Winter2006 -- DeHon 16

Generalizing

LUT Cascade

CALTECH CS137 Winter2006 -- DeHon 17

Cascaded LUT Delay Model

  • Tcascade =T(3LUT) + T(mux)
  • Don’t pay

– General interconnect – Full 4-LUT delay

CALTECH CS137 Winter2006 -- DeHon 18

Parallel Prefix LUT Cascade?

  • Can we do better than N×Tmux?
  • Can we compute LUT cascade in O(log(N))

time?

  • Can we compute mux cascade using parallel

prefix?

  • Can we make mux cascade associative?
slide-4
SLIDE 4

4

CALTECH CS137 Winter2006 -- DeHon 19

Parallel Prefix Mux cascade

  • How can mux transform Smux-out?

– A=0, B=0 mux-out=0 – A=1, B=1 mux-out=1 – A=0, B=1 mux-out=S – A=1, B=0 mux-out=/S

CALTECH CS137 Winter2006 -- DeHon 20

Parallel Prefix Mux cascade

  • How can mux transform Smux-out?

– A=0, B=0 mux-out=0 Stop= S – A=1, B=1 mux-out=1 Generate= G – A=0, B=1 mux-out=S Buffer = B – A=1, B=0 mux-out=/S Invert = I

CALTECH CS137 Winter2006 -- DeHon 21

Parallel Prefix Mux cascade

  • How can 2 muxes transform input?
  • Can I compute 2-mux transforms from 1

mux transforms?

CALTECH CS137 Winter2006 -- DeHon 22

Two-mux transforms

  • SSS
  • SGG
  • SBS
  • SIG
  • GSS
  • GGG
  • GBG
  • GIS
  • BSS
  • BGG
  • BBB
  • BII
  • ISS
  • IGG
  • IBI
  • IIB

CALTECH CS137 Winter2006 -- DeHon 23

Generalizing mux-cascade

  • How can N muxes transform the input?
  • Is mux transform composition

associative?

CALTECH CS137 Winter2006 -- DeHon 24

Associative Reduce Mux-Cascade

Can be hardwired, no general interconnect

slide-5
SLIDE 5

5

CALTECH CS137 Winter2006 -- DeHon 25

For Sums

CALTECH CS137 Winter2006 -- DeHon 26

Prefix Sum

  • Common Operation:

– Want B[x] such that B[x]=A[0]+A[1]+…A[x] – For I=0 to x

  • B[x]=B[x-1]+A[x]

CALTECH CS137 Winter2006 -- DeHon 27

Prefix Sum

  • Compute in tree fashion

– A[I]+A[I+1] – A[I]+A[I+1]+A[I+2]+A[I+3] – …

  • Combine partial sums back down tree

– S(0:7)+S(8:9)+S(10)=S(0:10)

CALTECH CS137 Winter2006 -- DeHon 28

Other simple operators

  • Prefix-OR
  • Prefix-AND
  • Prefix-MAX
  • Prefix-MIN

CALTECH CS137 Winter2006 -- DeHon 29

Find-First One

  • Useful for arbitration

– Finds first (highest-priority) requestor – Also magnitude finding in numbers

  • How:

– Prefix-OR – Locally compute X[I-1]^X[I] – Flags the first one

CALTECH CS137 Winter2006 -- DeHon 30

Arbitration

  • Often want to find first M requestors

– E.g. Assign unique memory ports to first M processors requesting

  • Prefix-sum across all potential

requesters

  • Counts requesters, giving unique

number to each

  • Know if one of first M

– Perhaps which resource assigned

slide-6
SLIDE 6

6

CALTECH CS137 Winter2006 -- DeHon 31

Partitioning

  • Use something to order

– E.g. spectral linear ordering – …or 1D cellular swap to produce linear

  • rder
  • Parallel prefix on area of units

– If not all same area

  • Know where the midpoint is

CALTECH CS137 Winter2006 -- DeHon 32

Channel Width

  • Prefix sum on delta wires at each node

– To compute net channel widths at all points along channel – E.g. 1D ordered

  • Maybe use with cellular placement scheme

CALTECH CS137 Winter2006 -- DeHon 33

Rank Finding

  • Looking for I’th ordered element
  • Do a prefix-sum on high-bit only

– Know m=number of things > 01111111…

  • High-low search on result

– I.e. if number > I, recurse on half with leading zero – If number < I, search for (I-m)’th element in half with high-bit true

  • Find median in log2(N) time

CALTECH CS137 Winter2006 -- DeHon 34

FA/FSM Evaluation

(regular expression recognition)

CALTECH CS137 Winter2006 -- DeHon 35

Finite Automata

  • Machine has finite state: S
  • On each cycle

– Input I – Compute output and new state

  • Based on inputs and current state
  • Oi,S(i+1)=f(Si,Ii)
  • Intuitively, a sequential process

– Must know previous state to compute next – Must know state to compute output

CALTECH CS137 Winter2006 -- DeHon 36

Function Specialization

  • But, this is just functions

– …and function composition is associative

  • Given that we know input sequence:

– I0,I1,I2…

  • Can compute specialized functions:

– fi(s)=f(s,Ii)

  • What is fi(s)?

– Worst-case, a translation table:

  • S=0 NS0, S=1 NS1 ….
slide-7
SLIDE 7

7

CALTECH CS137 Winter2006 -- DeHon 37

Function Composition

  • Now: O(i+m),S(i+m+1)=

f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))

  • Can we compute the function

composition?

– f(i+1,i)(s)=f(i+1)(fi(s)) – What is f(i+1,i)(s)?

  • A translation table just like fi(s) and f(i+1)(s)
  • Table of size |S|, can fillin in O(|S|) time

CALTECH CS137 Winter2006 -- DeHon 38

Recursive Function Composition

  • Now: O(i+m),S(i+m+1)=

f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))

  • We can compute the composition

– f(i+1,i)(s)=f(i+1)(fi(s))

  • Repeat to compute

– f(i+3,i)(s)=f(i+3,i+2)(f(i+1,i)(s)) – Etc. until have computed: f(i+m,i)(s) in O(log(m)) steps

CALTECH CS137 Winter2006 -- DeHon 39

Implications

  • If can get input stream,

– Any FA can be evaluated in O(log(N)) time – Regular Expression recognition in O(log(N))

  • Any streaming operator with finite state

– Where the input stream is independent of the output stream – Can be run arbitrarily fast by using parallel- prefix on FSM evaluation

CALTECH CS137 Winter2006 -- DeHon 40

Saturated Addition

  • S(i+1)=max(min(Ii+Si,maxval),minval)
  • Could model as FSM with:

– |S|=maxval-minval

  • So, in theory, FSM result applies
  • …but |S| might be 216, 224

CALTECH CS137 Winter2006 -- DeHon 41

SATADD Composition

  • Can compute composition efficiently

[Papadantonakis et al. FPT2005]

CALTECH CS137 Winter2006 -- DeHon 42

SATADD Composition

slide-8
SLIDE 8

8

CALTECH CS137 Winter2006 -- DeHon 43

SATADD Reduce Tree

CALTECH CS137 Winter2006 -- DeHon 44

Data Forwarding

UltraScalar From Henry, Kuszmaul, et al.

ARVLSI’99, SPAA’99, ISCA’00

CALTECH CS137 Winter2006 -- DeHon 45

Consider Machine

  • Each FU has a full RF

– FU=Functional Unit – RF=Register File

  • Build network between FUs

– use network to connect produce/consume – user register names to configure interconnect

  • Signal data ready along network

CALTECH CS137 Winter2006 -- DeHon 46

Ultrascalar: concept model

CALTECH CS137 Winter2006 -- DeHon 47

Ultrascalar Concept

  • Linear delay
  • O(1) register cost / FU
  • Complete renaming at each FU

– different set of registers – so when say complete RF at each FU, that’s only the logical registers

CALTECH CS137 Winter2006 -- DeHon 48

Ultrascalar: cyclic prefix

slide-9
SLIDE 9

9

CALTECH CS137 Winter2006 -- DeHon 49

Parallel Prefix

  • Basic idea is one we saw with adders
  • An FU will either

– produce a register (generate) – or transmit a register (propagate) – can do tree combining

  • pair of FUs will either both propagate or will

generate

  • compute function by pair in one stage
  • recurse to next stage
  • get log-depth tree network connecting producer

and consumer

CALTECH CS137 Winter2006 -- DeHon 50

Ultrascalar: cyclic prefix

CALTECH CS137 Winter2006 -- DeHon 51

Pointer Jumping

CALTECH CS137 Winter2006 -- DeHon 52

Pointer Jumping Motivation

  • Have a tree

– E.g. is-a relationship tree in NETL

  • Want to know if a node is of a particular

type (is-a mammal)

  • How long to find out?

– Naïve: O(distance)

  • Spread one level per timestep

CALTECH CS137 Winter2006 -- DeHon 53

Following Pointer Chain

  • Naïve: spread/color from target node

– On each step push down to children

  • Most nodes idle

– Only active on the step something arrives

  • Can the idle nodes do something to

accelerate?

CALTECH CS137 Winter2006 -- DeHon 54

Jumping Intermediates

  • Add notion of transitive parent
  • Initially: transitive-parent=parent
  • On each step:

– If my transitive-parent marked

  • Mark self

– else

  • Transitive-parent =

transitive-parent(transitive-parent)

slide-10
SLIDE 10

10

CALTECH CS137 Winter2006 -- DeHon 55

How Much Jumping?

  • On each step:

– If my transitive-parent marked

  • Mark self

– else

  • Transitive-parent =

transitive-parent(transitive-parent)

  • How many such steps?

– O(log(distance))

CALTECH CS137 Winter2006 -- DeHon 56

Pointer Jumping

  • Same basic idea as data forwarding
  • Can find length of a list in O(log(length))

time

CALTECH CS137 Winter2006 -- DeHon 57

Variations

CALTECH CS137 Winter2006 -- DeHon 58

Segmented Parallel Prefix

  • fi() can ignore its input

– …or the function can let special I’s tell it to reset the state

  • E.g. build huge/hardwired carry chain

hardware and configurably break into separate adders (LUT cascades)

CALTECH CS137 Winter2006 -- DeHon 59

Cyclic Segmented Parallel Prefix

  • Wrap output back to input
  • Configurable segmentation defines the

starting/stopping point

  • E.g.

– In Ultrascalar dataforwarding

  • Leave data in place and use FUs in FIFO fashion,

redefining the “head” at each cycle

– Priority allocation scheme

  • Mark priority item as start of segment

– Perhaps chose randomly (e.g. hardware router)

CALTECH CS137 Winter2006 -- DeHon 60

Admin

  • Class Wed.
  • Baseline due Friday
slide-11
SLIDE 11

11

CALTECH CS137 Winter2006 -- DeHon 61

Big Ideas

  • Any associative operation can be made

parallel

– Performed in log(N) time with O(N) hardware

  • Any Finite Automata computation can be

accelerated with parallelism

– (FA evaluation ⊂ NC)

  • Function composition is associated

– all functional operations can be associative