Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend - - PowerPoint PPT Presentation

compositional dataflow circuits
SMART_READER_LITE
LIVE PREVIEW

Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend - - PowerPoint PPT Presentation

Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend Martha A. Kim Columbia University MEMOCODE, Vienna, Austria, October 1, 2017 gcd( a , b ) = if a = b a else if a < b gcd( a , b a ) else gcd( a b , b ) a b gcd(


slide-1
SLIDE 1

Compositional Dataflow Circuits

Stephen A. Edwards Richard Townsend Martha A. Kim

Columbia University

MEMOCODE, Vienna, Austria, October 1, 2017

slide-2
SLIDE 2

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

slide-3
SLIDE 3

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b) a b

slide-4
SLIDE 4

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

1 0 1 0

a b

=

mux

1

initial token

slide-5
SLIDE 5

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

1 0 1 0

a b

=

1 0 1 0

fork

gcd(a,b)

discard

1

demux

slide-6
SLIDE 6

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

1 0 1 0

a b

=

1 0 1 0

gcd(a,b)

discard

<

1

slide-7
SLIDE 7

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

1 0 1 0

a b

=

1 0 1 0 1 0 1 0

gcd(a,b)

discard

<

1

slide-8
SLIDE 8

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

1 0 1 0

a b

=

1 0 1 0 1 0 1 0

gcd(a,b)

discard

<

1

1 0 1 0

slide-9
SLIDE 9

gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)

Townsend et al. CC ’2017

1 0 1 0

a b

=

1 0 1 0 1 0 1 0

gcd(a,b)

discard

<

1

1 0 1 0

− −

slide-10
SLIDE 10

Patience Through Handshaking

Want patient blocks to handle delays from Memory systems Data-dependent computations Full buffers Shared resources Busy computational units

slide-11
SLIDE 11

Patience Through Handshaking

Want patient blocks to handle delays from Memory systems Data-dependent computations Full buffers Shared resources Busy computational units

upstream downstream

data valid ready valid ready Meaning 1 1 Token transferred 1 Token valid; held

No token to transfer

Latency-insensitive Design (Carloni et al.) Elastic Circuits (Cortadella et al.) FIFOs with backpressure

slide-12
SLIDE 12

Combinational Function Block

Strict/Unit Rate: All input tokens required to produce an output in0 in1

  • ut

f

Datapath Combinational function ignores flow control

slide-13
SLIDE 13

Combinational Function Block

Strict/Unit Rate: All input tokens required to produce an output in0 in1

  • ut

f

Valid network Output valid if both inputs are valid

slide-14
SLIDE 14

Combinational Function Block

Strict/Unit Rate: All input tokens required to produce an output in0 in1

  • ut

f

Ready network Input tokens consumed if output token is consumed (output is valid and ready)

slide-15
SLIDE 15

Multiplexer Block

in0 in1 in2

  • ut

select in0 in1 in2 select

  • ut

decoder

slide-16
SLIDE 16

Demultiplexer Block

  • ut0 out1 out2

in select select in

  • ut2
  • ut1
  • ut0

decoder

slide-17
SLIDE 17

Buffering a Linear Pipeline (Point 1/4)

Combinational block

slide-18
SLIDE 18

Buffering a Linear Pipeline (Point 1/4)

Long Combinational Path (Data + Valid)

slide-19
SLIDE 19

Buffering a Linear Pipeline (Point 1/4)

Data buffer: Pipeline register with valid, enable

1

slide-20
SLIDE 20

Buffering a Linear Pipeline (Point 1/4)

1

Long Combinational Path (Ready)

slide-21
SLIDE 21

Buffering a Linear Pipeline (Point 1/4)

1

Control Buffer: Register diverts token when downstream suddenly stops

1 1 1

Cao et al. MEMOCODE 2015 Inspired by Carloni’s Latency Insensitive Design (e.g., MEMOCODE 2007)

slide-22
SLIDE 22

The Problem with Fork

Combinational Block: inputs ready when both valid &

  • utput ready
slide-23
SLIDE 23

The Problem with Fork

Combinational Block: inputs ready when both valid &

  • utput ready
slide-24
SLIDE 24

The Problem with Fork

Fork:

  • utputs valid only

when all are ready

slide-25
SLIDE 25

The Problem with Fork

Fork:

  • utputs valid only

when all are ready

slide-26
SLIDE 26

The Problem with Fork

Fork:

  • utputs valid only

when all are ready Oops: Combinational Cycle This is not compositional

slide-27
SLIDE 27

The Solution to Combinational Loops (Point 2/4)

valid ready

slide-28
SLIDE 28

The Solution to Combinational Loops (Point 2/4)

valid ready

slide-29
SLIDE 29

The Solution to Combinational Loops (Point 2/4)

valid ready Allowed: Combinational paths from valid to ready

slide-30
SLIDE 30

The Solution to Combinational Loops (Point 2/4)

valid ready

X X X X X

Allowed: Combinational paths from valid to ready Prohibited: Combinational paths from ready to valid

slide-31
SLIDE 31

The Solution to Fork: A Little State (Point 3/4)

in

  • ut2
  • ut1
  • ut0

Valid out ignores ready

  • f other outputs
slide-32
SLIDE 32

The Solution to Fork: A Little State (Point 3/4)

in

  • ut2
  • ut1
  • ut0

Valid out ignores ready

  • f other outputs

Flip-flop set after token sent suppresses duplicates

slide-33
SLIDE 33

The Solution to Fork: A Little State (Point 3/4)

in

  • ut2
  • ut1
  • ut0

Valid out ignores ready

  • f other outputs

Flip-flop set after token sent suppresses duplicates Input consumed once one token sent on every output

slide-34
SLIDE 34

Nondeterministic Merge (Point 4/4)

f f f

Share with merge/demux merge

f

demux select

slide-35
SLIDE 35

Two-Way Nondeterministic Merge Block w/ Select

in0 in1

  • ut

sel 1 Arbiter “Two-way fork with multiplexed output selected by an arbiter”

slide-36
SLIDE 36

Experiments: Random Buffer Placement

2 4 6 2 4 6 8 10 (7 buffers) Completion Time (µs) Number of buffer pairs GCD(100,2) 750 1500 2250 2 4 6 8 10 (80 buffers) 21-way Conveyor 1 2 3 2 4 6 8 10 (96 buffers) BSN

slide-37
SLIDE 37

Best Buffering for GCD (Manually Obtained)

Each loop has one of each buffer Data Buffer Control Buffer

slide-38
SLIDE 38

Summary

Compositional Dataflow Networks as an IR Patient dataflow blocks with valid/ready handshaking

  • 1. Break downstream, upstream paths w/ two buffer types
  • 2. Avoid comb. cycles: prohibit ready-to-valid paths
  • 3. Add one state bit per output so forks may “race ahead”
  • 4. Tame nondeterministic merge with a select output

Random buffer placement experiments show it works