SLIDE 1
Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend - - PowerPoint PPT Presentation
Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend - - PowerPoint PPT Presentation
Compositional Dataflow Circuits Stephen A. Edwards Richard Townsend Martha A. Kim Columbia University MEMOCODE, Vienna, Austria, October 1, 2017 gcd( a , b ) = if a = b a else if a < b gcd( a , b a ) else gcd( a b , b ) a b gcd(
SLIDE 2
SLIDE 3
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b) a b
SLIDE 4
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
1 0 1 0
a b
=
mux
1
initial token
SLIDE 5
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
1 0 1 0
a b
=
1 0 1 0
fork
gcd(a,b)
discard
1
demux
SLIDE 6
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
1 0 1 0
a b
=
1 0 1 0
gcd(a,b)
discard
<
1
SLIDE 7
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
1 0 1 0
a b
=
1 0 1 0 1 0 1 0
gcd(a,b)
discard
<
1
−
SLIDE 8
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
1 0 1 0
a b
=
1 0 1 0 1 0 1 0
gcd(a,b)
discard
<
1
1 0 1 0
−
SLIDE 9
gcd(a,b) = if a = b a else if a < b gcd(a,b − a) else gcd(a −b,b)
Townsend et al. CC ’2017
1 0 1 0
a b
=
1 0 1 0 1 0 1 0
gcd(a,b)
discard
<
1
1 0 1 0
− −
SLIDE 10
Patience Through Handshaking
Want patient blocks to handle delays from Memory systems Data-dependent computations Full buffers Shared resources Busy computational units
SLIDE 11
Patience Through Handshaking
Want patient blocks to handle delays from Memory systems Data-dependent computations Full buffers Shared resources Busy computational units
upstream downstream
data valid ready valid ready Meaning 1 1 Token transferred 1 Token valid; held
−
No token to transfer
Latency-insensitive Design (Carloni et al.) Elastic Circuits (Cortadella et al.) FIFOs with backpressure
SLIDE 12
Combinational Function Block
Strict/Unit Rate: All input tokens required to produce an output in0 in1
- ut
f
Datapath Combinational function ignores flow control
SLIDE 13
Combinational Function Block
Strict/Unit Rate: All input tokens required to produce an output in0 in1
- ut
f
Valid network Output valid if both inputs are valid
SLIDE 14
Combinational Function Block
Strict/Unit Rate: All input tokens required to produce an output in0 in1
- ut
f
Ready network Input tokens consumed if output token is consumed (output is valid and ready)
SLIDE 15
Multiplexer Block
in0 in1 in2
- ut
select in0 in1 in2 select
- ut
decoder
SLIDE 16
Demultiplexer Block
- ut0 out1 out2
in select select in
- ut2
- ut1
- ut0
decoder
SLIDE 17
Buffering a Linear Pipeline (Point 1/4)
Combinational block
SLIDE 18
Buffering a Linear Pipeline (Point 1/4)
Long Combinational Path (Data + Valid)
SLIDE 19
Buffering a Linear Pipeline (Point 1/4)
Data buffer: Pipeline register with valid, enable
1
SLIDE 20
Buffering a Linear Pipeline (Point 1/4)
1
Long Combinational Path (Ready)
SLIDE 21
Buffering a Linear Pipeline (Point 1/4)
1
Control Buffer: Register diverts token when downstream suddenly stops
1 1 1
Cao et al. MEMOCODE 2015 Inspired by Carloni’s Latency Insensitive Design (e.g., MEMOCODE 2007)
SLIDE 22
The Problem with Fork
Combinational Block: inputs ready when both valid &
- utput ready
SLIDE 23
The Problem with Fork
Combinational Block: inputs ready when both valid &
- utput ready
SLIDE 24
The Problem with Fork
Fork:
- utputs valid only
when all are ready
SLIDE 25
The Problem with Fork
Fork:
- utputs valid only
when all are ready
SLIDE 26
The Problem with Fork
Fork:
- utputs valid only
when all are ready Oops: Combinational Cycle This is not compositional
SLIDE 27
The Solution to Combinational Loops (Point 2/4)
valid ready
SLIDE 28
The Solution to Combinational Loops (Point 2/4)
valid ready
SLIDE 29
The Solution to Combinational Loops (Point 2/4)
valid ready Allowed: Combinational paths from valid to ready
SLIDE 30
The Solution to Combinational Loops (Point 2/4)
valid ready
X X X X X
Allowed: Combinational paths from valid to ready Prohibited: Combinational paths from ready to valid
SLIDE 31
The Solution to Fork: A Little State (Point 3/4)
in
- ut2
- ut1
- ut0
Valid out ignores ready
- f other outputs
SLIDE 32
The Solution to Fork: A Little State (Point 3/4)
in
- ut2
- ut1
- ut0
Valid out ignores ready
- f other outputs
Flip-flop set after token sent suppresses duplicates
SLIDE 33
The Solution to Fork: A Little State (Point 3/4)
in
- ut2
- ut1
- ut0
Valid out ignores ready
- f other outputs
Flip-flop set after token sent suppresses duplicates Input consumed once one token sent on every output
SLIDE 34
Nondeterministic Merge (Point 4/4)
f f f
Share with merge/demux merge
f
demux select
SLIDE 35
Two-Way Nondeterministic Merge Block w/ Select
in0 in1
- ut
sel 1 Arbiter “Two-way fork with multiplexed output selected by an arbiter”
SLIDE 36
Experiments: Random Buffer Placement
2 4 6 2 4 6 8 10 (7 buffers) Completion Time (µs) Number of buffer pairs GCD(100,2) 750 1500 2250 2 4 6 8 10 (80 buffers) 21-way Conveyor 1 2 3 2 4 6 8 10 (96 buffers) BSN
SLIDE 37
Best Buffering for GCD (Manually Obtained)
Each loop has one of each buffer Data Buffer Control Buffer
SLIDE 38
Summary
Compositional Dataflow Networks as an IR Patient dataflow blocks with valid/ready handshaking
- 1. Break downstream, upstream paths w/ two buffer types
- 2. Avoid comb. cycles: prohibit ready-to-valid paths
- 3. Add one state bit per output so forks may “race ahead”
- 4. Tame nondeterministic merge with a select output