2014/5/13 Advanced Processor Technology Group The School of Computer Science
A Wormhole Router Design progress report Wei Song 30/07/2009 - - PowerPoint PPT Presentation
A Wormhole Router Design progress report Wei Song 30/07/2009 - - PowerPoint PPT Presentation
A Wormhole Router Design progress report Wei Song 30/07/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science Content Motive and Plans Wormhole router Channel Slicing, motivation Lookahead, critical
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Content
- Motive and Plans
- Wormhole router
– Channel Slicing, motivation – Lookahead, critical cycle – Implementation
- XY/Stochastic routing scheme
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Why a wormhole router now
- Easy
- Smallest cycle period
- Early performance
estimation
- Proof of channel
slicing
wormhole SDM
Larger crossbar / switch network
QoS Fault-tolerance
Larger route scheduler Larger Input/Output buffer controller
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Plan for Router Design (1)
- Wormhole router
– Speed estimation, basic design flow – Channel slicing, lookahead pipeline
- Spatial Design Multiplex (SDM) router
– Utilizing channel slicing (provide virtual circuit) – M sub-channels on a port, crossbar*M – Benes, Clos switch network (ATM) – Route scheduling in the multi-stage switch
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Plan for Router Design (2)
- Dynamic Link Allocation
– Allocate idle sub-channels to active virtual circuits to reduce frame latency – Arbitration planning, crossbar reconfiguration and buffer planning
- Fault-tolerance
– Error detection, deadlock recovery, route scheduling algorithm
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Plan for Router Design (3)
- QoS
– Virtual circuit is latency and bandwidth guaranteed (weak if dynamic link allocation is used) – Best Effort is a problem – Priorities for virtual circuit setup (reduce circuit setup time for high priority services)
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Content
- Motive and Plans
- Wormhole router
– Channel Slicing, motivation – Lookahead, critical cycle – Implementation
- XY/Stochastic routing scheme
2014/5/13 Advanced Processor Technology Group The School of Computer Science
ChSlice: motive
C C C C
2-bit 2-bit
CD CD
16
d_i d_o ack_o ack_i
8 4 ack 16-bit ack of sub-channels
Advantages: data on all sub-channels are synchronized, ease the time division multiple access (TDMA) techniques, such as virtual channel and TDMA Drawbacks: low speed (66% on CD)
2014/5/13 Advanced Processor Technology Group The School of Computer Science
ChSlice: implementation
H H H H H H D D D D D D D D D D D T D D D D D T D D D D D T D D D D D T D D D D D T D D D D D T
sub-channels time head routing data
C C
2-bit 16
d_i0 ack_i0 C C
2-bit
d_i15 ack_i15 d_o0 ack_o0 d_o15 ack_o15
2014/5/13 Advanced Processor Technology Group The School of Computer Science
ChSlice: conclusion
- Advantage
– fast
- Overhead
– extra controllers – larger wire-count
- No TDMA techniques but SDM is easy.
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Content
- Motive and Plans
- Wormhole router
– Channel Slicing, motivation – Lookahead, critical cycle – Implementation
- XY/Stochastic routing scheme
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Lookahead: pipeline style
N CD N+1 CD N+2 CD
Di DN AN DN+1 AN+1 Do Ao Ai
data data data data data data data data N CD N+1 CD N+2 CD Di Ai DN DN+1 Do AN AN+1 Ao
Di DN AN DN+1 AN+1 Do Ao Ai
data data data data data data data data
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Lookahead: conclusion
- Advantage
– fast
- Disadvantage
– not QDI – a small area overhead
N CD N+1 CD N+2 CD Di DN AN DN+1 AN+1 Do Ao Ai
data data data data data data data data
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Lookahead: critical cycle
pipeline pipeline pipeline pipeline pipeline
arbiter ack from
- ther outputs
data from
- ther inputs
crossbar input buffer
- utput buffer
pipeline pipeline pipeline pipeline pipeline
arbiter ack from
- ther outputs
data from
- ther inputs
crossbar input buffer
- utput buffer
long line between routers d_i ack_i d_o ack_o
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Lookahead: implementation
N CD N+1 CD N+2 CD N+1 CD crossbar
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Content
- Motive and Plans
- Wormhole router
– Channel Slicing, motivation – Lookahead, critical cycle – Implementation
- XY/Stochastic routing scheme
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Router: structure
arbiter arbiter 5 input ports 5 output ports ctl ctl
80 16 80 16 80 16 80 16
d_i_0 ack_i_0 d_i_4 ack_i_4 d_o_0 ack_o_0 d_o_4 ack_o_4
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Router: data path
input buffer crossbar
- utput buffer
ip_d
ib_d ic_d
ib_pa ib_a
ip_a
rt_err acken gnt
- c_a
ic_a
- p_a
- p_d
- b_d
- c_d
- b_pa
- b_a
eof 3 2 1 eof 3 2 1 eof 3 2 1
ic_da eof acki
ip_d+ ib_d+ ic_d+ ip_a+ ip_d- ib_a+ ib_d- ip_a- ic_d-
- c_d+
- b_d+
- b_pa+
- p_d+
- b_a+
- c_a+
ic_a+ ib_pa+ ib_a-
- c_d-
- b_d-
- b_pa-
- c_a-
ic_a- ib_pa-
- p_d-
- p_a+
- p_a-
- b_a-
ic_da+ ic_da-
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Router: layout
- Faraday 130nm Technology
- 32-bit, 5 ports, XY routing algorithm
- 0.3x0.3mm (14.3K gates, 0.057mm2)
- Typical corner (25 oC 1.2V)
- Cycle period 1.7 ns (2.35GByte/s per port)
2014/5/13 Advanced Processor Technology Group The School of Computer Science
ChSlice and Lookahead
Speed: ChSlice 24.1% LH 17.2% Area: ChSlice 23.0% LH 5.3%
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Compare to other routers
- MANGO: 1.26ns; 0.12um; bundled data
- ANoC: 4ns; 0.13um; 1-of-4
- QNoC: 4.8ns; 0.18um; bundled data
- ASPIN: 0.88ns; 90nm; dual-rail & bundled data
- Our: 1.7ns; 0.13um; 1-of-4 & lookahead
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Speed vs. data width
QNoC Wormhole
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Content
- Motive and Plans
- Wormhole router
– Channel Slicing, motivation – Lookahead, critical cycle – Implementation
- XY/Stochastic routing scheme
2014/5/13 Advanced Processor Technology Group The School of Computer Science
XY/Stochastic
- Motive
– Two routing algorithm is complicated – The deadlock problem – The involvement of network interfaces – Keep router simple
- Solution
– Router: XY – Network interface: generate, consume, or forward (random)
2014/5/13 Advanced Processor Technology Group The School of Computer Science
XY/Stochastic (request path)
M M
XY/Stochastic Router Only
2014/5/13 Advanced Processor Technology Group The School of Computer Science
XY/Stochastic (ack path)
M M
Same Path Single Jump
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Compare
- Rely on routers
– A larger router (Special router design) – Longer routing overhead – Deadlocks – Shorter search time
- XY/Stochastic routing
– Smaller router (normal router design) – Shorter routing time – Only deadlocks caused by errors – Longer search time (higher priority by QoS)
2014/5/13 Advanced Processor Technology Group The School of Computer Science
Conclusion
- Router design plan
– Wormhole router is the first step
- Wormhole router
– Channel slicing – SDM is better than TDMA for asynchronous routers
- XY/Stochastic routing
2014/5/13 Advanced Processor Technology Group The School of Computer Science