 
              A Wormhole Router Design progress report Wei Song 30/07/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Motive and Plans • Wormhole router – Channel Slicing, motivation – Lookahead, critical cycle – Implementation • XY/Stochastic routing scheme Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Why a wormhole router now • Easy wormhole • Smallest cycle period Larger • Early performance crossbar / switch network estimation • Proof of channel SDM Larger Larger slicing route Input/Output scheduler buffer controller QoS Fault-tolerance Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Plan for Router Design (1) • Wormhole router – Speed estimation, basic design flow – Channel slicing, lookahead pipeline • Spatial Design Multiplex (SDM) router – Utilizing channel slicing (provide virtual circuit) – M sub-channels on a port, crossbar* M – Benes, Clos switch network (ATM) – Route scheduling in the multi-stage switch Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Plan for Router Design (2) • Dynamic Link Allocation – Allocate idle sub-channels to active virtual circuits to reduce frame latency – Arbitration planning, crossbar reconfiguration and buffer planning • Fault-tolerance – Error detection, deadlock recovery, route scheduling algorithm Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Plan for Router Design (3) • QoS – Virtual circuit is latency and bandwidth guaranteed (weak if dynamic link allocation is used) – Best Effort is a problem – Priorities for virtual circuit setup (reduce circuit setup time for high priority services) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Motive and Plans • Wormhole router – Channel Slicing, motivation – Lookahead, critical cycle – Implementation • XY/Stochastic routing scheme Advanced Processor Technology Group 2014/5/13 The School of Computer Science
ChSlice: motive 16-bit ack of sub-channels 2-bit C C 8 d_i 16 d_o 4 2-bit C C CD CD ack_i ack_o ack Advantages: data on all sub-channels are synchronized, ease the time division multiple access (TDMA) techniques, such as virtual channel and TDMA Drawbacks: low speed (66% on CD) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
ChSlice: implementation sub-channels H D D D D D D T 2-bit d_o 0 d_i 0 C C H D D D D D D T H D D D D D D T ack_o 0 ack_i 0 16 2-bit H D D D D D D T d_o 15 d_i 15 C C H D D D D D D T ack_o 15 ack_i 15 H D D D D D D T time head routing data Advanced Processor Technology Group 2014/5/13 The School of Computer Science
ChSlice: conclusion • Advantage – fast • Overhead – extra controllers – larger wire-count • No TDMA techniques but SDM is easy. Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Motive and Plans • Wormhole router – Channel Slicing, motivation – Lookahead, critical cycle – Implementation • XY/Stochastic routing scheme Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Lookahead: pipeline style D N+1 D N Di Do N+1 N+2 N+1 N+2 CD CD CD N N CD CD CD A N+1 Ai A N Ao Di data data Di data data Ai Ai D N D N data data data data A N A N D N+1 D N+1 data data data data A N+1 A N+1 Do Do data data data data Ao Ao Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Lookahead: conclusion • Advantage – fast N+1 N+2 CD CD CD N • Disadvantage – not QDI – a small area overhead Di data data Ai D N data data A N D N+1 data data A N+1 Do data data Ao Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Lookahead: critical cycle crossbar crossbar data from data from input buffer output buffer input buffer output buffer other inputs other inputs d_i d_o between long line routers pipeline pipeline pipeline pipeline pipeline pipeline pipeline pipeline pipeline pipeline ack from ack from other outputs other outputs ack_i ack_o arbiter arbiter Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Lookahead: implementation N+1 N+2 N+1 N CD CD CD CD crossbar Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Motive and Plans • Wormhole router – Channel Slicing, motivation – Lookahead, critical cycle – Implementation • XY/Stochastic routing scheme Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Router: structure 80 80 d_i_0 d_o_0 16 16 ack_i_0 ack_o_0 arbiter ctl 5 input 5 output ports ports 80 80 d_i_4 d_o_4 16 16 ack_i_4 ack_o_4 arbiter ctl Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Router: data path input buffer crossbar output buffer ib_d ic_d ob_d 0 0 oc_d 0 1 1 1 2 2 2 op_d ip_d 3 3 3 eof eof eof gnt ib_pa eof ip_a ob_a op_a oc_a ib_a ob_pa ic_a acki ic_da rt_err acken ip_d+ ib_d+ ic_d+ oc_d+ ob_d+ op_d+ ip_a+ ib_a+ ob_pa+ ob_a+ op_a+ i c _ d a+ oc_a+ ic_a+ ib_pa+ ip_d- ib_d- ic_d- oc_d- ob_d- op_d- ip_a- ib_a- ob_pa- ob_a- op_a- i c _ d a - oc_a- ic_a- Advanced Processor Technology Group 2014/5/13 ib_pa- The School of Computer Science
Router: layout • Faraday 130nm Technology • 32-bit, 5 ports, XY routing algorithm • 0.3x0.3mm (14.3K gates, 0.057mm 2 ) • Typical corner (25 o C 1.2V) • Cycle period 1.7 ns (2.35GByte/s per port) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
ChSlice and Lookahead Speed: ChSlice 24.1% LH 17.2% Area: ChSlice 23.0% LH 5.3% Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Compare to other routers • MANGO: 1.26ns; 0.12um; bundled data • ANoC: 4ns; 0.13um; 1-of-4 • QNoC: 4.8ns; 0.18um; bundled data • ASPIN: 0.88ns; 90nm; dual-rail & bundled data • Our: 1.7ns; 0.13um; 1-of-4 & lookahead Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Speed vs. data width Wormhole QNoC Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Motive and Plans • Wormhole router – Channel Slicing, motivation – Lookahead, critical cycle – Implementation • XY/Stochastic routing scheme Advanced Processor Technology Group 2014/5/13 The School of Computer Science
XY/Stochastic • Motive – Two routing algorithm is complicated – The deadlock problem – The involvement of network interfaces – Keep router simple • Solution – Router: XY – Network interface: generate, consume, or forward (random) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
XY/Stochastic (request path) Router Only XY/Stochastic M M Advanced Processor Technology Group 2014/5/13 The School of Computer Science
XY/Stochastic (ack path) Same Path Single Jump M M Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Compare • Rely on routers – A larger router (Special router design) – Longer routing overhead – Deadlocks – Shorter search time • XY/Stochastic routing – Smaller router (normal router design) – Shorter routing time – Only deadlocks caused by errors – Longer search time (higher priority by QoS) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Conclusion • Router design plan – Wormhole router is the first step • Wormhole router – Channel slicing – SDM is better than TDMA for asynchronous routers • XY/Stochastic routing Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Question? Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Recommend
More recommend