channel slicing a way to build fast routers for
play

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs - PowerPoint PPT Presentation

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs Wei Song and Doug Edwards The University of Manchester 15/09/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science Content Asynchronous NoCs


  1. Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs Wei Song and Doug Edwards The University of Manchester 15/09/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  2. Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  3. Network-on-Chip (NoC) (0,0) (0,1) (0,2) (0,3) PE NI (1,0) (1,1) (1,2) (1,3) RT (2,0) (2,1) (2,2) (2,3) PE: Processor Element NI: Network Interface (3,0) (3,1) (3,2) (3,3) RT: router Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  4. Synchronous/Asynchronous • Synchronous • Asynchronous – Fast – Slow !! • Intel 80-tile 4GHz 65nm • ASPIN 714MHz 90nm • DSPIN 408MHz 130nm • ANoC 220MHz 130nm – Small – Large • DSPIN 0.161mm 2 • ANoC 0.211mm 2 – Power Consuming – Power Efficient • 10.39mW (250MHz) • 3.69mW (160MHz) – Sensitive to variation – Tolerance to variation – Complex clock tree – No clock tree Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  5. Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  6. Asynchronous Pipelines • CHAIN ( Bainbridge’02 ) – 4 phase 1-of-4 pipelines • QoS NoC ( Felicijan’04 ) – 8-bit, Four 4 phase 1-of-4 pipelines • ANoC ( Beigne’05 ) – 32-bit 16 4 phase 1-of-4 pipelines • SpiNNaker ( Plana’07 ) – Several 1-of-4/2-of-7 pipelines • ASPIN ( Sheibanyrad’08 ) – 32-bit 16 dual-rail pipelines / bundled-data • MANGO ( Bjerregaard’05 ) & QNoC ( Dobkin’09 ) – Bundled-data Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  7. Completion Detection 16-bit ack of sub-channels 2-bit C C 8 d_i 16 d_o 4 2-bit C C CD CD ack_i ack_o ack Advantages: data on all sub-channels are synchronized, ease the time division multiple access (TDMA) techniques, such as virtual channel and TDMA Drawbacks: low speed (66% on CD) Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  8. ChSlice: implementation 2-bit C C d_i 16 d_o 2-bit C C CD CD ack_i ack_o 2-bit d_o 0 d_i 0 C C ack_o 0 ack_i 0 16 2-bit d_o 15 d_i 15 C C ack_o 15 ack_i 15 Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  9. How to do it in a router? Arbiter Arbiter other ports other ports crossbar crossbar data-path ack Arbiter Arbiter other ports other ports crossbar crossbar data-path ack Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  10. Flow control H D D D D D D T time head data routing sub-channels H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T time head routing data Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  11. Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  12. Router: structure 80 80 d_i_0 d_o_0 16 16 ack_i_0 ack_o_0 arbiter ctl 5 input 5 output ports ports 80 80 d_i_4 d_o_4 16 16 ack_i_4 ack_o_4 arbiter ctl Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  13. Router: data path input buffer crossbar output buffer ib_d ic_d ob_d 0 0 oc_d 0 1 1 1 2 op_d 2 2 ip_d 3 3 3 eof eof eof gnt ib_pa eof ip_a ob_a op_a oc_a ib_a ic_a rt_err acken Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  14. Re-Synchronization input buffer crossbar output buffer ib_d ic_d ob_d 0 0 oc_d 0 1 1 1 2 2 2 op_d ip_d 3 3 3 eof eof eof gnt ib_pa eof ip_a ob_a op_a oc_a ib_a ic_a rt_err ic_a rt_err rt_dec eof acken ch_fin-/1 ch_fin-/2 rt_dec+ rt_err+ rt_dec+ rt_err- acken-/1 acken-/2 ch_fin+/1 ch_fin+/2 eof+/1 eof+/2 eof-/2 ic_a- ic_a+ acken+/2 ch_fin acken eof-/1 normal frame acken+/1 faulty frame Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  15. Routing Decision rt_err rt_dec gnt_w gnt_s gnt_n gnt_l rt_en ch_fin_a ib_a 0 east arbiter M E M E local_x ib_d 0 [0..3] 4 ir_n > ib_a 1 4 - b i t ( 1 - o f - 4 ) gnts from other ports M E M E < 8 c o m p a r a t o r ib_d 1 [0..3] 4 = target_x ib_a 2 ir_e target_y M E M E > ib_d 2 [0..3] 4 4 - b i t ( 1 - o f - 4 ) < 8 c o m p a r a t o r ib_a 3 = ir_w or_s ib_d 3 [0..3] or_w 4 or_n or_l local_y ir_l rt_dec rt_err ch_fin 0 ch_fin 15 rt_dec+ rt_err+ rt_en+ rt_en-/1 rt_en-/2 ch_fin_a- ch_fin_a ch_fin_a+/1 ch_fin_a+/2 rt_dec- rt_err- rt_en normal frame faulty frame Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  16. Router: layout • Faraday 130nm Technology • 32-bit, 5 ports, XY routing algorithm • 0.3x0.3mm (12.6K gates, 0.050mm 2 ) • Typical corner (25 o C 1.2V) • Cycle period 2.2 ns (1.82GByte/s per port) • Equivalent to 450MHz Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  17. Compare with other routers Tech Period Period Pipeline Style Other (nm) (ns) (Hz) Sliced Wormhole 130 2.2 450M 4-phase 1-of-4 Standard cell Synchronized Wormhole 130 2.8 360M 4-phase 1-of-4 Standard cell ANoC 130 4.0 250M 4-phase 1-of-4 Customized Cell Lib ASPIN 90 0.88 1.13G Customized FIFO Dual-Rail / Bundled-Data QNoC 180 4.8 208M Bundled-data Delay line MANGO 120 1.26 790M Bundled-data Delay line DSPIN 130 2.45 408M Synchronous circuit Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  18. Speed vs. Data Width Sliced Wormhole QNoC Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  19. Speed and Area Advanced Processor Technology Group 2014/5/13 The School of Computer Science

  20. Question? Advanced Processor Technology Group 2014/5/13 The School of Computer Science

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend