Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs - - PowerPoint PPT Presentation

channel slicing a way to build fast routers for
SMART_READER_LITE
LIVE PREVIEW

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs - - PowerPoint PPT Presentation

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs Wei Song and Doug Edwards The University of Manchester 15/09/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science Content Asynchronous NoCs


slide-1
SLIDE 1

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs

Wei Song and Doug Edwards The University of Manchester 15/09/2009

slide-2
SLIDE 2

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Content

  • Asynchronous NoCs
  • Channel Slicing

– Motivation – Sliced sub-channels – Flow control

  • An asynchronous wormhole router

– Implementation details – Performances

slide-3
SLIDE 3

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Network-on-Chip (NoC)

(0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)

RT NI PE

PE: Processor Element NI: Network Interface RT: router

slide-4
SLIDE 4

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Synchronous/Asynchronous

  • Synchronous

– Fast

  • Intel 80-tile 4GHz 65nm
  • DSPIN 408MHz 130nm

– Small

  • DSPIN 0.161mm2

– Power Consuming

  • 10.39mW (250MHz)

– Sensitive to variation – Complex clock tree

  • Asynchronous

– Slow !!

  • ASPIN 714MHz 90nm
  • ANoC 220MHz 130nm

– Large

  • ANoC 0.211mm2

– Power Efficient

  • 3.69mW (160MHz)

– Tolerance to variation – No clock tree

slide-5
SLIDE 5

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Content

  • Asynchronous NoCs
  • Channel Slicing

– Motivation – Sliced sub-channels – Flow control

  • An asynchronous wormhole router

– Implementation details – Performances

slide-6
SLIDE 6

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Asynchronous Pipelines

  • CHAIN (Bainbridge’02)

– 4 phase 1-of-4 pipelines

  • QoS NoC (Felicijan’04)

– 8-bit, Four 4 phase 1-of-4 pipelines

  • ANoC (Beigne’05)

– 32-bit 16 4 phase 1-of-4 pipelines

  • SpiNNaker (Plana’07)

– Several 1-of-4/2-of-7 pipelines

  • ASPIN (Sheibanyrad’08)

– 32-bit 16 dual-rail pipelines / bundled-data

  • MANGO (Bjerregaard’05) & QNoC (Dobkin’09)

– Bundled-data

slide-7
SLIDE 7

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Completion Detection

C C C C

2-bit 2-bit

CD CD

16

d_i d_o ack_o ack_i

8 4 ack 16-bit ack of sub-channels

Advantages: data on all sub-channels are synchronized, ease the time division multiple access (TDMA) techniques, such as virtual channel and TDMA Drawbacks: low speed (66% on CD)

slide-8
SLIDE 8

2014/5/13 Advanced Processor Technology Group The School of Computer Science

ChSlice: implementation

C C

2-bit 16

d_i0 ack_i0 C C

2-bit

d_i15 ack_i15 d_o0 ack_o0 d_o15 ack_o15 C C C C

2-bit 2-bit

CD CD

16

d_i d_o ack_o ack_i

slide-9
SLIDE 9

2014/5/13 Advanced Processor Technology Group The School of Computer Science

How to do it in a router?

Arbiter

  • ther ports

crossbar Arbiter

  • ther ports

crossbar data-path ack

Arbiter

  • ther ports

crossbar Arbiter

  • ther ports

crossbar data-path ack

slide-10
SLIDE 10

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Flow control

H H H H H H D D D D D D D D D D D T D D D D D T D D D D D T D D D D D T D D D D D T D D D D D T

sub-channels time head routing data

H D D D D D D T

time head routing data

slide-11
SLIDE 11

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Content

  • Asynchronous NoCs
  • Channel Slicing

– Motivation – Sliced sub-channels – Flow control

  • An asynchronous wormhole router

– Implementation details – Performances

slide-12
SLIDE 12

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Router: structure

arbiter arbiter 5 input ports 5 output ports ctl ctl

80 16 80 16 80 16 80 16

d_i_0 ack_i_0 d_i_4 ack_i_4 d_o_0 ack_o_0 d_o_4 ack_o_4

slide-13
SLIDE 13

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Router: data path

input buffer crossbar

  • utput buffer

ip_d

ib_d ic_d

ib_pa ib_a

ip_a

rt_err acken gnt

  • c_a

ic_a

  • p_a
  • p_d
  • b_d
  • c_d
  • b_a

eof 3 2 1 eof 3 2 1 eof 3 2 1

eof

slide-14
SLIDE 14

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Re-Synchronization

input buffer crossbar

  • utput buffer

ip_d

ib_d ic_d

ib_pa ib_a

ip_a

rt_err acken gnt

  • c_a

ic_a

  • p_a
  • p_d
  • b_d
  • c_d
  • b_a

eof 3 2 1 eof 3 2 1 eof 3 2 1

eof

eof acken ch_fin ic_a rt_err rt_dec

rt_dec+ eof+/1 acken+/1 eof-/1 ch_fin-/1 ic_a+ ic_a- acken-/1 ch_fin+/1 rt_dec+ rt_err+ acken-/2 eof+/2 acken+/2 eof-/2 ch_fin+/2 rt_err- ch_fin-/2

normal frame faulty frame

slide-15
SLIDE 15

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Routing Decision

rt_dec+ rt_en-/1 ch_fin_a+/1 ch_fin_a- rt_en+ rt_err+ rt_en-/2 ch_fin_a+/2 rt_dec- rt_err-

normal frame faulty frame

rt_dec ch_fin0 ch_fin15 ch_fin_a rt_en rt_err

4 4 4 4

ib_a0 ib_a1 ib_a2 ib_a3 ib_d0[0..3] ib_d1[0..3] ib_d2[0..3] ib_d3[0..3]

8 8 4

  • b

i t ( 1

  • f
  • 4

) c

  • m

p a r a t

  • r

4

  • b

i t ( 1

  • f
  • 4

) c

  • m

p a r a t

  • r

target_x target_y local_x local_y > < = > < = ch_fin_a rt_dec rt_err

M E M E M E M E M E M E

ir_n ir_e ir_w ir_l

  • r_s
  • r_w
  • r_n
  • r_l

gnt_s gnt_l gnt_n gnt_w rt_en east arbiter

gnts from

  • ther ports
slide-16
SLIDE 16

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Router: layout

  • Faraday 130nm Technology
  • 32-bit, 5 ports, XY routing algorithm
  • 0.3x0.3mm (12.6K gates, 0.050mm2)
  • Typical corner (25 oC 1.2V)
  • Cycle period 2.2 ns (1.82GByte/s per port)
  • Equivalent to 450MHz
slide-17
SLIDE 17

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Compare with other routers

Sliced Wormhole Synchronized Wormhole ANoC ASPIN QNoC MANGO DSPIN Tech (nm) 130 130 130 90 180 120 130 Period (ns) 2.2 2.8 4.0 0.88 4.8 1.26 2.45 Period (Hz) 450M 360M 250M 1.13G 208M 790M 408M Pipeline Style 4-phase 1-of-4 4-phase 1-of-4 4-phase 1-of-4

Dual-Rail / Bundled-Data

Bundled-data Bundled-data Synchronous circuit Other Standard cell Standard cell

Customized Cell Lib Customized FIFO

Delay line Delay line

slide-18
SLIDE 18

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Speed vs. Data Width

QNoC Sliced Wormhole

slide-19
SLIDE 19

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Speed and Area

slide-20
SLIDE 20

2014/5/13 Advanced Processor Technology Group The School of Computer Science

Question?