From Channel Slicing to From Channel Slicing to Spatial Division - - PowerPoint PPT Presentation

from channel slicing to from channel slicing to spatial
SMART_READER_LITE
LIVE PREVIEW

From Channel Slicing to From Channel Slicing to Spatial Division - - PowerPoint PPT Presentation

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division Multiplexing -- the asynchronous router design the asynchronous router design -- Wei Song 03/12/2009 Advanced Processor Technology Group


slide-1
SLIDE 1

2009-12-2 Advanced Processor Technology Group The School of Computer Science

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division Multiplexing

  • - the asynchronous router design

the asynchronous router design Wei Song 03/12/2009

slide-2
SLIDE 2

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing

Channel Slicing

– Asynchronous NoCs and routers – Channel Slicing – A wormhole router design

  • Spatial Division Multiplexing (SDM)

– Motives – Switching networks – 2-stage Clos network – The distributed scheduler – Implementation results

slide-3
SLIDE 3

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Asynchronous NoCs NoCs

  • GALS
  • Full async comm fabric
  • QDI pipelines
  • Low dynamic power
  • Tolerance to variation
  • Fast prototype
slide-4
SLIDE 4

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronised Synchronised QDI Pipelines QDI Pipelines

8 4

Nangate Cell Lib 65nm 1-of-4

slide-5
SLIDE 5

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Channel Slicing (1) Channel Slicing (1)

  • Remove the C-element tree
  • Sub-channels run

independently

slide-6
SLIDE 6

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Channel Slicing (2) Channel Slicing (2)

slide-7
SLIDE 7

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Channel Slicing (3) Channel Slicing (3)

sub-channels

slide-8
SLIDE 8

2009-12-2 Advanced Processor Technology Group The School of Computer Science

The Wormhole Router The Wormhole Router

arbiter arbiter 5 input ports 5 output ports ctl ctl

80 16 80 16 80 16 80 16

d_i_0 ack_i_0 d_i_4 ack_i_4 d_o_0 ack_o_0 d_o_4 ack_o_4

  • Faraday 130 nm
  • 5 32-bit ports
  • 3 routers:

– Synchronised – Channel Sliced – Plus lookahead

N N+1 N+2

slide-9
SLIDE 9

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Area Results

Channel Slicing: 23% extra controllers in input buffer increased wire count in crossbar Lookahead: 5.3% extra AND gates and C2P elements on critical path

slide-10
SLIDE 10

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Speed Results

Synchronised: 345MHz Channel Slicing: 450MHz ChSlice+LH: 590MHz

slide-11
SLIDE 11

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Compare with Other Routers

Asynchronous cell library: constrains the adaptation to other projects ANoC, ASPIN Bundled-data: less tolerant to variation MANGO, QNoC, ASPIN Customized design: design complexity ASPIN

slide-12
SLIDE 12

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Data Width Effect

20 40 60 80 100 120 140 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Cycle Period (ns) Data Wdith of Ports (bit) ChSlice + LH Channel Slicing Synchronised

slide-13
SLIDE 13

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing

– Asynchronous NoCs and routers – Channel Slicing – A wormhole router design

  • Spatial Division Multiplexing (SDM)

Spatial Division Multiplexing (SDM)

– Motives – Switching networks – 2-stage Clos network – The distributed scheduler – Implementation results

slide-14
SLIDE 14

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: Motivation (1) SDM: Motivation (1)

  • The problems that the wormhole router cannot

handle:

– QoS, delay and throughput guaranteed services – Fault-tolerance – Network efficiency

slide-15
SLIDE 15

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Motivation (2) Motivation (2)

Switch Allocator

Input Port 0 Input Port P-1 Output Port 0 Output Port P-1 PxP Crossbar

W W

Input Buffer

Switch Scheduler

Input Port 0 Input Port P-1 Output Port 0 Output Port P-1

M

MPxMP Switching Network

W/M W/M

Wormhole Virtual Channel SDM

slide-16
SLIDE 16

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Motivation (3) Motivation (3) – – Problems of VC Problems of VC

  • Pipelines are

synchronised

  • Area overhead
  • QoS (complicated

arbiters)

  • TDMA (time slot

definition)

  • Fault-tolerance (partial

faulty link)

Input Buffer

VC Allocator Switch Allocator

Input Port 0 Input Port P-1 Output Port 0 Output Port P-1

M

PxP Crossbar

W W

slide-17
SLIDE 17

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Motivation (4) Motivation (4) – – Benefits of SDM Benefits of SDM

  • Delay and throughput

Guarantee

  • Fault-tolerance
  • Speed (Channel slicing)
  • Area
  • Link efficiency

– interrupts

Input Buffer

Switch Scheduler

Input Port 0 Input Port P-1 Output Port 0 Output Port P-1

M

MPxMP Switching Network

W/M W/M

slide-18
SLIDE 18

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Motivation (4) Motivation (4) – – Problems of SDM Problems of SDM

  • Area overhead
  • Scheduling Algorithm

– Wormhole (P to 1) – SDM (MP to M)

2 CB

C P W = ×

2 SDM

C M P W = × ×

slide-19
SLIDE 19

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing

– Asynchronous NoCs and routers – Channel Slicing – A wormhole router design

  • Spatial Division Multiplexing (SDM)

– Motives – – Switching networks Switching networks – 2-stage Clos network – The distributed scheduler – Implementation results

slide-20
SLIDE 20

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: Switching Networks SDM: Switching Networks

  • Strict Non-Blocking (SNB)

– An input port and an output port is always connectable

  • Rearrangeable Non-Blocking (RNB)

– An input port and an output port is connectable with possible changes on existing connections

  • Blocking

– Not all input ports and output ports are connectable under certain cases

slide-21
SLIDE 21

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Crossbar Crossbar

  • SNB

2 CB

C N W = ×

slide-22
SLIDE 22

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Clos Clos Network Network

SNB/RNB C(m,n,k) N = nk SNB: m >= 2n-1 RNB: m = n

slide-23
SLIDE 23

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Benes Network Benes Network

Multi-stage Clos C(2,2,4) + 2C(2,2,2) SNB

slide-24
SLIDE 24

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Area of Switching Networks Area of Switching Networks

slide-25
SLIDE 25

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Problems of all Switching Networks Problems of all Switching Networks

  • Crossbar

– Area ~ N2 – Easy to schedule

  • Clos

– Area ~ N1.5 – Difficult but possible to schedule by hardware – Optimal area is reached when

  • Benes

– Area ~ NlogN – Impossible to schedule by hardware (microprocessor) – Optimal area is reached when N=2n

slide-26
SLIDE 26

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing, the wormhole router
  • Spatial Division Multiplexing (SDM)

– Motives – Switching networks – – 2 2-

  • stage

stage Clos Clos network network – The distributed scheduler – Implementation results

slide-27
SLIDE 27

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: 2 SDM: 2-

  • stage

stage Clos Clos Network Network

M M

SIM

M M

WIM

M M

NIM

M M

EIM

M M

LIM

5 5

CM(0)

5 5

CM(r)

5 5

CM(M-1) SOM WOM NOM EOM LOM

slide-28
SLIDE 28

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Area Comparison Area Comparison

slide-29
SLIDE 29

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Benefits of the 2 Benefits of the 2-

  • stage

stage Clos Clos Network Network

  • Minimal area when M <= 16
  • Only have 2-stages, latency is reduced
  • Latency bounded
  • Scheduling algorithm is also simplified
  • The CMs could be further reduced
  • It is a RNB network. An SNB network requires 3 stages
slide-30
SLIDE 30

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing, the wormhole router
  • Spatial Division Multiplexing (SDM)

– Motives – Switching networks – 2-stage Clos network – – The distributed scheduler The distributed scheduler – Implementation results

slide-31
SLIDE 31

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: Scheduling Algorithms SDM: Scheduling Algorithms

  • Optimized algorithms

Optimized algorithms

– Always reach the optimal configuration that every possible connection is configured – Time complexity O(N2) – Normally software based ( [Leroy 2008] microprocessor, 64 ports, 50us)

  • Heuristic algorithms

Heuristic algorithms

– Capable of configuring part of the possible connections with less time and area – Time complexity O(N) ~ O(logN) – Normally hardware implementable, distributed, and scalable

slide-32
SLIDE 32

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronous Dispatch Synchronous Dispatch Algs Algs. .

slide-33
SLIDE 33

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronous Dispatch Synchronous Dispatch Algs Algs. .

slide-34
SLIDE 34

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronous Dispatch Synchronous Dispatch Algs Algs. .

slide-35
SLIDE 35

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronous Dispatch Synchronous Dispatch Algs Algs. .

slide-36
SLIDE 36

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Synchronous Dispatch Synchronous Dispatch Algs Algs. .

slide-37
SLIDE 37

2009-12-2 Advanced Processor Technology Group The School of Computer Science

  • Problems. Of Sync
  • Problems. Of Sync Algs

Algs. .

  • Iterations are synchronised.
  • The requests from IMs are blind and greedy.
  • CMs are blind and greedy too.
  • Multiple requests are sent out by IMs
slide-38
SLIDE 38

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Scheduling Alg. Asynchronous Scheduling Alg.

slide-39
SLIDE 39

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Scheduling Alg. Asynchronous Scheduling Alg.

slide-40
SLIDE 40

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Scheduling Alg. Asynchronous Scheduling Alg.

slide-41
SLIDE 41

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Scheduling Alg. Asynchronous Scheduling Alg.

slide-42
SLIDE 42

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Asynchronous Scheduling Alg. Asynchronous Scheduling Alg.

  • IM scheduler and CM schedulers are

independent

  • The scheduling algorithm can support arbitrary

number of CMs

  • Less transition rate than synchronous

schedulers

slide-43
SLIDE 43

2009-12-2 Advanced Processor Technology Group The School of Computer Science

IM scheduler (1) IM scheduler (1)

IMrb[VCN-1][SN-1] IMr[0][0] IMr[0][SN-1] IMr[VCN-1][0] IMr[VCN-1][SN-1] IMrb[0][0] IMrb[0][SN-1] IMrb[VCN-1][0]

slide-44
SLIDE 44

2009-12-2 Advanced Processor Technology Group The School of Computer Science

IM scheduler (2) IM scheduler (2)

h[i][j][k] CMrKeep[i][k][j] CMrMx[i][k][j] CMrMx[i][k][0] CMrMx[i][k][VCN-1] CMrME[k][i] CMrME[k][0] CMrME[k][SN-1] CMr[k][0] CMr[k][SN-1] CMa[k][i] cfgMx[j][k][i] cfgMx[j][k][0] cfgMx[j][k][SN-1] cfg[j][k] cfg[j][0] cfg[j][CMN-1] IMa[j] CMrMx[i][k][j] CMr[k][i] IMrb[j][i] CMrMx[j][0][i] CMrMx[j][CMN-1][i] CMs[k][i] CMs[k][i] IMr[j][i] CMrKeep[i][k][j] CMrME[k][i] CMsb[i][k]

slide-45
SLIDE 45

2009-12-2 Advanced Processor Technology Group The School of Computer Science

CM scheduler CM scheduler

slide-46
SLIDE 46

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Index Index

  • Channel Slicing, the wormhole router
  • Spatial Division Multiplexing (SDM)

– Motives – Switching networks – 2-stage Clos network – The distributed scheduler – – Implementation results Implementation results

slide-47
SLIDE 47

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: implementation (1) SDM: implementation (1)

  • Faraday 130nm
  • Wormhole, SDM crossbar, and SDM Clos
  • 64-bit ports, 4 virtual circuits/port
  • Design Compiler synthesized
  • System Verilog for testbench
  • Switches are back-annotated with latency from

synthesis

slide-48
SLIDE 48

2009-12-2 Advanced Processor Technology Group The School of Computer Science

SDM: implementation (2) SDM: implementation (2)

slide-49
SLIDE 49

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Network Performance (1) Network Performance (1)

slide-50
SLIDE 50

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Network Performance (2) Network Performance (2)

slide-51
SLIDE 51

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Conclusion of Results Conclusion of Results

  • SDM outperforms Wormhole with short

frames and local traffic

  • The connection loss from SNB to RNB is

significant

  • SDM is good at GT traffic, this work is the

first step to a QoS router

  • How to configure the SDM to settle GT paths

is the next problem.

slide-52
SLIDE 52

2009-12-2 Advanced Processor Technology Group The School of Computer Science

References References

  • Channel Slicing:

– ASP-DAC 2010. – UK Async Forum, 2009. – International Symposium on SOC, 2009.

  • SDM

– In submission to ASYNC 2010.

slide-53
SLIDE 53

2009-12-2 Advanced Processor Technology Group The School of Computer Science

Questions? Questions?