 
              From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division Multiplexing -- the asynchronous router design the asynchronous router design -- Wei Song 03/12/2009 Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Index Index • Channel Slicing Channel Slicing • – Asynchronous NoCs and routers – Channel Slicing – A wormhole router design • Spatial Division Multiplexing (SDM) – Motives – Switching networks – 2-stage Clos network – The distributed scheduler – Implementation results Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Asynchronous NoCs NoCs Asynchronous • GALS • Full async comm fabric • QDI pipelines • Low dynamic power • Tolerance to variation • Fast prototype Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Synchronised QDI Pipelines QDI Pipelines Synchronised 8 4 Nangate Cell Lib 65nm 1-of-4 Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Channel Slicing (1) Channel Slicing (1) • Remove the C-element tree • Sub-channels run independently Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Channel Slicing (2) Channel Slicing (2) Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Channel Slicing (3) Channel Slicing (3) sub-channels Advanced Processor Technology Group 2009-12-2 The School of Computer Science
The Wormhole Router The Wormhole Router 80 80 d_i_0 d_o_0 16 16 ack_i_0 ack_o_0 • Faraday 130 nm arbiter ctl • 5 32-bit ports 5 input 5 output ports ports • 3 routers: 80 80 d_i_4 d_o_4 16 16 – Synchronised ack_i_4 ack_o_4 arbiter ctl – Channel Sliced – Plus lookahead N+1 N+2 N Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Area Results Channel Slicing: 23% extra controllers in input buffer increased wire count in crossbar Lookahead: 5.3% extra AND gates and C2P elements on critical path Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Speed Results Synchronised: 345MHz Channel Slicing: 450MHz ChSlice+LH: 590MHz Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Compare with Other Routers Asynchronous cell library: constrains the adaptation to other projects ANoC, ASPIN Bundled-data: less tolerant to variation MANGO, QNoC, ASPIN Customized design: design complexity ASPIN Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Data Width Effect 4.0 ChSlice + LH Channel Slicing 3.5 Synchronised 3.0 Cycle Period (ns) 2.5 2.0 1.5 1.0 0.5 0.0 0 20 40 60 80 100 120 140 Data Wdith of Ports (bit) Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Index Index • Channel Slicing – Asynchronous NoCs and routers – Channel Slicing – A wormhole router design • Spatial Division Multiplexing (SDM) Spatial Division Multiplexing (SDM) • – Motives – Switching networks – 2-stage Clos network – The distributed scheduler – Implementation results Advanced Processor Technology Group 2009-12-2 The School of Computer Science
SDM: Motivation (1) SDM: Motivation (1) • The problems that the wormhole router cannot handle: – QoS, delay and throughput guaranteed services – Fault-tolerance – Network efficiency Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Motivation (2) Motivation (2) Wormhole Switch Allocator Input Output Port 0 Port 0 W Crossbar Input Output Port P-1 PxP Port P-1 W Virtual Channel SDM Input Port 0 Switch Scheduler M Output Port 0 W/M Switching Network Output Input Port P-1 Port P-1 MPxMP Input Buffer W/M Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Motivation (3) – – Problems of VC Problems of VC Motivation (3) • Pipelines are synchronised Input VC Allocator • Area overhead Port 0 Switch Allocator M Output W • QoS (complicated Port 0 arbiters) Crossbar • TDMA (time slot Input Output Port P-1 PxP Port P-1 Input Buffer definition) W • Fault-tolerance (partial faulty link) Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Motivation (4) – – Benefits of SDM Benefits of SDM Motivation (4) • Delay and throughput Guarantee Input • Fault-tolerance Switch Port 0 Scheduler M Output • Speed (Channel slicing) Port 0 W/M • Area Switching Network Output Input Port P-1 • Link efficiency Port P-1 MPxMP Input Buffer – interrupts W/M Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Motivation (4) – – Problems of SDM Problems of SDM Motivation (4) • Area overhead = × 2 C P W CB = × × 2 C M P W SDM • Scheduling Algorithm – Wormhole ( P to 1) – SDM ( MP to M ) Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Index Index • Channel Slicing – Asynchronous NoCs and routers – Channel Slicing – A wormhole router design • Spatial Division Multiplexing (SDM) – Motives – Switching networks Switching networks – – 2-stage Clos network – The distributed scheduler – Implementation results Advanced Processor Technology Group 2009-12-2 The School of Computer Science
SDM: Switching Networks SDM: Switching Networks • Strict Non-Blocking (SNB) – An input port and an output port is always connectable • Rearrangeable Non-Blocking (RNB) – An input port and an output port is connectable with possible changes on existing connections • Blocking – Not all input ports and output ports are connectable under certain cases Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Crossbar Crossbar • SNB = × 2 C N W CB Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Clos Network Network Clos SNB/RNB C( m , n , k ) N = nk SNB: m >= 2n-1 RNB: m = n Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Benes Network Benes Network Multi-stage Clos C(2,2,4) + 2C(2,2,2) SNB Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Area of Switching Networks Area of Switching Networks Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Problems of all Switching Networks Problems of all Switching Networks • Crossbar – Area ~ N 2 – Easy to schedule • Clos – Area ~ N 1.5 – Difficult but possible to schedule by hardware – Optimal area is reached when • Benes – Area ~ N log N – Impossible to schedule by hardware (microprocessor) – Optimal area is reached when N =2 n Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Index Index • Channel Slicing, the wormhole router • Spatial Division Multiplexing (SDM) – Motives – Switching networks – 2 2- -stage stage Clos Clos network network – – The distributed scheduler – Implementation results Advanced Processor Technology Group 2009-12-2 The School of Computer Science
SDM: 2- -stage stage Clos Clos Network Network SDM: 2 SIM SOM CM( 0 ) M M 5 5 WIM WOM M M CM( r ) NOM NIM M M 5 5 EIM EOM M M CM( M-1 ) LOM LIM 5 5 M M Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Area Comparison Area Comparison Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Benefits of the 2- -stage stage Clos Clos Network Network Benefits of the 2 • Minimal area when M <= 16 • Only have 2-stages, latency is reduced • Latency bounded • Scheduling algorithm is also simplified • The CMs could be further reduced • It is a RNB network. An SNB network requires 3 stages Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Index Index • Channel Slicing, the wormhole router • Spatial Division Multiplexing (SDM) – Motives – Switching networks – 2-stage Clos network – The distributed scheduler The distributed scheduler – – Implementation results Advanced Processor Technology Group 2009-12-2 The School of Computer Science
SDM: Scheduling Algorithms SDM: Scheduling Algorithms • Optimized algorithms Optimized algorithms • – Always reach the optimal configuration that every possible connection is configured – Time complexity O( N 2 ) – Normally software based ( [Leroy 2008] microprocessor, 64 ports, 50us) • Heuristic algorithms Heuristic algorithms • – Capable of configuring part of the possible connections with less time and area – Time complexity O( N ) ~ O(log N ) – Normally hardware implementable, distributed, and scalable Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Synchronous Dispatch Algs Algs. . Synchronous Dispatch Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Synchronous Dispatch Algs Algs. . Synchronous Dispatch Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Synchronous Dispatch Algs Algs. . Synchronous Dispatch Advanced Processor Technology Group 2009-12-2 The School of Computer Science
Recommend
More recommend