Architecture Design Principles for the Integration of - - PowerPoint PPT Presentation

architecture design principles for the integration of
SMART_READER_LITE
LIVE PREVIEW

Architecture Design Principles for the Integration of - - PowerPoint PPT Presentation

Architecture Design Principles for the Integration of Synchronization Interfaces into Network-on-Chip Switches Daniele Ludovici Alessandro Strano Davide Bertozzi Computer Engineering lab TUDelft - NL MPSoC research group


slide-1
SLIDE 1

Architecture Design Principles for the Integration of Synchronization Interfaces into Network-on-Chip Switches

Daniele Ludovici – Alessandro Strano‡ – Davide Bertozzi‡ Computer Engineering lab – TUDelft - NL

‡ MPSoC research group – UNIFE – Italy

daniele@ce.et.tudelft.nl

slide-2
SLIDE 2

OUTLINE

 GALS Network-on-Chip design paradigm  Different synchronization models  Methodology towards a synchronizer integration  Tightly Coupled Mesochronous synchronizer  Tightly Coupled Dual-clock FIFO  Results Performance, area overhead, power consumption  Conclusions

slide-3
SLIDE 3

MOTIVATION

 There is today little doubt on the fact that a high- performance and cost-effective NoC can be designed in 45nm and beyond under a relaxed synchronization assumption  interconnect delay, process variation, etc.  A possible solution: GALS NoC  Processing blocks are separated and clocked independently  No global clock distribution => simplified timing closure  No rigid timing constraints between local clock domains

slide-4
SLIDE 4

We chose one GALS implementation variant where the NoC is an independent clock domain

  • Conscious use of area/power expensive dual-clock FIFOs for throughput

sensitive link to IP cores (used only at the network boundary)

  • More compact mesochronous synchronizers are used in the network
  • Hierarchical Clock Tree Synthesis to relieve clock phase offset constraints

GALS implementation

slide-5
SLIDE 5

Mesochronous Synchronization

Domain 1

5% SKEW 30-40% SKEW

Top tree Bottom tree

Hierarchical clock tree with relaxed skew constraints might significantly decrease clock tree power and make the chip-wide NoC domain feasible

Challenge: implementing cost-effective mesochronous synchronization

Domain 2 Domain N

[Source: MIRO-PANADES08]

slide-6
SLIDE 6

SYNCHRONIZATION MODELS

 Single transaction handshake design style  Acknowledgment for each data word  Latency for each data transfer and lower throughput  Requires good asynch. knowledge  Low maturity for EDA tools  Source synchronous design style (our choice!)  The clock is routed along with the data it is going to strobe  Good for high-data rates  Requires only an incremental effort with current EDA tool flows  Potentially area/power-hungry, reliability concern

[Source: LATTARD07]

slide-7
SLIDE 7

 With conventional design techniques, source synchronous interfaces are external blocks to the modules they synchronize => synch latency, area and power overhead fully exposed  Mitigate synchronization overhead by co-designing the interface with the NoC submodules => to the limit: full merging

A STEP FORWARD

Synchronization

Loose coupling

DATA+clock

Switch

Buffering & flow control Buffering Flow control

slide-8
SLIDE 8

 With conventional design techniques, source synchronous interfaces are external blocks to the modules they synchronize => synch latency, area and power overhead fully exposed  Mitigate synchronization overhead by co-designing the interface with the NoC submodules => to the limit: full merging

A STEP FORWARD

DATA+clock

Switch

Tight coupling

Buffering & flow control Buffering Synchronization Flow control

achievement of major savings thanks to the sharing of expensive buffers

slide-9
SLIDE 9

Tightly coupled mesochronous synchronizer with the switch architecture

slide-10
SLIDE 10

Underlying principle: Information can safely settle in the front-end latches before being sampled by the target domain clock

L_0 Mux 3x1 L_1

Front-end Counter Back-end Counter

L_2

Clock_TX

Data/ Forward Flow Control

Clock_RX

Data out

FF_0

Front-end:

  • Clock_TX used as a strobe signal for data and flow control wires, thus

avoiding timing problems associated with phase offset of clock signals

  • Sampling through a number of latches used in a rotating fashion

based on a counter

Proposed synchronizer

slide-11
SLIDE 11

L_0 Mux 3x1 L_1

Front-end Counter Back-end Counter

L_2

Clock_TX

Data/ Forward Flow Control

Clock_RX

Data out

FF_0

Back-end:

  • Leverages local clock of the RX domain
  • Samples data from one of the latches in the front-

end thanks to multiplexing logic based on a counter

Proposed synchronizer

Underlying principle: Information can safely settle in the front-end latches before being sampled by the target domain clock

slide-12
SLIDE 12

L_0 Mux 3x1 L_1

Front-end Counter Back-end Counter

L_2

Reset_RX Clock_TX

Data/ Forward Flow Control

Clock_RX

Data out

FF_0

  • 3 input latch banks ensure timing constraints are safely met

 data stability window at latch outputs is enough to tolerate wide range of clock phase offset  phase detector can be avoided A unique bootstrap configuration can deal with all phase skew scenarios

  • Main challenge:

enforce timing margins for the NoC domain study implications of synchronizer integration into a NoC (e.g., flow control)

Proposed synchronizer

slide-13
SLIDE 13
  • Flow control implications considered

 xpipes comes with stall/go flow control; 2-stage buffer at each switch input  Optimization: the back-end flip-flop IS the switch input buffer  At least a 4 slot buffer is needed to keep using stall/go  A small single-bit synchronizer needed to synchronize backward flow control signal

Flow control

slide-14
SLIDE 14

Optimization

L_0 Mux 3x1 L_1

Front-end Counter Back-end Counter

L_2

Clock_TX Clock_RX

Data out

FF_0

Reset_RX

Data

SWITCH Receiver

Arbiter Crossbar Outbuf Outbuf Outbuf Outbuf FLOW CONTROL

  • Why not bringing flow control to the synchronizer latches as well?
  • So that data can be stalled there, without need for extra buffer in the switch.
  • Why not using the synchronizer IN PLACE OF the switch input buffer at all?

A multi-purpose switch input buffer (buffering, synchronization and flow control) might lead to large area/power savings, lower latency and would preserve modularity

slide-15
SLIDE 15

Optimization

SWITCH Receiver

Arbiter Crossbar Outbuf Outbuf Outbuf Outbuf

Mux 3x1 L_1 L_2 L_0

  • Why not bringing flow control to the synchronizer latches as well?
  • So that data can be stalled there, without need for extra buffer in the switch.
  • Why not using the synchronizer IN PLACE OF the switch input buffer at all?

A multi-purpose switch input buffer (buffering, synchronization and flow control) might lead to large area/power savings, lower latency and would preserve modularity

slide-16
SLIDE 16

SWITCH

Arbiter Crossbar Outbuf Outbuf Outbuf Outbuf

Mux 3x1 L_1 L_2 L_0

Tightly-coupled synchronizer (in the switch architecture)

slide-17
SLIDE 17

CLK_sender CLK_receiver Counter Counter

Latch_1 Latch_2 Mux

Data Data

Front-end Back-end

Latch_0

To switch logic

Counter Counter

CTR_Latch_1 CTR_Latch_2 CTR_Latch_0

Mux Enable Stall/go from switch arbiter

Switch Input Buffer

Flow control to switch sender

Tightly-coupled synchronizer (in the switch architecture)

DATA SYNCHRONIZER CONTROL SYNCHRONIZER

slide-18
SLIDE 18

TSKEW TSKEW = 0

Clock_sender Data_in Latched_data_0 Latched_data_1 Latched_data_2 Clock_receiver Data_out_Switch Data_out_Mesocronous Clock_sender Data_in Latched_data_0 Latched_data_1 Latched_data_2 Clock_receiver Data_out_Switch Data_out_Mesocronous

MUX OUTPUT BUFFER

SWITCH

A

TIGHTLY COUPLED SYNCRONIZER

Data_in_OutBuffer Data_in_OutBuffer

TIGHTLY COUPLED SYNCRONIZER

slide-19
SLIDE 19

SKEW TOLERANCE

 Setup Time: from the beginning of mux window to the rising edge of the sampling element.  Hold Time: from the rising edge of the sampling element to the end of the mux window.  For the tightly coupled these metrics are taken at the output

  • buffer. Tarb+Txbar reduces “setup time” for the tightly coupled

synchronizer.

slide-20
SLIDE 20

Loosely Coupled Skew Tolerance

 Pos. and Neg. skew are expressed as % of the clock period.  Setup and Hold time compared with those of a FF in 65nm lib.  Hold Time is stable and it has a solid margin.  Setup Time decreases when latch outputs end switching inside the mux window BUT there is still a safe margin!

slide-21
SLIDE 21

 Hold Time is stable and it has a solid margin  Tarb+Txbar lower the Setup Time curve starting point  Setup Time becomes even more critical for high negative skew  Tightly coupled synch cannot work beyond -95% skew!

Tightly Coupled Skew Tolerance

slide-22
SLIDE 22

Tightly coupled dual-clock FIFO synchronizer with the switch architecture

slide-23
SLIDE 23

Dual-Clock FIFO Architecture

 data is enqueued when is valid and the buffer is not full and it is dequeued in presence of a go-signal (no stall) and the buffer is not empty  clear separation between sender and receiver interfaces: token ring counters generate write and read pointer indicating where the operation occurs in the buffer

VALID_OUT

slide-24
SLIDE 24

Dual-Clock FIFO Architecture

 full and empty detectors catch the status of the FIFO buffer by performing an asynchronous comparison between write and read signals  Assertion of empty_tmp (full_tmp) signal is synch with the RX-domain (TX-domain)  Deassertion of empty_tmp (full_tmp) happens when the write (read) pointer increased  The ultimate consequence is that empty_tmp and full_tmp need to be synchronized by means of bruce force synchronizers

slide-25
SLIDE 25

Tight Integration in the Switch

 Seamless integration as for the mesochronous synchronizer  xpipesLite is natively output buffered (2in – 6out) but nothing prevents to resize the

  • utput buffer to 2 and have an integrated FIFO of 6 slots => no buffering overhead

 Performance evaluation at system-level is our ongoing work

slide-26
SLIDE 26

Minimum latency: ∆Trx + 1Clock_receiver ∆Trx to open the mux window 1Clock_receiver to read the data read writer

LATENCY ANALYSIS

 Latency of the Dual-clock FIFO depends on the relation between sender and receiver clocks: ∆Trx + Ω  0 < ∆Trx < 1 is the skew between clk_sender and clk_receiver  Ω is is the number of clock cycles required by the read pointer to reach the location pointed by the writer

slide-27
SLIDE 27

writer Empty deassertion: ∆Trx + 2Clock_receiver ∆Trx + 1Clock_receiver to clear the emptiness a further cycle is needed to enable data at mux output

LATENCY ANALYSIS

 Latency of the Dual-clock FIFO depends on the relation between sender and receiver clocks: ∆Trx + Ω  0 < ∆Trx < 1 is the skew between clk_sender and clk_receiver  Ω is is the number of clock cycles required by the read pointer to reach the location pointed by the writer

slide-28
SLIDE 28

Maximum latency: ∆Trx + Clock_receiver * (BufferDepth – 2) read writer

LATENCY ANALYSIS

 Latency of the Dual-clock FIFO depends on the relation between sender and receiver clocks: ∆Trx + Ω  0 < ∆Trx < 1 is the skew between clk_sender and clk_receiver  Ω is is the number of clock cycles required by the read pointer to reach the location pointed by the writer

slide-29
SLIDE 29

The loosely coupled solution requires up to 43% more area with respect to the vanilla switch!

AREA OVERHEAD

Mesochronous synchronizer Dual-clock FIFO interface

 Breakdown of total switch area  65nm UMC technology library, target frequency 1GHz  Both tightly coupled architectures have a comparable area footprint with their respective vanilla switches

slide-30
SLIDE 30

POWER ANALYSIS

Mesochronous synchronizer Dual-clock FIFO interface

 Post-layout simulations carried out at 800MHz  Area overhead comes with a power penalty!  Tightly coupled mesochronous power figures reflect those of vanilla switch (as for the area)  Tightly coupled dual-clock FIFO inherently clock gates the input buffer when data is not valid (not available in vanilla switch)

slide-31
SLIDE 31

Summing up

A loosely coupled synchronizer in front of the switch fabric  implies large buffering in the switch input  fully exposes its area and power overhead We advocate for a tightly coupled synchronizer with the switch architecture.  multi-purpose input buffer in charge of synchronization, buffering and flow control.  major savings thanks to sharing of expensive buffering  marginal area/power/timing overhead with respect to a fully synchronous switch

slide-32
SLIDE 32