Automated Generation of Round-robin Arbitration and Crossbar - - PowerPoint PPT Presentation

automated generation of round robin arbitration and
SMART_READER_LITE
LIVE PREVIEW

Automated Generation of Round-robin Arbitration and Crossbar - - PowerPoint PPT Presentation

Automated Generation of Round-robin Arbitration and Crossbar Switch Logic Eung S. Shin Advisor: Professor Vincent J Mooney III School of Electrical and Computer Engineering, Georgia Institute of Technology Overview Crossbar (Xbar) GPP GPP


slide-1
SLIDE 1

Automated Generation

  • f Round-robin

Arbitration and Crossbar Switch Logic

Eung S. Shin

Advisor: Professor Vincent J Mooney III

School of Electrical and Computer Engineering, Georgia Institute of Technology

slide-2
SLIDE 2

2/12/2004 2

Overview

DSP

  • n-chip network

& arbiter peripheral core DSP GPP GPP custom logic memory module memory module memory module memory module

Crossbar (Xbar) round- robin arbiter Multiprocessor System-on-a Chip (SoC)

slide-3
SLIDE 3

2/12/2004 3

Arbiter Problems

A fast and powerful arbiter for an SoC A fast arbiter for terabit switching speeds A tedious and error- prone task

Network Switch (16x16)

Crossbar Switch Fabric (16x16)x16 16 (16x16 arbiter)s

… … … VOQ(0,16) VOQ(0,0) . . . input port 0 VOQ(16,0) . . . VOQ(16,16) input port 16

. . . . . .

  • utput port 16
  • utput port 0

. . . . . . . . .

req(0, 0) req(16, 16) grant(0, 0-16) grant(16, 0-16)

slide-4
SLIDE 4

2/12/2004 4

Xbar Problems

Multiple communication channels demanded in a multiprocessor SoC Challenge: reducing productivity gap Productivity gap reduction techniques:

Enhancing IP core reusability Developing a CAD tool

slide-5
SLIDE 5

2/12/2004 5

Objective

To design and automate a fast round-robin arbiter logic generation for a bus or a network switch

The generated arbiter employed to crossbar

(Xbar) switch arbitration logic

To automate Xbar generation providing multiple communication paths among masters

The generated Xbar customized according to user

specifications

slide-6
SLIDE 6

2/12/2004 6

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-7
SLIDE 7

2/12/2004 7

Terminology

MxN Switch: M-input by N-output switch

Example: A 32x32 switch − 32-input by 32-output

switch with 1024 (322) possible connections between input ports and output ports

Virtual Output Queues (VOQs): to remove possible output port contention (Head of Line (HOL) blocking) VOQ (m, n): m − the input port index; n − the

  • utput port index

Example: VOQ (1, 0)

Network Switch (32x32)

Crossbar Switch Fabric (32x32)x32 32 (32x32 arbiter)s

… … … VOQ(0,31) VOQ(0,0) . . . input port 0 VOQ(31,0) . . . VOQ(31,31) input port 31

. . . . . .

  • utput port 31
  • utput port 0

. . . . . . . . .

req(0, 0) req(31, 31) grant(0, 0-31) grant(31, 0-31)

slide-8
SLIDE 8

2/12/2004 8

Terminology (Continued)

(MxV)xN Switch:

M − the number of

input ports of an MxN switch

V − the number of

VOQs per input port

N − the number of

  • utput ports of an MxN

switch

Typically, V = N The total number of

VOQs in an MxN switch − M∗N

Network Switch (32x32)

Crossbar Switch Fabric (32x32)x32 32 (32x32 arbiter)s

… … … VOQ(0,31) VOQ(0,0) . . . input port 0 VOQ(31,0) . . . VOQ(31,31) input port 31

. . . . . .

  • utput port 31
  • utput port 0

. . . . . . . . .

req(0, 0) req(31, 31) grant(0, 0-31) grant(31, 0-31)

slide-9
SLIDE 9

2/12/2004 9

Terminology (Continued)

(MxV)xN crossbar switch fabric:

Connections between

(MxV) inputs and N

  • utputs

MxM Switch Arbiter (SA):

Controlling M specific

transmission gates between M VOQs and a particular output port

N MxM SAs in an MxN

switch

Thirty-two 32x32 SAs 32 x 32 SA_0

. . .

grant (0, 0) grant (1, 0) grant (31, 0) VOQ (0, 0) VOQ (1, 0) VOQ (31, 0)

. . . . . .

VOQ (0, 0) VOQ (1, 0) VOQ (31, 0)

. . . . . . 32 x 32 SA_31

. . .

grant (0, 31) grant (1, 31) grant (31, 31) VOQ (0, 31) VOQ (1, 31) VOQ (31, 31)

. . . . . . (32x32)x32 Crossbar Switch Fabric

  • utput port 0
  • utput port 31

. . .

. . . . . .

slide-10
SLIDE 10

2/12/2004 10

Terminology (Continued)

MxM distributed SA (MxM hierarchical SA):

Equivalent to an MxM SA Consisting of smaller switch arbiter in the form of a hierarchical

tree structure

Bus Arbiter (BA): resolving bus conflicts

Priority Logic 0 req[0] req[1] req[2] req[3] Ring Counter

token [0] token [1] token [2] token [3] token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1

EN EN EN EN

grant[0] grant[1] grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

in[0] in[1] in[2] in[3]

D-FF D-FF

clock Priority Logic 0 req[0] req[1] req[2] req[3] Ring Counter

token [0] token [1] token [2] token [3] token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1

EN EN EN EN

grant[0] grant[1] grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

in[0] in[1] in[2] in[3]

D-FF D-FF

clock

req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack

slide-11
SLIDE 11

2/12/2004 11

Requirements for a Terabit Switch Arbiter

Starvation free Fast Arbitration Simplicity to implement Low power:

Power budget of single rack router ~ 10kW

slide-12
SLIDE 12

2/12/2004 12

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-13
SLIDE 13

2/12/2004 13

History: Arbiter in PPE

Centralized Switch Arbiters:

Programmable Priority

Encoder (PPE) implementing iterative round-robin algorithm (iSLIP)

  • P. Gupta and N. Mckeown,

“Designing and Implementing a Fast Crossbar Scheduler,” IEEE Micro, 1999, pp. 20-28.

  • N. Mckeown, P. Varaiya, and J.

Warland, “The iSLIP Scheduling Algorithm for Input-Queued Switch,” IEEE Transaction on Networks, 1999, pp. 188-201. tothermo P_enc log2 n Req n Priority Encoder Priority Encoder_thermo n new_Req n n n any_Gnt_PE_thermo n Gnt_PE_thermo Gnt_PE n Gnt P_thermo

slide-14
SLIDE 14

2/12/2004 14

History: Arbiter in PPA

Distributed Switch Arbiter:

Ping Pong Arbiter (PPA)

  • H. J. Chao, C. H. Lam, and
  • X. Guo, “A Fast Arbitration

Scheme for Terabit Packet Switches,” Proceedings of IEEE Global Telecommunications Conference, 1999, pp. 1236- 1243.

Comparison: our generated SA 2.3X faster than PPE and 1.8X faster than PPA

16 13 14 15 12 9 10 11 8 5 6 7 4 1 2 3 1 2

external grant signals

layer 1 layer 2 layer 3 layer 4 root PPA intermediate PPA leaf PPA

r0 r1 Fi Gg0 Gg1 g0 g1 Fo

2x2 PPA Q D

Clock r0 r1 g0 g1 Fi Fo Gg0 Gg1

slide-15
SLIDE 15

2/12/2004 15

Why do we need an arbiter for an SoC?

Arbitration required by all buses Our arbiter applicable to anywhere requiring arbitration The generated arbiter utilized in our Xbar

slide-16
SLIDE 16

2/12/2004 16

History: Crossbar Switch Smart memory:

Reconfigurable crossbar between bus

masters (2 integer clusters and 1 floating point cluster) and slaves (SRAMs)

  • K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally and M.

Horowitz, “Smart Memories: A Modular Reconfigurable Architecture,” Proceedings of International Symposium

  • n Computer Architecture (ISCA), June 2000, pp. 161-

171.

slide-17
SLIDE 17

2/12/2004 17

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-18
SLIDE 18

2/12/2004 18

Bus Arbiter Design

Implemented based on ring counter for a token and “priority logic” Priority Logic for 4 inputs:

  • utput[0] = EN•in[0]
  • utput[1] = EN•in[0]'•in[1]
  • utput[2] = EN•in[0]'•in[1]'•in[2]
  • utput[3] = EN•in[0]'•in[1]'•in[2]'•in[3]

1 1 1 1 1 X 1 1 1 X X 1 1 1 X X X 1 1 X X X X

  • utput [3]
  • utput [2]
  • utput [1]
  • utput [0]

in [3] in [2] in [1] in [0] EN

slide-19
SLIDE 19

2/12/2004 19

Previous Bus Arbiter Approach

FIFO Arbiter Round-robin arbiter with fixed priorities Round-robin arbiter with priority rotation

M-input Priority Logic M-request M-grant M-grant M-input Priority Encoder M-request rotation decoder

slide-20
SLIDE 20

2/12/2004 20

Example: Our Bus Arbiter

Condition:

Token=4’b0100 → Processor 2 with the highest priority Processors 0 and 1 requesting a bus

Result:

Only Priority Logic 2 enabled Processor 0 granted due to negated request signals of the

higher priority parties (processor 2 and processor 3)

Token rotated to 4’b1000 after the ring counter receiving ack

signal

Processor 2

req[0]

Processor 0 Processor 1 Processor 3 Memory PL 2

token[2]

ring counter grant[0]

4x4 BA

req[1]

  • utput[2]

ack

slide-21
SLIDE 21

2/12/2004 21

Example: Our Bus Arbiter (Continued)

Priority Logic 0 Ring Counter

token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1 grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

req[0] req[1] req[2] req[3]

EN EN EN

in[0] in[1] in[2] in[3]

D-FF

clock grant[0] grant[1] grant[0] Priority Logic 2

token [2] token [2]

Priority Logic 2

EN

slide-22
SLIDE 22

2/12/2004 22

Hierarchical SA Design

A hierarchical SA consisting of small switch arbiter blocks Six types of switch arbiter blocks:

2x2 ack-req SA 3x3 ack-req SA 4x4 ack-req SA 2x2 root SA 3x3 root SA 4x4 root SA

A root SA: placed on the top of a hierarchy

slide-23
SLIDE 23

2/12/2004 23

Switch Arbiter Blocks

2x2 Bus Arbiter ack req0[0] req0[1] grant0[1] grant0[2] req0

2x2 ack-reqSA

clock reset

3x3 Bus Arbiter

ack grant0[0] grant0[1] grant0[2] req0[0] req0[1] req0[2] req0

3x3 ack-reqSA

clock reset

4x4 Bus Arbiter

ack grant0[0] grant0[1] grant0[2] grant0[3] req0[3] req0[0] req0[1] req0[2] req0

4x4 ack-reqSA

clock reset

2x2 BA without D flip-flop

ack0 ack1 req0 req1

2x2 root SA

clock

ring counter

reset

token[1:0]

3x3 BA without D flip-flop

ack0 ack1 ack2 req0 req1 req2

3x3 root SA

clock

ring counter

reset

4x4 BA without D flip-flop

ack0 ack1 ack2 ack3 req0 req1 req2 req3

4x4 root SA

clock

ring counter

reset

slide-24
SLIDE 24

Example: 32x32 hierarch- ical SA

4x4 ack-req SA l0.sa0 req0[0] req0[1] req0[2] req0[3] ack0[0] 4x4 ack-req SA l0.sa1 ack0[1] 4x4 ack-req SA l0.sa2 ack0[3] 4x4 ack-req SA l0.sa3 4x4 ack-req SA l1.sa0 req1[0] req1[1] req1[2] req1[3] req2[0] req2[1] req2[2] req2[3] req3[0] req3[1] req3[2] req3[3] ack0[2] 4x4 ack-req SA l0.sa4 req4[0] req4[1] req4[2] req4[3] ack1[0] 4x4 ack-req SA l0.sa5 ack1[1] 4x4 ack-req SA l0.sa6 ack1[3] 4x4 ack-req SA l0.sa7 4x4 ack-req SA l1.sa1 req5[0] req5[1] req5[2] req5[3] req6[0] req6[1] req6[2] req6[3] req7[0] req7[1] req7[2] req7[3] ack1[2] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant2[0] grant2[1] grant2[2] grant2[3] grant3[0] grant3[1] grant3[2] grant3[3] grant4[0] grant4[1] grant4[2] grant4[3] grant5[0] grant5[1] grant5[2] grant5[3] grant6[0] grant6[1] grant6[2] grant6[3] grant7[0] grant7[1] grant7[2] grant7[3] up_req[0] up_req[1] req0 req1 req2 req3 req4 req5 req6 req7 up_ack0 up_ack1 clock

token = 2’b10 token = 4’b0010 token = 4’b0010

slide-25
SLIDE 25

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4x4 ack-req SA l0.sa0 req0[0] req0[1] req0[2] req0[3] ack0[0] 4x4 ack-req SA l0.sa1 ack0[1] 4x4 ack-req SA l0.sa2 ack0[3] 4x4 ack-req SA l0.sa3 4x4 ack-req SA l1.sa0 req1[0] req1[1] req1[2] req1[3] req2[0] req2[1] req2[2] req2[3] req3[0] req3[1] req3[2] req3[3] ack0[2] 4x4 ack-req SA l0.sa4 req4[0] req4[1] req4[2] req4[3] ack1[0] 4x4 ack-req SA l0.sa5 ack1[1] 4x4 ack-req SA l0.sa6 ack1[3] 4x4 ack-req SA l0.sa7 4x4 ack-req SA l1.sa1 req5[0] req5[1] req5[2] req5[3] req6[0] req6[1] req6[2] req6[3] req7[0] req7[1] req7[2] req7[3] ack1[2] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant2[0] grant2[1] grant2[2] grant2[3] grant3[0] grant3[1] grant3[2] grant3[3] grant4[0] grant4[1] grant4[2] grant4[3] grant5[0] grant5[1] grant5[2] grant5[3] grant6[0] grant6[1] grant6[2] grant6[3] grant7[0] grant7[1] grant7[2] grant7[3] up_req[0] up_req[1] req0 req1 req2 req3 req4 req5 req6 req7 up_ack0 up_ack1 clock up_req[1] req5[1] 4x4 ack-req SA l0.sa5 req5 grant5[1]

Example: 32x32 hierarch- ical SA

2x2 root SA 4x4 ack-req SA l1.sa1 up_ack1 ack1[1] ack1[1] D D up_ack1 req5[1] up_req[1] req5 2x2 root SA grant5[1] up_ack1 req5[1] 4x4 ack-req SA l0.sa5 req5 ack1[1] up_req[1] 2x2 root SA up_ack1 grant5[1] ack1[1] grant5[1] clock 2x2 root SA up_req[0] up_req[1] req5[0] req5[1] req5[2] req5[3] 4x4 BA counter D req4 req5 req6 req7 4x4 BA counter D up_ack0 up_ack1 l1.sa1.output[1] l0.sa5.output[1]

  • 32 x 32 SA Critical Path:
  • Travels through two 4-input OR gates in series
  • Then through a 2x2 root SA
  • Finally through two 2-input AND gates in series
  • Results in 0.94ns delay using a TSMC 0.25µ standard cell library

from LEDA Systems.

ack1[1] D D up_ack1 req5[1] up_req[1] req5 2x2 root SA grant5[1] up_ack1

ack signals look like feedback path through the same logic

  • block. In fact there is no input to

the same logic gates.

slide-26
SLIDE 26

2/12/2004 26

Hierarchical BA

Extra ‘ack’ input: indicating the completion of a single use of the bus Root token rotation:

By an ‘ack’ signal for a

hierarchical BA

By clock signal for a

hierarchical SA

Possession of a bus for multiple cycles

req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical BA

clock 4x4 ack-req BA 0 counter D 4x4 ack-req BA 1 counter D ack counter

D

D 4x4 Bus Arbiter D D D ack clock req0[3] req0[1] req0[2] req0[0] reset

4x4 ack-req BA

grant0[3] grant0[1] grant0[2] grant0[0] req0

D latches triggered the positive edge of ack

slide-27
SLIDE 27

2/12/2004 27

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-28
SLIDE 28

2/12/2004 28

Comparisons with PPE and PPA

Using TSMC 0.25µ std. cell library from LEDA Systems Synthesized by Design Compiler from Synopsys

1000 2000 3000 4000 5000 6000 50 100 150 MxM arbiter Area of arbiter in the number of inverter equivalents SA PPE PPA 0.5 1 1.5 2 2.5 3 3.5 50 100 150 MxM arbiter Delay in arbiter with TSMC .25um SA PPE PPA

1.81X 1.85X 1.94X 2.31X 2.03X 2.38X

slide-29
SLIDE 29

2/12/2004 29

Comparisons (Continued)

The shortest delay results from

Limiting the size of switch arbiter blocks to 2x2,

3x3 and 4x4 to reduce the critical path delay

Due to the expansion of priority logic blocks

compared with PPE

Reducing the number of levels in a hierarchy by

preferring to use more 4x4 switch arbiter blocks compared with PPA

slide-30
SLIDE 30

2/12/2004 30

Key Insights

Our hierarchical SA

  • vs. PPE:

Much larger logic

equations of PPE

More constrained to

a specific structure in

  • ur SA

A distributed “state”

kept by token

16x16 Hierarchical SA

4x4 4x4 4x4 4x4 4x4

16 requests

4 4 4 4

16 grants 16x16 PPE 16-input PE 16 requests 16 grants No token token token

slide-31
SLIDE 31

2/12/2004 31

Key Insights (Continued)

Our hierarchical SA

  • vs. PPA:

PPA with token

passing approach.

MxM PPA

implemented by 2- input basic arbiter

Our hierarchical SA

preferring utilizes 4- input switch arbiter blocks.

16 grants 16x16 Hierarchical SA

4x4 4x4 4x4 4x4 4x4

16 requests

4 4 4 4

16 grants 16x16 PPA

2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2

16 requests

2

2x2 PPA Q D

Clock r0 r1 g0 g1 Fi Fo Gg0 Gg1

slide-32
SLIDE 32

2/12/2004 32

Power Comparison

For dynamic power estimation, a Switching Activity Information File (SAIF) is extracted from Synopsys VCS. Assume congested network: all input requests are asserted.

RTL Verilog RTL Simulation using of Synopsys VCS Power Compiler Back Annotation SAIF file Power Estimation Technology Library Design Compiler

slide-33
SLIDE 33

2/12/2004 33

Power Comparison

Using TSMC 0.25µ std. cell library from LEDA Systems Estimated by Power Compiler from Synopsys

Static Power Dissipation with TSMC 0.25um

2000 4000 6000 8000 10000 20 40 60 80 100 120 140 MxM p W SA PPA PPE

Dynamic Power Dissipation with TSMC 0.25um

5 10 15 20 25 30 35 50 100 150 MxM m W SA PPA PPE

slide-34
SLIDE 34

2/12/2004 34

Power Comparison

Total Power Dissipation

5 10 15 20 25 30 35 50 100 150 Number of Requests mW SA PPA PPE

slide-35
SLIDE 35

2/12/2004 35

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-36
SLIDE 36

RAG

A user specification: an arbiter type A user specification: the number of masters (M) to be arbitrated Output of RAG: a synthesizable Verilog code for a Bus Arbiter

  • r a Switch Arbiter at

the RTL

User input:

  • 1. Type of the arbiter:
  • 3. Number of masters

Library 2x2 ack-req SA 3x3 ack-req SA 4x4 ack-req SA 2x2 ack-req BA 3x3 ack-req BA 4x4 ack-req BA 2x2 root SA 3x3 root SA 4x4 root SA integrate M x M hierarchical switch arbiter integ_arb(); integrate M x M hierarchical bus arbiter integ_bus_arb(); calculate the number of levels and the number of basic arbiter blocks for each level Interpret_sa(); SA BA BA SA

number-of_levels ← 0; dividend←number_of_masters; remainder← (dividend modulo 4); dividend=0?

No

dividend ←floor(dividend/4); number_of_levels ++; dividend=1 and remainder=0 ? dividend←number_of_masters; n← 0;

Yes

dividend>3? remainder ← dividend modulo 4; dividend← (dividend/4); 4by4_in_level[n]← dividend;

Yes

dividend modulo 3 =0 ?

No Yes Yes

dividend modulo 4=0?

Yes No

remainder ← dividend modulo 3; dividend← (dividend/3); 3by_in_level[n]← dividend; remainder ← dividend modulo 4; dividend← floor (dividend/4); 4by_in_level[n]← dividend;

No

remainder=3 ? 2by2_in_level[n]++;

Yes

3by3_in_level[n]++;

No

dividend>2?

No

dividend← (dividend/3); 3by3_in_level[n]← dividend;

Yes

dividend← floor (dividend/2); 2by2-in_level[n]← dividend;

No

dividend = 4by4_in_level[n] + 3by3_in_level[n] + 2by2_in_level[n] ; n = number_of_levels-1 ?

Yes No

Generate Hierarchical Arbiter

Yes

slide-37
SLIDE 37

num_level ← 0 dividend←num_masters remainder← 0 dividend=0?

No

dividend ←(integer) (dividend/4) n ←num_level num_level ++ dividend=0 and remainder=0 ? dividend←num_masters n← 0

Yes

dividend>2? remainder ← dividend mod 4 dividend←(integer) (dividend/4) num_4by4_level[n]← dividend remainder ← dividend mod 2 dividend←(integer) (dividend/2) num_2by2_level[n]← dividend

No Yes

remainder=0 ? remainder>2 ?

No

num_4by4_level[n]++ num_2by2_level[n]++

Yes No

n++ remainder=0 ?

No Yes No Yes

dividend←num_4by4_level[n]+num_2by2_level[n] n<num_level?

No Yes Hierarchical SA dividend=32 dividend=32 n=0 Yes

RAG (Continued)

dividend=8 n=0 num_4by4_level(0)=0 num_2by2_level(0)=0 num_level=1 dividend=2 n=1 num_4by4_level(1)=0 num_2by2_level(1)=0 num_level=2 dividend=0 n=2 num_4by4_level(2)=0 num_2by2_level(2)=0 num_level=3 remainder=0 dividend=8 num_4by4(0)=8

4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA

n=1 dividend=8 remainder=0 dividend=2 num_4by4(1)=2

4x4 SA 4x4 SA

n=2 dividend=2 remainder=0 dividend=1 num_2by2(2)=1

2x2 root SA

n=3 dividend=1

slide-38
SLIDE 38

2/12/2004 38

4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A

RAG (Continued)

4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A

4x4 ack-req SA l0.sa0 req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] 4x4 ack-req SA l0.sa1 ack0[1] 4x4 ack-req SA l0.sa2 ack0[3] 4x4 ack-req SA l0.sa3 4x4 ack-req SA l1.sa0 req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] req2[0] req2[1] req2[2] req2[3] req2[0] req2[1] req2[2] req2[3] req3[0] req3[1] req3[2] req3[3] req3[0] req3[1] req3[2] req3[3] ack0[2] 4x4 ack-req SA l0.sa4 req4[0] req4[1] req4[2] req4[3] req4[0] req4[1] req4[2] req4[3] ack1[0] 4x4 ack-req SA l0.sa5 ack1[1] 4x4 ack-req SA l0.sa6 ack1[3] 4x4 ack-req SA l0.sa7 4x4 ack-req SA l1.sa1 req5[0] req5[1] req5[2] req5[3] req5[0] req5[1] req5[2] req5[3] req6[0] req6[1] req6[2] req6[3] req6[0] req6[1] req6[2] req6[3] req7[0] req7[1] req7[2] req7[3] req7[0] req7[1] req7[2] req7[3] ack1[2] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] grant2[0] grant2[1] grant2[2] grant2[3] grant2[0] grant2[1] grant2[2] grant2[3] grant3[0] grant3[1] grant3[2] grant3[3] grant3[0] grant3[1] grant3[2] grant3[3] grant4[0] grant4[1] grant4[2] grant4[3] grant4[0] grant4[1] grant4[2] grant4[3] grant5[0] grant5[1] grant5[2] grant5[3] grant5[0] grant5[1] grant5[2] grant5[3] grant6[0] grant6[1] grant6[2] grant6[3] grant6[0] grant6[1] grant6[2] grant6[3] grant7[0] grant7[1] grant7[2] grant7[3] grant7[0] grant7[1] grant7[2] grant7[3] up_req[0] up_req[1] req0 req1 req2 req3 req4 req5 req6 req7 up_ack0 up_ack1 clock

User input:

  • 1. Type of the arbiter:
  • 3. Number of masters

Library 2x2 ack-req SA 3x3 ack-req SA 4x4 ack-req SA 2x2 ack-req BA 3x3 ack-req BA 4x4 ack-req BA 2x2 root SA 3x3 root SA 4x4 root SA integrate M x M hierarchical switch arbiter integ_arb(); integrate M x M hierarchical bus arbiter integ_bus_arb(); calculate the number of levels and the number of basic arbiter blocks for each level Interpret_sa(); SA BA BA SA integrate M x M hierarchical switch arbiter integ_arb();

slide-39
SLIDE 39

2/12/2004 39

Speedup

Throughput comparison for 64-bit switching 32x32 network switch

The longest delay of 32x32

switch = 0.63ns in .25µ TSMC → the maximum throughput of switch determined by arbitration delay

Experimental setup:

Replacing SA in 32x32

network switch with our hierarchical SA, PPE and PPA

VOQ Controllers 32 32x32 SAs 64-bit 32x32 Switch Fabric 322 64-bit VOQs

The floorplan of the 64-bit 32x32 switch fabric, VOQs, controllers and SAs Area: 125.64mm2

slide-40
SLIDE 40

2/12/2004 40

Speedups (Continued)

Speedup in .25µ TSMC

Our hierarchical 32x32 SA:

64 bits@0.94ns delay → 2.18Tbps

32x32 PPA: 64bits@1.70ns

delay → 1.20Tbps

32x32 PPE: 64bits@2.17ns

delay → 0.94Tbps

Results: The throughput

achieved by our SA > 1.8X than PPA and > 2.3X than PPE

Network Switch (32x32)

Crossbar Switch Fabric (32x32)x32 32 (32x32 arbiter)s

… … … VOQ(0,31) VOQ(0,0) . . . input port 0 VOQ(31,0) . . . VOQ(31,31) input port 31

. . . . . .

  • utput port 31
  • utput port 0

. . . . . . . . .

req(0, 0) req(31, 31) grant(0, 0-31) grant(31, 0-31)

slide-41
SLIDE 41

2/12/2004 41

Perfectly Fair?

Priority Logic 0 req[0] req[1] req[2] req[3] Ring Counter

token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1

EN EN EN EN

grant[0] grant[1] grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

in[0] in[1] in[2] in[3]

D-FF

clock

EN

1 1

# of grants req #

1 1 1

# of grants req #

Priority Logic 0

EN EN

Priority Logic 1

EN EN

Priority Logic 2

EN

1 1 2

# of grants req #

Priority Logic 3

EN

1 1 3

# of grants req #

How about PPE and PPA?

slide-42
SLIDE 42

2/12/2004 42

Fairness Simulation

Joint work with Dr. Riley and Tom Cheng Simulation with the 32x32 hierarchical SA Several input patterns applied for uniform traffic simulation Bursty and TCP traffics applied Traffic intensity, ρ: a measure of the demand

  • n the switch for bursty and TCP traffics.
slide-43
SLIDE 43

2/12/2004 43

Simulation Results

Uniform traffic: metrics are the number of grants

249999 3 250000 250000 2 250000 250000 250000 1 499999 749000 250000 Grants, 0, 1, 2 asserted Grants, 0 , 1 asserted Grant, all asserted 4x4 ack-req input Run 3 Run 2 Run 1

slide-44
SLIDE 44

2/12/2004 44

Simulation Results (Continued)

Bursty traffic: metrics are average delay

1.10 2.00 4.63 16.44 31.11 3 1.01 1.82 4.61 17.14 30.95 2 1.10 1.93 4.60 15.59 31.23 1 1.05 1.98 4.46 16.65 31.43 Delay, ρ = 0.1 Delay, ρ = 0.5 Delay, ρ = 0.9 Delay, ρ = 1.0 Delay, ρ = 2.0 4x4 ack- req input

slide-45
SLIDE 45

2/12/2004 45

Simulation Results (Continued)

TCP traffic: metrics are average delay

1.10 1.94 12.13 17.54 19.22 3 1.10 1.93 10.69 19.71 20.44 2 1.10 1.95 11.29 19.48 19.91 1 1.10 1.95 11.68 16.75 18.12 Delay, ρ = 0.1 Delay, ρ = 0.5 Delay, ρ = 0.9 Delay, ρ = 1.0 Delay, ρ = 2.0 4x4 ack- req input

slide-46
SLIDE 46

2/12/2004 46

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-47
SLIDE 47

2/12/2004 47

Why Xbar and X-Gt?

Xbar:

Demand on multiple communication

channels

Concurrent accesses to resources

X-Gt:

Generation customized Xbar on-the-fly The generated Xbar in RTL Verilog

slide-48
SLIDE 48

2/12/2004 48

Xbar Switch

One configuration for 4x4 case Each 4x1 switches: Comparing physical addresses from PEs and judging if addresses belong to the address space of the attached memory block Maximum of 4 concurrent memory access in one cycle

P E3 P E2 P E1 P E0

4x4 Xbar

4x1switch3 4x1switch2 mem0 mem1 mem3 mem2 4x1switch0 4x1switch1

slide-49
SLIDE 49

2/12/2004 49

Xbar (Continued)

An MxN switch consisting of N Mx1 switches, M = # of PEs & N = # of memory blocks An assertion of ‘mem_req’ inside a Mx1 switch: depending on an ‘address’ input from a PE Grant: given by an arbiter by the assertion

  • f ‘mem_on’ signals

Arbitration in round-robin order: handling M requests from M processors

slide-50
SLIDE 50

2/12/2004 50

4x1 Switch Example

Scenario:

Request

from PE 0 and PE 3

PE 3

granted

pe_req[0] pe_req[3]

arbiter

comp

. . .

pe_addr0 pe_addr3 addr bus switch mem_on[0] mem_on[3] mem_data mem_req[3]

. . . . . . . . . . . .

pe_data0 pe_data3 data bus switch

. . .

pe_re0 pe_re3 wire switch

. . . . . .

pe_we0 pe_we3 wire switch

. . . . . .

pe_ta0 pe_ta3 wire_ta switch

. . . . . .

mem_we mem_ta mem_req[0] mem_addr mem_re

slide-51
SLIDE 51

2/12/2004 51

Example of Xbar Configuration

4x1switch7 4x1switch4 mem0 mem1 mem2 mem3 mem7 mem6 mem5 mem4

4x8 Xbar:

Supporting 4 PEs

and 8 memory modules

Connecting a

particular PE signals to a specific memory module by a 4x1 switch according to a physical address from a PE

4x1switch0 4x1switch1 4x1switch2 4x1switch3 4x1switch6 4x1switch5

slide-52
SLIDE 52

2/12/2004 52

X-Gt (Xbar Generator)

Generation of customized Xbar in Verilog at the RTL

An arbiter generated by RAG Parameterizable switch blocks in an Mx1

switch

All submodules connected by wire names

slide-53
SLIDE 53

2/12/2004 53

X-Gt (Continued)

Parameters:

The number of PEs that determines M in an MxN

Xbar

The number of memory blocks that determines N

in an MxN Xbar

The total global memory size that determines

(pe_)address bus width

The data bus width of each PE determined by PE

type

The (mem_)address bus width determined by the

size of memory block

slide-54
SLIDE 54

2/12/2004 54

parameters m<M pe_req[m] . . . m++ gen_proc_wire(M) n<N mem_addrn . . . n++ m<N gen_mem_wire(M) gen_addr_bus_switch(M) gen_data_bus_switch(M) gen_wire_switch(M) gen_wre_ta_switch(M) yes yes RAG generating an arbiter m++ gen_comp(M) yes gen_Mx1(parameters) MxN Xbar

Flowchart of X-Gt

slide-55
SLIDE 55

2/12/2004 55

parameters m<M pe_req[m] . . . m++ gen_proc_wire(M) n<N mem_addrn . . . n++ m<N gen_mem_wire(M) gen_addr_bus_switch(M) gen_data_bus_switch(M) gen_wire_switch(M) gen_wre_ta_switch(M) yes yes RAG generating an arbiter n++ gen_comp(M) yes gen_Mx1(parameters) 4x4 Xbar pe_req[0] pe_addr0 pe_data0 pe_read0 pe_write0 pe_ta0 m=0 M=4, N=4 pe_req[1] pe_addr1 pe_data1 pe_read1 pe_write1 pe_ta1 m=1 pe_req[2] pe_addr2 pe_data2 pe_read2 pe_write2 pe_ta2 m=2 pe_req[3] pe_addr3 pe_data3 pe_read3 pe_write3 pe_ta3 m=3 mem_addr0 mem_data0 mem_read0 mem_write0 mem_ta0 n=0 mem_addr1 mem_data1 mem_read1 mem_write1 mem_ta1 n=1 mem_addr2 mem_data2 mem_read2 mem_write2 mem_ta2 n=2 mem_addr3 mem_data3 mem_read3 mem_write3 mem_ta3 n=3

4x1switch0

m=1

4x1switch1

m=2

4x1switch2

m=3

4x1switch3

m=4

4x1switch3 4x1switch2 4x1switch0 4x1switch1

Verilog in RTL

slide-56
SLIDE 56

2/12/2004 56

Outline

Terminology Origin and history of problems:

Arbiter design: PPE and PPA Crossbar switch design: “Smart” Memory

Arbiter design Arbiter experiments RAG: Round-robin Arbiter Generator X-Gt: Xbar Generator Xbar experiments Conclusion

slide-57
SLIDE 57

2/12/2004 57

Xbar Synthesis: Mx1 Gate Area

Using TSMC 0.25µ std. cell library from Artisan Components Estimated by Design Compiler from Synopsys

1000 2000 3000 4000 5000 6000 2 3 4 5 6 7 8 9 Number of processors

Mx1 switch area in the number of INVERTER equivalents with TSMC .25um

slide-58
SLIDE 58

2/12/2004 58

Xbar Synthesis: MxM Area

Using TSMC 0.25µ std. cell library from Artisan Components Gate Area estimated by Design Compiler from Synopsys Gate + Wire Area estimated by Silicon Ensemble from Cadence

0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 4 6 8 10

Number of processors and number of memory blocks MxM Xbar area in square mm with TSMC .25um

Gate Area Gate + Wire Area

slide-59
SLIDE 59

2/12/2004 59

Back Annotated Timing

An RTL Verilog (output

  • f X-Gt) synthesized by

DC with TSMC .25µ from Artisan Components Back annotation of SDF and set_load files to logic synthesis stage after place and route using Synopsys Design Compiler and Cadence Silicon Ensemble

Logic Synthesis Place & Route Extract RC values Design Compiler Silicon Ensemble report timing RTL Verilog X-Gt Back annotate SDF and set load files

slide-60
SLIDE 60

2/12/2004 60

MxM Xbar Delay

5 10 15 20 25 30 35 2 3 4 5 6 7 8 9

Number of processors and number of memory blocks MxM Xbar delay in ns with TSMC .25um

w/o back annotation w/ back annotation

slide-61
SLIDE 61

2/12/2004 61

Conclusion

Showed how we design hierarchical SA and hierarchical BA. Demonstrated ≥ 1.8X speedups for our hierarchical SA over PPE and PPA. Illustrated how RAG generates a hierarchical SA. Demonstrated Xbar switch and showed how X-Gt generates an Xbar according to user specifications. Illustrated gap between delays from logic synthesis stage and delays from physical synthesis stage for MxM Xbars.

slide-62
SLIDE 62

2/12/2004 62

Publications

As (co-) First Author:

  • K. Ryu, E. S. Shin, and V. Mooney, “A Comparison of Five Different

Multiprocessor SoC Bus Architectures,” Proceedings of the 2001 Euromicro Symposium on Digital Systems Design (DSD’01), September 2001.

  • E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and

Generation,” Proceedings of 15th International Symposium on System Synthesis (ISSS’02), October 2002.

  • E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and

Generation,” Georgia Institute of Technology Technical Report, GIT-CC-02- 38, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.02.html

  • M. Shalan, E. S. Shin and V. Mooney, “DX-Gt: Memory Management and

Crossbar Switch Generator for Multriprocessor System-on-a-Chip,” Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI’03), April 2003.

  • E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and

Generation,” submitted to IEEE Transactions on CAD.

slide-63
SLIDE 63

2/12/2004 63

Publications (Continued)

Other publications:

  • P. Cheng, Eung S. Shin, G. R. Riley and V. J. Mooney III, “SASim: Switch

Arbiter Simulator,” Georgia Institute of Technology Technical Report, GIT- CC-03-38, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.03.html.

  • A. Talpasanu , J. A. Davis, E. S. Shin and V. J. Mooney, “Crossbar Switch

Interconnect Delay Calculation,” Georgia Institute of Technology Technical Report, GIT-CC-03-37, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.03.html.

Provisional Patent:

  • E. S. Shin and V. Mooney, “Fast Distributed Switch Arbiter Design and

Generation.”