Round-robin Arbiter Design and Generation Eung S. Shin Prof. - - PowerPoint PPT Presentation

round robin arbiter design and generation
SMART_READER_LITE
LIVE PREVIEW

Round-robin Arbiter Design and Generation Eung S. Shin Prof. - - PowerPoint PPT Presentation

Round-robin Arbiter Design and Generation Eung S. Shin Prof. Vincent J. Mooney III Prof. George F. Riley Electrical and Computer Engineering Georgia Institute of Technology Outline Introduction Terminology Related Work Bus


slide-1
SLIDE 1

Round-robin Arbiter Design and Generation

Eung S. Shin

  • Prof. Vincent J. Mooney III
  • Prof. George F. Riley

Electrical and Computer Engineering Georgia Institute of Technology

slide-2
SLIDE 2

2

Outline

Introduction Terminology Related Work Bus Arbiter (BA) Design Switch Arbiter (SA) Design Round-robin Arbiter Generator (RAG) Comparison with other Switch Arbiters Conclusion

slide-3
SLIDE 3

3

Introduction

As the number of bus masters increases in a single chip, the importance of fast and powerful arbiters commands more attention. A fast arbiter is one of the dominant factors to achieve terabit switching speeds. To design with high per- formance and fairness in arbitration is a tedious and error-prone task. Our goal is to provide a fast and fair arbiter design with a tool for automatic generation.

Network Switch (16x16)

Crossbar Switch Fabric (16x16)x16 16 (16x16 arbiter)s

… … … VOQ(0,16) VOQ(0,0) . . . input port 0 VOQ(16,0) . . . VOQ(16,16) input port 16

. . . . . .

  • utput port 16
  • utput port 0

. . . . . . . . .

req(0, 0) req(16, 16) grant(0, 0-16) grant(16, 0-16)

slide-4
SLIDE 4

4

Terminology

MxN Switch: M-input by N-output switch.

  • Example: A 32x32 switch is a 32-input by 32-output switch

with 1024 (322) possible connections between input ports and output ports.

Virtual Output Queues (VOQs): there are VOQs in a switch to remove possible output port contention (Head of Line (HOL) blocking). VOQ (m, n): m is the input port index and n is the

  • utput port index.
  • Example: VOQ (1, 0) is the VOQ of input port 1and queues

packets destined to output port 0.

slide-5
SLIDE 5

5

HOL Blocking Example

1 input port 0 input port 1

  • utput port 0
  • utput port 1

Without VOQs

slide-6
SLIDE 6

6

HOL Blocking Example

1 input port 0 VOQ (0, 0) VOQ (0, 1) input port 1 VOQ (1, 0) VOQ (1, 1)

  • utput port 0
  • utput port 1

With VOQs

slide-7
SLIDE 7

7

Terminology (Continued)

(MxV)xN Switch:

  • M is the number of

input ports of an MxN switch.

  • V is the number of

VOQs per input port.

  • N is the number of
  • utput ports of an MxN

switch.

  • Typically, V is equal to

N.

  • The total number of

VOQs in an MxN switch is M∗N.

Network Switch (32x32)

Crossbar Switch Fabric (32x32)x32 32 (32x32 arbiter)s

… … … VOQ(0,31) VOQ(0,0) . . . input port 0 VOQ(31,0) . . . VOQ(31,31) input port 31

. . . . . .

  • utput port 31
  • utput port 0

. . . . . . . . .

req(0, 0) req(31, 31) grant(0, 0-31) grant(31, 0-31)

slide-8
SLIDE 8

8

Terminology (Continued)

(MxV)xN crossbar switch fabric:

  • There are connections

between (MxV) inputs (from VOQ (0, 0) to VOQ (M-1, V-1)) and N outputs, the number of output ports in the switch fabric.

  • MxM Switch Arbiter (SA):
  • An MxM SA controls M

specific transmission gates between M VOQs and a particular output port.

  • There are N MxM SAs in

an MxN switch.

32 x 32 SA_0

. . .

grant (0, 0) grant (1, 0) grant (31, 0) VOQ (0, 0) VOQ (1, 0) VOQ (31, 0)

. . . . . . 32 x 32 SA_31

. . .

grant (0, 31) grant (1, 31) grant (31, 31) VOQ (0, 31) VOQ (1, 31) VOQ (31, 31)

. . . . . . (32x32)x32 Crossbar Switch Fabric Thirty-two 32x32 SAs

  • utput port 0
  • utput port 31

. . .

. . . . . .

slide-9
SLIDE 9

9

Terminology (Continued)

  • MxM distributed SA (MxM hierarchical SA): plays the same role

as an MxM SA.

  • Consists of smaller switch arbiter in the form of a hierarchical tree

structure.

  • Bus Arbiter (BA): resolves bus conflicts when multiple bus

masters request a bus in the same cycle.

req0[0] req0[1] req0[2] req0[3] ack0[0] ack0[1] req1[0] req1[1] req1[2] req1[3] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] req0 req1

8x8 hierarchical SA

clock 4x4 ack-req SA 0 counter D 4x4 ack-req SA 1 counter D ack

Priority Logic 0 req[0] req[1] req[2] req[3] Ring Counter

token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1

EN EN EN EN

grant[0] grant[1] grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

in[0] in[1] in[2] in[3]

D-FF

clock Priority Logic 0 req[0] req[1] req[2] req[3] Ring Counter

token [0] token [1] token [2] token [3] token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1

EN EN EN EN

grant[0] grant[1] grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

in[0] in[1] in[2] in[3]

D-FF D-FF

clock

slide-10
SLIDE 10

10

Related Work

  • Centralized Switch Arbiters:
  • Dual Round-Robin Matching algorithm (DRRM)

– H. J. Chao and J. S. Park, “Centralized Contention Resolution Schemes for a Larger-capacity Optical ATM Switch,” Proceedings of IEEE ATM Workshop, 1998,

  • pp. 11-16.
  • Programmable Priority Encoder (PPE) implementing iterative

round-robin algorithm (iSLIP)

– P. Gupta and N. Mckeown, “Designing and Implementing a Fast Crossbar Scheduler,” IEEE Micro, 1999, pp. 20-28. – N. Mckeown, P. Varaiya, and J. Warland, “The iSLIP Scheduling Algorithm for Input-Queued Switch,” IEEE Transaction on Networks, 1999, pp. 188-201.

  • Distributed Switch Arbiter:
  • Ping Pong Arbiter (PPA)

– H. J. Chao, C. H. Lam, and X. Guo, “A Fast Arbitration Scheme for Terabit Packet Switches,” Proceedings of IEEE Global Telecommunications Conference, 1999, pp. 1236-1243.

  • We will show how our generated SA achieves throughput 2.4X

higher than PPE and 1.9X higher than PPA (and thus, at least 1.9X higher than DRRM since PPA outperforms DRRM).

slide-11
SLIDE 11

11

Bus Arbiter Design

Implemented based on ring counter for a token and “priority logic”. Priority Logic for 4 inputs:

  • output[0] = EN•in[0]
  • output[1] = EN•in[0]'•in[1]
  • output[2] = EN•in[0]'•in[1]'•in[2]
  • output[3] = EN•in[0]'•in[1]'•in[2]'•in[3]

1 1 1 1 1 X 1 1 1 X X 1 1 1 X X X 1 1 X X X X

  • utput [3]
  • utput [2]
  • utput [1]
  • utput [0]

in [3] in [2] in [1] in [0] EN

slide-12
SLIDE 12

12

Example: Bus Arbiter

  • Condition:
  • Token=4’b0100 → Processor 2 has the highest priority.
  • Processor 0 and processor 1 request a bus.
  • Result:
  • Only Priority Logic 2 is enabled.
  • Processor 0 is granted because the higher priority parties

(processor 2 and processor 3) do not request a bus.

  • Token is rotated to 4’b1000 after the ring counter receives ack

signal.

Processor 2

req[0]

Processor 0 Processor 1 Processor 3 Memory PL 2

token[2]

ring counter grant[0]

4x4 BA

req[1]

  • utput[2]

ack

slide-13
SLIDE 13

13

Example: Bus Arbiter (Continued)

Priority Logic 0 Ring Counter

token [0] token [1] token [2] token [3]

Priority Logic 2 Priority Logic 3 Priority Logic 1 grant[2] grant[3]

4x4 BA

ack reset

  • utput[0]
  • utput[1]
  • utput[2]
  • utput[3]

req[0] req[1] req[2] req[3]

EN EN EN

in[0] in[1] in[2] in[3]

D-FF

clock grant[0] grant[1]

token [2]

Priority Logic 2

EN

grant[0]

slide-14
SLIDE 14

14

Switch Arbiter Design

  • A hierarchical SA consists of

small switch arbiter blocks.

  • There are four types of

switch arbiter blocks.

  • 2x2 ack-req SA.
  • 4x4 ack-req SA.
  • 2x2 root SA.
  • 4x4 root SA.
  • A root SA placed on the top
  • f a hierarchy.

4x4 Bus Arbiter

ack grant0[0] grant0[1] grant0[2] grant0[3] req0[3] req0[0] req0[1] req0[2] req0

4x4 ack-req SA

clock reset 2x2 Bus Arbiter ack req0[0] req0[1] grant0[1] grant0[2] req0

2x2 ack-req SA

clock reset

4x4 BA without D flip-flop

ack0 ack1 ack2 ack3 req0 req1 req2 req3

4x4 root SA

clock

ring counter

reset

2x2 BA without D flip-flop

ack0 ack1 req0 req1

2x2 root SA

clock

ring counter

reset

slide-15
SLIDE 15

Key Insight

  • With TSMC .25µ std. cell library

from LEDA Systems, 4x4 is the “sweet spot” of high performance → analogous to std. cell design where using 4-input gates in design speeds up over, say only 2-input gates or 8-input gates.

  • Use as many 4x4 as possible.
  • Use 2x2 if needed.

2x2 SA .24 ns 2x2 PPE .45 ns 2x2 PPA .40 ns 4x4 SA .34 ns 4x4 PPA .65 ns

2x2 2x2 2x2

4x4 PPE .61 ns 8x8 SA .53 ns

2x2 4x4 4x4

8x8 PPA .85 ns

2x2 2x2 2x2 2x2 2x2 2x2 2x2

8x8 PPE 1.12 ns 16x16 SA .76 ns

4x4 4x4 4x4 4x4 4x4

16x16 PPA 1.45 ns

2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2 2x2

16x16 PPE 1.55 ns PPE PPA Our SA from RAG

slide-16
SLIDE 16

Example: 32x32 hierarchical SA

4x4 ack-req SA l0.sa0 req0[0] req0[1] req0[2] req0[3] ack0[0] 4x4 ack-req SA l0.sa1 ack0[1] 4x4 ack-req SA l0.sa2 ack0[3] 4x4 ack-req SA l0.sa3 4x4 ack-req SA l1.sa0 req1[0] req1[1] req1[2] req1[3] req2[0] req2[1] req2[2] req2[3] req3[0] req3[1] req3[2] req3[3] ack0[2] 4x4 ack-req SA l0.sa4 req4[0] req4[1] req4[2] req4[3] ack1[0] 4x4 ack-req SA l0.sa5 ack1[1] 4x4 ack-req SA l0.sa6 ack1[3] 4x4 ack-req SA l0.sa7 4x4 ack-req SA l1.sa1 req5[0] req5[1] req5[2] req5[3] req6[0] req6[1] req6[2] req6[3] req7[0] req7[1] req7[2] req7[3] ack1[2] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant2[0] grant2[1] grant2[2] grant2[3] grant3[0] grant3[1] grant3[2] grant3[3] grant4[0] grant4[1] grant4[2] grant4[3] grant5[0] grant5[1] grant5[2] grant5[3] grant6[0] grant6[1] grant6[2] grant6[3] grant7[0] grant7[1] grant7[2] grant7[3] up_req[0] up_req[1] req0 req1 req2 req3 req4 req5 req6 req7 up_ack0 up_ack1 clock

  • 32 x 32 SA Critical Path:

req5[1] 4x4 ack-req SA l0.sa5 req5 up_req[1] 4x4 ack-req SA l1.sa1 2x2 root SA up_ack1 ack1[1] grant5[1] ack1[1] grant5[1] clock 2x2 root SA up_req[0] up_req[1] req5[0] req5[1] req5[2] req5[3] 4x4 BA counter D req4 req5 req6 req7 4x4 BA counter D up_ack0 up_ack1 l1.sa1.output[1] l0.sa5.output[1]

  • Travels through two 4-input OR gates in series

req5[1] up_req[1] req5

  • Then through a 2x2 root SA

2x2 root SA

  • Finally through two 2-input AND gates in series

grant5[1] up_ack1

  • Results in 0.94ns delay using a TSMC 0.25µ standard cell library

from LEDA Systems.

ack1[1] D D up_ack1

ack signals look like feedback path through the same logic

  • block. In fact there is no input to

the same logic gates.

slide-17
SLIDE 17

17

Comparison w/32x32 PPE and PPA

  • PPE Critical Path:
  • utput[31] =

in[0]’•in[1]’•…•in[30]’•in[31] plus output encoding.

  • Associates with 8 4-input

AND gates and 31 inverters.

  • Results in 2.17ns delay

using a TSMC 0.25µ std. cell library from LEDA Systems.

  • PPA Critical Path:
  • Only use 2x2 arbiters.
  • 2x2 PPA: 0.4ns while our

2x2 SA:0.24ns

  • 5 levels in a binary tree

structure.

  • Associates with 4 serially

connected 2-input OR gates for ORed request.

  • Associates with 2

acknowledgements from two higher levels →3 3-input AND gates.

  • Results in 1.7ns delay using

a TSMC 0.25µ std. cell library from LEDA Systems.

req[31] ack (ANDed) req (ORed)

PPA

slide-18
SLIDE 18

18

Round-robin Arbiter Generator (RAG)

  • RAG is preferable to employ as many 4x4 SAs as possible to

reduce the number of levels in a hierarchy.

  • A hierarchical 4x4 SA has longer delay (0.46ns) than a 4x4 ack-

req SA (0.34ns) in .25µ std. cell library from LEDA Systems.

2x2 ack-req SA req0[0] req0[1] ack0[0] 2x2 ack-req SA ack0[1] req1[0] req1[1] 2x2 root SA grant0[0] grant0[1] grant1[0] grant1[1] req0 req1

4x4 SA

clock

root leaves

slide-19
SLIDE 19

RAG (Continued)

  • A user specify an

arbiter type either a Bus Arbiter or a Switch Arbiter.

  • A user specify the

number of masters (M) to be arbitrated.

  • RAG generates

synthesizable Verilog code for a Bus Arbiter

  • r a Switch Arbiter at

the RTL level.

  • RAG is most efficient

when M is a power of two.

User input:

  • 1. Type of the arbiter
  • 2. Number of masters

generate M x M bus arbiter gen_arb(); Bus Arbiter Switch Arbiter integrate M x M hierarchical switch arbiter integ_arb(); Library 2x2 ack-req SA 4x4 ack-req SA 2x2 root SA 4x4 root SA Bus Arbiter Switch Arbiter

num_level ← 0 dividend←num_masters remainder← 0 dividend=0?

No

dividend ←(integer) (dividend/4) n ←num_level num_4by4_level(n) ←0 num_2by2_level(n) ←0 num_level ++ dividend=0 and remainder=0 ? dividend←num_masters n← 0

Yes

dividend>2? remainder ← dividend mod 4 dividend←(integer) (dividend/4) num_4by4_level(n)← dividend remainder ← dividend mod 2 dividend←(integer) (dividend/2) num_2by2_level(n)← dividend

No Yes

remainder=0 ? remainder>2 ?

No

num_4by4_level(n)++ num_2by2_level(n)++

Yes No

n++ remainder=0 ?

No Yes No Yes

dividend←num_4by4_level(n)+num_2by2_level(n) n<num_level?

No Yes Hierarchical SA Yes

slide-20
SLIDE 20

RAG (Continued)

num_level ← 0 dividend←num_masters remainder← 0 dividend=0?

No

dividend ←(integer) (dividend/4) n ←num_level num_4by4_level(n) ←0 num_2by2_level(n) ←0 num_level ++ dividend=0 and remainder=0 ? dividend←num_masters n← 0

Yes

dividend>2? remainder ← dividend mod 4 dividend←(integer) (dividend/4) num_4by4_level(n)← dividend remainder ← dividend mod 2 dividend←(integer) (dividend/2) num_2by2_level(n)← dividend

No Yes

remainder=0 ? remainder>2 ?

No

num_4by4_level(n)++ num_2by2_level(n)++

Yes No

n++ remainder=0 ?

No Yes No Yes

dividend←num_4by4_level(n)+num_2by2_level(n) n<num_level?

No Yes Hierarchical SA dividend=32 dividend=32 n=0 Yes dividend=8 n=0 num_4by4_level(0)=0 num_2by2_level(0)=0 num_level=1 dividend=2 n=1 num_4by4_level(1)=0 num_2by2_level(1)=0 num_level=2 dividend=0 n=2 num_4by4_level(2)=0 num_2by2_level(2)=0 num_level=3 remainder=0 dividend=8 num_4by4(0)=8

4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA 4x4 SA

n=1 dividend=8 remainder=0 dividend=2 num_4by4(1)=2

4x4 SA 4x4 SA

n=2 dividend=2 remainder=0 dividend=1 num_2by2(2)=1

2x2 root SA

n=3 dividend=1

slide-21
SLIDE 21

21

User input:

  • 1. Type of the arbiter
  • 2. Number of masters

generate M x M bus arbiter gen_arb(); Bus Arbiter Switch Arbiter integrate M x M hierarchical switch arbiter integ_arb(); Library 2x2 ack-req SA 4x4 ack-req SA 2x2 root SA 4x4 root SA Bus Arbiter Switch Arbiter

RAG (Continued)

Calculate the number

  • f levels;

Calculate SA blocks for each level; 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 4x4 S A 2x2 root S A integrate M x M hierarchical switch arbiter integ_arb();

4x4 ack-req SA l0.sa0 req0[0] req0[1] req0[2] req0[3] req0[0] req0[1] req0[2] req0[3] ack0[0] 4x4 ack-req SA l0.sa1 ack0[1] 4x4 ack-req SA l0.sa2 ack0[3] 4x4 ack-req SA l0.sa3 4x4 ack-req SA l1.sa0 req1[0] req1[1] req1[2] req1[3] req1[0] req1[1] req1[2] req1[3] req2[0] req2[1] req2[2] req2[3] req2[0] req2[1] req2[2] req2[3] req3[0] req3[1] req3[2] req3[3] req3[0] req3[1] req3[2] req3[3] ack0[2] 4x4 ack-req SA l0.sa4 req4[0] req4[1] req4[2] req4[3] req4[0] req4[1] req4[2] req4[3] ack1[0] 4x4 ack-req SA l0.sa5 ack1[1] 4x4 ack-req SA l0.sa6 ack1[3] 4x4 ack-req SA l0.sa7 4x4 ack-req SA l1.sa1 req5[0] req5[1] req5[2] req5[3] req5[0] req5[1] req5[2] req5[3] req6[0] req6[1] req6[2] req6[3] req6[0] req6[1] req6[2] req6[3] req7[0] req7[1] req7[2] req7[3] req7[0] req7[1] req7[2] req7[3] ack1[2] 2x2 root SA grant0[0] grant0[1] grant0[2] grant0[3] grant0[0] grant0[1] grant0[2] grant0[3] grant1[0] grant1[1] grant1[2] grant1[3] grant1[0] grant1[1] grant1[2] grant1[3] grant2[0] grant2[1] grant2[2] grant2[3] grant2[0] grant2[1] grant2[2] grant2[3] grant3[0] grant3[1] grant3[2] grant3[3] grant3[0] grant3[1] grant3[2] grant3[3] grant4[0] grant4[1] grant4[2] grant4[3] grant4[0] grant4[1] grant4[2] grant4[3] grant5[0] grant5[1] grant5[2] grant5[3] grant5[0] grant5[1] grant5[2] grant5[3] grant6[0] grant6[1] grant6[2] grant6[3] grant6[0] grant6[1] grant6[2] grant6[3] grant7[0] grant7[1] grant7[2] grant7[3] grant7[0] grant7[1] grant7[2] grant7[3] up_req[0] up_req[1] req0 req1 req2 req3 req4 req5 req6 req7 up_ack0 up_ack1 clock

slide-22
SLIDE 22

22

Comparisons with PPE and PPA

Using TSMC 0.25µ std. cell library from LEDA Systems

1000 2000 3000 4000 5000 6000 50 100 150 MxM arbiter Area of arbiter in the number of inverter equivalents SA PPE PPA 0.5 1 1.5 2 2.5 3 3.5 50 100 150 MxM arbiter Delay in arbiter with TSMC .25um SA PPE PPA

slide-23
SLIDE 23

23

Comparisons (Continued)

The shortest delay results from

  • Limiting the size of switch arbiter blocks to 2x2 and 4x4 to

reduce the critical path delay due to the expansion of priority logic blocks compared with Programmable Priority Logic Encoder (PPE), a centralized arbiter.

  • Reducing the number of levels in a hierarchy by preferring to

use more 4x4 switch arbiter blocks compared with Ping- Pong Arbiter (PPA).

slide-24
SLIDE 24

24

Speedup for a Terabit Switch

  • Assumptions for comparison
  • The speed of switching is

wholly determined by the arbitration cycles.

  • Speedup
  • Our hierarchical 128x128 SA:

6.16Tbps.

  • 128x128 PPA: 3.18Tbps.
  • 128x128 PPE: 2.59Tbps.
  • Our SA achieves throughput

1.9X higher than PPA and 2.4X higher than PPE.

  • Commercial Switches
  • Mindspeed claims up to

.45Tbps for 144x144 switch using multiple chips.

  • PetaSwitch claims up to

10.24Tbps for 256x256 switch using multiple chips.

  • No details about logic design

nor process technology used.

  • utput port 127

Network Switch (128x128)

Crossbar Switch Fabric (128x128)x128 128 (128x128 arbiter)s

… … … VOQ(0,127) VOQ(0,0) . . . input port 0 VOQ(127,0) . . . VOQ(127,127) input port 127

. . . . . .

  • utput port 0

. . . . . . . . .

req(0, 0) req(127, 127) grant(0, 0-127) grant(127, 0-127)

slide-25
SLIDE 25

25

Conclusion

BA logic We showed how 2x2 and 4x4 BAs are applied to 2x2 and 4x4 switch arbiter blocks. We demonstrated how RAG generate synthesizable Verilog codes for a BA and a SA with the example of 32x32 hierarchical SA. We compared areas and delays with other SAs. We demonstrated how our generated 128x128 hierarchical SA could achieve throughput 1.9X higher than PPA and 2.4X higher than PPE.