A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - - PowerPoint PPT Presentation

a four terabit single stage a four terabit single stage
SMART_READER_LITE
LIVE PREVIEW

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - - PowerPoint PPT Presentation

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R.


slide-1
SLIDE 1

www.zurich.ibm.com

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support

  • F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis
  • F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis

IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland

slide-2
SLIDE 2

www.zurich.ibm.com www.zurich.ibm.com

Motivation

Merchant switch market

Achieve coverage of wide application spectrum: MAN/WAN/SAN

Can a versatile switch architecture be designed to achieve this? Requires:

High performance for different protocols and QoS requirements Allows very little assumptions about traffic properties

slide-3
SLIDE 3

www.zurich.ibm.com www.zurich.ibm.com

Outline

Current single-stage switch architectures Preferred architecture Physical implementation of a 4 Tb/s switch Simulated performance: 256 x 256 system Conclusions

slide-4
SLIDE 4

www.zurich.ibm.com www.zurich.ibm.com

Centralized Scheduling

Current Single-Stage Switch Architectures

VOQ-Based Switch Architectures IQ

(no speedup)

Distributed Shared Buffer Distributed Scheduling

(full int. speedup) (lim. ext. speedup)

CIOQ Crosspoint Queueing Shared Buffer

Scheduler

slide-5
SLIDE 5

www.zurich.ibm.com www.zurich.ibm.com

Selection of the Preferred Architecture

Initial focus on high-level architecture issues Equally significant aspects arise when actually building the system

Physical system size

Multi-rack packaging, interconnection, clocking and synchronization are required

Power has become a tremendous challenge and a major design factor

Typically required: 2 kW per rack, 150 W per card, 25 W per chip

Switch fabric (SF) internal round trip (RT) has significantly increased Switch core (SC), line cards (LC) and VLSI chip packaging

Significant consequences for system cost, power and practical implementation

slide-6
SLIDE 6

www.zurich.ibm.com www.zurich.ibm.com

Size of a Terabit-Class System

2.5 Tb/s Switch Core Active + Backup Line Cards 64 - 127 Line Cards 128 - 191 Line Cards 0 - 63

19" Rack 4 Shelves 16 x OC-192 / Shelf

Line Cards 192 - 255

30 m / 100 feet

slide-7
SLIDE 7

www.zurich.ibm.com www.zurich.ibm.com

Switch-Fabric-Internal Round Trip (RT)

RT = Number of cells in flight:

RTtotal = RTcable + RTlogic

RTcable = cells in flight over backplanes and/or cables RTlogic = cells pipelined in arbiter and SerDes (Serializer/Deserializer) logic

RT has become an important SF-internal issue because of:

Increased physical system size Increased link speed rates SerDes circuits are now widely used to implement high-speed I/Os

d Line Card Switch Core

SerDes-Tx SerDes-Tx SerDes-Rx SerDes-Rx

RT

Line rate OC-12 OC-48 OC-192 OC-768 Interconnect distance 1 m 1 m 6 m 30 m Interconnect type backplane backplane cable fiber Packet duration 512 ns 128 ns 32 ns 8 ns Round Trip << 1 cell ~ 1 cell 16 cells 64 cells

Evolution of RT

slide-8
SLIDE 8

www.zurich.ibm.com www.zurich.ibm.com

Preferred Architecture (1/2)

Combined input- and crosspoint- queued (CICQ) architecture

Decoupling of the arrival and departure processes Distributed contention resolution over both inputs and outputs Close to ideal performance is achieved without speedup of the SC Memories are operated at the line rate

XQS1 iFI 1 IQS1 VOQ 1,1 VOQ 1,N

XP1,1 XP1,N XPN,1 XPN,N

XQSN iFI N IQSN VOQ N,1 VOQ N,N eFI 1 OQ1 eFI N OQN Buffered Crossbar (BC) Switch Core (SC) Switch Fabric (SF)

slide-9
SLIDE 9

www.zurich.ibm.com www.zurich.ibm.com

Preferred Architecture: CICQ (2/2)

Advantages:

Performance and robust QoS of OQ switches A buffered crossbar is inherently free of buffer hogging A buffered SC enables hop-by-hop FC instead

  • f end-to-end

Reduced latency at low utilization

Distribution of OQs exhibits some of the fair queueing properties

Fair bandwidth allocation (e.g. with a simple Round-Robin) Protection and isolation of the sources from each other

XQS1 iFI 1 IQS1 VOQ 1,1 VOQ 1,N

XP1,1 XP1,N XPN,1 XPN,N

XQSN iFI N IQSN VOQ N,1 VOQ N,N eFI 1 OQ1 eFI N OQN Buffered Crossbar (BC) Switch Core (SC) Switch Fabric (SF)

slide-10
SLIDE 10

www.zurich.ibm.com www.zurich.ibm.com

CICQ and CoS Support

Selective queueing at each queueing point (iFI, SC, eFI) Service scheduling in addition to contention resolution (IQS, XQS) Additional scheduler at the egress (EQS)

XQS1 XQSN

iFI[1]

IQS

1

VOQ 1,1

CoS[0] CoS[7]

VOQ

1,N

CoS[0] CoS[7]

XP1,1

CoS[7] CoS[0]

XPM

XPN,1

CoS[7] CoS[0]

XPM

iFI[N]

IQS

N

VOQ

N,1

CoS[0] CoS[7]

VOQ

N,N

CoS[0] CoS[7]

XP1,N

CoS[7] CoS[0]

XPM

XPN,N

CoS[7] CoS[0]

XPM

eFI[1]

EQS1

CoS[0] CoS[7]

eFI[N]

EQSN

CoS[0] CoS[7]

SC

slide-11
SLIDE 11

www.zurich.ibm.com www.zurich.ibm.com

CICQ and Parallel Sliced Switching

4 Tb/s 2 Tb/s 1 Tb/s

  • No. Master

Chips x1 x1 x1

  • No. Slave

Chips x30 x15 x8

  • No. Cards

x8 x4 x2

iFI[i]

Hd D D .. D D

IQSi

VOQi,1

CoS[0] CoS[7] CoS[0] CoS[7]

VOQi,N

R

R/k R/k R/k R/k R/k R/k

64 x 64 @ 64 Gb/s/port

S S S S

128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s

R S S

128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s

R

128 Gb/s 128 Gb/s 128 Gb/s 128 Gb/s

iSPEX eSPEX iSPEX eSPEX eSPEX iSPEX eSPEX iSPEX

D D D D

M

i e

slide-12
SLIDE 12

www.zurich.ibm.com www.zurich.ibm.com

Crosspoint Buffer Dimensioning (1/2)

Bandwidth (on the links) is becoming the scarce resource

Hence

Utilization must be maximized Link speedup should be avoided as much as possible

Assuming a credit-based FC and a commun. channel with an RT of ! cells

! credits are required to keep link busy

Do we also need ! credits per XP ?

Traffic agnostic principle:

The bandwidth of each flow can vary on an instantaneous basis

Link utilization principle:

Full utilization of the link bandwidth must be achieved in the absence of other flows

To provide 100% throughout under any traffic condition:

A minimum of ! cells are required per XP to ensure that any input can transmit to any output at any instant and at full rate.

(e.g. in the case of fully unbalanced traffic or in absence of output contention)

slide-13
SLIDE 13

www.zurich.ibm.com www.zurich.ibm.com

Crosspoint Buffer Dimensioning (2/2)

RT evaluation

RTcable = 2dR / SlightCsize j 30 cells

(with R = 64 Gb/s, d = 30 m, Slight = 250 Mm/s (over the dielectric), Csize = 512 bits)

RTlogic j 30 cells (estimated by design) RTtotal j 60 cells

Buffer requirement (assuming RTtotal = 64 cells, Csize = 64 B)

Per logical XP: XPMsize = ! = (64 % 64) = 4 kB Total for the switch core: N

2 % ! = 16 MB

XPMsize= ! = 64 cells provides:

100% throughput under contentionless traffic and d = 30 m / 100 feet 100% throughput under uniform traffic and d j 3 km / 10.000 feet

slide-14
SLIDE 14

www.zurich.ibm.com www.zurich.ibm.com

VLSI Implementation

CMOS 0.11-#m, Std. cell design, 2.5 Gb/s SerDes Slave chip (64x64@2Gb/s/port)

200 mm

2, 20 W, 750 SIOs

Split master chip (48x48@4Gb/s/port)

225 mm

2, 28 W, 825 SIOs

slide-15
SLIDE 15

www.zurich.ibm.com www.zurich.ibm.com

Simulated Performance

Parameters:

256 x 256 CICQ switch fabric

64 x 64 SC with 4 external ports (OC-192) per SC port (OC-768) 64/128 cells per XP partitioned into 4 areas of 16/32 cells Ingress and egress link RT = 64 cells (at the OC-768 level) Line card egress buffer = 4 x 256 cells

CoS

8 classes of service (C0 is the highest, C7 is the lowest priority) Uniform distribution, i.e. 12.5% of offered traffic per class Strict priority scheduling throughout the system (iFI, SC, eFI)

slide-16
SLIDE 16

www.zurich.ibm.com www.zurich.ibm.com

Non-Uniform Traffic

XPM = 68/72/80/ cells Non-uniform traffic: We adopt the distribution used by Rojas-Cessa et al. (HPSR 2001) where: $ $ i,j = $(w + (1-w)/N) if i = j, $(1-w)/N otherwise

  • N is the number of ports (256), $i,j is the traffic intensity from input i to output j
  • $ is the aggregate offered load (100%), w is the non-uniformity factor
slide-17
SLIDE 17

www.zurich.ibm.com www.zurich.ibm.com

Uniform Traffic

XPM = 64 cells / Bursts = 30 cells

Uniform traffic

Uniformly distributed bursts over all 256 destinations Geometrically distributed bursts

XPM = 128 cells / Bursts = 30 cells

slide-18
SLIDE 18

www.zurich.ibm.com www.zurich.ibm.com

Conclusions

System design and implementation are equally important as performance considerations

Impact of power, packaging, links, RT

Traffic agnosticness requirement in OEM

CoS support

CICQ architecture is a viable solution

Scalable

Demonstrated sizing

VLSI implementation of a single-stage 4 Tb/s switch Excellent performance

slide-19
SLIDE 19

www.zurich.ibm.com www.zurich.ibm.com

Contacts

IBM Prizma research team

http://www.zurich.ibm.com/cs/powerprs.html

IBM PowerPRS™: Switch fabric products

http://www-3.ibm.com/chips/products/wired/products/switch_fabric.html