a four terabit single stage a four terabit single stage
play

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - PowerPoint PPT Presentation

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R.


  1. A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland www.zurich.ibm.com

  2. Motivation Merchant switch market Achieve coverage of wide application spectrum: MAN/WAN/SAN Can a versatile switch architecture be designed to achieve this? Requires: High performance for different protocols and QoS requirements Allows very little assumptions about traffic properties www.zurich.ibm.com www.zurich.ibm.com

  3. Outline Current single-stage switch architectures Preferred architecture Physical implementation of a 4 Tb/s switch Simulated performance: 256 x 256 system Conclusions www.zurich.ibm.com www.zurich.ibm.com

  4. Current Single-Stage Switch Architectures VOQ-Based Switch Architectures Centralized Distributed Scheduling Scheduling (full int. speedup) IQ CIOQ Shared Crosspoint Distributed Buffer Queueing Shared (no speedup) (lim. ext. speedup) Buffer Scheduler www.zurich.ibm.com www.zurich.ibm.com

  5. Selection of the Preferred Architecture Initial focus on high-level architecture issues Equally significant aspects arise when actually building the system Physical system size Multi-rack packaging, interconnection, clocking and synchronization are required Power has become a tremendous challenge and a major design factor Typically required: 2 kW per rack, 150 W per card, 25 W per chip Switch fabric (SF) internal round trip (RT) has significantly increased Switch core (SC), line cards (LC) and VLSI chip packaging Significant consequences for system cost, power and practical implementation www.zurich.ibm.com www.zurich.ibm.com

  6. Size of a Terabit-Class System 30 m / 100 feet 19" Rack 4 Shelves 16 x OC-192 / Shelf 2.5 Tb/s Switch Line Cards Line Cards Line Cards Line Cards Core 0 - 63 64 - 127 128 - 191 192 - 255 Active + Backup www.zurich.ibm.com www.zurich.ibm.com

  7. Switch-Fabric-Internal Round Trip (RT) RT = Number of cells in flight: Line Card Switch Core RT total = RT cable + RT logic RT cable = cells in flight over backplanes and/or cables SerDes-Rx SerDes-Tx RT logic = cells pipelined in arbiter and SerDes (Serializer/Deserializer) logic RT SerDes-Tx SerDes-Rx RT has become an important SF-internal issue because of: d Increased physical system size Increased link speed rates SerDes circuits are now widely used to implement high-speed I/Os Line rate OC-12 OC-48 OC-192 OC-768 Interconnect distance 1 m 1 m 6 m 30 m Interconnect type backplane backplane cable fiber Packet duration 512 ns 128 ns 32 ns 8 ns Round Trip << 1 cell ~ 1 cell 16 cells 64 cells Evolution of RT www.zurich.ibm.com www.zurich.ibm.com

  8. Preferred Architecture (1/2) Combined input- and crosspoint- queued (CICQ) architecture Decoupling of the arrival and departure processes Distributed contention resolution over both inputs and outputs Close to ideal performance is achieved without speedup of the SC Memories are operated at the line rate Buffered Crossbar (BC) iFI 1 VOQ 1,1 XP 1,1 XP 1,N IQS 1 eFI 1 VOQ 1,N OQ 1 XQS 1 XQS N iFI N VOQ N,1 XP N,1 XP N,N IQS N eFI N VOQ N,N OQ N Switch Core (SC) Switch Fabric (SF) www.zurich.ibm.com www.zurich.ibm.com

  9. Preferred Architecture: CICQ (2/2) Advantages: Buffered Crossbar (BC) iFI 1 Performance and robust VOQ 1,1 XP 1,1 XP 1,N QoS of OQ switches IQS 1 eFI 1 VOQ 1,N OQ 1 A buffered crossbar is inherently free of buffer hogging XQS 1 XQS N A buffered SC enables iFI N hop-by-hop FC instead VOQ N,1 XP N,1 XP N,N of end-to-end IQS N eFI N VOQ N,N OQ N Reduced latency at low Switch Core (SC) utilization Switch Fabric (SF) Distribution of OQs exhibits some of the fair queueing properties Fair bandwidth allocation (e.g. with a simple Round-Robin) Protection and isolation of the sources from each other www.zurich.ibm.com www.zurich.ibm.com

  10. CICQ and CoS Support Selective queueing at each queueing point (iFI, SC, eFI) Service scheduling in addition to contention resolution (IQS, XQS) Additional scheduler at the egress (EQS) iFI[1] SC CoS[0] XP 1,1 XP 1,N VOQ 1,1 XPM XPM CoS[7] IQS 1 CoS[0] CoS[0] eFI[1] CoS[0] VOQ CoS[7] CoS[7] CoS[0] EQS 1 1,N CoS[7] XQS 1 XQS N CoS[7] iFI[N] CoS[0] XP N,1 XP N,N VOQ N,1 XPM XPM CoS[7] IQS N CoS[0] CoS[0] eFI[N] CoS[0] VOQ CoS[7] CoS[7] CoS[0] EQS N N,N CoS[7] CoS[7] www.zurich.ibm.com www.zurich.ibm.com

  11. CICQ and Parallel Sliced Switching 64 x 64 @ 64 Gb/s/port M iFI[i] 128 Gb/s 128 Gb/s 128 Gb/s R / k 128 Gb/s i e R / k D CoS[0] R eSPEX VOQ i,1 iSPEX D Hd iSPEX eSPEX CoS[7] S D IQS i 128 Gb/s R D S 128 Gb/s R / k .. 128 Gb/s CoS[0] 128 Gb/s D R / k VOQ i,N D CoS[7] D R eSPEX iSPEX D iSPEX eSPEX S 4 Tb/s 2 Tb/s 1 Tb/s 128 Gb/s S 128 Gb/s 128 Gb/s No. Master S 128 Gb/s x1 x1 x1 Chips 128 Gb/s S 128 Gb/s R / k No. Slave 128 Gb/s x30 x15 x8 128 Gb/s Chips R / k No. Cards x8 x4 x2 www.zurich.ibm.com www.zurich.ibm.com

  12. Crosspoint Buffer Dimensioning (1/2) Bandwidth (on the links) is becoming the scarce resource Hence Utilization must be maximized Link speedup should be avoided as much as possible Assuming a credit-based FC and a commun. channel with an RT of ! cells ! credits are required to keep link busy Do we also need ! credits per XP ? Traffic agnostic principle: The bandwidth of each flow can vary on an instantaneous basis Link utilization principle: Full utilization of the link bandwidth must be achieved in the absence of other flows To provide 100% throughout under any traffic condition: A minimum of ! cells are required per XP to ensure that any input can transmit to any output at any instant and at full rate. (e.g. in the case of fully unbalanced traffic or in absence of output contention) www.zurich.ibm.com www.zurich.ibm.com

  13. Crosspoint Buffer Dimensioning (2/2) RT evaluation RT cable = 2 dR / S light C size j 30 cells (with R = 64 Gb/s, d = 30 m, S light = 250 Mm/s (over the dielectric), C size = 512 bits) RT logic j 30 cells (estimated by design) RT total j 60 cells Buffer requirement ( assuming RT total = 64 cells, C size = 64 B ) Per logical XP: XPM size = ! = (64 % 64) = 4 kB Total for the switch core: N 2 % ! = 16 MB XPM size = ! = 64 cells provides: 100% throughput under contentionless traffic and d = 30 m / 100 feet 100% throughput under uniform traffic and d j 3 km / 10.000 feet www.zurich.ibm.com www.zurich.ibm.com

  14. VLSI Implementation CMOS 0.11- # m, Std. cell design, 2.5 Gb/s SerDes Slave chip (64x64@2Gb/s/port) 200 mm 2 , 20 W, 750 SIOs Split master chip (48x48@4Gb/s/port) 2 , 28 W, 825 SIOs 225 mm www.zurich.ibm.com www.zurich.ibm.com

  15. Simulated Performance Parameters: 256 x 256 CICQ switch fabric 64 x 64 SC with 4 external ports (OC-192) per SC port (OC-768) 64/128 cells per XP partitioned into 4 areas of 16/32 cells Ingress and egress link RT = 64 cells (at the OC-768 level) Line card egress buffer = 4 x 256 cells CoS 8 classes of service (C 0 is the highest, C 7 is the lowest priority) Uniform distribution, i.e. 12.5% of offered traffic per class Strict priority scheduling throughout the system (iFI, SC, eFI) www.zurich.ibm.com www.zurich.ibm.com

  16. Non-Uniform Traffic Non-uniform traffic: We adopt the distribution used by Rojas-Cessa et al. ( HPSR 2001 ) where: $ i,j = $ (w + (1-w)/N) if i = j, $ (1-w)/N otherwise $ - N is the number of ports (256), $ i,j is the traffic intensity from input i to output j - $ is the aggregate offered load (100%), w is the non-uniformity factor XPM = 68/72/80/ cells www.zurich.ibm.com www.zurich.ibm.com

  17. Uniform Traffic Uniform traffic Uniformly distributed bursts over all 256 destinations Geometrically distributed bursts XPM = 64 cells / Bursts = 30 cells XPM = 128 cells / Bursts = 30 cells www.zurich.ibm.com www.zurich.ibm.com

  18. Conclusions System design and implementation are equally important as performance considerations Impact of power, packaging, links, RT Traffic agnosticness requirement in OEM CoS support CICQ architecture is a viable solution Scalable Demonstrated sizing VLSI implementation of a single-stage 4 Tb/s switch Excellent performance www.zurich.ibm.com www.zurich.ibm.com

  19. Contacts IBM Prizma research team http://www.zurich.ibm.com/cs/powerprs.html IBM PowerPRS™: Switch fabric products http://www-3.ibm.com/chips/products/wired/products/switch_fabric.html www.zurich.ibm.com www.zurich.ibm.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend