A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - PowerPoint PPT Presentation

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland IBM Research, Zurich Research Laboratory, CH-8803 Ruschlikon, Switzerland www.zurich.ibm.com

Motivation Merchant switch market Achieve coverage of wide application spectrum: MAN/WAN/SAN Can a versatile switch architecture be designed to achieve this? Requires: High performance for different protocols and QoS requirements Allows very little assumptions about traffic properties www.zurich.ibm.com www.zurich.ibm.com

Outline Current single-stage switch architectures Preferred architecture Physical implementation of a 4 Tb/s switch Simulated performance: 256 x 256 system Conclusions www.zurich.ibm.com www.zurich.ibm.com

Current Single-Stage Switch Architectures VOQ-Based Switch Architectures Centralized Distributed Scheduling Scheduling (full int. speedup) IQ CIOQ Shared Crosspoint Distributed Buffer Queueing Shared (no speedup) (lim. ext. speedup) Buffer Scheduler www.zurich.ibm.com www.zurich.ibm.com

Selection of the Preferred Architecture Initial focus on high-level architecture issues Equally significant aspects arise when actually building the system Physical system size Multi-rack packaging, interconnection, clocking and synchronization are required Power has become a tremendous challenge and a major design factor Typically required: 2 kW per rack, 150 W per card, 25 W per chip Switch fabric (SF) internal round trip (RT) has significantly increased Switch core (SC), line cards (LC) and VLSI chip packaging Significant consequences for system cost, power and practical implementation www.zurich.ibm.com www.zurich.ibm.com

Size of a Terabit-Class System 30 m / 100 feet 19" Rack 4 Shelves 16 x OC-192 / Shelf 2.5 Tb/s Switch Line Cards Line Cards Line Cards Line Cards Core 0 - 63 64 - 127 128 - 191 192 - 255 Active + Backup www.zurich.ibm.com www.zurich.ibm.com

Switch-Fabric-Internal Round Trip (RT) RT = Number of cells in flight: Line Card Switch Core RT total = RT cable + RT logic RT cable = cells in flight over backplanes and/or cables SerDes-Rx SerDes-Tx RT logic = cells pipelined in arbiter and SerDes (Serializer/Deserializer) logic RT SerDes-Tx SerDes-Rx RT has become an important SF-internal issue because of: d Increased physical system size Increased link speed rates SerDes circuits are now widely used to implement high-speed I/Os Line rate OC-12 OC-48 OC-192 OC-768 Interconnect distance 1 m 1 m 6 m 30 m Interconnect type backplane backplane cable fiber Packet duration 512 ns 128 ns 32 ns 8 ns Round Trip << 1 cell ~ 1 cell 16 cells 64 cells Evolution of RT www.zurich.ibm.com www.zurich.ibm.com

Preferred Architecture (1/2) Combined input- and crosspoint- queued (CICQ) architecture Decoupling of the arrival and departure processes Distributed contention resolution over both inputs and outputs Close to ideal performance is achieved without speedup of the SC Memories are operated at the line rate Buffered Crossbar (BC) iFI 1 VOQ 1,1 XP 1,1 XP 1,N IQS 1 eFI 1 VOQ 1,N OQ 1 XQS 1 XQS N iFI N VOQ N,1 XP N,1 XP N,N IQS N eFI N VOQ N,N OQ N Switch Core (SC) Switch Fabric (SF) www.zurich.ibm.com www.zurich.ibm.com

Preferred Architecture: CICQ (2/2) Advantages: Buffered Crossbar (BC) iFI 1 Performance and robust VOQ 1,1 XP 1,1 XP 1,N QoS of OQ switches IQS 1 eFI 1 VOQ 1,N OQ 1 A buffered crossbar is inherently free of buffer hogging XQS 1 XQS N A buffered SC enables iFI N hop-by-hop FC instead VOQ N,1 XP N,1 XP N,N of end-to-end IQS N eFI N VOQ N,N OQ N Reduced latency at low Switch Core (SC) utilization Switch Fabric (SF) Distribution of OQs exhibits some of the fair queueing properties Fair bandwidth allocation (e.g. with a simple Round-Robin) Protection and isolation of the sources from each other www.zurich.ibm.com www.zurich.ibm.com

CICQ and CoS Support Selective queueing at each queueing point (iFI, SC, eFI) Service scheduling in addition to contention resolution (IQS, XQS) Additional scheduler at the egress (EQS) iFI[1] SC CoS[0] XP 1,1 XP 1,N VOQ 1,1 XPM XPM CoS[7] IQS 1 CoS[0] CoS[0] eFI[1] CoS[0] VOQ CoS[7] CoS[7] CoS[0] EQS 1 1,N CoS[7] XQS 1 XQS N CoS[7] iFI[N] CoS[0] XP N,1 XP N,N VOQ N,1 XPM XPM CoS[7] IQS N CoS[0] CoS[0] eFI[N] CoS[0] VOQ CoS[7] CoS[7] CoS[0] EQS N N,N CoS[7] CoS[7] www.zurich.ibm.com www.zurich.ibm.com

CICQ and Parallel Sliced Switching 64 x 64 @ 64 Gb/s/port M iFI[i] 128 Gb/s 128 Gb/s 128 Gb/s R / k 128 Gb/s i e R / k D CoS[0] R eSPEX VOQ i,1 iSPEX D Hd iSPEX eSPEX CoS[7] S D IQS i 128 Gb/s R D S 128 Gb/s R / k .. 128 Gb/s CoS[0] 128 Gb/s D R / k VOQ i,N D CoS[7] D R eSPEX iSPEX D iSPEX eSPEX S 4 Tb/s 2 Tb/s 1 Tb/s 128 Gb/s S 128 Gb/s 128 Gb/s No. Master S 128 Gb/s x1 x1 x1 Chips 128 Gb/s S 128 Gb/s R / k No. Slave 128 Gb/s x30 x15 x8 128 Gb/s Chips R / k No. Cards x8 x4 x2 www.zurich.ibm.com www.zurich.ibm.com

Crosspoint Buffer Dimensioning (1/2) Bandwidth (on the links) is becoming the scarce resource Hence Utilization must be maximized Link speedup should be avoided as much as possible Assuming a credit-based FC and a commun. channel with an RT of ! cells ! credits are required to keep link busy Do we also need ! credits per XP ? Traffic agnostic principle: The bandwidth of each flow can vary on an instantaneous basis Link utilization principle: Full utilization of the link bandwidth must be achieved in the absence of other flows To provide 100% throughout under any traffic condition: A minimum of ! cells are required per XP to ensure that any input can transmit to any output at any instant and at full rate. (e.g. in the case of fully unbalanced traffic or in absence of output contention) www.zurich.ibm.com www.zurich.ibm.com

Crosspoint Buffer Dimensioning (2/2) RT evaluation RT cable = 2 dR / S light C size j 30 cells (with R = 64 Gb/s, d = 30 m, S light = 250 Mm/s (over the dielectric), C size = 512 bits) RT logic j 30 cells (estimated by design) RT total j 60 cells Buffer requirement ( assuming RT total = 64 cells, C size = 64 B ) Per logical XP: XPM size = ! = (64 % 64) = 4 kB Total for the switch core: N 2 % ! = 16 MB XPM size = ! = 64 cells provides: 100% throughput under contentionless traffic and d = 30 m / 100 feet 100% throughput under uniform traffic and d j 3 km / 10.000 feet www.zurich.ibm.com www.zurich.ibm.com

VLSI Implementation CMOS 0.11- # m, Std. cell design, 2.5 Gb/s SerDes Slave chip (64x64@2Gb/s/port) 200 mm 2 , 20 W, 750 SIOs Split master chip (48x48@4Gb/s/port) 2 , 28 W, 825 SIOs 225 mm www.zurich.ibm.com www.zurich.ibm.com

Simulated Performance Parameters: 256 x 256 CICQ switch fabric 64 x 64 SC with 4 external ports (OC-192) per SC port (OC-768) 64/128 cells per XP partitioned into 4 areas of 16/32 cells Ingress and egress link RT = 64 cells (at the OC-768 level) Line card egress buffer = 4 x 256 cells CoS 8 classes of service (C 0 is the highest, C 7 is the lowest priority) Uniform distribution, i.e. 12.5% of offered traffic per class Strict priority scheduling throughout the system (iFI, SC, eFI) www.zurich.ibm.com www.zurich.ibm.com

Non-Uniform Traffic Non-uniform traffic: We adopt the distribution used by Rojas-Cessa et al. ( HPSR 2001 ) where: $ i,j = $ (w + (1-w)/N) if i = j, $ (1-w)/N otherwise $ - N is the number of ports (256), $ i,j is the traffic intensity from input i to output j - $ is the aggregate offered load (100%), w is the non-uniformity factor XPM = 68/72/80/ cells www.zurich.ibm.com www.zurich.ibm.com

Uniform Traffic Uniform traffic Uniformly distributed bursts over all 256 destinations Geometrically distributed bursts XPM = 64 cells / Bursts = 30 cells XPM = 128 cells / Bursts = 30 cells www.zurich.ibm.com www.zurich.ibm.com

Conclusions System design and implementation are equally important as performance considerations Impact of power, packaging, links, RT Traffic agnosticness requirement in OEM CoS support CICQ architecture is a viable solution Scalable Demonstrated sizing VLSI implementation of a single-stage 4 Tb/s switch Excellent performance www.zurich.ibm.com www.zurich.ibm.com

Contacts IBM Prizma research team http://www.zurich.ibm.com/cs/powerprs.html IBM PowerPRS™: Switch fabric products http://www-3.ibm.com/chips/products/wired/products/switch_fabric.html www.zurich.ibm.com www.zurich.ibm.com

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - PowerPoint PPT Presentation

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R.

in Big-Data Analytic Systems Rui Li , Peizhen Guo, Bo Hu, Wenjun Hu Yale University Background

On Terabit Flow Analysis FloCon 2008, Savannah Jonathan M. Smith CIS Department, U. Penn

TECHNOLOGICAL CHALLENGES FOR FIELD DEPLOYMENT AND UPGRADE OF MULTI-TERABIT/S UPGRADE OF

VOLVO PENTA STAGE V SOLUTION Engine concept and range presentation April 2019 ADDITIONAL

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

IGCSE MISY Mandalay 2020-2022 MISY Mandalay Key Stage 4 MISY Key Stages EYFS KS4 KS5 KS1

24/10/2018 01/12/2018 01/07/2019 01/07/2020 01/07/2021 01/07/2022 Stage 2 Stage 3 Royal

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong

P4EC: Enabling Terabit Edge Computing in Enterprise 4G LTE Max Hollingsworth Jinsung Lee

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan,

Learning in the Foundation Stage The Foundation Stage is the stage of education for children

ERSKINE PARK HIGH SCHOOL Putting Plans into Action STAGE 3 STAGE 2 STAGE 1 Since last

SACE Stage 1 into SACE Stage 2 COURSE COUNSELLING For 2019 Respect, Care & Compassion,

Progression options beyond Stage 3 CIVIL CI Accredited for MIEI Add experience and/or MEngSc

Simple Differential CS Amplifier with Active Load Single Stage Op Amp Input stage of

Data Visualization Steve Marschner Cornell CS 322 unless noted, images are from our

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Topological Sorting Union-Find Wheeler

Bayesian Networks Part 1 Yingyu Liang Computer Sciences 760 Fall 2017

Todays goals eXtreme Programming What is XP? When

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [

Introduction to Agile Software Development Word Association Write down the first word or phrase

Day 2: Linear Regression and Statistical Learning Lucas Leemann Essex Summer School Introduction

Deep Agile Blending Scrum and Extreme Programming Jeff Sutherland Ron Jeffries Separation of

Sambuz

Useful Links

Newsletter

Mail Us

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet - PowerPoint PPT Presentation

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch with Large Round-Trip Time Support Round-Trip Time Support F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis F. Abel, C. Minkenberg, R.

in Big-Data Analytic Systems Rui Li , Peizhen Guo, Bo Hu, Wenjun Hu Yale University Background

On Terabit Flow Analysis FloCon 2008, Savannah Jonathan M. Smith CIS Department, U. Penn

TECHNOLOGICAL CHALLENGES FOR FIELD DEPLOYMENT AND UPGRADE OF MULTI-TERABIT/S UPGRADE OF

VOLVO PENTA STAGE V SOLUTION Engine concept and range presentation April 2019 ADDITIONAL

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

IGCSE MISY Mandalay 2020-2022 MISY Mandalay Key Stage 4 MISY Key Stages EYFS KS4 KS5 KS1

24/10/2018 01/12/2018 01/07/2019 01/07/2020 01/07/2021 01/07/2022 Stage 2 Stage 3 Royal

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong

P4EC: Enabling Terabit Edge Computing in Enterprise 4G LTE Max Hollingsworth Jinsung Lee

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan,

Learning in the Foundation Stage The Foundation Stage is the stage of education for children

ERSKINE PARK HIGH SCHOOL Putting Plans into Action STAGE 3 STAGE 2 STAGE 1 Since last

SACE Stage 1 into SACE Stage 2 COURSE COUNSELLING For 2019 Respect, Care &amp; Compassion,

Progression options beyond Stage 3 CIVIL CI Accredited for MIEI Add experience and/or MEngSc

Simple Differential CS Amplifier with Active Load Single Stage Op Amp Input stage of

Data Visualization Steve Marschner Cornell CS 322 unless noted, images are from our

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Topological Sorting Union-Find Wheeler

Bayesian Networks Part 1 Yingyu Liang Computer Sciences 760 Fall 2017

Todays goals eXtreme Programming What is XP? When

Fast Fourier Transform Fourier Series &amp; Transform Summary Discrete-time windowing X [

Introduction to Agile Software Development Word Association Write down the first word or phrase

Day 2: Linear Regression and Statistical Learning Lucas Leemann Essex Summer School Introduction

Deep Agile Blending Scrum and Extreme Programming Jeff Sutherland Ron Jeffries Separation of

Sambuz

Useful Links

Newsletter

Mail Us

SACE Stage 1 into SACE Stage 2 COURSE COUNSELLING For 2019 Respect, Care & Compassion,

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [