Data Center Switch Architecture in the Age of Merchant Silicon - - PowerPoint PPT Presentation

data center switch architecture in the age of merchant
SMART_READER_LITE
LIVE PREVIEW

Data Center Switch Architecture in the Age of Merchant Silicon - - PowerPoint PPT Presentation

Data Center Switch Architecture in the Age of Merchant Silicon Nathan Farrington Erik Rubow Amin Vahdat The Network is a Bottleneck HTTP request amplification Web search (e.g. Google) Small object retrieval (e.g. Facebook) Web


slide-1
SLIDE 1

Data Center Switch Architecture in the Age of Merchant Silicon

Nathan Farrington Erik Rubow Amin Vahdat

slide-2
SLIDE 2

The Network is a Bottleneck

  • HTTP request amplification

– Web search (e.g. Google) – Small object retrieval (e.g. Facebook) – Web services (e.g. Amazon.com)

  • MapReduce-style parallel computation

– Inverted search index – Data analytics

  • Need high-performance interconnects

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 2

slide-3
SLIDE 3

The Network is Expensive

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 3

Rack 1 Rack 2 Rack 3 Rack N 8xGbE . . . 48xGbE TOR Switch . . . . . . 40x1U Servers . . . 10GbE

slide-4
SLIDE 4

What we really need: One Big Switch

  • Commodity
  • Plug-and-play
  • Potentially

no oversubscription

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 4

Rack 1 Rack 2 Rack 3 Rack N

slide-5
SLIDE 5

Why not just use a fat tree of commodity TOR switches?

  • M. Al-Fares, A. Loukissas, A. Vahdat. A Scalable, Commodity Data

Center Network Architecture. In SIGCOMM ’08.

Hot Interconnects August 27, 2009 5 Nathan Farrington farrington@cs.ucsd.edu

k=4,n=3

slide-6
SLIDE 6

10 Tons of Cable

  • 55,296 Cat-6 cables
  • 1,128 separate cable bundles

The “Yellow Wall”

Hot Interconnects August 27, 2009 6 Nathan Farrington farrington@cs.ucsd.edu

slide-7
SLIDE 7

Merchant Silicon gives us Commodity Switches

Maker Broadcom Fulcrum Fujitsu Model BCM56820 FM4224 MB86C69RBC Ports 24 24 26 Cost NDA NDA $410 Power NDA 20 W 22 W Latency < 1 μs 300 ns 300 ns Area NDA 40 x 40 mm 35 x 35 mm SRAM NDA 2 MB 2.9 MB Process 65 nm 130 nm 90 nm

Hot Interconnects August 27, 2009 7 Nathan Farrington farrington@cs.ucsd.edu

slide-8
SLIDE 8

Eliminate Redundancy

  • Networks of packet

switches contain many redundant components

– chassis, power conditioning circuits, cooling – CPUs, DRAM

  • Repackage these

discrete switches to lower the cost and power consumption

CPU ASIC PHY SFP+ SFP+ SFP+ FAN FAN FAN FAN PSU 8 Ports

Hot Interconnects August 27, 2009 8 Nathan Farrington farrington@cs.ucsd.edu

slide-9
SLIDE 9

Our Architecture, in a Nutshell

  • Fat tree of merchant silicon switch ASICs
  • Hiding cabling complexity with PCB traces and
  • ptics
  • Partition into multiple pod switches + single

core switch array

  • Custom EEP ASIC to further reduce cost and

power

  • Scales to 65,536 ports when 64-port ASICs

become available, late 2009

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 9

slide-10
SLIDE 10

3 Different Designs

  • 24-ary 3-tree
  • 720 switch ASICs
  • 3,456 ports of 10GbE
  • No oversubscription

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 10

1 2 3

slide-11
SLIDE 11

Network 1: No Engineering Required

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 11

Cost of Parts $4.88M Power 52.7 kW Cabling Complexity 3,456 Footprint 720 RU NRE $0

  • 720 discrete packet switches, connected with optical

fiber

Cabling complexity (noun): the number of long cables in a data center network.

slide-12
SLIDE 12

Network 2: Custom Boards and Chassis

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 12

Cost of Parts $3.07M Power 41.0 kW Cabling Complexity 96 Footprint 192 RU NRE $3M est

  • 24 “pod” switches, one core switch array, 96 cables

This design is shown in more detail later.

slide-13
SLIDE 13

Switch at 10G, but Transmit at 40G

SFP SFP+ QSFP Rate 1 Gb/s 10 Gb/s 40 Gb/s Cost/Gb/s $35* $25* $15* Power/Gb/s 500mW 150mW 60mW * 2008-2009 Prices

Hot Interconnects August 27, 2009 13 Nathan Farrington farrington@cs.ucsd.edu

slide-14
SLIDE 14

Network 3: Network 2 + Custom ASIC

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 14

Cost of Parts $2.33M Power 36.4 kW Cabling Complexity 96 Footprint 114 RU NRE $8M est

  • Uses 40GbE between pod switches and core switch

array; everything else is same as Network 2. EEP

This simple ASIC provides tremendous cost and power savings.

slide-15
SLIDE 15

Cost of Parts

4.88 3.07 2.33 1 2 3 4 5 6 Cost of Parts (in millions) Network 1 Network 2 Network 3

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 15

slide-16
SLIDE 16

Power Consumption

52.7 41 36.4 10 20 30 40 50 60 Power Consumption (kW) Network 1 Network 2 Network 3

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 16

slide-17
SLIDE 17

Cabling Complexity

3,456 96 96 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 Cabling Complexity Network 1 Network 2 Network 3

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 17

slide-18
SLIDE 18

Footprint

720 192 114 100 200 300 400 500 600 700 800 Footprint (in rack units) Network 1 Network 2 Network 3

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 18

slide-19
SLIDE 19

Partially Deployed Switch

Hot Interconnects August 27, 2009 19 Nathan Farrington farrington@cs.ucsd.edu

slide-20
SLIDE 20

Fully Deployed Switch

Hot Interconnects August 27, 2009 20 Nathan Farrington farrington@cs.ucsd.edu

slide-21
SLIDE 21

Pod Switch

Hot Interconnects August 27, 2009 21 Nathan Farrington farrington@cs.ucsd.edu

slide-22
SLIDE 22

Logical Topology

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 22

slide-23
SLIDE 23

Pod Switch Line Card

Hot Interconnects August 27, 2009 23 Nathan Farrington farrington@cs.ucsd.edu

slide-24
SLIDE 24

Pod Switch Uplink Card

Hot Interconnects August 27, 2009 24 Nathan Farrington farrington@cs.ucsd.edu

slide-25
SLIDE 25

Core Switch Array Card

Hot Interconnects August 27, 2009 25 Nathan Farrington farrington@cs.ucsd.edu

slide-26
SLIDE 26

Why an Ethernet Extension Protocol?

  • Optical transceivers are 80% of the cost
  • EEP allows the use of fewer and faster optical

transceivers

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 26

EEP EEP

40GbE 10GbE 10GbE 10GbE 10GbE 10GbE 10GbE 10GbE 10GbE

slide-27
SLIDE 27

How does EEP work?

  • Ethernet frames are split up into EEP frames
  • Most EEP frames are 65 bytes

– Header is 1 byte; payload is 64 bytes

  • Header encodes ingress/egress port

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 27

EEP EEP

slide-28
SLIDE 28

How does EEP work?

  • Round-robin arbiter
  • EEP frames are transmitted as one large

Ethernet frame

  • 40GbE overclocked by 1.6%

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 28

EEP EEP

slide-29
SLIDE 29

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 29

EEP EEP

Ethernet Frames

slide-30
SLIDE 30

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 30

EEP EEP

EEP Frames

1 2 3 1 1 2 1 3 2

slide-31
SLIDE 31

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 31

EEP EEP

1 2 3 1 1 2 1 3 2 1 2 3 1 1 2 1 3 2

slide-32
SLIDE 32

EEP Frame Format

SOF: Start of Ethernet Frame EOF: End of Ethernet Frame LEN: Set if EEP Frame contains less than 64B of payload Virtual Link ID: Corresponds to port number (0-15) Payload Length: (0-63B)

Hot Interconnects August 27, 2009 32 Nathan Farrington farrington@cs.ucsd.edu

slide-33
SLIDE 33

Why not use VLANs?

  • Because it adds latency and requires more

SRAM

  • FPGA Implementation

– VLAN tagging – EEP

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 33

slide-34
SLIDE 34

Latency Measurements

Hot Interconnects August 27, 2009 34 Nathan Farrington farrington@cs.ucsd.edu

slide-35
SLIDE 35

Related Work

  • M. Al-Fares, A. Loukissas, A. Vahdat. A Scalable, Commodity Data Center

Network Architecture. In SIGCOMM ’08.

  • Fat trees of commodity switches, Layer 3 routing, flow scheduling
  • R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S.

Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault- Tolerant Layer 2 Data Center Network Fabric. In SIGCOMM ’09.

– Layer 2 routing, plug-and-play configuration, fault tolerance, switch software modifications

  • A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A.

Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center

  • Network. In SIGCOMM ’09.

– Layer 2 routing, end-host modifications

Hot Interconnects August 27, 2009 35 Nathan Farrington farrington@cs.ucsd.edu

slide-36
SLIDE 36

Conclusion

  • General architecture

– Fat tree of merchant silicon switch ASICs – Hiding cabling complexity – Pods + Core – Custom EEP ASIC – Scales to 65,536 ports with 64-port ASICs

  • Design of a 3,456-port 10GbE switch
  • Design of the EEP ASIC

Hot Interconnects August 27, 2009 Nathan Farrington farrington@cs.ucsd.edu 36