Ch 6b: Data-center networking Holger Karl Future Internet Computer - - PowerPoint PPT Presentation
Ch 6b: Data-center networking Holger Karl Future Internet Computer - - PowerPoint PPT Presentation
Ch 6b: Data-center networking Holger Karl Future Internet Computer Networks Group Universitt Paderborn Outline Evolution of data centres Topologies Networking issues Case study: Jupiter rising SS 19, v 0.9 FI - Ch 6b:
Outline
- Evolution of data centres
- Topologies
- Networking issues
- Case study: Jupiter rising
SS 19, v 0.9 FI - Ch 6b: Data-center networking 2
Evolution of data centres
- Scale
- Workloads: north-south traffic to east-west traffic
- Data-parallel applications, map-reduce frameworks
- Requires different optimization: bisection bandwidth
- Latency!
- Virtualization
- Many virtual machines – scale
- Moving virtual machines – reassign MAC addresses?
SS 19, v 0.9 FI - Ch 6b: Data-center networking 3
Evolution: Scale
SS 19, v 0.9 FI - Ch 6b: Data-center networking 4
Example: CERN LHC
- 24 Gigabytes/s produced
- 30 petabytes produced
per year
- > 300 petabytes online disk storage
- > 1 petabyte per day processed
- > 550.000 cores
- > 2 million jobs/day
https://home.cern/about/computing www.computerworld.com/article/2960642/cloud
- storage/cerns-data-stores-soar-to-530m-
gigabytes.html SS 19, v 0.9 FI - Ch 6b: Data-center networking 5
Evoluation: Workloads
- Conventional: Mostly north-south traffic
- From individual machine to gateway
- Typical: Webserver farm
- Modern: East-west traffic
- From server to server
- Typical: data-parallel applications like map/reduce
SS 19, v 0.9 FI - Ch 6b: Data-center networking 6
Programming model – Rough idea
Duis dolor vel Duis dolor vel Duis dolor vel
Server #1 Server #3 Server #2 Server #4
Lore m ipsum dolor Lore m ipsum dolor Dui s aut e vel Dui s aut e vel
Process Process Process Intermedia te results Intermediate results Intermedia te results Related Intermedia te results Related Intermedia te results Related Intermedia te results Process Process Process Final result s Final result s Final result s
Map Shuffle Reduce ? Collect
SS 19, v 0.9 FI - Ch 6b: Data-center networking 7
Evolution: Virtualization
- Virtualize machines!
- Many more MAC addresses to handle
- Easily: hundreds of thousands of VMs
- Scaling problem for switches
- Give hierarchical MAC addresses? Eases routing
- ARP!
- More problematic: Moving a VM from one physical
machine to another
- Must not change IP address – one L2 domain!!
- ARP? Caching?
- Keep MAC address? Makes hierarchical MACs infeasible
SS 19, v 0.9 FI - Ch 6b: Data-center networking 8
Topologies in data centres
- Basic physical setups
- 19’’ racks
- Often: 42 units high (one unit: 1.75’’)
- Server: 1U – 4U
- Two processors per 1U@32 cores each: up to 2688 cores per rack
(as of 2019), one core easily deals with 10 VMs
- Blade enclosure: 10 U
- Networking inside a rack: Top of rack (ToR) switch
- 48 ports, 1G or 10G typical
- 2-4 uplinks, often 10G, evolution to 40G, perhaps 100G in the
future
- Some (small) number of gateways to outside world
- Core question: how to connect ToRs?
- To support N/S and E/W traffic
SS 19, v 0.9 FI - Ch 6b: Data-center networking 9
Topologies: requirements
- High throughput / bisection bandwidth
- Fault-tolerant setup: typically, 2-connected
- Means: multiple paths between any two end hots in operation!
- Not just spanning tree!
- But: Loop freedom
- VM migration support
- ne L2 domain!
SS 19, v 0.9 FI - Ch 6b: Data-center networking 10
Topology: Example
- Example: Cisco standard recommendation
SS 19, v 0.9 FI - Ch 6b: Data-center networking 11
Clos Network
SS 19, v 0.9 FI - Ch 6b: Data-center networking 12 https://upload.wikimedia.org/wikipedia/en/9/9a/Closnet work.png
3-stage Clos
IEEE ANTS 2012 Tutorial
5-stage Clos
- Idea: Build an nxn crossbar
switch out of smaller kxk crossbar switches
- Nonblocking
Fat-Tree Topology: Special case of Clos
SS 19, v 0.9 FI - Ch 6b: Data-center networking 13
IEEE ANTS 2012 Tutorial
SS 19, v 0.9 FI - Ch 6b: Data-center networking 14
Questions to answer
- Which path to use?
- To exploit entire bisection bandwidth, without overload
- Options
- Central point
- Valiant load balancing
- Equal Cost Multi-Pathing (ECMP)
- Choose path by hashing
SS 19, v 0.9 FI - Ch 6b: Data-center networking 15
Papers to know
- If we had time, we now would talk about:
- Portland
- VL2
- Helios
- Hedera
SS 19, v 0.9 FI - Ch 6b: Data-center networking 16
Networking issues
- How to make sure forwarding works in a huge L2 domain?
- With multi-pathing, so no spanning tree solution plausible
- One approach: IETF TRILL (Transparent Interconnection
- f Lots of Links)
- Idea: Start from a plain Ethernet with bridges that do spanning tree
- But replace (subset of) bridges with RoutingBridges (Rbridge)
- Operating on L2
- Still looks like a giant Ethernet domain to IP
- Other buzzwords: Bridging protocols (802.1q), Provider
Bridging (802.11ad), Provider Backbone Bridging(802.1ah); Shortest Path Bridging (IEEE 802.1aq), data center bridging (802.1Qaz. .1Qbb, 1Qau)
SS 19, v 0.9 FI - Ch 6b: Data-center networking 17
TRILL operations
- Rbridges find each other using a link-state protocol
- Do routing on these link states
- Along those computed paths, do tunneling over the
Rbridges
- Needs an extra header, first Rbridge encapsulates packet
SS 19, v 0.9 FI - Ch 6b: Data-center networking 18
Case study: Jupiter rising
- Google sigcom paper, 2015
- https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.
- Figures, tables taken from that paper
- Describes evolution of Google’s internal data center
networks
- Starting point:
ToRs connected to ring of routers
SS 19, v 0.9 FI - Ch 6b: Data-center networking 19
Figure 9: A 128x10G port Watchtower chassis (top left).
The internal non-blocking topology over eight linecards (bottom left). Four chassis housed in two racks cabled with fiber (right). 188
Jupiter rising
SS 19, v 0.9 FI - Ch 6b: Data-center networking 20
Jupiter rising: Challenges
SS 19, v 0.9 FI - Ch 6b: Data-center networking 21
Challenge Our Approach (Section Discussed in) Introducing the network to production Initially deploy as bag-on-the-side with a fail-safe big-red button (3.2) High availability from cheaper components Redundancy in fabric, diversity in deployment, robust software, necessary protocols only, reliable out of band control plane (3.2, 3.3, 5.1) High fiber count for deployment Cable bundling to optimize and expedite deployment (3.3) Individual racks can leverage full uplink capacity to external clusters Introduce Cluster Border Routers to aggregate external bandwidth shared by all server racks (4.1) Incremental deployment Depopulate switches and optics (3.3) Routing scalability Scalable in-house IGP, centralized topology view and route control (5.2) Interoperate with external vendor gear Use standard BGP between Cluster Border Routers and vendor gear (5.2.5) Small on-chip buffers Congestion window bounding on servers, ECN, dynamic buffer sharing of chip buffers, QoS (6.1) Routing with massive multipath Granular control over ECMP tables with proprietary IGP (5.1) Operating at scale Leverage existing server installation, monitoring software; tools build and
- perate fabric as a whole; move beyond individual chassis-centric network
view; single cluster-wide configuration (5.3) Inter cluster networking Portable software, modular hardware in other applications in the network hierarchy (4.2)
Table 1:
High-level summary of challenges we faced and our approach to address them. 186
Jupiter rising: Generations
SS 19, v 0.9 FI - Ch 6b: Data-center networking 22
Datacenter First Merchant ToR Aggregation Spine Block Fabric Host Bisection Generation Deployed Silicon Config Block Config Config Speed Speed BW Four-Post CRs 2004 vendor 48x1G
- 10G
1G 2T Firehose 1.0 2005 8x10G 2x10G up 2x32x10G (B) 32x10G (NB) 10G 1G 10T 4x10G (ToR) 24x1G down Firehose 1.1 2006 8x10G 4x10G up 64x10G (B) 32x10G (NB) 10G 1G 10T 48x1G down Watchtower 2008 16x10G 4x10G up 4x128x10G (NB) 128x10G (NB) 10G nx1G 82T 48x1G down Saturn 2009 24x10G 24x10G 4x288x10G (NB) 288x10G (NB) 10G nx10G 207T Jupiter 2012 16x40G 16x40G 8x128x40G (B) 128x40G (NB) 10/40G nx10G/ 1.3P nx40G
Table 2:
Multiple generations of datacenter networks. (B) indicates blocking, (NB) indicates Nonblocking. 186
Jupiter rising: Firehose 1.0
SS 19, v 0.9 FI - Ch 6b: Data-center networking 23
Figure 5: Firehose 1.0 topology. Top right shows a sam-
ple 8x10G port fabric board in Firehose 1.0, which formed Stages 2, 3 or 4 of the topology. 186
Jupiter rising: Saturn
SS 19, v 0.9 FI - Ch 6b: Data-center networking 24
Figure 12: Components of a Saturn fabric. A 24x10G Pluto
ToR Switch and a 12-linecard 288x10G Saturn chassis (in- cluding logical topology) built from the same switch chip. Four Saturn chassis housed in two racks cabled with fiber (right). 189
Jupiter rising: Jupiter
SS 19, v 0.9 FI - Ch 6b: Data-center networking 25
Figure 13: Building blocks used in the Jupiter topology.
190
Figure 14: Jupiter Middle blocks housed in racks.
190
Figure 15: Four options to connect to the external network
layer.
Figure 16:
Two-stage fabrics used for inter-cluster and intra-campus connectivity. 191