CompSci 514: Computer Networks Lecture 17: Datacenter Network - - PowerPoint PPT Presentation

compsci 514 computer networks lecture 17 datacenter
SMART_READER_LITE
LIVE PREVIEW

CompSci 514: Computer Networks Lecture 17: Datacenter Network - - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 17: Datacenter Network Architectures Xiaowei Yang Overview Motivation Challenges The FatTree architecture


slide-1
SLIDE 1

CompSci 514: Computer Networks Lecture 17: Datacenter Network Architectures

Xiaowei Yang

slide-2
SLIDE 2

Overview

  • Motivation
  • Challenges
  • The FatTree architecture
slide-3
SLIDE 3
slide-4
SLIDE 4

Two design choices

  • Specialized hardware and communication

protocols

– InfiniBand, Myrinet – Cons: expensive, may not support TCP/IP

  • Commodity Ethernet switches and routes

– Aggregate cluster bandwidth scales poorly with cluster size – High bandwidth incurs non-linear cost

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

FatTree Design Goals

  • Scalable interconnection bandwidth: it should be possible for an

arbitrary host in the data center to communicate with any other host in the network at the full bandwidth of its local network interface.

  • Economies of scale: just as commodity personal computers became

the basis for large-scale computing environments, we hope to leverage the same economies of scale to make cheap off-the-shelf Ethernet switches the basis for large- scale data center networks.

  • Backward compatibility: the entire system should be back- ward

compatible with hosts running Ethernet and IP. That is, existing data centers, which almost universally leverage com- modity Ethernet and run IP, should be able to take advantage of the new interconnect architecture with no modifications.

slide-8
SLIDE 8

Components

  • GigE switches

Hierarchical design Fat-tree Year 10 GigE Hosts Cost/ GigE Hosts Cost/ GigE GigE 2002 28-port 4,480 $25.3K 28-port 5,488 $4.5K 2004 32-port 7,680 $4.4K 48-port 27,648 $1.6K 2006 64-port 10,240 $2.1K 48-port 27,648 $1.2K 2008 128-port 20,480 $1.8K 48-port 27,648 $0.3K

Table 1: The maximum possible cluster size with an oversub- scription ratio of 1:1 for different years.

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
  • 23
slide-13
SLIDE 13
slide-14
SLIDE 14
  • 23
  • 3
slide-15
SLIDE 15
slide-16
SLIDE 16

Addressing

  • Switches are given addresses 10.pod.switch.1

– Pod in [0,k-1] – Switches in [0,k-1]

  • Core switches: 10.k.j.i

– j,I are coordinates in the core switch grid, each in [1,k/2]

  • Hosts: 10.pod.switch.ID

– Id in [2, k/2+1]

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Flow Classification

  • Recognize subsequent packets of the same

flow, and forward them on the same outgoing port.

– Avoid reordering

  • Periodically reassign a minimal number of

flow output ports to minimize any disparity between the aggregate flow capacity of different ports.

slide-24
SLIDE 24

Flow Scheduling

  • Edge switches locally assign a new flow to the

least- loaded port initially

  • Edge switches additionally detect elephant flows

and periodically send notifications to a central scheduler

  • A central scheduler, possibly replicated, tracks all

active large flows and tries to assign them non- conflicting paths if possible.

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
  • Figure 8: Proposed packaging solution. The only external ca-

bles are between the pods and the core nodes.

slide-28
SLIDE 28
slide-29
SLIDE 29
  • 500

1000 1500 2000 2500 3000 100 200 300 400 500 600 700 800 Hierarchical design Fat-tree Total Heat Dissipation (kBTU/hr) Total Power (kW) Total power (kW) Total heat dissipation (kBTU/hr)

slide-30
SLIDE 30
slide-31
SLIDE 31

Comments

  • Each pod switch connects to only half of the

cores

– May be hard to wire

  • A pod is not loop free

– A pod is usually the boundary of l2 and l3. Within a pod, run l2 and use agg as the default gateway for that pod. Beyond a pod, run l3. – Must run spanning tree inside a pod