CompSci 514: Computer Networks L18: Datacenter Network - - PowerPoint PPT Presentation

compsci 514 computer networks l18 datacenter network
SMART_READER_LITE
LIVE PREVIEW

CompSci 514: Computer Networks L18: Datacenter Network - - PowerPoint PPT Presentation

CompSci 514: Computer Networks L18: Datacenter Network Architectures II Xiaowei Yang 1 Outline Design and evaluation of VL2 Discussion FatTree vs VL2 What common challenges did each address? What methods did each use to


slide-1
SLIDE 1

CompSci 514: Computer Networks L18: Datacenter Network Architectures II

Xiaowei Yang

1

slide-2
SLIDE 2

Outline

  • Design and evaluation of VL2
  • Discussion

– FatTree vs VL2

  • What common challenges did each address?
  • What methods did each use to address those challenges?

2

slide-3
SLIDE 3

Virtual Layer 2: A Scalable and Flexible Data-Center Network

Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta

Microsoft Research

Changhoon Kim

slide-4
SLIDE 4

Tenets of Cloud-Service Data Center

  • Agility: Assign any servers to any services

– Boosts cloud utilization

  • Scaling out: Use large pools of commodities

– Achieves reliability, performance, low cost

4

Statistical Multiplexing Gain Economies

  • f Scale
slide-5
SLIDE 5

What is VL2?

  • Why is agility important?

– Today’s DC network inhibits the deployment of

  • ther technical advances toward agility
  • With VL2, cloud DCs can enjoy agility in full

5

The first DC network that enables agility in a scaled-out fashion

slide-6
SLIDE 6

Status Quo: Conventional DC Network

Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

CR CR AR AR AR AR

. . .

S S

DC-Layer 3 Internet

S S A A A

S S A A A

. . .

DC-Layer 2

Key

  • CR = Core Router (L3)
  • AR = Access Router (L3)
  • S = Ethernet Switch (L2)
  • A = Rack of app. servers

~ 1,000 servers/pod == IP subnet

6

slide-7
SLIDE 7

Conventional DC Network Problems

CR CR AR AR AR AR S S S S A A A

S S A A A

. . .

7

S S S S A A A

S S A A A

~ 5:1 ~ 40:1 ~ 200:1

  • Dependence on high-cost proprietary routers
  • Extremely limited server-to-server capacity
slide-8
SLIDE 8

And More Problems …

8

CR CR AR AR AR AR S S S S S S S S S S S S

IP subnet (VLAN) #1

~ 200:1

  • Resource fragmentation, significantly lowering

cloud utilization (and cost-efficiency)

IP subnet (VLAN) #2

A A A

A A A

A A

A A

A A A

slide-9
SLIDE 9

And More Problems …

9

CR CR AR AR AR AR S S S S S S S S S S S S

IP subnet (VLAN) #1

~ 200:1

  • Resource fragmentation, significantly lowering

cloud utilization (and cost-efficiency)

Complicated manual L2/L3 re-configuration

IP subnet (VLAN) #2

A A A

A A A

A A

A A

A A A

slide-10
SLIDE 10

And More Problems …

10

CR CR AR AR AR AR S S S S S S S S S S S S

  • Resource fragmentation, significantly lowering

cloud utilization (and cost-efficiency)

A A A

A A A

A A

A A

A A

Revenue lost Expense wasted

slide-11
SLIDE 11

Designing VL2

  • Measuring to know the characteristics of

datacenter networks

  • Design routing schemes that work well with the

traffic patterns

  • Q: limitations of this design approach?

11

slide-12
SLIDE 12

Measuring Traffic

  • Instrumented a large cluster used for data

mining and identified distinctive traffic patterns

– a highly utilized 1500 node cluster in a data center that supports data mining on petabytes of data. – The servers are distributed roughly evenly across 75 ToR switches – Collected socket-level event logs from all machines

  • ver two months.

12

slide-13
SLIDE 13

Traffic analysis

  • 1. The ratio of traffic volume between servers in our data

centers to traffic entering/leaving our data centers is currently around 4:1 (excluding CDN applications).

  • 2. Datacenter computation is focused where high speed

access to data on memory or disk is fast and cheap. Although data is distributed across multiple data centers, intense computation and communication on data does not straddle data centers due to the cost of long-haul links.

  • 3. The demand for bandwidth between servers inside a

data center is growing faster than the demand for bandwidth to external hosts

  • 4. The network is a bottleneck to computation. We

frequently see ToR switches whose uplinks are above 80% utilization.

13

slide-14
SLIDE 14

Flow Distribution Analysis

14 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 100 10000 1e+06 1e+08 1e+10 1e+12 PDF Flow Size (Bytes) Flow Size PDF Total Bytes PDF 0.2 0.4 0.6 0.8 1 1 100 10000 1e+06 1e+08 1e+10 1e+12 CDF Flow Size (Bytes) Flow Size CDF Total Bytes CDF

Figure : Mice are numerous;  of fows are smaller than  MB. However, more than  of bytes are in fows between  MB and  GB.

slide-15
SLIDE 15

Number of Concurrent Flows

15

0.01 0.02 0.03 0.04 1 10 100 1000 0.2 0.4 0.6 0.8 1 Fraction of Time Cumulative Number of Concurrent flows in/out of each Machine PDF CDF

Figure : Number of concurrent connections has two modes: ()  fows per node more than  of the time and ()  fows per node for at least  of the time.

slide-16
SLIDE 16

Implications

  • The distributions of flow size and number of

concurrent flows both imply that VLB will perform well on this traffic. Since even big flows are only 100MB (1 s of transmit time at 1 Gbps), randomiz- ing at flow granularity (rather than packet) will not cause perpetual congestion if there is unlucky placement of a few flows.

  • Moreover, adaptive routing schemes may be

difficult to implement in the data center, since any reactive traffic engineering will need to run at least

  • nce a second if it wants to react to individual

flows.

16

slide-17
SLIDE 17

Traffic Matrix Analysis

  • Q: Is there regularity in the traffic that might be

exploited through careful measurement and traffic engineering?

  • Method

– Compute the ToR-to-ToR TM — the entry TM(t)i,j is the number of bytes sent from servers in ToR i to servers in ToR j during the 100 s beginning at time t. We compute one TM for every 100 s interval, and servers outside the cluster are treated as belonging to a single “ToR” – Cluster similar TMs and choose one representative TM per cluster

17

slide-18
SLIDE 18

Results

  • No representative TMs
  • On a timeseries of 864 TMs
  • Approximat ing with 50 − 60 clusters
  • The fitting error remains high (60%) and only

decreases moderately beyond that point

18

slide-19
SLIDE 19

Instability of Traffic Patterns

19

5 10 15 20 25 30 35 40 200 400 600 800 1000 Index of the Containing Cluster Time in 100s intervals

Frequency

5 10 20 50 100 200

Run Length Frequency

2.0 3.0 4.0 100 200 300

log(Time to Repeat)

(a) (b) (c)

Figure : Lack of short-term predictability: Tie cluster to which a trafc matrix belongs, i.e., the type of trafc mix in the TM, changes quickly and randomly.

slide-20
SLIDE 20

Failure Characteristics

  • Most failures are small in size

– 50% of network device failures involve < 4 devices – 95% of network device failures involve < 20 devices while large correlated failures are rare (e.g., the largest correlated failure involved 217 switches) – Downtimes can be significant: 95% of failures are resolved in 10 min, 98%in<1hr, 99.6%in<1 day,but 0.09% last>10days.

20

slide-21
SLIDE 21

Questions to ponder

  • What design choices may change if observed

different traffic patterns?

21

slide-22
SLIDE 22

22

Know Your Cloud DC: Challenges

  • Instrumented a large cluster used for data mining

and identified distinctive traffic patterns

  • Traffic patterns are highly volatile

– A large number of distinctive patterns even in a day

  • Traffic patterns are unpredictable

– Correlation between patterns very weak

Optimization should be done frequently and rapidly

slide-23
SLIDE 23

Know Your Cloud DC: Opportunities

  • DC controller knows everything about hosts
  • Host OS’s are easily customizable
  • Probabilistic flow distribution would work well

enough, because …

– Flows are numerous and not huge – no elephants! – Commodity switch-to-switch links are substantially thicker (~ 10x) than the maximum thickness of a flow

23

DC network can be made simple

slide-24
SLIDE 24

24

All We Need is Just a Huge L2 Switch,

  • r an Abstraction of One

A A A

A A A

. . .

A A A

A A A

CR CR AR AR AR AR S S S S S S S S S S S S A A A A A A A A A A A A A A A A A A A A A A A A A

. . .

slide-25
SLIDE 25

25

All We Need is Just a Huge L2 Switch,

  • r an Abstraction of One
  • 1. L2 semantics
  • 2. Uniform high

capacity

  • 3. Performance

isolation

A A A

A A A

A A A

A A A

A A A A A A A A A A A A A A A A A A A A A A A A A

slide-26
SLIDE 26

Specific Objectives and Solutions

26

Solution Approach Objective

  • 2. Uniform

high capacity between servers Enforce hose model using existing mechanisms only Employ flat addressing

  • 1. Layer-2

semantics

  • 3. Performance

Isolation Guarantee bandwidth for hose-model traffic Flow-based random traffic indirection (Valiant LB) Name-location separation & resolution service TCP

slide-27
SLIDE 27

Hose model

27

Figure 1: A VPN BASED ON THE CUSTOMER-PIPE MODEL. A mesh of customer-pipes is needed, each ex- tending from one customer endpoint to another. A cus- tomer endpoint must maintain a logical interface for each

  • f its customer-pipes.

Service level agreements following the characterization phase might be based on the current traffic load with provi- sions made for expected gradual growth as well as expected drastic traffic changes that the customer might foresee (or protect against). Both the customer and the provider may play a role in testing whether the SLAs are met. The provider may police (and possibly shape) the incoming traffic to a hose from the customer’s access link to ensure that it stays within the specified profile. Similarly, traffic leaving the VPN at a hose egress (i.e., traffic potentially generated from multi- ple sources that has traversed the network) may have to be monitored and measured at the hose egress point, to ensure that such traffic stays within the specified profile and that the provider has met the SLA. The customer might also be required to specify a policy for actions to be taken should egress traffic be more than the specified egress hose capacity.

2.1 Capacity Management

From a provider’s perspective, it is potentially more chal- lenging to support the hose model, due to the need to meet the SLAs with a very weak specification

  • f the traffic ma-

trix. To manage resources so as to deal with this increased uncertainty, we consider two basic mechanisms: Statistical Multiplexing: As a single QoS assurance applies to a hose, the provider can consider multiplexing all the traffic of a given hose together. Similarly, the set of hoses making up the VPN have a common QoS assurance, and the provider can consider multiplexing all the traffic of a given VPN together. These techniques can be applied

  • n both

access links and network internal links. Resizing: In order to provide tight QoS assurances, the provider may use (aggregate) network resource reservation mechanisms that allocate capacity

  • n a set of links for a

given hose or VPN. A provider can take the approach

  • f

allocating this capacity statically, taking into account worst case demands. Alternatively, a provider can make an initial allocation, and then resize that allocation based on online measurements. Again, such techniques can be applied

  • n

both access and network internal

  • links. Resizing is allowed
  • nly within the envelope defined by the SLA. Resizing would
  • ccur at a substantially

finer time scale than the time scale

  • ver which SLA’s might be renegotiated.

Figure 2: A VPN BASED ON THE HOSE MODEL. A customer endpoint maintains just one logical interface, a hose, to the provider access router. In the Figure, we show the implementation

  • f one hose (based at A) using

provider-pipes. These two resource management mechanisms can be used separately

  • r in combination.

Some more remarks are in order on resizing. Provi- sioning decisions normally have an impact

  • ver fairly long

timescales. Within the context

  • f a VPN framework,

mea- surements

  • f actual

usage can be used on much shorter timescales to enable efficient capacity management. Under- lying this is an assumption that within the network bound- aries will exist between resources that might be used by different classes of traffic to ensure that performance guar- antees are met. For example, traffic from different VPNs might be isolated from each other, and from other classes of traffic. In the context of this paper, resources available for VPN traffic cannot be used by other traffic requiring perfor- mance guarantees. We assume that this perspective holds whether the boundaries reflect reservation

  • f resources, such

as in the case of Intserv, or whether it represents some allo- cation in a bandwidth broker in a Diffserv environment. If we can use the measurements

  • f actual usage to resize

the boundary for a given VPN’s traffic, more bandwidth will be made available to other traffic and we can make better use of available capacity. In reality, measurements

  • f current

usage would be used to make a prediction about near term future usage, and this prediction will be used to resize the share of resources allocated. In the hose model, this approach can be realized by al- lowing customers to resize the respective hose capacities

  • f

a VPN. Presumably there will be some cost incentive for customers to resize their hose capacities. While we envisage this mechanism to be mainly used to track actual usage, by exposing this interface to the customer, it would also en- able the customer to resize its hose capacities based on local policy decisions. How frequently hoses may be resized will depend on im- plementation and overheads for resizing and measurement. More important, however, is whether frequent resizing is beneficial and whether it is possible to make predictions with sufficient accuracy. Finally, short timescale resizing is not a replacement for provisioning and admission control and the appropriate relationship between these resource man- agement approaches is important. 97

slide-28
SLIDE 28

28

Addressing and Routing: Name-Location Separation

payload ToR3

. . . . . .

y

x

Servers use flat names Switches run link-state routing and maintain only switch-level topology

Cope with host churns with very little overhead

y z

payload ToR4 z

ToR2 ToR4 ToR1 ToR3 y, z

payload ToR3 z

. . .

Directory Service

… x à ToR2 y à ToR3 z à ToR4 …

Lookup & Response

… x à ToR2 y à ToR3 z à ToR3 …

slide-29
SLIDE 29

29

Addressing and Routing: Name-Location Separation

payload ToR3

. . . . . .

y

x

Servers use flat names Switches run link-state routing and maintain only switch-level topology

Cope with host churns with very little overhead

y z

payload ToR4 z

ToR2 ToR4 ToR1 ToR3 y, z

payload ToR3 z

. . .

Directory Service

… x à ToR2 y à ToR3 z à ToR4 …

Lookup & Response

… x à ToR2 y à ToR3 z à ToR3 …

  • Allows to use low-cost switches
  • Protects network and hosts from host-state churn
  • Obviates host and switch reconfiguration
slide-30
SLIDE 30

Example Topology: Clos Network

30

. . . . . .

TOR

20 Servers

Int

. . . . . . . . .

Aggr

K aggr switches with D ports

20*(DK/4) Servers

. . . . . . . . . . .

Offer huge aggr capacity and multi paths at modest cost

slide-31
SLIDE 31

Example Topology: Clos Network

31

. . . . . .

TOR

20 Servers

Int

. . . . . . . . .

Aggr

K aggr switches with D ports

20*(DK/4) Servers

. . . . . . . . . . .

Offer huge aggr capacity and multi paths at modest cost

D (# of 10G ports) Max DC size (# of Servers) 48 11,520 96 46,080 144 103,680

slide-32
SLIDE 32

32

Traffic Forwarding: Random Indirection

x y

payload T3 y

z

payload T5 z

IANY IANY IANY

IANY

Cope with arbitrary TMs with very little overhead

Links used for up paths Links used for down paths

T1 T2 T3 T4 T5 T6

slide-33
SLIDE 33

33

Traffic Forwarding: Random Indirection

x y

payload T3 y

z

payload T5 z

IANY IANY IANY

IANY

Cope with arbitrary TMs with very little overhead

Links used for up paths Links used for down paths

T1 T2 T3 T4 T5 T6

[ ECMP + IP Anycast ]

  • Harness huge bisection bandwidth
  • Obviate esoteric traffic engineering or optimization
  • Ensure robustness to failures
  • Work with switch mechanisms available today
slide-34
SLIDE 34

Does VL2 Ensure Uniform High Capacity?

  • How “high” and “uniform” can it get?

– Performed all-to-all data shuffle tests, then measured aggregate and per-flow goodput

  • The cost for flow-based random spreading

34

Time (s) Fairness Index

§

0 100 200 300 400 500 1.00 0.96 0.92 0.88 0.84 0.80 Fairness of Aggr-to-Int links’ utilization

Goodput efficiency Fairness§ between flows

§Jain’s fairness index defined as (∑xi)2/(n·∑xi2)

94% 0.995

slide-35
SLIDE 35

35 50 100 150 200 250 300 350 400 10 20 30 40 50 60 Time (s) Aggregate goodput (Gbps) 50 100 150 200 250 300 350 400 1000 2000 3000 4000 5000 6000 Active flows Aggregate goodput Active flows

Figure : Aggregate goodput during a .TB shufe among  servers.

slide-36
SLIDE 36

Good performance isolation

36

  • 60

80 100 120 140 160 180 200 220 5 10 15 Aggregate goodput (Gbps) Time (s) Service 1 Service 2

Figure : Aggregate goodput of two services with servers inter- mingled on the ToRs. Service one’s goodput is unafected as ser- vice two ramps trafc up and down.

slide-37
SLIDE 37

37

50 60 70 80 90 100 110 120 130 5 10 15 20 Aggregate goodput (Gbps) Time (s) 50 60 70 80 90 100 110 120 130 500 1000 1500 2000 # mice started Aggregate goodput # mice started

Figure : Aggregate goodput of service one as service two cre- ates bursts containing successively more short TCP connections.

slide-38
SLIDE 38

38

  • Figure : Aggregate goodput as all links to switches Interme-

diate and Intermediate are unplugged in succession and then reconnected in succession. Approximate times of link manipu- lation marked with vertical lines. Network re-converges in < 1s afer each failure and demonstrates graceful degradation.

slide-39
SLIDE 39

VL2 Conclusion

  • VL2 achieves agility at scale via
  • 1. L2 semantics
  • 2. Uniform high capacity between servers
  • 3. Performance isolation between services

39

Lessons

  • Randomization can tame volatility
  • Add functionality where you have control
  • There’s no need to wait!