ECHO: Recreating Network Traffic Maps for Datacenters with Tens of - - PowerPoint PPT Presentation

echo recreating network traffic maps for datacenters with
SMART_READER_LITE
LIVE PREVIEW

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of - - PowerPoint PPT Presentation

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1 , Sriram Sankar 2 , Aman Kansal 3 , Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3 Microsoft Research IISWC November 5 th


slide-1
SLIDE 1

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers

Christina Delimitrou1, Sriram Sankar2, Aman Kansal3, Christos Kozyrakis1

1Stanford University 2Microsoft 3Microsoft Research

IISWC – November 5th 2012

slide-2
SLIDE 2

2

Motivation

Network Performance and Efficiency  critical for DC operation

 Scalable Topologies

 Dragonfly, Fat tree, Clos, etc.  Hotspot detection & elimination

 Flow Control

 Load balancing  Speculative flow control  Hedera, etc.

 Network Switches Design

 Low latency RPCs  RAMCloud, etc.

 Software-defined DC networks

 OpenFlow  Nicira, etc.

slide-3
SLIDE 3

3

Challenge

Where to find representative traffic patterns??

slide-4
SLIDE 4

4

Executive Summary

 Network Workload Model: A scheme that accurately and concisely captures the

traffic of a DC workload

 User patterns only emerge in large-scale  scalability  Different level of detail per application  modularity/configurability

 Prior work on network modeling  mostly single-node, temporal behavior

 No spatial patterns, scalability and modularity

 ECHO addresses limitations of previous schemes:

 System-wide network modeling: Not confined to a single-node  Locality-aware: Accounts for spatial network traffic patterns  Hierarchical: Adjusts the level of granularity to the needs of each app/study  Scalable: Scales to DCs with ~30,000 servers  Lightweight: Low and upper-bound modeling overheads  Validated: ECHO is validated against real traces from applications in production DCs

slide-5
SLIDE 5

5

Outline

 Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation

slide-6
SLIDE 6

6

Distribution Fitting Model

 Most well-known modeling approach for network  Single-node as opposed to system-wide!  Capture temporal patterns in per-server network traffic  Identify known distributions (e.g., Gaussian, Poisson, Zipf, etc. ) in

network activity traces

 Represent server network activity as a superposition of identified

distributions

slide-7
SLIDE 7

7

Distribution Fitting Model

 Capture temporal patterns in per-

server network traffic

 Identify known distributions (e.g., Gaus-

sian, Poisson, Zipf, etc. ) in network activity traces

 Represent server network activity as a

superposition of identified distributions

 Model = Gaussian +

Exponential +

Gaussian + Gaussian + Constant 1 2 3

4 5

Validation: Deviation between original and synthetic is 4.9% on average

slide-8
SLIDE 8

8

Distribution Fitting Model

Positive:

 Simple, accurate and concise  Captures temporal patterns in network activity  Facilitates traffic characterization (traffic is expressed as well-studied

distributions)

Negative:

×

Does not track spatial patterns

×

Bursts in network activity not easily emulated by known distributions  would complicate the model

×

Non-modular design

slide-9
SLIDE 9

9

Outline

 Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation

slide-10
SLIDE 10

10

Methodology

 Workloads:  Entire Websearch application  Combine  Websearch query results aggregator  Render  Websearch query results display  Experimental systems are production DCs with:  30,000 servers running Websearch  360 servers running Combine  1350 servers running Render

 We collect per-server bandwidth traces of data sent and

received over a period of 5 months (at 5msec granularity)

slide-11
SLIDE 11

11

Understanding Network-wide Behavior

 Temporal variations of network traffic  Fluctuation over time  Differences between workloads  Average spatial patterns in network activity  Locality in network traffic  Impact of application functionality to locality  Temporal variations in spatial patterns  Changes over different time scales  Changes for different types of workloads

slide-12
SLIDE 12

12

Temporal Variations in Network Traffic

 Most servers are greatly underutilized  significant overprovisioning

for latency-critical apps

 Some servers have higher utilization  mostly well load-balanced  Similarity in network activity patterns over time  Model should: capture fluctuation, remove information redundancy

slide-13
SLIDE 13

13

Temporal Variations in Network Traffic

 Clearer diurnal patterns  31 dark and 31 light vertical bands

slide-14
SLIDE 14

14

Temporal Variations in Network Traffic

 Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that

aggregate query results

slide-15
SLIDE 15

15

Temporal Variations in Network Traffic

 Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that

aggregate query results

 Not equally load-balanced  impact of queries serviced by each server

slide-16
SLIDE 16

16

Spatial Patterns in Network Activity

 High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots)

slide-17
SLIDE 17

17

Spatial Patterns in Network Activity

 High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots)  A few servers communicate with most of the machines  cluster scheduler,

aggregators, monitoring servers

slide-18
SLIDE 18

18

Spatial Patterns in Network Activity

 In contrast, Combine has less spatial locality  most servers talk to many

machines

 Consistent with its functionality  query aggregation

slide-19
SLIDE 19

19

Fluctuations in Spatial Patterns

 At first glance spatial locality is very similar across months

slide-20
SLIDE 20

20

Fluctuations in Spatial Patterns

 At first glance spatial locality is very similar across months  However, at finer granularity there are differences

slide-21
SLIDE 21

21

Fluctuations in Spatial Patterns

 At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. )

slide-22
SLIDE 22

22

Fluctuations in Spatial Patterns

 At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. )  Fine-grain patterns important for studies focused on specific hours of the day

slide-23
SLIDE 23

23

Outline

 Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation

slide-24
SLIDE 24

24

Model Requirements

Don’t just model a node. Model the whole DC! Requirements:

1.

Average activity over time and space

2.

Per-server activity fluctuation over time

3.

Spatial patterns in network traffic

4.

Individual server-to-server communication

slide-25
SLIDE 25

25

Model Design – Spatial Aspects

 Hierarchical Markov Chain: groups of racks  racks  individual servers  Configurable granularity based on app/study requirements  Captures spatial patterns in network traffic: fine-grain transitions are

explored within each coarse state  most locality confined within a rack

slide-26
SLIDE 26

26

Model Design – Temporal Aspects

 Captures temporal patterns in network traffic  multiple models used

  • ver time

 Number of models is a function of the workload’s activity fluctuations  Switching between models allows compression in replay  fast

experimentation

1 2 3 4 5

slide-27
SLIDE 27

27

Hierarchical vs. Flat Model

 Hierarchical: explore fine grain transitions within coarse states  Flat: explore all fine grain states  exponential increase in transition count  Even for problems with a few hundred servers the model becomes

intractable

 No loss in accuracy with the hierarchical model since locality is mostly

confined within racks

vs

slide-28
SLIDE 28

28

Model Construction

 Collect system-wide network activity traces  Cluster network requests based on  Sender/receiver server IDs  Type (rd/wr) and size of request (MB)  Inter-arrival time between requests (ms)  Compute transition probabilities between states (e.g., S1 S2: 90% 8KB

read requests, 10msec inter-arrival time)

p12 = 90% 8KB, rd, 10msec

slide-29
SLIDE 29

29

Cloud Node: Modeling Server Subsets

 Focus on specific interesting activity

patterns  Validating the model in server subsets (a few hundred servers)

 Network activity is not necessarily self-

contained in those server subsets

 Cloud Node: Emulate all network activity

to and from servers external to the studied server subset

 Maintains accuracy of per-server load

while enabling more fine-grain validation

slide-30
SLIDE 30

30

Outline

 Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation

slide-31
SLIDE 31

31

Validation

1.

Temporal variations of network activity

2.

Spatial patterns of network activity

3.

Individual server interactions (one-to-one communication patterns)

slide-32
SLIDE 32

32

Validation – Temporal Patterns

 Less than 8% deviation between original and synthetic workload, on average

across server subsets

Original

1

Model Original Model

2

Original Model

3

slide-33
SLIDE 33

33

Validation – Spatial Patterns

 Less than 10% deviation between original and synthetic workload, on

average across server subsets

Original Model

1 2

Original Model Original

3

Model

slide-34
SLIDE 34

34

Validation – Indiv. Server Interactions

 12% deviation between original and synthetic for a weekday  9% deviation between original and synthetic for a day of the weekend

slide-35
SLIDE 35

35

Validation – Benefits of Hierarchy

28% deviation 9.1% deviation 4.4% deviation 1 Level 2 Levels 3 Levels

slide-36
SLIDE 36

36

Motivation: Revisited

 Scalable Topologies 

 Dragonfly, Fat tree, Clos, hotspot detection & elimination

 Flow Control 

 Load balancing  Speculative flow control, Hedera, etc.

 Network Switches Design 

 High port count designs, low latency RPCs, RAMCloud, etc.

 Software-defined DC networks 

 OpenFlow, Nicira, etc.

 Security attacks 

 Real-time deviation from modeled behavior

 Retraining for major sw updates, major system configuration changes

 Low overhead process

slide-37
SLIDE 37

37

Conclusions

 ECHO leverages validated analytical models to capture the temporal and

spatial access patterns in DC network activity

 It preserves the intensity and characteristics of DC network traffic  It adjusts the granularity of representation to the app/study demands  It is scalable and lightweight  Decouples network system studies from access to large-scale applications

Future work

 Use ECHO for network system studies without the requirement for full

application deployment

 Expand similar concepts to other subsystems.

slide-38
SLIDE 38

38

Thank you Questions??