ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers
Christina Delimitrou1, Sriram Sankar2, Aman Kansal3, Christos Kozyrakis1
1Stanford University 2Microsoft 3Microsoft Research
IISWC – November 5th 2012
ECHO: Recreating Network Traffic Maps for Datacenters with Tens of - - PowerPoint PPT Presentation
ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1 , Sriram Sankar 2 , Aman Kansal 3 , Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3 Microsoft Research IISWC November 5 th
1Stanford University 2Microsoft 3Microsoft Research
IISWC – November 5th 2012
2
Network Performance and Efficiency critical for DC operation
Scalable Topologies
Dragonfly, Fat tree, Clos, etc. Hotspot detection & elimination
Flow Control
Load balancing Speculative flow control Hedera, etc.
Network Switches Design
Low latency RPCs RAMCloud, etc.
Software-defined DC networks
OpenFlow Nicira, etc.
3
4
Network Workload Model: A scheme that accurately and concisely captures the
traffic of a DC workload
User patterns only emerge in large-scale scalability Different level of detail per application modularity/configurability
Prior work on network modeling mostly single-node, temporal behavior
No spatial patterns, scalability and modularity
ECHO addresses limitations of previous schemes:
System-wide network modeling: Not confined to a single-node Locality-aware: Accounts for spatial network traffic patterns Hierarchical: Adjusts the level of granularity to the needs of each app/study Scalable: Scales to DCs with ~30,000 servers Lightweight: Low and upper-bound modeling overheads Validated: ECHO is validated against real traces from applications in production DCs
5
Simple Temporal Model DC Network Traffic Characterization ECHO Design Model Validation
6
Most well-known modeling approach for network Single-node as opposed to system-wide! Capture temporal patterns in per-server network traffic Identify known distributions (e.g., Gaussian, Poisson, Zipf, etc. ) in
network activity traces
Represent server network activity as a superposition of identified
distributions
7
Capture temporal patterns in per-
server network traffic
Identify known distributions (e.g., Gaus-
sian, Poisson, Zipf, etc. ) in network activity traces
Represent server network activity as a
superposition of identified distributions
Model = Gaussian +
Exponential +
Gaussian + Gaussian + Constant 1 2 3
4 5
Validation: Deviation between original and synthetic is 4.9% on average
8
Simple, accurate and concise Captures temporal patterns in network activity Facilitates traffic characterization (traffic is expressed as well-studied
distributions)
×
Does not track spatial patterns
×
Bursts in network activity not easily emulated by known distributions would complicate the model
×
Non-modular design
9
Simple Temporal Model DC Network Traffic Characterization ECHO Design Model Validation
10
Workloads: Entire Websearch application Combine Websearch query results aggregator Render Websearch query results display Experimental systems are production DCs with: 30,000 servers running Websearch 360 servers running Combine 1350 servers running Render
We collect per-server bandwidth traces of data sent and
11
Temporal variations of network traffic Fluctuation over time Differences between workloads Average spatial patterns in network activity Locality in network traffic Impact of application functionality to locality Temporal variations in spatial patterns Changes over different time scales Changes for different types of workloads
12
Most servers are greatly underutilized significant overprovisioning
for latency-critical apps
Some servers have higher utilization mostly well load-balanced Similarity in network activity patterns over time Model should: capture fluctuation, remove information redundancy
13
Clearer diurnal patterns 31 dark and 31 light vertical bands
14
Clearer diurnal patterns 31 dark and 31 light vertical bands Higher utilization not as much overprovisioning for servers that
aggregate query results
15
Clearer diurnal patterns 31 dark and 31 light vertical bands Higher utilization not as much overprovisioning for servers that
aggregate query results
Not equally load-balanced impact of queries serviced by each server
16
High spatial locality Most accesses are confined within the same rack The model should preserve the spatial locality (within racks & hotspots)
17
High spatial locality Most accesses are confined within the same rack The model should preserve the spatial locality (within racks & hotspots) A few servers communicate with most of the machines cluster scheduler,
aggregators, monitoring servers
18
In contrast, Combine has less spatial locality most servers talk to many
machines
Consistent with its functionality query aggregation
19
At first glance spatial locality is very similar across months
20
At first glance spatial locality is very similar across months However, at finer granularity there are differences
21
At first glance spatial locality is very similar across months However, at finer granularity there are differences Software updates Changes in traffic due to user load Background processes (e.g., garbage collection, logging, etc. )
22
At first glance spatial locality is very similar across months However, at finer granularity there are differences Software updates Changes in traffic due to user load Background processes (e.g., garbage collection, logging, etc. ) Fine-grain patterns important for studies focused on specific hours of the day
23
Simple Temporal Model DC Network Traffic Characterization ECHO Design Model Validation
24
1.
2.
3.
4.
25
Hierarchical Markov Chain: groups of racks racks individual servers Configurable granularity based on app/study requirements Captures spatial patterns in network traffic: fine-grain transitions are
explored within each coarse state most locality confined within a rack
26
Captures temporal patterns in network traffic multiple models used
Number of models is a function of the workload’s activity fluctuations Switching between models allows compression in replay fast
experimentation
1 2 3 4 5
27
Hierarchical: explore fine grain transitions within coarse states Flat: explore all fine grain states exponential increase in transition count Even for problems with a few hundred servers the model becomes
intractable
No loss in accuracy with the hierarchical model since locality is mostly
confined within racks
28
Collect system-wide network activity traces Cluster network requests based on Sender/receiver server IDs Type (rd/wr) and size of request (MB) Inter-arrival time between requests (ms) Compute transition probabilities between states (e.g., S1 S2: 90% 8KB
read requests, 10msec inter-arrival time)
p12 = 90% 8KB, rd, 10msec
29
Focus on specific interesting activity
patterns Validating the model in server subsets (a few hundred servers)
Network activity is not necessarily self-
contained in those server subsets
Cloud Node: Emulate all network activity
to and from servers external to the studied server subset
Maintains accuracy of per-server load
while enabling more fine-grain validation
30
Simple Temporal Model DC Network Traffic Characterization ECHO Design Model Validation
31
1.
2.
3.
32
Less than 8% deviation between original and synthetic workload, on average
across server subsets
Original
1
Model Original Model
2
Original Model
3
33
Less than 10% deviation between original and synthetic workload, on
average across server subsets
Original Model
1 2
Original Model Original
3
Model
34
12% deviation between original and synthetic for a weekday 9% deviation between original and synthetic for a day of the weekend
35
28% deviation 9.1% deviation 4.4% deviation 1 Level 2 Levels 3 Levels
36
Scalable Topologies
Dragonfly, Fat tree, Clos, hotspot detection & elimination
Flow Control
Load balancing Speculative flow control, Hedera, etc.
Network Switches Design
High port count designs, low latency RPCs, RAMCloud, etc.
Software-defined DC networks
OpenFlow, Nicira, etc.
Security attacks
Real-time deviation from modeled behavior
Retraining for major sw updates, major system configuration changes
Low overhead process
37
ECHO leverages validated analytical models to capture the temporal and
spatial access patterns in DC network activity
It preserves the intensity and characteristics of DC network traffic It adjusts the granularity of representation to the app/study demands It is scalable and lightweight Decouples network system studies from access to large-scale applications
Use ECHO for network system studies without the requirement for full
application deployment
Expand similar concepts to other subsystems.
38