 
              ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1 , Sriram Sankar 2 , Aman Kansal 3 , Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3 Microsoft Research IISWC – November 5 th 2012
Motivation Network Performance and Efficiency  critical for DC operation  Scalable Topologies  Dragonfly, Fat tree, Clos, etc.  Hotspot detection & elimination  Flow Control  Load balancing  Speculative flow control  Hedera, etc.  Network Switches Design  Low latency RPCs  RAMCloud, etc.  Software-defined DC networks  OpenFlow  Nicira, etc. 2
Challenge Where to find representative traffic patterns?? 3
Executive Summary  Network Workload Model: A scheme that accurately and concisely captures the traffic of a DC workload  User patterns only emerge in large-scale  scalability  Different level of detail per application  modularity/configurability  Prior work on network modeling  mostly single-node, temporal behavior  No spatial patterns, scalability and modularity  ECHO addresses limitations of previous schemes:  System-wide network modeling: Not confined to a single-node  Locality-aware: Accounts for spatial network traffic patterns  Hierarchical: Adjusts the level of granularity to the needs of each app/study  Scalable: Scales to DCs with ~30,000 servers  Lightweight: Low and upper-bound modeling overheads  Validated: ECHO is validated against real traces from applications in production DCs 4
Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 5
Distribution Fitting Model  Most well-known modeling approach for network  Single-node as opposed to system-wide!  Capture temporal patterns in per-server network traffic  Identify known distributions (e.g., Gaussian, Poisson, Zipf, etc. ) in network activity traces  Represent server network activity as a superposition of identified distributions 6
Distribution Fitting Model  Capture temporal patterns in per- server network traffic 1 2  Identify known distributions (e.g., Gaus- sian, Poisson, Zipf, etc. ) in network activity traces  Represent server network activity as a 3 superposition of identified distributions  Model = Gaussian + 4 5 Exponential + Gaussian + Gaussian + Constant Validation: Deviation between original and synthetic is 4.9% on average 7
Distribution Fitting Model Positive:  Simple, accurate and concise  Captures temporal patterns in network activity  Facilitates traffic characterization (traffic is expressed as well-studied distributions) Negative: Does not track spatial patterns × Bursts in network activity not easily emulated by known distributions  × would complicate the model Non-modular design × 8
Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 9
Methodology  Workloads:  Entire Websearch application  Combine  Websearch query results aggregator  Render  Websearch query results display  Experimental systems are production DCs with:  30,000 servers running Websearch  360 servers running Combine  1350 servers running Render  We collect per-server bandwidth traces of data sent and received over a period of 5 months (at 5msec granularity) 10
Understanding Network-wide Behavior  Temporal variations of network traffic  Fluctuation over time  Differences between workloads  Average spatial patterns in network activity  Locality in network traffic  Impact of application functionality to locality  Temporal variations in spatial patterns  Changes over different time scales  Changes for different types of workloads 11
Temporal Variations in Network Traffic  Most servers are greatly underutilized  significant overprovisioning for latency-critical apps  Some servers have higher utilization  mostly well load-balanced  Similarity in network activity patterns over time  Model should: capture fluctuation, remove information redundancy 12
Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands 13
Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that aggregate query results 14
Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that aggregate query results  Not equally load-balanced  impact of queries serviced by each server 15
Spatial Patterns in Network Activity  High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots) 16
Spatial Patterns in Network Activity  High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots)  A few servers communicate with most of the machines  cluster scheduler, aggregators, monitoring servers 17
Spatial Patterns in Network Activity  In contrast, Combine has less spatial locality  most servers talk to many machines  Consistent with its functionality  query aggregation 18
Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months 19
Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences 20
Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. ) 21
Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. )  Fine-grain patterns important for studies focused on specific hours of the day 22
Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 23
Model Requirements Don’t just model a node. Model the whole DC! Requirements: Average activity over time and space 1. Per-server activity fluctuation over time 2. Spatial patterns in network traffic 3. Individual server-to-server communication 4. 24
Model Design – Spatial Aspects  Hierarchical Markov Chain: groups of racks  racks  individual servers  Configurable granularity based on app/study requirements  Captures spatial patterns in network traffic: fine-grain transitions are explored within each coarse state  most locality confined within a rack 25
Model Design – Temporal Aspects 3 2 4 1 5  Captures temporal patterns in network traffic  multiple models used over time  Number of models is a function of the workload’s activity fluctuations  Switching between models allows compression in replay  fast experimentation 26
Hierarchical vs. Flat Model vs  Hierarchical: explore fine grain transitions within coarse states  Flat: explore all fine grain states  exponential increase in transition count  Even for problems with a few hundred servers the model becomes intractable  No loss in accuracy with the hierarchical model since locality is mostly confined within racks 27
Model Construction p 12 = 90% 8KB, rd, 10msec  Collect system-wide network activity traces  Cluster network requests based on  Sender/receiver server IDs  Type (rd/wr) and size of request (MB)  Inter-arrival time between requests (ms)  Compute transition probabilities between states (e.g., S1  S2: 90% 8KB read requests, 10msec inter-arrival time) 28
Cloud Node: Modeling Server Subsets  Focus on specific interesting activity patterns  Validating the model in server subsets (a few hundred servers)  Network activity is not necessarily self- contained in those server subsets  Cloud Node: Emulate all network activity to and from servers external to the studied server subset  Maintains accuracy of per-server load while enabling more fine-grain validation 29
Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 30
Validation Temporal variations of network activity 1. Spatial patterns of network activity 2. Individual server interactions (one-to-one communication 3. patterns) 31
Validation – Temporal Patterns Original Original 2 1 Model Model Original 3 Model  Less than 8% deviation between original and synthetic workload, on average across server subsets 32
Validation – Spatial Patterns Model Original 2 Original 1 Original Model 3 Model  Less than 10% deviation between original and synthetic workload, on average across server subsets 33
Validation – Indiv. Server Interactions  12% deviation between original and synthetic for a weekday  9% deviation between original and synthetic for a day of the weekend 34
Recommend
More recommend