echo recreating network traffic maps for datacenters with
play

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of - PowerPoint PPT Presentation

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1 , Sriram Sankar 2 , Aman Kansal 3 , Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3 Microsoft Research IISWC November 5 th


  1. ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1 , Sriram Sankar 2 , Aman Kansal 3 , Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3 Microsoft Research IISWC – November 5 th 2012

  2. Motivation Network Performance and Efficiency  critical for DC operation  Scalable Topologies  Dragonfly, Fat tree, Clos, etc.  Hotspot detection & elimination  Flow Control  Load balancing  Speculative flow control  Hedera, etc.  Network Switches Design  Low latency RPCs  RAMCloud, etc.  Software-defined DC networks  OpenFlow  Nicira, etc. 2

  3. Challenge Where to find representative traffic patterns?? 3

  4. Executive Summary  Network Workload Model: A scheme that accurately and concisely captures the traffic of a DC workload  User patterns only emerge in large-scale  scalability  Different level of detail per application  modularity/configurability  Prior work on network modeling  mostly single-node, temporal behavior  No spatial patterns, scalability and modularity  ECHO addresses limitations of previous schemes:  System-wide network modeling: Not confined to a single-node  Locality-aware: Accounts for spatial network traffic patterns  Hierarchical: Adjusts the level of granularity to the needs of each app/study  Scalable: Scales to DCs with ~30,000 servers  Lightweight: Low and upper-bound modeling overheads  Validated: ECHO is validated against real traces from applications in production DCs 4

  5. Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 5

  6. Distribution Fitting Model  Most well-known modeling approach for network  Single-node as opposed to system-wide!  Capture temporal patterns in per-server network traffic  Identify known distributions (e.g., Gaussian, Poisson, Zipf, etc. ) in network activity traces  Represent server network activity as a superposition of identified distributions 6

  7. Distribution Fitting Model  Capture temporal patterns in per- server network traffic 1 2  Identify known distributions (e.g., Gaus- sian, Poisson, Zipf, etc. ) in network activity traces  Represent server network activity as a 3 superposition of identified distributions  Model = Gaussian + 4 5 Exponential + Gaussian + Gaussian + Constant Validation: Deviation between original and synthetic is 4.9% on average 7

  8. Distribution Fitting Model Positive:  Simple, accurate and concise  Captures temporal patterns in network activity  Facilitates traffic characterization (traffic is expressed as well-studied distributions) Negative: Does not track spatial patterns × Bursts in network activity not easily emulated by known distributions  × would complicate the model Non-modular design × 8

  9. Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 9

  10. Methodology  Workloads:  Entire Websearch application  Combine  Websearch query results aggregator  Render  Websearch query results display  Experimental systems are production DCs with:  30,000 servers running Websearch  360 servers running Combine  1350 servers running Render  We collect per-server bandwidth traces of data sent and received over a period of 5 months (at 5msec granularity) 10

  11. Understanding Network-wide Behavior  Temporal variations of network traffic  Fluctuation over time  Differences between workloads  Average spatial patterns in network activity  Locality in network traffic  Impact of application functionality to locality  Temporal variations in spatial patterns  Changes over different time scales  Changes for different types of workloads 11

  12. Temporal Variations in Network Traffic  Most servers are greatly underutilized  significant overprovisioning for latency-critical apps  Some servers have higher utilization  mostly well load-balanced  Similarity in network activity patterns over time  Model should: capture fluctuation, remove information redundancy 12

  13. Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands 13

  14. Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that aggregate query results 14

  15. Temporal Variations in Network Traffic  Clearer diurnal patterns  31 dark and 31 light vertical bands  Higher utilization  not as much overprovisioning for servers that aggregate query results  Not equally load-balanced  impact of queries serviced by each server 15

  16. Spatial Patterns in Network Activity  High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots) 16

  17. Spatial Patterns in Network Activity  High spatial locality  Most accesses are confined within the same rack  The model should preserve the spatial locality (within racks & hotspots)  A few servers communicate with most of the machines  cluster scheduler, aggregators, monitoring servers 17

  18. Spatial Patterns in Network Activity  In contrast, Combine has less spatial locality  most servers talk to many machines  Consistent with its functionality  query aggregation 18

  19. Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months 19

  20. Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences 20

  21. Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. ) 21

  22. Fluctuations in Spatial Patterns  At first glance spatial locality is very similar across months  However, at finer granularity there are differences  Software updates  Changes in traffic due to user load  Background processes (e.g., garbage collection, logging, etc. )  Fine-grain patterns important for studies focused on specific hours of the day 22

  23. Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 23

  24. Model Requirements Don’t just model a node. Model the whole DC! Requirements: Average activity over time and space 1. Per-server activity fluctuation over time 2. Spatial patterns in network traffic 3. Individual server-to-server communication 4. 24

  25. Model Design – Spatial Aspects  Hierarchical Markov Chain: groups of racks  racks  individual servers  Configurable granularity based on app/study requirements  Captures spatial patterns in network traffic: fine-grain transitions are explored within each coarse state  most locality confined within a rack 25

  26. Model Design – Temporal Aspects 3 2 4 1 5  Captures temporal patterns in network traffic  multiple models used over time  Number of models is a function of the workload’s activity fluctuations  Switching between models allows compression in replay  fast experimentation 26

  27. Hierarchical vs. Flat Model vs  Hierarchical: explore fine grain transitions within coarse states  Flat: explore all fine grain states  exponential increase in transition count  Even for problems with a few hundred servers the model becomes intractable  No loss in accuracy with the hierarchical model since locality is mostly confined within racks 27

  28. Model Construction p 12 = 90% 8KB, rd, 10msec  Collect system-wide network activity traces  Cluster network requests based on  Sender/receiver server IDs  Type (rd/wr) and size of request (MB)  Inter-arrival time between requests (ms)  Compute transition probabilities between states (e.g., S1  S2: 90% 8KB read requests, 10msec inter-arrival time) 28

  29. Cloud Node: Modeling Server Subsets  Focus on specific interesting activity patterns  Validating the model in server subsets (a few hundred servers)  Network activity is not necessarily self- contained in those server subsets  Cloud Node: Emulate all network activity to and from servers external to the studied server subset  Maintains accuracy of per-server load while enabling more fine-grain validation 29

  30. Outline  Simple Temporal Model  DC Network Traffic Characterization  ECHO Design  Model Validation 30

  31. Validation Temporal variations of network activity 1. Spatial patterns of network activity 2. Individual server interactions (one-to-one communication 3. patterns) 31

  32. Validation – Temporal Patterns Original Original 2 1 Model Model Original 3 Model  Less than 8% deviation between original and synthetic workload, on average across server subsets 32

  33. Validation – Spatial Patterns Model Original 2 Original 1 Original Model 3 Model  Less than 10% deviation between original and synthetic workload, on average across server subsets 33

  34. Validation – Indiv. Server Interactions  12% deviation between original and synthetic for a weekday  9% deviation between original and synthetic for a day of the weekend 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend