traffic engineering with forw rward fault correction
play

Traffic Engineering with Forw rward Fault Correction Harry Liu - PowerPoint PPT Presentation

Traffic Engineering with Forw rward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1 Cloud services require large network capacity Cloud Applications


  1. Traffic Engineering with Forw rward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1

  2. Cloud services require large network capacity Cloud Applications Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2

  3. TE is critical to effectively utilizing networks Traffic Engineering (centralized & SDN-Based) WAN Network Datacenter Network • Microsoft SWAN (SIGCOMM’13) • Devoflow (SIGCOMM’11) • Google B4 (SIGCOMM’13) • MicroTE (CoNEXT’11) • …… • …… 3

  4. Centralized TE is the key to network efficiency Demand=10 2 Link Cap: 10 Sub-optimal resource allocation 5 10 10 Total throughput: 15 Demand=10 1 4 based on local view & control. 5 1) how much traffic to admit 3 2) how to route Requirement: path length ≤ 2 hops 2 Link Cap: 10 Optimal resource allocation Total throughput: 20 10 10 based on global view & control. 1 4 10 3 TE controller 4

  5. But, centralized TE is also vulnerable to faults TE controller Network view Network configuration Frequent updates for high (e.g. topo, cap, traffic) (e.g. routes, rate limits) utilization (e.g. per 5min) Control-plane Data-plane faults faults Network 5

  6. Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 s2 7 link failure link failure 10 3 s1 s4 s1 s4 3 3 congestion 7 7 s3 s4 (10) s3 s3 Link Capacity: 10 6

  7. Control plane faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage Control plane faults can also result in congestion. 7

  8. The TE controllability is undermined by faults TE controller Network view Network configuration Incompleteness Inaccuracy Control-plane Data-plane faults faults Network 8

  9. Control and data plane faults in practice In a production WAN network (200+ routers, 6000+ links) : • Faults are common. • Faults cause severe congestion. Data plane: Control plane: fault rate = 25% per 5 minutes. fault rate = 0.1% -- 1% per TE update. 9

  10. State of the art for handling faults • Heavy over-provisioning: Big loss in throughput • Reactive handling of faults: • Control plane faults: retry • Data plane faults: re-compute TE and update networks Cannot prevent Blocked by control Slow congestion plane faults (seconds -- minutes) 10

  11. How about handling faults proactively? TE Algorithm Network making it robust not robust enough 11

  12. Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution 12

  13. Example: FFC for link failures Link Capacity: 10 Failure Cases s2 s4 (9) s2 8 s2 s2 s2 9 link failure 9 link failure 1 9 link failure s1 s4 s1 s4 s1 s4 s1 s4 1 1 1 8 8 9 s3 s3 s3 8 s3 s4 (9) s3 K=1 (FFC) 13

  14. Trade-off: network efficiency v.s. robustness s2 s2 10 link failure 5 15 Non-FFC s1 s4 s1 s4 (Throughput: 30) 5 5 10 10 s3 s3 There exists a trade-off between throughput and robustness s2 8 FFC (k=1) 1 Achieving the optimal throughput with FFC guarantee s1 s4 (Throughput: 18) 1 8 s3 FFC does not always sacrifice efficiency for robustness s2 s2 link failure 5 9 Non-FFC 4 s1 s4 s1 s4 (Throughput: 18) 4 4 5 5 s3 s3 14

  15. Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15

  16. Basic TE linear programming formulations LP formulations 𝑐 𝑔 Sizes of flows TE decisions: 𝑚 𝑔,𝑢 Traffic on paths max. ∀𝑔 𝑐 𝑔 TE objective: Maximizing throughput ∀𝑢 𝑚 𝑔,𝑢 ≥ 𝑐 𝑔 s.t. ∀𝑔: Deliver all granted flows ∀𝑓: ∀𝑔 ∀𝑢∋𝑓 𝑚 𝑔,𝑢 ≤ 𝑑 𝑓 No overloaded link Basic TE constraints: … … 𝑙 𝑑 control plane faults FFC constraints: No overloaded link up to 𝑙 𝑓 link failures 𝑙 𝑤 switch failures 16

  17. Formulating data-plane FFC path-1 Paths are link-disjoint. bw allocation: 𝑏 1 flow size: 𝑡 𝑔 path-2 S D 𝑏 2 path-3 𝑏 3 𝑡 𝑔 ≤ 𝑏 2 + 𝑏 3 Fault on path-1: 𝟒 Lemma: FFC is achieved when FFC k=1 𝑡 𝑔 ≤ 𝑏 1 + 𝑏 3 Fault on path-2: path- i’s weight is 𝑏 𝑗 / 𝑏 1 + 𝑏 2 + 𝑏 3 𝟑 𝑡 𝑔 ≤ 𝑏 1 + 𝑏 2 Fault on path-3: 17

  18. An efficient and precise solution to FFC k-sum linear constraint group (k-sum group): Given n paths and 𝐵 = {𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 } , FFC requires that O( 𝑜 𝑙 ) the sum of arbitrary n-k elements in 𝐵 is ≥ flow size FFC-TE LP-formulation: Lossless compression of a k-sum group: O(kn) bubble sorting network TE Objective (SIGCOMM 2014) O( 𝑜 Basic TE Constraints 𝑙 ) O(n) strong duality k-sum group-1 (MSR TR 2016) FFC http://www.hongqiangliu.com/publications.html Constraints k-sum group-N (too many) 18

  19. FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) • … 19

  20. Implementation & evaluation highlights • Testbed experiment (8 switches & 30 servers) • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation • A WAN network with O(100) switches and O(1000) links • One-week traffic trace • Fault injection according to real failure trace • Results: with negligible throughput loss, FFC can reduce • data loss by a factor of 7-130 in well-provisioned networks • data loss of high priority traffic to almost zero in well-utilized networks 20

  21. Conclusion and future work SDN Controller Network view Network configuration Network Network Faults: Network Properties: • Data-plane • High throughput FFC • Control-plane • No congestion • Misconfigurations • Security • Attacks • Availability • Traffic spikes • Connectivity • …… • …… 21

  22. Q&A 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend