forward fault correction ffc
play

Forward Fault Correction (FFC) Hongqiang Harry Liu , Srikanth - PowerPoint PPT Presentation

Traffic Engineering with Forward Fault Correction (FFC) Hongqiang Harry Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1 Cloud services require large network capacity Cloud Services Growing traffic


  1. Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1

  2. Cloud services require large network capacity Cloud Services Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2

  3. TE is critical to effectively utilizing networks Traffic Engineering • Devoflow • Microsoft SWAN • MicroTE • Google B4 • …… • …… WAN Network Datacenter Network 3

  4. But, TE is also vulnerable to faults TE controller Network Network view configuration Frequent updates for Control-plane high utilization faults Data-plane faults Network 4

  5. Control plan faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage 5

  6. Congestion due to control plane faults s2 s4 (10) New Flows (traffic demands): s2 s2 Link Capacity: 10 s1 s2 (10) 7 10 10 s1 s3 (10) 3 s1 s4 (10) 10 s1 s4 s1 s4 3 7 10 10 s3 s3 Configuration s3 s4 (10) failure s2 7 10 3 s1 s4 Congestion 10 10 10 s3 6

  7. Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 s2 7 link failure link failure 10 3 s1 s4 s1 s4 3 3 congestion 7 7 s3 s4 (10) s3 s3 Link Capacity: 10 7

  8. Control and data plane faults in practice In production networks : • Faults are common. • Faults cause severe congestion. Control plane: Data plane: fault rate = 0.1% -- 1% per TE update. fault rate = 25% per 5 minutes. 8

  9. State of the art for handling faults • Heavy over-provisioning Big loss in throughput • Reactive handling of faults • Control plane faults: retry • Data plane faults: re-compute TE and update networks Cannot prevent Blocked by control Slow congestion plane faults (seconds -- minutes) 9

  10. How about handling congestion proactively? 10

  11. Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution 11

  12. Example: FFC for control plane faults s2 s2 Link Capacity: 10 7 10 10 3 10 s1 s4 s1 s4 3 7 10 10 s3 s3 Non-FFC Configuration failure s2 7 10 3 s1 s4 Congestion 10 10 10 s3 12

  13. Example: FFC for control plane faults s2 s2 Link Capacity: 10 7 10 10 3 7 s1 s4 s1 s4 3 10 7 10 s3 s3 Control Plane FFC (k=1) Configuration failure s2 s2 7 10 10 10 3 7 s1 s4 s1 s4 3 7 10 10 10 7 s3 s3 Configuration 13 failure

  14. Trade-off: network utilization vs. robustness s2 s2 s2 10 10 10 10 10 10 4 7 10 s1 s4 s1 s4 s1 s4 10 10 10 10 10 10 s3 s3 s3 K=1 K=2 Non-FFC (Control Plane FFC) (Control Plane FFC) Throughput: 44 Throughput: 47 Throughput: 50 14

  15. Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15

  16. Basic TE linear programming formulations LP formulations 𝑐 𝑔 Sizes of flows TE decisions: 𝑚 𝑔,𝑢 Traffic on paths max. ∀𝑔 𝑐 𝑔 TE objective: Maximizing throughput ∀𝑢 𝑚 𝑔,𝑢 ≥ 𝑐 𝑔 s.t. ∀𝑔: Deliver all granted flows ∀𝑔 ∀𝑢∋𝑓 𝑚 𝑔,𝑢 ≤ 𝑑 𝑓 ∀𝑓: No overloaded link Basic TE constraints: … … 𝑙 𝑑 control plane faults FFC constraints: No overloaded link up to 𝑙 𝑓 link failures 𝑙 𝑤 switch failures 16

  17. Formulating control plane FFC 𝑔 1 s1 s2 𝑔 2 𝑔 3 Total load under faults? 𝑔 1 ’s load in old TE 𝑔 2 ’s load in new TE 𝑝𝑚𝑒 + 𝑚 2 𝑜𝑓𝑥 + 𝑚 3 𝑜𝑓𝑥 ≤ link cap Fault on 𝑔 1 : 𝑚 1 𝟒 𝑜𝑓𝑥 + 𝑚 2 𝑝𝑚𝑒 + 𝑚 3 𝑜𝑓𝑥 ≤ link cap Fault on 𝑔 2 : 𝑚 1 𝟐 𝑜𝑓𝑥 + 𝑚 2 𝑜𝑓𝑥 + 𝑚 3 𝑝𝑚𝑒 ≤ link cap Fault on 𝑔 3 : 𝑚 1 Challenge: too many constraints With n flows and FFC protection k : #constraints = 𝒐 𝒐 𝟐 + … + 𝒍 for each link. 17

  18. An efficient and precise solution to FFC Our approach: A lossless compression from O( 𝑜 𝑙 ) constraints to O(kn) constraints. Total load under faults Total additional load due to faults ≤ link capacity ≤ link spare capacity 𝑦 𝑗 : additional load due to fault- i Given 𝑌 = {𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 } , FFC requires that O( 𝑜 𝑙 ) the sum of arbitrary k elements in 𝑌 is ≤ link spare capacity Define 𝑧 𝑛 as the m th largest element in 𝑌 : O( 1 ) 𝑙 𝑛=1 𝑧 𝑛 ≤ link spare capacity Expressing 𝑧 𝑛 with 𝑌 ? 18

  19. Sorting network 𝑧 1 (1 st largest) 𝑦 1 𝑨 4 𝑨 5 𝑧 2 (2 nd largest) 𝑦 2 𝑨 7 𝑨 2 𝑨 3 𝑨 8 𝑦 3 𝑨 6 𝑨 1 A comparison 𝑦 4 𝑦 1 𝑨 1 =max{ 𝑦 1 , 𝑦 2 } 𝑦 2 𝑨 2 =min{ 𝑦 1 , 𝑦 2 } 1 st round 2 nd round 𝑧 1 + 𝑧 2 ≤ link spare capacity • Complexity: O(kn) additional variables and constraints. • Throughput: optimal in control-plane and data plane if paths are disjoint. 19

  20. FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) • … 20

  21. Evaluation overview • Testbed experiment • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation A WAN network with O(100) switches and O(1000) links Injecting faults based on real failure reports Single priority traffic in a Multiple priority traffic in a well-provisioned network well-utilized network 21

  22. FFC prevents congestion with negligible throughput loss Single priority 160% High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) 100 80 Ratio (%) 60 40 20 <0.01% 0 FFC Data-loss / Non-FFC Data-loss FFC Throughput / Optimal Throughput 22

  23. Conclusions • Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults. • FFC proactively handle these faults. • Guarantee: no congestion when #faults ≤ k . • Efficiently computable with low throughput overhead in practice. FFC Heavy network High risk of over-provisioning congestion 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend