Lancet: Better Network Resilience by Designing for Pruned Failure - PowerPoint PPT Presentation

  Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* ✝ , Chuan Jiang*, Ashish Chandra*,   Sanjay Rao*, Mohit Tawarmalani* *Purdue University, ✝ Bytedance This work was done when Yiyang Chang was at Purdue University ACM Sigmetrics 2020

  Challenges in Network Design • Failures are important in designing wide-area networks • Inevitable [1, 2] and costly • Network users desire high service level objectives (SLOs) • 99.99% or even 99.999% [1] Gill, et al, Understanding network failures in data centers: Measurement, analysis, and implications. Sigomm 2011. [2] Potharaju and Jain, When the network crumbles: An empirical study of cloud network failures and their 2 impact on services, SOCC 2013.

State-of-the-art in Network Design • Key problem: how to design networks for such stringent requirements? • State-of-the-art: Design for worst-case failure • Robust to all possible combinations of f or fewer failures • A weak point: If a single f -failure scenario cannot be tackled, forced to design for f-1 failures only • Examples: R3 (Wang, et al, Sigcomm 2010), FFC (Liu, et al, Sigcomm 2014) 3

Lancet - Beyond Worst-case • Designing for worst-case may be conservative • Can we design for most f -failure scenarios when designing for all is not possible? a a s b t s b t Link 1 unit Capacity c c Demand 2 units from s to t Good 2-failure scenario Bad 2-failure scenario 4

Lancet - Contributions • New approach to designing protection routing • For most failure scenarios, when designing for all not possible • Key components • Novel divide-and-conquer algorithm to e ffi ciently identifies failure scenarios which a network can intrinsically handle • Provides a compact representation of these scenarios • Linear program (LP) approach to designing protection routing that exploits this compact representation • Cuts design time from > 18 hours to 10 seconds for a real- world topology • Validations on real-world network topologies show Lancet’s promise 5

Determine Which Scenarios to Design for • How to determine which scenarios to design for? • Observation: Any routing scheme cannot perform better than an ideal scheme. An ideal scheme routes using multi-commodity flow • Exclude all bad scenarios with the ideal scheme • Design for the rest of the failure scenarios • How to find which scenarios can be handled by the ideal scheme? • A divide-and-conquer algorithm to classify which failures can and cannot be handled 6

Lancet Classification Algorithm A set of failure scenarios Classification algorithm Do all certify? Yes. Do all violate? Yes. Do all certify? No. The subset is The subset is Do all violate? No. acceptable/good violating/bad Needs further partitioning 7

Classification Algorithm in Operation f = 0 Do all certify? Yes. Prune 8

Classification Algorithm in Operation f = 1 f = 0 Do all certify? Yes. Prune 9

Classification Algorithm in Operation f = 1 f = 2 f = 0 Do all certify? Yes. Prune 10

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 No. Do all violate? No. Partition scenarios 11

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 Yes. Prune x 4 0 12

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 No. Do all violate? x 4 0 1 No. Partition scenarios 13

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 Yes. Prune x 4 0 1 0 x 0 14

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 No. Do all violate? x 4 0 1 Yes. Prune 0 1 x 0 15

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 1 No. Do all violate? x 4 0 1 No. Partition scenarios 0 1 x 0 16

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 1 Yes. Prune x 4 0 1 x 0 0 0 1 x 0 17

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Do all certify? x 2 0 1 No. Do all violate? x 4 0 1 x 0 0 1 Yes. Prune 0 1 x 0 18

Classification Algorithm in Operation f = 1 f = 2 f = 0 f = 3 Done. x 2 0 1 x 4 0 1 x 0 0 1 0 1 x 0 Key procedures • DoAllCertify() • DoAllViolate() • Partitioning strategy 19

Keys for Tractable Classification • DoAllCertify(A) • We show it is NP-complete • Instead, get a conservative bound • Doesn’t a ff ect correctness • DoAllViolate(A) • Simple feasibility LP to test if there is a good failure scenario • Partitioning strategy • Heuristic to choose a link l that fails in many bad scenarios 20

Compact Representation of Failure Sets • Two ways to represent failure x 1 0 1 scenarios A 1 • A 1 , A 2 , and A 3 as 3 sets x 2 0 1 • 161k+ separate failure A 2 x 3 0 1 scenarios • The classification algorithm A 3 Y naturally generates the first representation • Sets of 3-failure scenarios • Next we will see why the first of a 100-link network representation is better • A 1 , A 2 , and A 3 certify • Y is undecided 21

Protection Routing Design p: Extra tra ffi c   • Link-based protection routing when <i, j> fails • Provisions bypass paths to m protect against each failure scenarios s i j t • Achieved using (H), generalizing a state-of-the-art scheme [1] r: Normal tra ffi c   (no failures) • Issues with existing protection routing schemes • Only work if X is all f failures • If worst-case U > 1 , we are forced to design for f - 1 failures • What we want: Design for most f failures if not all [1] Wang, et al, R3: Resilient routing reconfiguration, Sigcomm 2010 22

Protection Routing Design with Excluded Scenarios • Two ways to implement the capacity constraints (circled in red) 1. Enumerate constraints, one for each failure scenario x 2. Impose the constraint for a union of failure sets, each one represented using LP duality • Our approach (2nd above) is more compact • Since number of sets can be exponentially smaller than the number of failure scenarios 23

Summarizing Design with Lancet • Step 1: Reformulate (H) to an LP to handle arbitrary sets of failure scenarios • Step 2: Determine which failure scenarios (represented in failure sets) to include with the classification algorithm • Step 3: Leveraging the LP in Step 1, design a protection routing scheme for failure sets discovered in Step 2 24

        Evaluations • Real topologies   Network # of Nodes # of Edges # of sub-links Abilene 11 14 2 GEANT 32 50 2 Deltacom 103 151 2 ION 114 135 2 • Partial failure model • All links comprises 2 sub-links • Synthetic tra ffi c matrix: Gravity model [1] • Environment: single-threaded on a 3.00GHz Intel Xeon CPU • Implemented in Python and Gurobi 8.0 [1] Yin Zhang, et al. Network anomography. IMC 2005. 25

Design with Lancet Ideal Lancet Gen-R3(1) Gen-R3(2) 99.8 100 % of Certified 2-failure 75 • The ideal scheme handles Scenarios 99.8% of the 2-failure scenarios for GEANT 50 25 0 GEANT 26

Design with Lancet Ideal Lancet Gen-R3(1) Gen-R3(2) 99.8 100 86.7 • Gen-R3 (f): the protection % of Certified 2-failure routing design obtained by 75 optimizing worst-case f- failure scenarios Scenarios 50 • Takeaway: Large performance gaps exist between Gen-R3 schemes and the ideal scheme 25 11.8 0 GEANT 27

Design with Lancet Ideal Lancet Gen-R3(1) Gen-R3(2) 99.8 99.8 100 86.7 • Lancet: protection routing % of Certified 2-failure designed with Lancet by 75 excluding bad failure scenarios Scenarios • Takeaway: Lancet bridges the 50 performance gap, reaching optimal for GEANT 25 11.8 0 GEANT 28

Design with Lancet on Larger Networks Ideal Lancet Gen-R3 (best) 100 % of Certified 75 Scenarios 50 25 0 GEANT f = 2 ION f = 2 Deltacom f = 2 Deltacom f = 3 • Gen-R3 (best): the Gen-R3 (f) that gives the best result • Takeaway: Lancet achieves much better performance than   Gen-R3 (best) and is close to optimal 29

Compactness of Failure Set Representation Topology (# # of # of • Lancet represents a large number of of failures) sets scenarios failure scenarios in a small number of GEANT (2) 3 1272 failure sets ION (2) 5 9172 • Enables tractable designs of protection Deltacom (2) 6 11,465 routing Deltacom (3) 3 466,486 GEANT 30

Design Time with Lancet and Enumeration • For a moderate-sized network GEANT • Lancet reduces design time from > 18 hours to 10 seconds • Makes it possible to handle large topologies in less than 2 hours Enumeration Lancet 80000 > 18 hours Design Time (s) 60000 40000 20000 7,041 3,724 2,749 10 0 GEANT f=2 GEANT f=2 ION f=2 Deltacom f=2 Deltacom f=3 31

Extensions and Other Results • Generalizations and extensions • Richer failure models • E.g., Shared-risk link group (SRLG) • Design to meet probability requirements • Multiple tra ffi c demands • Other results • Design with multiple tra ffi c classes • Validations on SDN testbed 32

Lancet: Better Network Resilience by Designing for Pruned Failure - PowerPoint PPT Presentation

Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* , Chuan Jiang, Ashish Chandra, Sanjay Rao, Mohit Tawarmalani *Purdue University, Bytedance This work was done when Yiyang Chang was at Purdue

Lancet March 2015 Patient Schematic Perkins GD et al The Lancet, 385, 2015, 947 - 955

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

Designing Better Places: Designing Better Places: Hands- H Hands H d d -On Design Training

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

LANCET COMMISSION ON GLOBAL SURGERY ECONOMICS & FINANCING Anna J Dare Commissioner &

Lancet: A Nifty Code Editing Tool Ludo Van Put, Bjorn De Sutter, Matias Madou, Bruno De Bus,

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Class 14 Slides SLIDE what is the designing principle how does designing principle

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Pruned FFT Implementations Franz Franchetti, Markus Pschel Electrical and Computer Engineering

Pruned Dynamic Programming for Steiner Tree Yoichi Iwata (NII) Takuto Shigemura (U-Tokyo)

What is AST? Abstract Syntax Tree pruned CST What is it? From Wikipedia: Why use it instead of

OEBB + Reliant Behavioral Health 866.750.1327 | MyRBH.com Access Code: OEBB What Is An EAP?

Analyzing Resiliency of Smart Grid Communication Architectures under Cyber Attacks Anas Al

Resiliency in Nursing Leadership Paula Coe DNP, MSN, RN, NEA-BC Vice President Nursing Education

resilience all organisms are self-regulating systems equilibrium limits cycles Tallgrass

LRR-DPUF: Learning Resilient and Reliable Digital Physical Unclonable Function Jin Miao 1 Meng Li

Addressing brain drain: the local and regional dimension Simona Cavallini (Fondazione FORMIT)

Leakage-Resilient (Symmetric) Cryptography Franois-Xavier Standaert UCL Crypto Group, Belgium

On the Resilience of Biometric Authentication Systems against Random Inputs Benjamin Zhao ,

Lancet: Better Network Resilience by Designing for Pruned Failure - PowerPoint PPT Presentation

Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* , Chuan Jiang*, Ashish Chandra*, Sanjay Rao*, Mohit Tawarmalani* *Purdue University, Bytedance This work was done when Yiyang Chang was at Purdue

Lancet March 2015 Patient Schematic Perkins GD et al The Lancet, 385, 2015, 947 - 955

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

Designing Better Places: Designing Better Places: Hands- H Hands H d d -On Design Training

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

LANCET COMMISSION ON GLOBAL SURGERY ECONOMICS &amp; FINANCING Anna J Dare Commissioner &amp;

Lancet: A Nifty Code Editing Tool Ludo Van Put, Bjorn De Sutter, Matias Madou, Bruno De Bus,

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Class 14 Slides SLIDE what is the designing principle how does designing principle

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Pruned FFT Implementations Franz Franchetti, Markus Pschel Electrical and Computer Engineering

Pruned Dynamic Programming for Steiner Tree Yoichi Iwata (NII) Takuto Shigemura (U-Tokyo)

What is AST? Abstract Syntax Tree pruned CST What is it? From Wikipedia: Why use it instead of

OEBB + Reliant Behavioral Health 866.750.1327 | MyRBH.com Access Code: OEBB What Is An EAP?

Analyzing Resiliency of Smart Grid Communication Architectures under Cyber Attacks Anas Al

Resiliency in Nursing Leadership Paula Coe DNP, MSN, RN, NEA-BC Vice President Nursing Education

resilience all organisms are self-regulating systems equilibrium limits cycles Tallgrass

LRR-DPUF: Learning Resilient and Reliable Digital Physical Unclonable Function Jin Miao 1 Meng Li

Addressing brain drain: the local and regional dimension Simona Cavallini (Fondazione FORMIT)

Leakage-Resilient (Symmetric) Cryptography Franois-Xavier Standaert UCL Crypto Group, Belgium

On the Resilience of Biometric Authentication Systems against Random Inputs Benjamin Zhao ,

Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* , Chuan Jiang, Ashish Chandra, Sanjay Rao, Mohit Tawarmalani *Purdue University, Bytedance This work was done when Yiyang Chang was at Purdue

LANCET COMMISSION ON GLOBAL SURGERY ECONOMICS & FINANCING Anna J Dare Commissioner &