Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 - PowerPoint PPT Presentation

Survivable Network Design Dr. János Tapolcai tapolcai@tmit.bme.hu 1

The final goal 2 • We prefer not to see:

Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile access 3

Telecommunicaiton Networks 4 http://www.icn.co

Traditional network architecture in backbone networks Adressing, IP (Internet routing Protocol) Traffic engineering ATM (Asynchronous Transfer Mode) Transport and SDH/SONET protection (Synchronous Digital Hierarchy) High bandwidth WDM (Wavelength Division Multiplexing) 5

Evolution of network layers BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds Layer 3 Layer IP GMPLS 2 ATM 1 IP 2/3 MPLS SONET 0 Packet Packet Thin SONET Inter- IP/Ethernet working Smart 0/1 Optics Optics Optical Optical 1999 2003 201x 6

IP - Internet Protocol • Packet switched – Hop-by-hop routing – Packets are forwarded based on forwarding tables • Distributed control – Shortest path routing • via link-state protocols: OSPF (Open Shortest Path First), IS- IS (Intermediate System To Intermediate System) • Routing on a logical topology • Widespread, its role is straightforward – From a technical point of view not very popular 7

Optical backbone • Circuit switched – Centralized control – Exact knowledge of the physical topology • Logical links are lightpaths – Source and destination node pairs, bandwidth IP router B C E Wavelength crossconnect Lightpaths A D 9

Optical Backbone Networks 10

Motivation Behind Survivable Network Design 11

FAILURE SOURCES 12

Failure Sources – HW Failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 13

Failure Sources – SW Failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 14

Failure Sources – Operator Errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 15

Failure Sources – Operator Errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 16

Failure Sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 17

Failure Sources – User Errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 18

Failure Sources – Environmental Causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

Operating Routers During Sandy Hurricane 20

Michnet ISP Backbone (1998) • Which failures were the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 21

Michnet ISP Backbone (1998) Cause Type # [%] Maintenance Operator 272 16.2 User Power Outage Environmental 273 16.0 5% Environmental 261 15.3 Fiber Cut/Cicuit/Carrier Operator Problem Environmental 35% Unreachable Operator 215 12.6 31% Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Malice 2% Miscellaneous Unknown 86 5.9 Hardw are Softw are Unknow n 15% 1% 11% Unknown 32 5.6 Unknown/Undetermined/ No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 22

Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 23

Failure Sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 24

Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 25

Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing – R ( 0 ) 1 = – lim R ( t ) 0 = t → ∞ R ( t ) 1 R ( a ) t t R ( t ) 1 F ( t ) 1 ( 1 e − λ ) e − λ = − = − − = 0 a t 26

Network with Reparable Subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 27

Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 28

Availability in Hours Outage time/ Outage time/ Outage time/ Availability Nines year week month 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 29

Availability Evaluation – Assumptions • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) α λ t F ( t ) = 1 e − − – λ > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – µ > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 30

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 - PowerPoint PPT Presentation

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 The final goal 2 We prefer not to see: Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile

CockroachDBs Survivability Model Scalable, Survivable, Consistent, SQL presented by Marc

Survivability Modern telecommunication network are built survivable Network maintain

Strong Formulations for the Survivable Network Design with Hop Constraints Problem A. Ridha

The Survivable Network Analysis Method: Assessing Survivability of Critical Systems

Survivable Real-Time Network Services David L. Mills University of Delaware

Cloudinomicon :: Idempotent Infrastructure, Survivable Systems & Bringing Sexy Back to

CockroachDB Scalable, survivable, strongly consistent, SQL presented by Ben Darnell / CTO About

Survivable and Bandwidth- Guaranteed Embedding of Virtual Clusters in Cloud Data Centers Ruozhou

Semantic Markup for Secure Survivable Enterprise Applications Anya Kim, Amit Khashnobish, Jim

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

What is SUDS design? PAUL DAVIES What is SUDS design? What is SUDS design? What is SUDS design?

Agile Software Design 19 February, 2020 Software Design Early decisions Modular design Agile

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

Network Coding Network Coding Jie Gao Existing network Existing network Independent data

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

Introduction to Network Introduction to Network Theory Theory What is a Network? What is a

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

1 Graph abstraction: costs 5 c(x,x) = cost of link (x,x) 3 v w 5 - e.g., c(w,z) =

Learning and Generating Distributed Routing Protocols Using Graph-Based Deep Learning Fabien

Network Layer (Routing) Recap: Why do we need a Network layer? Internetworking Need to

The Touring Polygons Problem (TPP) [Dror-Efrat-Lubiw-M]: Given a sequence of k polygons in the

Challenges in Distributed Shortest Paths Algorithms Danupon Nanongkai KTH Royal Institute of

Replacement Paths and k Simple Shortest Paths in Unweighted Directed Graphs Liam Roditty Uri

Routing Algorithms in Traffic Assignment Modeling Tuan Nam Nguyen and Gerhard Reinelt University

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 - PowerPoint PPT Presentation

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 The final goal 2 We prefer not to see: Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile

CockroachDBs Survivability Model Scalable, Survivable, Consistent, SQL presented by Marc

Survivability Modern telecommunication network are built survivable Network maintain

Strong Formulations for the Survivable Network Design with Hop Constraints Problem A. Ridha

The Survivable Network Analysis Method: Assessing Survivability of Critical Systems

Survivable Real-Time Network Services David L. Mills University of Delaware

Cloudinomicon :: Idempotent Infrastructure, Survivable Systems &amp; Bringing Sexy Back to

CockroachDB Scalable, survivable, strongly consistent, SQL presented by Ben Darnell / CTO About

Survivable and Bandwidth- Guaranteed Embedding of Virtual Clusters in Cloud Data Centers Ruozhou

Semantic Markup for Secure Survivable Enterprise Applications Anya Kim, Amit Khashnobish, Jim

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

What is SUDS design? PAUL DAVIES What is SUDS design? What is SUDS design? What is SUDS design?

Agile Software Design 19 February, 2020 Software Design Early decisions Modular design Agile

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

Network Coding Network Coding Jie Gao Existing network Existing network Independent data

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

Introduction to Network Introduction to Network Theory Theory What is a Network? What is a

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

1 Graph abstraction: costs 5 c(x,x) = cost of link (x,x) 3 v w 5 - e.g., c(w,z) =

Learning and Generating Distributed Routing Protocols Using Graph-Based Deep Learning Fabien

Network Layer (Routing) Recap: Why do we need a Network layer? Internetworking Need to

The Touring Polygons Problem (TPP) [Dror-Efrat-Lubiw-M]: Given a sequence of k polygons in the

Challenges in Distributed Shortest Paths Algorithms Danupon Nanongkai KTH Royal Institute of

Replacement Paths and k Simple Shortest Paths in Unweighted Directed Graphs Liam Roditty Uri

Routing Algorithms in Traffic Assignment Modeling Tuan Nam Nguyen and Gerhard Reinelt University

Cloudinomicon :: Idempotent Infrastructure, Survivable Systems & Bringing Sexy Back to