survivable network design
play

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 - PowerPoint PPT Presentation

Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 The final goal 2 We prefer not to see: Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile


  1. Survivable Network Design Dr. János Tapolcai tapolcai@tmit.bme.hu 1

  2. The final goal 2 • We prefer not to see:

  3. Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile access 3

  4. Telecommunicaiton Networks 4 http://www.icn.co

  5. Traditional network architecture in backbone networks Adressing, IP (Internet routing Protocol) Traffic engineering ATM (Asynchronous Transfer Mode) Transport and SDH/SONET protection (Synchronous Digital Hierarchy) High bandwidth WDM (Wavelength Division Multiplexing) 5

  6. Evolution of network layers BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds Layer 3 Layer IP GMPLS 2 ATM 1 IP 2/3 MPLS SONET 0 Packet Packet Thin SONET Inter- IP/Ethernet working Smart 0/1 Optics Optics Optical Optical 1999 2003 201x 6

  7. IP - Internet Protocol • Packet switched – Hop-by-hop routing – Packets are forwarded based on forwarding tables • Distributed control – Shortest path routing • via link-state protocols: OSPF (Open Shortest Path First), IS- IS (Intermediate System To Intermediate System) • Routing on a logical topology • Widespread, its role is straightforward – From a technical point of view not very popular 7

  8. 8

  9. Optical backbone • Circuit switched – Centralized control – Exact knowledge of the physical topology • Logical links are lightpaths – Source and destination node pairs, bandwidth IP router B C E Wavelength crossconnect Lightpaths A D 9

  10. Optical Backbone Networks 10

  11. Motivation Behind Survivable Network Design 11

  12. FAILURE SOURCES 12

  13. Failure Sources – HW Failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 13

  14. Failure Sources – SW Failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 14

  15. Failure Sources – Operator Errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 15

  16. Failure Sources – Operator Errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 16

  17. Failure Sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 17

  18. Failure Sources – User Errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 18

  19. Failure Sources – Environmental Causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

  20. Operating Routers During Sandy Hurricane 20

  21. Michnet ISP Backbone (1998) • Which failures were the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 21

  22. Michnet ISP Backbone (1998) Cause Type # [%] Maintenance Operator 272 16.2 User Power Outage Environmental 273 16.0 5% Environmental 261 15.3 Fiber Cut/Cicuit/Carrier Operator Problem Environmental 35% Unreachable Operator 215 12.6 31% Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Malice 2% Miscellaneous Unknown 86 5.9 Hardw are Softw are Unknow n 15% 1% 11% Unknown 32 5.6 Unknown/Undetermined/ No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 22

  23. Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 23

  24. Failure Sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 24

  25. Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 25

  26. Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing – R ( 0 ) 1 = – lim R ( t ) 0 = t → ∞ R ( t ) 1 R ( a ) t t R ( t ) 1 F ( t ) 1 ( 1 e − λ ) e − λ = − = − − = 0 a t 26

  27. Network with Reparable Subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 27

  28. Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 28

  29. Availability in Hours Outage time/ Outage time/ Outage time/ Availability Nines year week month 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 29

  30. Availability Evaluation – Assumptions • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) α λ t F ( t ) = 1 e − − – λ > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – µ > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend