availability models
play

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - PowerPoint PPT Presentation

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1 Failure sources HW failures Network element failures Type failures Manufacturing or design failures Turns out at the testing


  1. Availability models Dr. János Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1

  2. Failure sources – HW failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 2

  3. Failure sources – SW failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 3

  4. Failure sources – Operator errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 4

  5. Failure sources – Operator errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 5

  6. Failure sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 6

  7. Failure sources – User errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 7

  8. Failure sources – Environmental causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

  9. Michnet ISP Backbone 11/97 – 11/98 • Which failures are the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 9

  10. Michnet ISP Backbone 11/97 – 11/98 Cause Type # [%] Maintenance Operator 272 16.2 User 5% Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Operator Environmental 261 15.3 Problem Environmental 35% 31% Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Malice Routing Problems Operator 104 6.1 2% Hardw are Softw are Unknow n Miscellaneous Unknown 86 5.9 15% 1% 11% Unknown/Undetermine Unknown 32 5.6 d/No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 10

  11. Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 11

  12. Failure sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 12

  13. Motivation behind survivable network design 13

  14. Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 14

  15. Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing  – R ( 0 ) 1 –  lim R ( t ) 0   t R ( t ) 1 R ( a )           t t R ( t ) 1 F ( t ) 1 ( 1 e ) e 0 a t 15

  16. Network with reparable subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 16

  17. Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 17

  18. Availability in hours Outage time/ Outage time/ Outage time/ Availability Nines year month week 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 18

  19. Availability evaluation – assumptions (1) • Deployment – availability increases (unavailability decreases) – Performance is optimized • Steady state – the availability remains the same for a long period (time independent) • Wear out (component aging) – availability decrease (unavailability increase) – e.g. impairments in the fiber U(t) Bathtub curve 1 Steady state t 0 19

  20. Availability evaluation – assumptions (2) • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution    t – sometimes Weibull distribution is used (hard)  1  F ( t ) e –  > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – m > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 20

  21. Two-state Markov model – Steady state analysis (1)  Mean of exp. dist. variables: 1 1  MTTF  UP DN 1 0 1 1m  MTTR m m • Transition probability distribution in a matrix form – Transition matrix P (stochastic matrix) • Time homogeneous Markov-chain – The transition matrix after k steps: P k – Stationary distribution is a row vector π, for which     P – π exists, (and in this case it is unambiguous) 21

  22. Two-state Markov model – Steady state analysis (2)  1      1 Transition matrix:    P   UP DN m  m 1   1 0 Stationary distribution: m 1m      UP , DOWN ( A U )      1     ( A U ) ( A U )   m  m 1         m A A ( 1 ) U     m   A U / we have seen U 1 A m  A   m 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend