1
Availability models
- Dr. János Tapolcai
Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - - PowerPoint PPT Presentation
Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1 Failure sources HW failures Network element failures Type failures Manufacturing or design failures Turns out at the testing
1
2
– Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake)
3
– Compiler detects most of these failures
4
– misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv)
– Policers, classifiers, markers, shapers
– Block legacy traffic
5
ASs
6
Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other
7
– Physical devices
– Against nodes
– DoS (denial-of-service) attack (i.e. used in the Interneten)
1996 computers could be froze by recieving larger packets.
– Short term
– Long term
– Road construction (‘Universal Cable Locator’) – Rodent bites
– New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes
– Electro-magnetic noise – solar flares
– Air-conditioner fault
– Fires, floods, terrorist attacks, lightnings, earthquakes, etc.
9
Maintenance Power Outage Fiber Cut/Cicuit/Carrier Problem Hardware Problem Routing Problems Interface Down Congestion/Sluggish Malicious Attack Software Problem
10
Operator 35% Hardw are 15% Environmental 31% User 5% Unknow n 11% Malice 2% Softw are 1%
Cause Type # [%]
Maintenance Operator 272 16.2 Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Problem Environmental 261 15.3 Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Miscellaneous Unknown 86 5.9 Unknown/Undetermine d/No problem Unknown 32 5.6 Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3
11
Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002,
12
– Simple solutions needed – Sometimes reach 90% of all failures
– Running at night – Sometimes reach 20% of all failures
– It will be worse in the future
– 10 million line source codes
– Anything from which a point-to-point connection fails (not only cable cuts)
13
14
15
– Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables
– non-increasing – –
t t
1
) ( lim 1 ) (
t R R
t
16
Device is
Device is
Device is
The network element is failed, repair action is in progress.
Failure
– Availability, A(t)
– Unavailability, U(t)
faulty state at some time t in the future
Failure
17
– MTTR - Mean Time To Repair – MTTF - Mean Time to Failure
– MTBF - Mean Time Between Failures
– MUT - Mean Up Time
– MDT - Mean Down Time
– MCT - Mean Cycle Time
18
Availability Nines Outage time/ year Outage time/ month Outage time/ week
90% 1 nine 36.52 day 73.04 hour 16.80 hour 95%
36.52 hour 8.40 hour 98%
14.60 hour 3.36 hour 99% 2 nines (maintained) 3.65 day 7.30 hour 1.68 hour 99.5%
3.65 hour 50.40 min 99.8%
87.66 min 20.16 min 99.9% 3 nines (well maintained) 8.77 hour 43.83 min 10.08 min 99.95%
21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 99.999% 5 nines (failure protected) 5.26 min 25.9 sec 6.05 sec 99.9999% 6 nines (high reliability) 31.56 sec 2.62 sec 0.61 sec 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec
19
– availability increases (unavailability decreases) – Performance is optimized
– the availability remains the same for a long period (time independent)
– availability decrease (unavailability increase) – e.g. impairments in the fiber
Steady state
Bathtub curve
20
– independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) – > 0 failure rate (time independent!)
– iid exponential variables – sometimes Weibull distribution is used (hard) – m > 0 repair rate (time independent!)
– Continuous Time Markov Chain
t
21
UP 1 DN
– Transition matrix P (stochastic matrix)
– The transition matrix after k steps: Pk – Stationary distribution is a row vector π, for which – π exists, (and in this case it is unambiguous)
Mean of exp. dist. variables:
22
UP 1 DN
Transition matrix: Stationary distribution:
23
m m
ss
) (
t t
m m
24
t
p
25
26
IP Router (simplified model, configuration example ) HW common parts SW library
1 X 4 port OC3/STM1 POS line card 2 X 1 portGigabit Ethernet module 4 X 1 port OC48/STM16 POS line card
8 slot available
housing, conditioning
Not used
IP router: interface card MTBF[h] = 8.5·104 MTTR[h] = 4 IP router: SW MTBF[h] = 3·104 MTTR[h] = 0.0004 (SW restart) MTTR[h] = 0.02 (SW reload) MTTR[h] = 0.25 (no automatic restart) IP router: route processor MTBF[h] = 2·105 MTTR[h] = 4
27
Trunk Transponder Tributary Transponder
Control
SDH DXC/ADM: MTBF[h] = 1·106 MTTR[h] = 4 DXC has more ports than IP routers
SDH – Synchronous Digial Hierarchy SONET - Synchronous Optical NETworking DXC – digital cross connect ADM – add-drop multiplexer OEO – optical electrical optical conversion
28
Aerial cable MTBF[km]=1.75·105 MTTR=6
OXC Trans- ponder WDM line system Cable/ Fibre Amplifier
MTBF=400·103 MTTR=6 Submarine cables MTBF[km]=4.64·106 MTTR=540 MTBF=250·103 MTTR=6 MTBF=160·103 MTTR=6 WDM OXC (OEO) or OADM MTBF=1·105 MTTR=6 OXC redundant: 1+1 protected MTBF=6·106 MTTR=4 Buried cable MTBF[km]=2.6·105 MTTR=12
WDM – wavelength division multiplexing OXC – optical cross connect OADM – optical add-drop multiplexer
29
OXC Trans- ponder WDM line system Amplifier
MTBF=4·105 MTTR=6 MTBF=2.5·105 MTTR=6 MTBF=1.6·105 MTTR=6 WDM OXC MTBF=1·105 MTTR=6 Ground cable (200 km) MTBF[km]=2.63·105 MTTR=12 As-d = AOXC * Atr * AMUX * Acable * Aamp * AMUX * Atr * AOXC = 0.99994 * 0.999985 * 0.9999625 * 0.99087 * 0.999976 * 0.9999625 * 0.999985 * 0.99994 = 0.99994 * 0.99074 * 0.99994 = 0.99062
i m i
1
30
1 i m i
As-d = AOXC * [1-(1-Apath1) *(1-Apath2)] * AOXC = 0.99994 * [1-(1-0.99074)*(1- 0.99074)] * 0.99994 = 0.99979
31
network reliability”
protection techniques in WDM networks”
Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.
Network recovery: Protection and Restoration
Morgan Kaufmann Publishers, 2004. Computer Networking: A Top Down Approach Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.