Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - - PowerPoint PPT Presentation

availability models
SMART_READER_LITE
LIVE PREVIEW

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - - PowerPoint PPT Presentation

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1 Failure sources HW failures Network element failures Type failures Manufacturing or design failures Turns out at the testing


slide-1
SLIDE 1

1

Availability models

  • Dr. János Tapolcai

tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ /

slide-2
SLIDE 2

2

Failure sources – HW failures

  • Network element failures

– Type failures

  • Manufacturing or design failures
  • Turns out at the testing phase

– Wear out

  • Processor, memory, main board, interface cards
  • Components with moving parts:

– Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake)

  • Circuit breakers, transistors, etc.
slide-3
SLIDE 3

3

Failure sources – SW failures

  • Design errors
  • High complexity and compound failures
  • Faulty implementations
  • Typos in variable names

– Compiler detects most of these failures

  • Failed memory reading/writing operation
slide-4
SLIDE 4

4

Failure sources – Operator errors (1)

  • Unplanned maintenance

– Misconfiguration

  • Routing and addressing

– misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv)

  • Traffic Conditioners

– Policers, classifiers, markers, shapers

  • Wrong security settings

– Block legacy traffic

– Other operation faults:

  • Accidental errors (unplug, reset)
  • Access denial (forgotten password)
  • Planned maintenance
  • Upgrade is longer than planned
slide-5
SLIDE 5

5

Failure sources – Operator errors (2)

  • Topology/Dimensioning/Implementation design

errors

– Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection)

  • Compatibility errors

– Between different vendors and versions – Between service providers or AS (Autonomous system)

  • Different routein settings and Admission Control between two

ASs

slide-6
SLIDE 6

6

Failure sources – Operator errors (3)

  • Operation and maintenance errors

Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other

slide-7
SLIDE 7

7

Failure sources – User errors

  • Failures from malicious users

– Physical devices

  • Robbery, damage the device

– Against nodes

  • Viruses

– DoS (denial-of-service) attack (i.e. used in the Interneten)

  • Routers are overload
  • At once from many addresses
  • IP address spoofing
  • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In

1996 computers could be froze by recieving larger packets.

  • Unexpected user behavior

– Short term

  • Extreme events (mass calling)
  • Mobility of users (e.g. after a football match the given cell is congested)

– Long term

  • New popular sites and killer applications
slide-8
SLIDE 8

Failure sources – Environmental causes

  • Cable cuts

– Road construction (‘Universal Cable Locator’) – Rodent bites

  • Fading of radio waves

– New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes

  • Electro-magnetic interference

– Electro-magnetic noise – solar flares

  • Power outage
  • Humidity and temperature

– Air-conditioner fault

  • Natural disasters

– Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

slide-9
SLIDE 9

9

Michnet ISP Backbone 11/97 – 11/98

Maintenance Power Outage Fiber Cut/Cicuit/Carrier Problem Hardware Problem Routing Problems Interface Down Congestion/Sluggish Malicious Attack Software Problem

  • Which failures are the most probable ones?
slide-10
SLIDE 10

10

Michnet ISP Backbone 11/97 – 11/98

Operator 35% Hardw are 15% Environmental 31% User 5% Unknow n 11% Malice 2% Softw are 1%

Cause Type # [%]

Maintenance Operator 272 16.2 Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Problem Environmental 261 15.3 Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Miscellaneous Unknown 86 5.9 Unknown/Undetermine d/No problem Unknown 32 5.6 Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3

slide-11
SLIDE 11

11

Case study - 2002

  • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation,

Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002,

slide-12
SLIDE 12

12

Failure sources - Summary

  • Operator errors (misconfiguration)

– Simple solutions needed – Sometimes reach 90% of all failures

  • Planned maintenance

– Running at night – Sometimes reach 20% of all failures

  • DoS attack

– It will be worse in the future

  • Software failures

– 10 million line source codes

  • Link failures

– Anything from which a point-to-point connection fails (not only cable cuts)

slide-13
SLIDE 13

13

Motivation behind survivable network design

slide-14
SLIDE 14

14

Reliability

  • Failure

– is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment tf

  • Reliability, R(t)

– continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t] intended in the presence of network failures

slide-15
SLIDE 15

15

Reliability (2)

  • Reliability, R(t)

– Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables

  • Properies:

– non-increasing – –

t t

e e t F t R

   

      ) 1 ( 1 ) ( 1 ) (

t a

1

R(t) R(a)

) ( lim 1 ) (  

 

t R R

t

slide-16
SLIDE 16

16

Device is

  • perational

Network with reparable subsystems

t

UP DOWN

Device is

  • perational

Device is

  • perational

The network element is failed, repair action is in progress.

Failure

  • Measures to charecterize a reparable system are:

– Availability, A(t)

  • refers to the probability of a reparable system to be found in the
  • perational state at some time t in the future
  • A(t) = P(time = t, system = UP)

– Unavailability, U(t)

  • refers to the probability of a reparable system to be found in the

faulty state at some time t in the future

  • U(t) = P(time = t, system = DOWN)
  • A(t) + U(t) = 1 at time t

Failure

slide-17
SLIDE 17

17

Element Availability Assignment

  • The mainly used measures are

– MTTR - Mean Time To Repair – MTTF - Mean Time to Failure

  • MTTR << MTTF

– MTBF - Mean Time Between Failures

  • MTBF=MTTF+MTTR
  • if the repair is fast, MTBF is approximately the same as MTTF
  • Sometimes given in FITs (Failures in Time), MTBF[h]=109/FIT
  • Another notation

– MUT - Mean Up Time

  • Like MTTF

– MDT - Mean Down Time

  • Like MTTR

– MCT - Mean Cycle Time

  • MCT=MUT+MDT
slide-18
SLIDE 18

18

Availability in hours

Availability Nines Outage time/ year Outage time/ month Outage time/ week

90% 1 nine 36.52 day 73.04 hour 16.80 hour 95%

  • 18.26 day

36.52 hour 8.40 hour 98%

  • 7.30 day

14.60 hour 3.36 hour 99% 2 nines (maintained) 3.65 day 7.30 hour 1.68 hour 99.5%

  • 1.83 day

3.65 hour 50.40 min 99.8%

  • 17.53 hour

87.66 min 20.16 min 99.9% 3 nines (well maintained) 8.77 hour 43.83 min 10.08 min 99.95%

  • 4.38 hour

21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 99.999% 5 nines (failure protected) 5.26 min 25.9 sec 6.05 sec 99.9999% 6 nines (high reliability) 31.56 sec 2.62 sec 0.61 sec 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec

slide-19
SLIDE 19

19

Availability evaluation – assumptions (1)

  • Deployment

– availability increases (unavailability decreases) – Performance is optimized

  • Steady state

– the availability remains the same for a long period (time independent)

  • Wear out (component aging)

– availability decrease (unavailability increase) – e.g. impairments in the fiber

t

Steady state

U(t) 1

Bathtub curve

slide-20
SLIDE 20

20

Availability evaluation – assumptions (2)

  • Failure arrival times

– independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) –  > 0 failure rate (time independent!)

  • Repair times

– iid exponential variables – sometimes Weibull distribution is used (hard) – m > 0 repair rate (time independent!)

  • If both failure arrival times and repair times are

exponentially distributed we have a simple model

– Continuous Time Markov Chain

t

e t F

 1 ) (

slide-21
SLIDE 21

21

Two-state Markov model – Steady state analysis (1)

MTTR MTTF   m  1 1

UP 1 DN

 m 1 1m

  • Transition probability distribution in a matrix form

– Transition matrix P (stochastic matrix)

  • Time homogeneous Markov-chain

– The transition matrix after k steps: Pk – Stationary distribution is a row vector π, for which – π exists, (and in this case it is unambiguous)

P   

Mean of exp. dist. variables:

slide-22
SLIDE 22

Two-state Markov model – Steady state analysis (2)

22

UP 1 DN

 m 1 1m

           m m   1 1 P

m  m m  m  m m                           A A U U A U A A U A U A 1 seen have / we ) 1 ( 1 1 ) ( ) (

Transition matrix: Stationary distribution:

 

) ( , U A DOWN UP   

slide-23
SLIDE 23

23

Two-state Markov-model - Summary

1

A(t) Ass=

m  m 

MTTR MTTF MTTF A

ss

      m   m  m 1 1 1

  • Without the assumption of reparable subsystems (m=0)
  • availability is the same as reliability

) ( | ) (

) (

t R e e t A

t t

     

     m m 

m   m  m

t

slide-24
SLIDE 24

24

Estimate the failure rate  - Military Handbook

  • First for electric devices
  • MIL-HDBK-217 (Military Handbook, Reliability

Prediction of Electronic Equipment)

  • Microelectronic circuits
  • Semiconductors
  • Passive elements

– Match curves on the observations to get p

  • where p = the failure rate of the element

t

p

e t R

 

 ) (

slide-25
SLIDE 25

25

Estimate the failure rate  - Telcordia standard

  • The operation environment is considred in

the estimation

– On spot measured data – Data tested in laboratory

  • AT&T Bell Labs.

– Since then calledTelcordia standard – France Telecom (CNET93) and British Telecom (HRD5) improved the method

slide-26
SLIDE 26

26

Equipment availability - IP router

IP Router (simplified model, configuration example ) HW common parts SW library

1 X 4 port OC3/STM1 POS line card 2 X 1 portGigabit Ethernet module 4 X 1 port OC48/STM16 POS line card

8 slot available

  • Pow. Supply,

housing, conditioning

Not used

IP router: interface card MTBF[h] = 8.5·104 MTTR[h] = 4 IP router: SW MTBF[h] = 3·104 MTTR[h] = 0.0004 (SW restart) MTTR[h] = 0.02 (SW reload) MTTR[h] = 0.25 (no automatic restart) IP router: route processor MTBF[h] = 2·105 MTTR[h] = 4

slide-27
SLIDE 27

27

Equipment availability – DXC in SDH/SONET

OEO

Trunk Transponder Tributary Transponder

Control

SDH DXC/ADM: MTBF[h] = 1·106 MTTR[h] = 4 DXC has more ports than IP routers

SDH – Synchronous Digial Hierarchy SONET - Synchronous Optical NETworking DXC – digital cross connect ADM – add-drop multiplexer OEO – optical electrical optical conversion

slide-28
SLIDE 28

28

Aerial cable MTBF[km]=1.75·105 MTTR=6

Equipment availability – WDM system

OXC Trans- ponder WDM line system Cable/ Fibre Amplifier

MTBF=400·103 MTTR=6 Submarine cables MTBF[km]=4.64·106 MTTR=540 MTBF=250·103 MTTR=6 MTBF=160·103 MTTR=6 WDM OXC (OEO) or OADM MTBF=1·105 MTTR=6 OXC redundant: 1+1 protected MTBF=6·106 MTTR=4 Buried cable MTBF[km]=2.6·105 MTTR=12

WDM – wavelength division multiplexing OXC – optical cross connect OADM – optical add-drop multiplexer

slide-29
SLIDE 29

29

Single WDM lightpath

OXC Trans- ponder WDM line system Amplifier

MTBF=4·105 MTTR=6 MTBF=2.5·105 MTTR=6 MTBF=1.6·105 MTTR=6 WDM OXC MTBF=1·105 MTTR=6 Ground cable (200 km) MTBF[km]=2.63·105 MTTR=12 As-d = AOXC * Atr * AMUX * Acable * Aamp * AMUX * Atr * AOXC = 0.99994 * 0.999985 * 0.9999625 * 0.99087 * 0.999976 * 0.9999625 * 0.999985 * 0.99994 = 0.99994 * 0.99074 * 0.99994 = 0.99062

3.65 day/year

  • utage

i m i

A A

1 

 

Series rule:

slide-30
SLIDE 30

30

1+1 Protection (disjiont pair of paths)

  • 200km lightpath 0.99074

) 1 ( 1

1 i m i

A A    

53 min/year

  • utage

Parallel rule:

As-d = AOXC * [1-(1-Apath1) *(1-Apath2)] * AOXC = 0.99994 * [1-(1-0.99074)*(1- 0.99074)] * 0.99994 = 0.99979

slide-31
SLIDE 31

31

References

  • Dr. Chidung LAC, “Telecommunication

network reliability”

  • D. Arci, et.al, “Availability models for

protection techniques in WDM networks”

  • Computer Networking: A Top Down Approach

Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.

  • J. Vasseur, M. Pickavet, and P. Demeester.

Network recovery: Protection and Restoration

  • f Optical, SONET-SDH, IP, and MPLS.

Morgan Kaufmann Publishers, 2004. Computer Networking: A Top Down Approach Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.