Why do systems fail? Review studies from 1993-98 Large commercial - PowerPoint PPT Presentation

Why do systems fail? Review studies from 1993-98 Large commercial enterprises Look to the future Lisa Spainhower m eServer Technology

Comparative Study 1993 GENRE EXAMPLE DATA SOURCES Campus-wide LAN Heterogeneous Industry Surveys IBM Mainframe IBM S/390 9021 Customer Logs Customer Report Vendor Claims/IB specs Unix SMP HP 9000 Technical Report Industry Research HA Mainframe IBM S/390 XRF Customer Data Above Unix Sources HA Unix HP SwitchOver/UX Vendor Claims Customer Logs Fault Tolerant Tandem Himalaya Technical Report ???? IBM S/390 Parallel Sysplex IBM Specs & Models

Comparative Study 1993 ANNUAL % BY CAUSE GENRE DOWNTIME HW SW other (unplanned) Campus-wide LAN 453.6 HRS na na na Mainframe 18.0 4.4 68.3 27.2 Unix SMP 76.1-136.9 34.3 40.7 25.0 HA Mainframe 5.8 0.9 63.7 35.4 HA Unix 21.5 14.9 70.2 14.9 Fault Tolerant 8.9 18.8 74.1 7.1 FT Mainframe 8.4 MIN 14.3 57.1 28.6

1995 Downtime in a poorly-managed S/390 LPAR Impact events Events ATTRIBUTION # OUTAGES # OUTAGES IMPACT RATIO Control Center 70 24 2.9 Environment 18 5 3.6 Hardware 10 1 10 Software 118* 52 2.3 Total 216 82 2.6 ATTRIBUTION OUTAGE (min) OUTAGE (min) IMPACT RATIO Control Center 5202 1949 2.7 Environment 1275 454 2.8 Hardware 875 88 10 Software 6209 3062 2.0 Total 13561 5553 2.4 *TM-56%, Apps-16%, DBA-14%, OS- 6%, other-

1995 Downtime in a poorly-managed S/390 LPAR Total Outage per log: 226 hours Per one outage/event: 93 hours # 1 Contributor is software: product & process 1818 process 453 product 791 uncertain 3062 total (51 hours) Assume all CC outages are process (1949 min.; 32 hours) 68%-82% of all unscheduled outages are process (63-76 hours) Technology - HW/SW - 10-24% (9-22 hours)

Aggregated UNIX server data Downtime Cause % UNIX Standalone UNIX Cluster Hardware 42 46 Software 34 36 Other 24 18 Data from a very large well-managed Unix customer % Unplanned downtime % Planned downtime HW 43.8 HW maint 26.5 OS 7.8 OS install 1.6 App 7.3 App Release 34 Com link 18.8 Com link 2.5 DB 2.0 DB admin, BU 31.5 Dis Rec Test 0 Environment 0.7 Pwr Test 0.2 Supplier 0 Other 3.7 Op tools 1.8 Process 6.2 Org/structure 0.8 Human error 5.0 Other 5.8

1998 Unix Investigation GG claims Unix HA/FT clusters limited to 99.9% availability until 2000. HP has 99.95% guarantee; announced plan for 5 9s in 2000. Objective Determine achievable availability for Unix for 98-00 Major Limitation Very limited, inconsistent data available

1998 Unix Investigation Customer DH Brown # clusters/nodes 11/63 15/38 unsched. avail. 99.6% 99.99% failover/yr/node 2.4 1.4 downtime/failover >1 hr 15 min node failover - 4-20 min With improved HW: 8.6-76 min/yr/2 nodes Software retry will also improve Measured as HP specifies, 4 9s is feasible with 5 9s for OPS and Web server

Lessons from the 90s Management discipline is critical to HA Fault tolerant servers make a difference Clusters are difficult to implement

Making the Front Page Making the Front Page Source: Gartner Group Source: Gartner Group Charles Schwab & Co. eBay 24 February 1999 through 21 April 12 June 1999 outage: 22 hours Causes of Unplanned 1999: Four Operating System Failure outages of at least four hours Cost: $3 million to $5 million revenue Application Downtime Upgrades/Operator Errors hit Cost: ???; Announced that it had 26% decline in stock price made $70 million in new infrastructure investment. $s Technology Operator Failures Errors AT&T Dev. Bank of Singapore 13 April 1998 outage: Six to 26 hours 1 July 1999 to August 1999: Processing Errors Software Upgrade 20% Incorrect debiting of POS Cost: $40 million in rebates due to a system overload Forced to file SLAs with the 40% Cost: Embarrassment/loss of FCC (frame relay) integrity; interest charges 40% E*Trade America Online 3 February 1999 through 3 March 6 August 1996 outage: 24 hours 1999: Four outages of at least five Maintenance/Human Error Application hours Cost: $3 million in rebates System Upgrades Failures Investment: ??? Cost: ???? 22 percent stock price hit on 5 February 1999

Managing Exploding e-business Managing Exploding e-business Infrastructure Infrastructure Price Price Skills Shortage New Complexity Workloads Services and Software Costs W orkloads Time

Challenges for the 00s Increased importance of firmware Circuit failure mechanisms State encapsulation On-the-fly change Dynamic resource allocation Configuration validation

Why do systems fail? Review studies from 1993-98 Large commercial - PowerPoint PPT Presentation

Why do systems fail? Review studies from 1993-98 Large commercial enterprises Look to the future Lisa Spainhower m eServer Technology Comparative Study 1993 GENRE EXAMPLE DATA SOURCES Campus-wide LAN Heterogeneous Industry Surveys IBM

mndag 13 maj 13 OVERVIEW Fail-recovery Precedence (1,N) Logged register Byzantine (1,N)

Cut Not and Fail Cut, Not, and Fail York University CSE 3401 Vida Movahedi 1 York University

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Why bids fail? Why bids fail? Research by McIlhiney & Goldring 2010 For Pathways 21 [ UK]

Calls-to-Action that Fail: The most common causes for why CTAs fail (and how you can achieve

Too big to fail or Too non-traditional to fail? The determinants of banks systemic

Fail-Safe/Local IP Address and Local Link Fail-Safe/Local IP address - the IP address

Non-Instagrammable Urbanism Tactical Urbanism Wins + Fails place creative. FAIL 1 FAIL 2

Fail to Plan, Plan to Fail: Zoning and Land Use Case Review Koontz v. St. Johns River Water Mgmt.

Registers Shared Memory Fail-crash, fail-silent BJRN A. JOHNSSON Introduction Analogy

FOR AS/A2 Fail to prepare Prepare to fail We can learn anything! Walt Disney was afraid of

What do you do if your data fail your specification? Target ... Repair your data.

Solid Type System vs Runtime Checks and Unit Tests Vladimir Pavkin Plan Fail Fast concept

Nearness Gone Wrong: When Leaders Fail Steve Midgley When leaders fail... An imaginary

Just for fun: too smart to fail Just for fun: too smart to fail Francesco P. Lovergine

The 2% Time Bomb Why Most Restaurants Fail Speakers Dixie McCurly Stephen Gross The 2% Time

CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI 41st Cray User Group

ProtoDUNE-DP: Layout of HV on roof Yann-Axel RIGAUT on behalf of ETHZ group 14/09/2017

[ B IT T ORRENT & D ISTRIBUTED C OMPUTING E CONOMICS ] Shrideep Pallickara Computer Science

2020 Session 1: Advances in Collecting Health Survey Data John Stevenson Associate Director

An IBM 704 mainframe (image courtesy of LLNL) 1 It is practically impossible to teach good

Advanced Programming Lab 6 Graphical User Interface (GUI) Swing Standard toolkit,

CENG4480 Lecture 01: Introduction Bei Yu byu@cse.cuhk.edu.hk (Latest update: August 19, 2020)

Lecture 3 Hardware and Software 3. Hardware and Softw are 4. High Level Languages 5.

Sambuz

Useful Links

Newsletter

Mail Us

Why do systems fail? Review studies from 1993-98 Large commercial - PowerPoint PPT Presentation

Why do systems fail? Review studies from 1993-98 Large commercial enterprises Look to the future Lisa Spainhower m eServer Technology Comparative Study 1993 GENRE EXAMPLE DATA SOURCES Campus-wide LAN Heterogeneous Industry Surveys IBM

mndag 13 maj 13 OVERVIEW Fail-recovery Precedence (1,N) Logged register Byzantine (1,N)

Cut Not and Fail Cut, Not, and Fail York University CSE 3401 Vida Movahedi 1 York University

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Why bids fail? Why bids fail? Research by McIlhiney &amp; Goldring 2010 For Pathways 21 [ UK]

Calls-to-Action that Fail: The most common causes for why CTAs fail (and how you can achieve

Too big to fail or Too non-traditional to fail? The determinants of banks systemic

Fail-Safe/Local IP Address and Local Link Fail-Safe/Local IP address - the IP address

Non-Instagrammable Urbanism Tactical Urbanism Wins + Fails place creative. FAIL 1 FAIL 2

Fail to Plan, Plan to Fail: Zoning and Land Use Case Review Koontz v. St. Johns River Water Mgmt.

Registers Shared Memory Fail-crash, fail-silent BJRN A. JOHNSSON Introduction Analogy

FOR AS/A2 Fail to prepare Prepare to fail We can learn anything! Walt Disney was afraid of

What do you do if your data fail your specification? Target ... Repair your data.

Solid Type System vs Runtime Checks and Unit Tests Vladimir Pavkin Plan Fail Fast concept

Nearness Gone Wrong: When Leaders Fail Steve Midgley When leaders fail... An imaginary

Just for fun: too smart to fail Just for fun: too smart to fail Francesco P. Lovergine

The 2% Time Bomb Why Most Restaurants Fail Speakers Dixie McCurly Stephen Gross The 2% Time

CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI 41st Cray User Group

ProtoDUNE-DP: Layout of HV on roof Yann-Axel RIGAUT on behalf of ETHZ group 14/09/2017

[ B IT T ORRENT &amp; D ISTRIBUTED C OMPUTING E CONOMICS ] Shrideep Pallickara Computer Science

2020 Session 1: Advances in Collecting Health Survey Data John Stevenson Associate Director

An IBM 704 mainframe (image courtesy of LLNL) 1 It is practically impossible to teach good

Advanced Programming Lab 6 Graphical User Interface (GUI) Swing Standard toolkit,

CENG4480 Lecture 01: Introduction Bei Yu byu@cse.cuhk.edu.hk (Latest update: August 19, 2020)

Lecture 3 Hardware and Software 3. Hardware and Softw are 4. High Level Languages 5.

Sambuz

Useful Links

Newsletter

Mail Us

Why bids fail? Why bids fail? Research by McIlhiney & Goldring 2010 For Pathways 21 [ UK]

[ B IT T ORRENT & D ISTRIBUTED C OMPUTING E CONOMICS ] Shrideep Pallickara Computer Science