outline
play

Outline What have we been doing Recovery Oriented Computing - PowerPoint PPT Presentation

Outline What have we been doing Recovery Oriented Computing Motivation for a new Challenge: making things work (including endorsements) Dave Patterson University of California at Berkeley What have we learned


  1. Outline • What have we been doing Recovery Oriented Computing • Motivation for a new Challenge: making things work (including endorsements) Dave Patterson University of California at Berkeley • What have we learned Patterson@cs.berkeley.edu • New Challenge: Recovery-Oriented Computer http://roc.CS.Berkeley.EDU/ • Examples: benchmarks, prototypes September 2001 Slide 1 Slide 2 After 15 year improving Goals,Assumptions of last 15 years Performance • Goal #1: Improve performance • Availability is now a vital metric for servers! • Goal #2: Improve performance – near-100% availability is becoming mandatory • Goal #3: Improve cost-performance » for e-commerce, enterprise apps, online services, ISPs – but, service outages are frequent • Assumptions » 65% of IT managers report that their websites were – Humans are perfect (they don’t make mistakes during unavailable to customers over a 6-month period installation, wiring, upgrade, maintenance or repair) • 25%: 3 or more outages – outage costs are high – Software will eventually be bug free (good programmers write bug-free code) » social effects: negative press, loss of customers who “click over” to competitor – Hardware MTBF is already very large (~100 years between failures), and will continue to increase Slide 3 Source: InternetWeek 4/3/2000 Slide 4

  2. Jim Gray: Trouble-Free Systems Downtime Costs (per Hour) • Brokerage operations $6,450,000 • Manager “What Next? – Sets goals • Credit card authorization $2,600,000 A dozen remaining IT problems” – Sets policy Turing Award Lecture, • Ebay (1 outage 22 hours) $225,000 – Sets budget FCRC, – System does the rest. • Amazon.com $180,000 May 1999 • Everyone is a CIO Jim Gray • Package shipping services $150,000 Microsoft (Chief Information Officer) • Home shopping channel $113,000 • Build a system • Catalog sales center $90,000 – used by millions of people each day • Airline reservation center $89,000 – Administered and managed by a ½ time person. • Cellular service activation $41,000 » On hardware fault, order replacement part » On overload, order additional equipment • On-line network fees $25,000 » Upgrade hardware and software automatically. • ATM service fees $14,000 Source: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction , R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research." Slide 5 Slide 6 Lampson: Systems Challenges Hennessy: What Should the “New World” Focus Be? • Availability • Systems that work – Meeting their specs – Both appliance & service – Always available • Maintainability – Adapting to changing environment – Two functions: – Evolving while they run – Made from unreliable components » Enhancing availability by preventing failure – Growing without practical limit » Ease of SW and HW upgrades • Scalability • Credible simulations or analysis – Especially of service “Back to the Future: • Writing good specs Time to Return to Longstanding • Cost “Computer Systems Research • Testing Problems in Computer Systems?” -Past and Future” – per device and per service transaction Keynote address, Keynote address, • Performance FCRC, • Performance 17th SOSP, May 1999 – Understanding when it doesn’t matter Dec. 1999 – Remains important, but its not SPECint John Hennessy Butler Lampson Stanford Microsoft Slide 7 Slide 8

  3. The real scalability problems: AME Total Cost of Ownership (IBM) • Availability – systems should continue to meet quality of service HW goals despite hardware and software failures management • Maintainability 3% Purchase Downtime – systems should require only minimal ongoing human 20% 20% administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase Administration • Evolutionary Growth Environmental 13% – systems should evolve gracefully in terms of 14% performance, maintainability, and availability as they Backup are grown/upgraded/expanded Restore • These are problems at today’s scales, and will 30% only get worse as systems grow • Administration: all people time • Backup Restore: devices, media, and people time • Environmental: floor space, power, air conditioning Slide 9 Slide 10 Lessons learned from Past Projects Lessons learned from Past Projects for which might help AME for AME • Know how to improve performance (and cost) • Maintenance of machines (with state) expensive – Run system against workload, measure, innovate, repeat – ~5X to 10X cost of HW – Benchmarks standardize workloads, lead to competition, – Stateless machines can be trivial to maintain (Hotmail) evaluate alternatives; turns debates into numbers • System admin primarily keeps system available • Major improvements in Hardware Reliability – System + clever human working during failure = uptime – 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000 – Also plan for growth, software upgrades, configuration, – PC motherboards from 100,000 to 1,000,000 hours fix performance bugs, do backup • Yet Everything has an error rate • Software upgrades necessary, dangerous – Well designed and manufactured HW: >1% fail/year – SW bugs fixed, new features added, but stability? – Well designed and tested SW: > 1 bug / 1000 lines – Admins try to skip upgrades, be the last to use one – Well trained people doing routine tasks: 1%-2% – Well run collocation site (e.g., Exodus): 1 power failure per year, 1 network outage per year Slide 11 Slide 12

  4. Lessons learned from Past Projects Lessons learned from Internet for AME • Realities of Internet service environment: Cause of System Crashes – hardware and software failures are inevitable 100% 15% Other: app, power, 18% 21% » hardware reliability still imperfect 80% network failure 15% » software reliability thwarted by rapid evolution System management: 60% actions + N/problem 53% » Internet system scale exposes second-order failure modes Operating System 50% 69% 40% – system failure modes cannot be modeled or predicted failure 20% Hardware failure 18% » commodity components do not fail cleanly 20% 5% 10% 0% 5% » black-box system design thwarts models (est.) 1985 1993 2001 » unanticipated failures are normal • Failures due to people up, hard to measure – human operators are imperfect – VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 » human error accounts for ~50% of all system failures – HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? – How get administrator to admit mistake? (Heisenberg?) Slide 13 Slide 14 Sources: Gray86, Hamilton99, Menn99, Murphy95, Perrow99, Pope86 Lessons learned from Past Projects Learning from other fields: PSTN for AME Number of Outages Minutes of Failure • FCC-collected data on outages in the US public-switched telephone network Human-company – metric: breakdown of customer calls blocked by system outages Human-external (excluding natural disasters). Jan-June 2001 HW failures Human error accounts for Act of Nature 9% 56% of all blocked calls 56% SW failure 22% Vandalism Human-co. Human-ext. 5% Hardware Failure • “Sources of Failure in the Public Switched Software Failure Overload Telephone Network,” Kuhn 47% Vandalism 17% – FCC Records 1992-1994; IEEE Computer, 30:4 (Apr 97) – Overload (not sufficient switching to lower costs) – comparison with 1992-4 data shows that human error is the only another 6% outages, 44% minutes factor that is not improving over time Slide 15 Slide 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend