Develop Quantitative Reliability Roadmap to Meet Market s - - PowerPoint PPT Presentation
Develop Quantitative Reliability Roadmap to Meet Market s - - PowerPoint PPT Presentation
Develop Quantitative Reliability Roadmap to Meet Market s Expectations Xuemei Zhang Alcatel-Lucent April 27, 2007 Introduction Gaps between a products target and current-release availability can arise in early releases of
2 | Reliability Roadmap | April 2007
Introduction Gaps between a product’s target and current-release availability can arise… …
in early releases of new products
…
when product is deployed in new scenario, such as supporting VoIP or IPTV by a traditionally IP data-only product
…
when significant software features or hardware/architecture changes are made
Reliability roadmapping is the best practice for managing closure of an
availability gap
Product management owns product roadmaps; reliability roadmaps are an
key input to overall product roadmaps
This presentation details what a reliability roadmap is, how to construct
- ne, and how to use that roadmap to manage closure of an availability gap
3 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution
- New Product Reliability Risk
- New Deployment Scenario Reliability Risk
- New Feature Reliability Risk
- Reliability Roadmap as a Solution
Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers
4 | Reliability Roadmap | April 2007
Business Problem – New Product Reliability Risk
Market expects 99.999% availability for most of Lucent’s products
- Best practice for assessing market’s availability expectation given in a
companion presentation Significant risk in achieving 99.999% availability in initial (or early) releases
- f most products because:
1. Some availability features may have been deferred from initial product release(s) in favor of higher-priority features 2. High availability system configurations (e.g., N+K, duplex controllers) may not be supported in initial release(s) (note: high availability configurations may be required in RFx’s, but not actually be purchased and hence not reflected in business cases) 3. Software may not be sufficiently mature to have low enough failure rate 4. Software may not be sufficiently mature to have sufficiently effective and efficient automatic failure detection, isolation, alarming and recovery mechanisms
5 | Reliability Roadmap | April 2007
Business Problem – New Deployment S cenario Reliability Risk
As existing products are deployed in new scenarios, they may encounter different availability expectations, thus exposing a gap; for example
- Network element availability expectations for VoIP and IPTV may be
higher than for data-only deployments
- Basestation availability expectations for wireless local-loop may be
higher than for typical mobility deployments
6 | Reliability Roadmap | April 2007
Business Problem – New Feature Reliability Risk
As existing products evolve, large, availability-impacting features may be added, such as:
- Adding VoIP or other major capability
- Expanding architecture/configuration (e.g., adding duplex controllers)
- Changing blades or major hardware elements
Significant changes to existing products increase reliability risks of: 1. “degrow” software reliability (increase failure rate) or 2. reduce system’s ability to effectively detect and isolate failures (lower coverage factor) or 3. add latency to recovery/restart times
- thus adding software downtime
- Note: hardware downtime for a particular element typically changes
little from release-to-release, so release-by-release roadmapping of hardware elements is less common
7 | Reliability Roadmap | April 2007
Business S
- lution: Reliability Roadmap
The risk in purchasing a release of a system that doesn’t currently meet a customer’s availability expectations can be reduced by providing a credible, concrete plan for closing the availability gap in an upcoming release… .a.k.a., a “reliability roadmap” Key elements of a reliability roadmap 1. Gives ‘ultimate’ quantitative system availability goal(s) and definition 2. Availability estimate of current release and system configuration 3. A target release and system configuration to meet a specific availability level 4. Per-release availability budgets to plausibly close the gap between current release performance and specific availability goal in target release 5. By-release enumeration of features and/or factors that will support this availability growth
8 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution Reliability Roadmap Elements
1. Ultimate Availability Goal 2. Estimate Availability of Current Release 3. Specific Release Identified to Meet Goal 4. Per-Release Availability Improvement Targets 5. Per-Release Availability Improvement Features
- Graphical Example
Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers
9 | Reliability Roadmap | April 2007
Roadmap Element 1:
Set “Ultimate” Availability Goal
Availability goals are typically set for annualized minutes of unplanned,
supplier-attributable “ total” system unavailability (meaning greater than 90% capacity lost)
Includes both hardware and software downtime, but may exclude
planned/ scheduled downtime for upgrades, updates, growth, etc
Market expectation for most telecom products is 5.25 down-minutes per
year (99.999% availability)
Partial-capacity-loss events are quite common, and thus sophisticated
customers may have availability expectations for pro-rated partial-capacity- loss availability
TL-9000 defines partial-capacity-loss to be greater than 10%
capacity loss, but less than 90% capacity loss
Planned unavailability includes system downtime for upgrades, updates,
reconfiguration, growth, degrowth, and so on. S
- phisticated customers may
have clear planned downtime expectations
S
- me sophisticated customers (e.g., Nextel) explicitly define their 99.999%
availability requirement to include planned events, as well as unplanned events
Quantitatively define exactly what ‘ ultimate’ obj ective is Example: Availability goal for Product A is 99.999%
unplanned, supplier- attributable, (partial) pro-rated availability16
10 | Reliability Roadmap | April 2007
Roadmap Element 2:
Estimate Availability of Current Release
Estimating the availability of the current release of a product provides
the baseline availability and helps identify the gap with the market’s availability expectation
The availability of a baseline release can be estimated from: Field data if the release is out in the field and reliable data exists Lab data via system reliability modeling
11 | Reliability Roadmap | April 2007
Roadmap Element 3:
Set Specific Release to Meet Target
As with any business objective, explicitly setting a clear scheduled
completion goal is essential
Since products are typically planned and managed on a release basis
(rather than a calendar basis), recommend setting a target release
12 | Reliability Roadmap | April 2007
Roadmap Element 4:
Set By-Release Improvement Targets
Based on the availability of the baseline release and the release planned to
meet the market expectation, by-release reliability improvement targets can be set to plan the reliability growth.
RX R(X+1) R(X+2) R(X+3) R(X+4)
Actual Release X Downtime
Product A Reliability Roadmap Linear Growth
Release Annual Downtime (min/yr)
13 | Reliability Roadmap | April 2007
Roadmap Element 5:
Set By-Release Feature Investments
Investing in reliability improving features is often required to achieve high
availability in a timely manner.
Example: Product A reliability roadmap
Release by Release Reliability Feature Sets
RX R(X+1) R(X+2) R(X+3) R(X+4)
14 | Reliability Roadmap | April 2007
Roadmap Example
Element 2: Estimate current availability Element 4: Set rough per-release targets Element 3: Pick a release to achieve availability goal Element 1: Set Ultimate availability goal Element 5: Set per-release feature investments to achieve availability goal
Release
Product A Reliability Roadmap
Annual Downtime (min/yr)
RX R(X+1) R(X+2) R(X+3) R(X+4)
15 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers
16 | Reliability Roadmap | April 2007
Availability Improving Features
Product availability improves in 3 general ways
- Maturation of software and support (both service provider and Lucent) reduces
software failure rates, shortens outage durations for manually-recovered events, and improves reliability of manual maintenance activities
- This growth is fairly slow, often not keeping pace with reliability degrowth
from addition of new features
- Investment in reliability/availability improving features. Broadly, these
features address one or more of the following: 1. Reduce failure rates 2. Reduce impact of failures 3. Improve efficiency of failure detection, isolation, alarming and recovery 4. Shorten recovery latency 5. Improve Design-for-Serviceability (DfS) 6. Reduce planned downtime 7. Policy and other items
- Technology change – products can undergo significant changes in architecture,
configuration, hardware or software which can significantly affect availability.
- Often managed via product’s feature roadmap
17 | Reliability Roadmap | April 2007
Availability Improvement via Reducing Failure Rates
System downtime is typically a linear function of hardware and
software failure rates.
Reducing hardware and software failure rates is an efficient way to
reduce system downtime
General feature categories Hardware
Use lower failure rate components Enhance thermal environment
– Better cooling and lower temperatures means lower hardware failure rates
Software
More/better testing More mature development processes More/better static and dynamic analysis (Purify, lint, “clean”
compilations/builds, etc)
18 | Reliability Roadmap | April 2007
Availability Improvement via Reducing Impact of Failures
Better product architectures will reduce failure group sizes so failures
impact smallest feasible system capacity
General approaches Redundant hardware elements - so service can be rapidly restored
after hardware failure
Clustering – Put resources in a pool and run load-sharing operations
- mode. Failures of one unit can be re-distributed to other healthy
units.
Partitioning – ‘intelligently’ distribute hardware and software failure
rates to minimize pro-rated downtime. For example:
moving high failure rate software modules from top-controller
(where a failure would have a large failure footprint) to subordinate linecards (where a failure will have a smaller failure footprint)
Separating OA&M software from service-related software so failures
in OA&M software don’t cause service downtime
Application-specific architectures/mechanisms
19 | Reliability Roadmap | April 2007
Availability Improvement via Improving Failure Detection… Efficiency
Typically, detected failures can be recovered by fast automatic failovers or
restarts
Failures that are not properly automatically detected, isolated and alarmed
must be addressed manually, thereby significantly prolonging outage duration
Slowly detected failures also increase downtime General approaches: Timeouts, watchdogs, heartbeats, audits of data integrity, etc Leveraging platform- and OS-provided monitoring and recovery facilities Fault insertion testing to validate system’s automatic failure detection,
isolation, alarming, and recovery capabilities
Higher-layer failure detection/integrity monitoring applications Application- and protocol-specific techniques, such as Event Correlation
Service (RECS) or Reliability Integrity Monitoring (RIM)
20 | Reliability Roadmap | April 2007
Availability Improvement via S hortening Recovery Latency
Shortening recovery latency for both automatically
detected+recovered failures and manual restarts is an efficient approach to reduce system downtime.
Although detected failures typically contribute less system downtime
than uncovered failures, faster detection and recovery mechanism also improve availability, especially using the TL9000 outage discounting rules.
Failures that cause less than 30 seconds of service disruption are
not TL9000 reportable outages, and can thus be excluded.
General approaches: Switchover and restart times are often shortened via a combination
- f optimizations in high-availability middleware and application-
specific mechanisms
Faster hardware can also reduce recovery times
21 | Reliability Roadmap | April 2007
Availability Improvement via Design for S erviceability
General approach: Follow DfS requirements and guidelines during architecture and
design
LWS-Serviceability Engineering team performs DfS assessment and
identifies gaps
Invest to close serviceability gaps
22 | Reliability Roadmap | April 2007
Availability Improvement via Reducing Planned Downtime Planned events --- such as retrofits, upgrades and updates --- often happen more frequently than unplanned downtime events; thus reducing downtime associated with planned events increases system availability
- Acceptability of planned downtime often varies by product
category; some markets accept planned downtime if it occurs in a scheduled maintenance window, other markets won’t General approach is to drive planned events to be be less than 15 seconds
- TL9000 guidelines suggest planned service disruptions of 15
seconds or less to can be excluded from availability calculations Various operating-system-, middleware- and application-specific mechanisms are often used to minimize planned downtime for updates, upgrades, retrofits and other planned maintenance events
23 | Reliability Roadmap | April 2007
Availability Improvement via Policy and Other Items
Many items beyond traditional network element design can improve
system availability, including:
S
paring strategies – maintaining an adequate supply of spare FRUs close to network elements can shorten repair times
Sparing entire network elements (e.g., “cold standby”) is sometimes
appropriate
S
upport agreements – having service providers purchase support agreements from key equipment suppliers can shorten outage resolution times
Training of support engineers – appropriate training of service
providers’ maintenance staff can shorten outage resolution times and improve reliability of both planned and on-demand maintenance actions
24 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers
25 | Reliability Roadmap | April 2007
Reliability Roadmapping Recipe Initial Effort 1. Product management sets availability target 2. Estimate availability of current product release 3. Based on significant features and changes planned for future releases, estimate likely availability of those future releases 4. Set target release to close (expected) availability gap 5. Identify candidate, availability-improving features
- Detailed on next slide
6. Estimate availability benefit of candidate features 7. Product management selects suitable set of candidate availability- improving features to invest in and slots them into specific releases
- Goal is to select the ‘ right’ combination of functional, non-
functional, and availability-improving features that cost-effectively meets the market’ s needs
26 | Reliability Roadmap | April 2007
Identifying Candidate Features Candidate reliability/availability-improving features can be identified in several ways 1. Input from System Architects, Developers, Systems Engineers, and Technical Support Engineers… they know many of the weaknesses and areas-for-improvement in their products 2. Analysis of field outages and lab data 3. High availability techniques assessment tool Targeted downtime reduction analysis, e.g., if there is too much downtime from HW or SW on a particular FRU or module, then focus on ways to minimize that downtime
27 | Reliability Roadmap | April 2007
Refreshing Reliability Roadmap
Recommend refreshing reliability roadmap for every major release by… 1. Gather and analyze latest field data to estimate latest field availability performance and estimate availability parameters 2. Use lab data to estimate availability of most recent product release (since field data probably isn’t available) 3. Re-estimate availability of future releases based on observed latest field and lab data, and latest feature plans 4. If significant gap appears between updated estimated future availability and baselined by-release availability targets, then…
- Increase investment in availability-improving features or decrease investment in
reliability-degrading features and/or
- Revise baselined by-release availability targets (e.g., postpone target for meeting
availability target, adopt more aggressive availability-improvement plan for future releases) and/or
- Reset availability expectations with customers
28 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability
- Roadmapping Solutions
- Elements of a Solution Availability Roadmap
- Availability-improving Features for Solutions
Recommendations for Product Managers
29 | Reliability Roadmap | April 2007
Roadmapping End-to-End S
- lutions
New solutions or expanded deployment of existing solutions may have availability gaps compared to incumbent or alternative solutions, such as
- IPTV over DSL v. cable TV
- Wireless local loop v. DSL or cable
- 3G v. 2G; eventually 4G v. 3G
For solution-level roadmapping, it is essential to
- Select one (or more) specific solution configurations to model and analyze
- Select and precisely define the correct solution-level availability metrics
- While “99.999%
service availability” may be fairly easy to understand and define for a single network element, it is much harder to precisely characterize “availability” for IMS, 3G or IPTV solutions
30 | Reliability Roadmap | April 2007
S
- lution Availability Roadmap Elements
Key elements of a solution availability roadmap 1. Precisely define solution availability metrics 2. Define solution deployments/applications to be analyzed 3. Specify ‘ultimate’ quantitative system availability goal(s) 4. Construct appropriate mathematical availability model of solution(s) 5. Insert availability estimates (or actuals) for all elements in solution, and compute resulting solution-level availability 6. Create downtime budget for all network elements that achieves desired availability and is consistent with business considerations 7. Commit availability-improving feature plans to close gaps between network element availabilities and targets required to achieve solution availability
- Note: these features are likely to affect both solution
architecture/configuration, Lucent-developed products and partner/OEM/ODM products 8. Construct a by-solution-release view of when availability-improving features will be phased into solution
31 | Reliability Roadmap | April 2007
S
- lution Availability-Improving Features
Solution-level availability is improved via the following general techniques:
1. Reconfigure elements in the solution (e.g., add redundant elements or interconnects, make network elements geographically-redundant) 2. Increase robustness of end to end applications software (e.g. protocol enhancements, reliable/dependable transactions/services) 3. Improve availability of individual network elements 4. More/better network-level testing 5. Replace network element with alternative product (perhaps from alternate supplier) 6. Adopt alternate solution architecture/configuration/protocol (e.g., support distributed elements/protocols rather than standalone elements)
32 | Reliability Roadmap | April 2007
New Challenges - Availability for S
- lutions that Provide Blended S
ervices
Different perspectives of solution availability: End user view Service providers’ view Solution availability metrics Downtime oriented: downtime min/yr, 5 9’s availability, etc. Defects oriented: ineffective attempts, cut-off calls, etc. Service oriented: service reliability, etc. Risk oriented: security related availability, etc. Analysis complexities: Access solutions vs. core Control plane vs. traffic plane Call processing vs. management visibility Application variations
33 | Reliability Roadmap | April 2007
Outline
The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers
34 | Reliability Roadmap | April 2007
Recommendations for Product Managers
- PdM should own and drive the reliability roadmap process
1. Product Management should assign an owner for reliability roadmap (product manager, delegated to SAE, or managed jointly) 2. Product management should set quantitative availability goal and release target to achieve that goal
- E.g., MR/ECO SRD
3. Owner for reliability roadmap (with support from cross-functional team
- f Architecture, Development, Reliability Team, Systems Engineering or
- ther) should analyze and propose availability-improving features
4. Product management selects the ‘right’ mix of availability-improving features per release and gets those features committed 5. Revisit reliability roadmap for every major release, and make revisions as appropriate
35 | Reliability Roadmap | April 2007