Develop Quantitative Reliability Roadmap to Meet Market s - - PowerPoint PPT Presentation

▶

Jan 04, 2023 102 likes •459 views

Develop Quantitative Reliability Roadmap to Meet Market s Expectations Xuemei Zhang Alcatel-Lucent April 27, 2007 Introduction Gaps between a products target and current-release availability can arise in early releases of

SLIDE 1

Develop Quantitative Reliability Roadmap to Meet Market’ s Expectations

Xuemei Zhang Alcatel-Lucent April 27, 2007

SLIDE 2

2 | Reliability Roadmap | April 2007

Introduction Gaps between a product’s target and current-release availability can arise… …

in early releases of new products

…

when product is deployed in new scenario, such as supporting VoIP or IPTV by a traditionally IP data-only product

…

when significant software features or hardware/architecture changes are made

Reliability roadmapping is the best practice for managing closure of an

availability gap

Product management owns product roadmaps; reliability roadmaps are an

key input to overall product roadmaps

This presentation details what a reliability roadmap is, how to construct

ne, and how to use that roadmap to manage closure of an availability gap

SLIDE 3

3 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution

New Product Reliability Risk
New Deployment Scenario Reliability Risk
New Feature Reliability Risk
Reliability Roadmap as a Solution

Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers

SLIDE 4

4 | Reliability Roadmap | April 2007

Business Problem – New Product Reliability Risk

Market expects 99.999% availability for most of Lucent’s products

Best practice for assessing market’s availability expectation given in a

companion presentation Significant risk in achieving 99.999% availability in initial (or early) releases

f most products because:

1. Some availability features may have been deferred from initial product release(s) in favor of higher-priority features 2. High availability system configurations (e.g., N+K, duplex controllers) may not be supported in initial release(s) (note: high availability configurations may be required in RFx’s, but not actually be purchased and hence not reflected in business cases) 3. Software may not be sufficiently mature to have low enough failure rate 4. Software may not be sufficiently mature to have sufficiently effective and efficient automatic failure detection, isolation, alarming and recovery mechanisms

SLIDE 5

5 | Reliability Roadmap | April 2007

Business Problem – New Deployment S cenario Reliability Risk

As existing products are deployed in new scenarios, they may encounter different availability expectations, thus exposing a gap; for example

Network element availability expectations for VoIP and IPTV may be

higher than for data-only deployments

Basestation availability expectations for wireless local-loop may be

higher than for typical mobility deployments

SLIDE 6

6 | Reliability Roadmap | April 2007

Business Problem – New Feature Reliability Risk

As existing products evolve, large, availability-impacting features may be added, such as:

Adding VoIP or other major capability
Expanding architecture/configuration (e.g., adding duplex controllers)
Changing blades or major hardware elements

Significant changes to existing products increase reliability risks of: 1. “degrow” software reliability (increase failure rate) or 2. reduce system’s ability to effectively detect and isolate failures (lower coverage factor) or 3. add latency to recovery/restart times

thus adding software downtime
Note: hardware downtime for a particular element typically changes

little from release-to-release, so release-by-release roadmapping of hardware elements is less common

SLIDE 7

7 | Reliability Roadmap | April 2007

Business S

lution: Reliability Roadmap

The risk in purchasing a release of a system that doesn’t currently meet a customer’s availability expectations can be reduced by providing a credible, concrete plan for closing the availability gap in an upcoming release… .a.k.a., a “reliability roadmap” Key elements of a reliability roadmap 1. Gives ‘ultimate’ quantitative system availability goal(s) and definition 2. Availability estimate of current release and system configuration 3. A target release and system configuration to meet a specific availability level 4. Per-release availability budgets to plausibly close the gap between current release performance and specific availability goal in target release 5. By-release enumeration of features and/or factors that will support this availability growth

SLIDE 8

8 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution Reliability Roadmap Elements

1. Ultimate Availability Goal 2. Estimate Availability of Current Release 3. Specific Release Identified to Meet Goal 4. Per-Release Availability Improvement Targets 5. Per-Release Availability Improvement Features

Graphical Example

Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers

SLIDE 9

9 | Reliability Roadmap | April 2007

Roadmap Element 1:

Set “Ultimate” Availability Goal

Availability goals are typically set for annualized minutes of unplanned,

supplier-attributable “ total” system unavailability (meaning greater than 90% capacity lost)

Includes both hardware and software downtime, but may exclude

planned/ scheduled downtime for upgrades, updates, growth, etc

Market expectation for most telecom products is 5.25 down-minutes per

year (99.999% availability)

Partial-capacity-loss events are quite common, and thus sophisticated

customers may have availability expectations for pro-rated partial-capacity- loss availability

TL-9000 defines partial-capacity-loss to be greater than 10%

capacity loss, but less than 90% capacity loss

Planned unavailability includes system downtime for upgrades, updates,

reconfiguration, growth, degrowth, and so on. S

phisticated customers may

have clear planned downtime expectations

S

me sophisticated customers (e.g., Nextel) explicitly define their 99.999%

availability requirement to include planned events, as well as unplanned events

Quantitatively define exactly what ‘ ultimate’ obj ective is Example: Availability goal for Product A is 99.999%

unplanned, supplier- attributable, (partial) pro-rated availability16

SLIDE 10

10 | Reliability Roadmap | April 2007

Roadmap Element 2:

Estimate Availability of Current Release

Estimating the availability of the current release of a product provides

the baseline availability and helps identify the gap with the market’s availability expectation

The availability of a baseline release can be estimated from: Field data if the release is out in the field and reliable data exists Lab data via system reliability modeling

SLIDE 11

11 | Reliability Roadmap | April 2007

Roadmap Element 3:

Set Specific Release to Meet Target

As with any business objective, explicitly setting a clear scheduled

completion goal is essential

Since products are typically planned and managed on a release basis

(rather than a calendar basis), recommend setting a target release

SLIDE 12

12 | Reliability Roadmap | April 2007

Roadmap Element 4:

Set By-Release Improvement Targets

Based on the availability of the baseline release and the release planned to

meet the market expectation, by-release reliability improvement targets can be set to plan the reliability growth.

RX R(X+1) R(X+2) R(X+3) R(X+4)

Actual Release X Downtime

Product A Reliability Roadmap Linear Growth

Release Annual Downtime (min/yr)

SLIDE 13

13 | Reliability Roadmap | April 2007

Roadmap Element 5:

Set By-Release Feature Investments

Investing in reliability improving features is often required to achieve high

availability in a timely manner.

Example: Product A reliability roadmap

Release by Release Reliability Feature Sets

RX R(X+1) R(X+2) R(X+3) R(X+4)

SLIDE 14

14 | Reliability Roadmap | April 2007

Roadmap Example

Element 2: Estimate current availability Element 4: Set rough per-release targets Element 3: Pick a release to achieve availability goal Element 1: Set Ultimate availability goal Element 5: Set per-release feature investments to achieve availability goal

Release

Product A Reliability Roadmap

Annual Downtime (min/yr)

RX R(X+1) R(X+2) R(X+3) R(X+4)

SLIDE 15

15 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers

SLIDE 16

16 | Reliability Roadmap | April 2007

Availability Improving Features

Product availability improves in 3 general ways

Maturation of software and support (both service provider and Lucent) reduces

software failure rates, shortens outage durations for manually-recovered events, and improves reliability of manual maintenance activities

This growth is fairly slow, often not keeping pace with reliability degrowth

from addition of new features

Investment in reliability/availability improving features. Broadly, these

features address one or more of the following: 1. Reduce failure rates 2. Reduce impact of failures 3. Improve efficiency of failure detection, isolation, alarming and recovery 4. Shorten recovery latency 5. Improve Design-for-Serviceability (DfS) 6. Reduce planned downtime 7. Policy and other items

Technology change – products can undergo significant changes in architecture,

configuration, hardware or software which can significantly affect availability.

Often managed via product’s feature roadmap

SLIDE 17

17 | Reliability Roadmap | April 2007

Availability Improvement via Reducing Failure Rates

System downtime is typically a linear function of hardware and

software failure rates.

Reducing hardware and software failure rates is an efficient way to

reduce system downtime

General feature categories Hardware

Use lower failure rate components Enhance thermal environment

– Better cooling and lower temperatures means lower hardware failure rates

Software

More/better testing More mature development processes More/better static and dynamic analysis (Purify, lint, “clean”

compilations/builds, etc)

SLIDE 18

18 | Reliability Roadmap | April 2007

Availability Improvement via Reducing Impact of Failures

Better product architectures will reduce failure group sizes so failures

impact smallest feasible system capacity

General approaches Redundant hardware elements - so service can be rapidly restored

after hardware failure

Clustering – Put resources in a pool and run load-sharing operations

mode. Failures of one unit can be re-distributed to other healthy

units.

Partitioning – ‘intelligently’ distribute hardware and software failure

rates to minimize pro-rated downtime. For example:

moving high failure rate software modules from top-controller

(where a failure would have a large failure footprint) to subordinate linecards (where a failure will have a smaller failure footprint)

Separating OA&M software from service-related software so failures

in OA&M software don’t cause service downtime

Application-specific architectures/mechanisms

SLIDE 19

19 | Reliability Roadmap | April 2007

Availability Improvement via Improving Failure Detection… Efficiency

Typically, detected failures can be recovered by fast automatic failovers or

restarts

Failures that are not properly automatically detected, isolated and alarmed

must be addressed manually, thereby significantly prolonging outage duration

Slowly detected failures also increase downtime General approaches: Timeouts, watchdogs, heartbeats, audits of data integrity, etc Leveraging platform- and OS-provided monitoring and recovery facilities Fault insertion testing to validate system’s automatic failure detection,

isolation, alarming, and recovery capabilities

Higher-layer failure detection/integrity monitoring applications Application- and protocol-specific techniques, such as Event Correlation

Service (RECS) or Reliability Integrity Monitoring (RIM)

SLIDE 20

20 | Reliability Roadmap | April 2007

Availability Improvement via S hortening Recovery Latency

Shortening recovery latency for both automatically

detected+recovered failures and manual restarts is an efficient approach to reduce system downtime.

Although detected failures typically contribute less system downtime

than uncovered failures, faster detection and recovery mechanism also improve availability, especially using the TL9000 outage discounting rules.

Failures that cause less than 30 seconds of service disruption are

not TL9000 reportable outages, and can thus be excluded.

General approaches: Switchover and restart times are often shortened via a combination

f optimizations in high-availability middleware and application-

specific mechanisms

Faster hardware can also reduce recovery times

SLIDE 21

21 | Reliability Roadmap | April 2007

Availability Improvement via Design for S erviceability

General approach: Follow DfS requirements and guidelines during architecture and

design

LWS-Serviceability Engineering team performs DfS assessment and

identifies gaps

Invest to close serviceability gaps

SLIDE 22

22 | Reliability Roadmap | April 2007

Availability Improvement via Reducing Planned Downtime Planned events --- such as retrofits, upgrades and updates --- often happen more frequently than unplanned downtime events; thus reducing downtime associated with planned events increases system availability

Acceptability of planned downtime often varies by product

category; some markets accept planned downtime if it occurs in a scheduled maintenance window, other markets won’t General approach is to drive planned events to be be less than 15 seconds

TL9000 guidelines suggest planned service disruptions of 15

seconds or less to can be excluded from availability calculations Various operating-system-, middleware- and application-specific mechanisms are often used to minimize planned downtime for updates, upgrades, retrofits and other planned maintenance events

SLIDE 23

23 | Reliability Roadmap | April 2007

Availability Improvement via Policy and Other Items

Many items beyond traditional network element design can improve

system availability, including:

S

paring strategies – maintaining an adequate supply of spare FRUs close to network elements can shorten repair times

Sparing entire network elements (e.g., “cold standby”) is sometimes

appropriate

S

upport agreements – having service providers purchase support agreements from key equipment suppliers can shorten outage resolution times

Training of support engineers – appropriate training of service

providers’ maintenance staff can shorten outage resolution times and improve reliability of both planned and on-demand maintenance actions

SLIDE 24

24 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers

SLIDE 25

25 | Reliability Roadmap | April 2007

Reliability Roadmapping Recipe Initial Effort 1. Product management sets availability target 2. Estimate availability of current product release 3. Based on significant features and changes planned for future releases, estimate likely availability of those future releases 4. Set target release to close (expected) availability gap 5. Identify candidate, availability-improving features

Detailed on next slide

6. Estimate availability benefit of candidate features 7. Product management selects suitable set of candidate availability- improving features to invest in and slots them into specific releases

Goal is to select the ‘ right’ combination of functional, non-

functional, and availability-improving features that cost-effectively meets the market’ s needs

SLIDE 26

26 | Reliability Roadmap | April 2007

Identifying Candidate Features Candidate reliability/availability-improving features can be identified in several ways 1. Input from System Architects, Developers, Systems Engineers, and Technical Support Engineers… they know many of the weaknesses and areas-for-improvement in their products 2. Analysis of field outages and lab data 3. High availability techniques assessment tool Targeted downtime reduction analysis, e.g., if there is too much downtime from HW or SW on a particular FRU or module, then focus on ways to minimize that downtime

SLIDE 27

27 | Reliability Roadmap | April 2007

Refreshing Reliability Roadmap

Recommend refreshing reliability roadmap for every major release by… 1. Gather and analyze latest field data to estimate latest field availability performance and estimate availability parameters 2. Use lab data to estimate availability of most recent product release (since field data probably isn’t available) 3. Re-estimate availability of future releases based on observed latest field and lab data, and latest feature plans 4. If significant gap appears between updated estimated future availability and baselined by-release availability targets, then…

Increase investment in availability-improving features or decrease investment in

reliability-degrading features and/or

Revise baselined by-release availability targets (e.g., postpone target for meeting

availability target, adopt more aggressive availability-improvement plan for future releases) and/or

Reset availability expectations with customers

SLIDE 28

28 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability

Roadmapping Solutions
Elements of a Solution Availability Roadmap
Availability-improving Features for Solutions

Recommendations for Product Managers

SLIDE 29

29 | Reliability Roadmap | April 2007

Roadmapping End-to-End S

lutions

New solutions or expanded deployment of existing solutions may have availability gaps compared to incumbent or alternative solutions, such as

IPTV over DSL v. cable TV
Wireless local loop v. DSL or cable
3G v. 2G; eventually 4G v. 3G

For solution-level roadmapping, it is essential to

Select one (or more) specific solution configurations to model and analyze
Select and precisely define the correct solution-level availability metrics
While “99.999%

service availability” may be fairly easy to understand and define for a single network element, it is much harder to precisely characterize “availability” for IMS, 3G or IPTV solutions

SLIDE 30

30 | Reliability Roadmap | April 2007

S

lution Availability Roadmap Elements

Key elements of a solution availability roadmap 1. Precisely define solution availability metrics 2. Define solution deployments/applications to be analyzed 3. Specify ‘ultimate’ quantitative system availability goal(s) 4. Construct appropriate mathematical availability model of solution(s) 5. Insert availability estimates (or actuals) for all elements in solution, and compute resulting solution-level availability 6. Create downtime budget for all network elements that achieves desired availability and is consistent with business considerations 7. Commit availability-improving feature plans to close gaps between network element availabilities and targets required to achieve solution availability

Note: these features are likely to affect both solution

architecture/configuration, Lucent-developed products and partner/OEM/ODM products 8. Construct a by-solution-release view of when availability-improving features will be phased into solution

SLIDE 31

31 | Reliability Roadmap | April 2007

S

lution Availability-Improving Features

Solution-level availability is improved via the following general techniques:

1. Reconfigure elements in the solution (e.g., add redundant elements or interconnects, make network elements geographically-redundant) 2. Increase robustness of end to end applications software (e.g. protocol enhancements, reliable/dependable transactions/services) 3. Improve availability of individual network elements 4. More/better network-level testing 5. Replace network element with alternative product (perhaps from alternate supplier) 6. Adopt alternate solution architecture/configuration/protocol (e.g., support distributed elements/protocols rather than standalone elements)

SLIDE 32

32 | Reliability Roadmap | April 2007

New Challenges - Availability for S

lutions that Provide Blended S

ervices

Different perspectives of solution availability: End user view Service providers’ view Solution availability metrics Downtime oriented: downtime min/yr, 5 9’s availability, etc. Defects oriented: ineffective attempts, cut-off calls, etc. Service oriented: service reliability, etc. Risk oriented: security related availability, etc. Analysis complexities: Access solutions vs. core Control plane vs. traffic plane Call processing vs. management visibility Application variations

SLIDE 33

33 | Reliability Roadmap | April 2007

Outline

The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers

SLIDE 34

34 | Reliability Roadmap | April 2007

Recommendations for Product Managers

PdM should own and drive the reliability roadmap process

1. Product Management should assign an owner for reliability roadmap (product manager, delegated to SAE, or managed jointly) 2. Product management should set quantitative availability goal and release target to achieve that goal

E.g., MR/ECO SRD

3. Owner for reliability roadmap (with support from cross-functional team

f Architecture, Development, Reliability Team, Systems Engineering or
ther) should analyze and propose availability-improving features

4. Product management selects the ‘right’ mix of availability-improving features per release and gets those features committed 5. Revisit reliability roadmap for every major release, and make revisions as appropriate

SLIDE 35

35 | Reliability Roadmap | April 2007