IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ - - PowerPoint PPT Presentation

ibm zseries fault tolerant design
SMART_READER_LITE
LIVE PREVIEW

IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ - - PowerPoint PPT Presentation

IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ September 20, 2001 Power/Cooling Fault Tolerance N+1 350 volt DC to DC Load Converter AC to DC DC to DC Converter Converter AC input Battery DC to DC Load Converter


slide-1
SLIDE 1

IBM zSeries Fault Tolerant Design

Lisa Spainhower Technology September 20, 2001

m

^

slide-2
SLIDE 2

Power/Cooling Fault Tolerance

AC to DC Converter Battery AC to DC Converter Battery

DC to DC Converter DC to DC Converter DC to DC Converter Load Load

350 volt

N+1 N+1

AC input AC input

Fan/ Compressor Fan/ Compressor Fan/ Compressor Control Fan/ Compressor Control

slide-3
SLIDE 3

I/ O ED and Recovery I/ O ED and Recovery

MAIN MEMORY

HUB I/O ADAPTER

I/O SUBSYSTEM

CHANNEL STORAGE STORAGE I/O ADAPTER BRIDGE NETWORK ETHERNET

Memory Bus GX, S/390 L2 S/390 STI RIO RIO IBT, ESCON FICON, FC PCI, PCIx SCSI, FCAL SCSI, FCAL

CACHES PROCESSORS

NO ED standards for PCI

RS/AIX custom design to circumvent

IBT is channel-based

Like S/390 Defined errors Defined robust checking & isolation

LEVEL THE PLAYING FIELD

Unix S/390

slide-4
SLIDE 4

Level 1 Cache

Parity Protected Store-through to L2 ECC'd Store Buffer on uP Line delete/sparing

Memory

(72, 64) SEC/DED ECC One bit per chip Background scrubbing Dynamic chip sparing

Memory Hierarchy Fault Tolerance

L1 uP L1 uP L1 uP L1 uP L1 uP L1 uP

Memory

Level 2 Cache

(72, 64) SEC/DED ECC Line/directory deletes Line sparing

L1 uP L1 uP L1 uP L1 uP L1 uP L1 uP

L2

Memory

I/O I/O

L1 uP L1 uP L1 uP L1 uP L1 uP L1 uP

L2 L2

slide-5
SLIDE 5

Duplicated:

Complex controls Arithmetic dataflow

Shared:

Cache controls Cache data/address flow R-Unit

Check all state updates Preserve known good state If error

  • 1. Stop state updates
  • 2. Refresh from saved state
  • 3. Restart CPU

If error persists

  • 1. Extract saved state (SE)
  • 2. Load into spare CPU
  • 3. Start spare CPU

CFW 3/30/00

E-Unit

(unchecked)

Cache

(parity)

I-Unit

(mirror)

E-Unit

(mirror)

R-Unit

(ECC on saved state)

I-Unit

(unchecked) Address Cache data Instructions Results / state updates Saved state data

CP Error Detection & Recovery

slide-6
SLIDE 6

2Q01 zSeries Full Field Data

MTTHardware Repair = 8 months 81-83% of repairs are concurrent

TYPICAL REPAIR SCENARIO

100% RESOURCES UP

Hard Soft

100% RESOURCES UP

CPU Channel

100% RESOURCES UP Detect Error Single Element Offline

System up

Repair/restore

(hours+)

Retry Op

(~1 second)

HW Failure Restart Op

(~1 minute)

HW Checkpoint

13-15% of repairs are deferable 2-6% of repairs are app loss: MTTAL = 24 years

slide-7
SLIDE 7

zSeries Error Reporting

~2 week interval "call home" recovery data

Suppose CP hard logic (not array) fails caused app loss:

MTTAL from 24 yrs to 11 yrs

Suppose array (L1, L2, BHT) fails also caused app loss:

MTTAL from 11 yrs to 5 yrs

slide-8
SLIDE 8

Circuit-level detection Soft error recovery Hard error recovery

Instruction retry

uArch checkpoint

CPU Sparing PAF

S/ 390 uses same technology building blocks for soft and hard error recovery

Enhanced over past 35 years

IT'S NOT THE ONLY OPTION Beginning afresh, might land elsewhere Need to be driven by current conditions

Technology Workload

IT'S EFFICIENT & EFFECTIVE FOR S/390

S/390 Evolution

slide-9
SLIDE 9

Challenges for the 00s

Increased importance of firmware Circuit failure mechanisms State encapsulation On-the-fly change Dynamic resource allocation Configuration validation