Storage and reliability Computer Architecture J. Daniel Garca - - PowerPoint PPT Presentation

storage and reliability
SMART_READER_LITE
LIVE PREVIEW

Storage and reliability Computer Architecture J. Daniel Garca - - PowerPoint PPT Presentation

Storage and reliability Storage and reliability Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and Engineering Department University Carlos III


slide-1
SLIDE 1

Storage and reliability

Storage and reliability

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/41

slide-2
SLIDE 2

Storage and reliability Storage

1

Storage

2

Reliability and availability

3

RAID

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/41

slide-3
SLIDE 3

Storage and reliability Storage

Magnetic disks

High storage capacity (hundreds of GBs). Spin at constant angular velocity. Access time for data stream:

T = track seek + rotation latency. Depends on the stream access sequence.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/41

slide-4
SLIDE 4

Storage and reliability Storage

Density

Bits stored along track (BPI). Number of tracks per surface (TPI). Disks design trend to increasing density of bits stored per area unit (Areal Density). Areal Density = BPI × TPI

Year Density 1973 2 1979 8 1989 63 1997 3,090 2000 17,100 2006 130,000

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/41

slide-5
SLIDE 5

Storage and reliability Storage

History perspective

1956 IBM Ramac → Early 70s Winchester.

Developed for mainframes. Proprietary interfaces. Constant reduction of size: from 27 to 14 inches.

1970s.

5.25 inches. Industry of standard interfaces for storage emerge.

Early 1980s: Personal Computers (PCs) and first generations of desktop computers.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/41

slide-6
SLIDE 6

Storage and reliability Storage

History perspective

Mid 1980s: Client/server computing.

Centralized storage in file servers. Miniaturization increases: 8 inches to 5.25. Mass production of disk units in the market. Standards: SCSI, IPI, IDE. 5.25 inches to 3.5 inches for PCs.

1900s: Laptops => 2.5 inches. 2000s: New devices leading to new units:

1.8 inches: iPods, MP3 players. 1 inch IBMs microdrive. 0.85 inches (Toshiba) mobile phones.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/41

slide-7
SLIDE 7

Storage and reliability Storage

Illiac IV

University of Illinois (1974)

30,000,000$. Solid state memory. Laser memory. Fastest in the world until 1981. Numeric computing for NASA.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/41

slide-8
SLIDE 8

Storage and reliability Storage

Disk capacity and performance

Continuous increase in capacity (60%/year) and bandwidth (40%/year). Slow increase of disk rotation (8%/year). Time to read the whole disk.

Year Sequentially Randomly (1 sector/seek) 1990 4 min. 6 hours 2000 12 min. 1 week 2006 (SCSI) 56 min. 3 weeks 2006 (SATA) 171 min. 7 weeks

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/41

slide-9
SLIDE 9

Storage and reliability Reliability and availability

1

Storage

2

Reliability and availability

3

RAID

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/41

slide-10
SLIDE 10

Storage and reliability Reliability and availability Reliability

2

Reliability and availability Reliability Availability

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/41

slide-11
SLIDE 11

Storage and reliability Reliability and availability Reliability

Reliability

The life time of a system represented as a random variable X. System reliability defined as function R(t) R(t) = P(X > t) : R(0) = 1yR(inf) = 0 (1)

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/41

slide-12
SLIDE 12

Storage and reliability Reliability and availability Reliability

Reliability and failures

From study of components failures we obtain reliability

http://www.jmcprl.net/ntps/@datos/ntp_418.htm.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/41

slide-13
SLIDE 13

Storage and reliability Reliability and availability Reliability

Reliability distributions

Examples of distributions used for reliability:

http://www.relexsoftware.com/resources/art/art_ distrib.asp.

Exponential:

If error rate is constant (generally true for electronic components), reliability follows an exponential distribution.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/41

slide-14
SLIDE 14

Storage and reliability Reliability and availability Reliability

Reliability distributions

Weibull:

Characteristic life η (time in which 63.2% of population fails) and form factor β

Associated to error rate, with b = 1 → constant error rate.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/41

slide-15
SLIDE 15

Storage and reliability Reliability and availability Reliability

Serial systems

Let Ri(t) reliability for component i. System fails when some component fails.

R1(t) R2(t) R3(t) R4(t)

If failures are independent then: R(t) =

N

  • i=1

Ri(t) System reliability is lower: R(t) < Ri(t)∀i

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/41

slide-16
SLIDE 16

Storage and reliability Reliability and availability Reliability

Paralel system

System fails when all components fail. R(t) = 1 −

N

  • i=1

Qi(t) : Qi(t) = 1 − Ri(t)

R1(t) R2(t) R3(t)

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/41

slide-17
SLIDE 17

Storage and reliability Reliability and availability Reliability

Example

Para t = 100 Ri (t) = 0.9 R1(t) R2(t) R3(t) R1(t) R2(t) R3(t)

R(t) = 0.9·0.9·0.9 = 0.729 R(t) = 1−(1−0.9)3 = 0.999

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/41

slide-18
SLIDE 18

Storage and reliability Reliability and availability Availability

2

Reliability and availability Reliability Availability

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/41

slide-19
SLIDE 19

Storage and reliability Reliability and availability Availability

Availability

In many cases, it is more interesting to know availability. Availability of a system A(t) defined as the probability that the system is working correctly at instant t.

Reliability considers interval [0, t]. Availability considers a concrete instant in time.

A system modelled as following state diagram.

Working Not working Failure Repair

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/41

slide-20
SLIDE 20

Storage and reliability Reliability and availability Availability

Availability measurement

Let TMF the average time to failure. Let TMR the average time to repair. System availability A is defined as: A = TMF TMF + TMR What does a reliability of 99% mean?

In 365 days, it works correctly 99·365

100

= 361.35 days. Out of service 3.65 days.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/41

slide-21
SLIDE 21

Storage and reliability Reliability and availability Availability

Annual time without service

Availability (%) Days without service in a year 98% 7.3 days 99% 3.65 days 99.8% 17 hours y 30 minutes 99.9% 8 hours y 45 minutes 99.99% 52 minutes y 30 seconds 99.999% 5 minutes y 15 seconds 99.9999% 31.5 seconds

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/41

slide-22
SLIDE 22

Storage and reliability Reliability and availability Availability

Computing availability

Elements availability

HW: 99.99% Disk: 99.9% SO: 99.99% Application: 99.9% Communications: 99.9%

System availability:

Product of elements availability.

A(t) =

N

  • i=1

Ai(t) = 99.6804 ⇒ 1.17days without service

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/41

slide-23
SLIDE 23

Storage and reliability Reliability and availability Availability

Sectors with most service interruptions

Sector Percentage Bank and finance 26% Government, public 19.1% administrations and institutions Education 11.3% Industry 10.9% Services 9.5% Communications 8.2%

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/41

slide-24
SLIDE 24

Storage and reliability Reliability and availability Availability

Cost of stopping one hour

Cost Percentage Up to 50,000$ 46% 50,000$ – 100,000$ 15% 100,000$ – 250,000$ 13% 250,000$ – 500,000$ 9% 500,000$ – 1,000,000$ 9% 1,000,000$ – 5,000,000$ 4% More than 5,000,000$ 4%

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/41

slide-25
SLIDE 25

Storage and reliability RAID

1

Storage

2

Reliability and availability

3

RAID

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/41

slide-26
SLIDE 26

Storage and reliability RAID

What to do with failures?

Problems in disks:

Failure in the disk itself. Failure in the disk controller. Failure in block (damaged sectors). Transient failures.

Using a redundant storage system:

Redundant Array of Inexpensive/Independent Disks. Proposed for the first time in 1998 by David A. Patterson, Garth A. Gibson and Randy H. Katz. “A case for inexpensive arrays of redundant disks (RAID)”

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/41

slide-27
SLIDE 27

Storage and reliability RAID

RAID Disks

Several types of RAID: Basic levels:

RAID 0: block distribution (striping) without fault tolerance. RAID 1: disk mirroring. RAID 2: bit level interleaving with Hamming. RAID 3: bit level interleaving with redundant information (parity) RAID 4: block distribution with parity disk. RAID 5: block distribution with distributed parity.

Combinations:

RAID 10: Stripping and mirroring (RAID 0 and 1). RAID 51: Combination of RAID 5 and RAID 1. . . .

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/41

slide-28
SLIDE 28

Storage and reliability RAID

RAID 0 (striping)

Fault tolerance:

Does not offer fault tolerance.

Performance:

Higher throughput in read/write operations.

Capacity:

Addition.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/41

slide-29
SLIDE 29

Storage and reliability RAID

RAID 1 (mirroring)

Fault tolerance:

1 failure.

Performance:

Higher throughput in read operations.

Capacity:

50% of total.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/41

slide-30
SLIDE 30

Storage and reliability RAID

RAID 2

Failure detection. Use Hamming code. Bit level Striping. Very costly implementation. Not used.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/41

slide-31
SLIDE 31

Storage and reliability RAID

RAID 3

RAID 3 (striping with dedicated parity, bit level. Byte level stripping. Parity of written bytes. Tolerance to 1 failure. Use byte level redundancy. Improve throughput: Parallel access to block. Parity disk is a bottleneck.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/41

slide-32
SLIDE 32

Storage and reliability RAID

RAID 4

RAID 4 (striping with dedicated parity. Block level striping. Fault tolerance: 1 failure. Performance:

Costly writes (parity). Parity disk is a bottleneck.

Capacity: 100·(n−1)

n

%

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/41

slide-33
SLIDE 33

Storage and reliability RAID

RAID 3 versus RAID 4

RAID 3: Each byte in a disk. RAID 4: Each block in a disk.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/41

slide-34
SLIDE 34

Storage and reliability RAID

RAID 5

RAID 5 (striping with distributed parity. Block level striping. Parity striping. Parity is not in the same disk as associated blocks. Fault tolerance: 1 failure. There is no bottleneck in access to parity. Capacity: 100·(n−1)

n

%

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/41

slide-35
SLIDE 35

Storage and reliability RAID

RAID 6

RAID 6 (striping with distributed redundant parity. Block level striping. Parity striping. Parity is replicated twice. Parity is not in the same disk than the associated blocks. Fault tolerance: 2 failures. There is no bottleneck in access to parity.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/41

slide-36
SLIDE 36

Storage and reliability RAID

Reads in RAID 4-5

If disk works:

Corresponding disk is read.

If disk does not work:

Blocks in other disks and parity disk are read to compute new block.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 36/41

slide-37
SLIDE 37

Storage and reliability RAID

Writes in RAID 4-5

If disk works:

Write a block and the new parity, by:

1

Read the old block OB and the parity block OP .

2

New parity will be: NP = (OB ⊕ NB) ⊕ OP.

3

Write the new block NB and the parity block NP .

If disk does not work:

Update block and parity in working disk.

Whe disk fails is substituted and its information is reconstructed.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 37/41

slide-38
SLIDE 38

Storage and reliability Conclusion

1

Storage

2

Reliability and availability

3

RAID

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 38/41

slide-39
SLIDE 39

Storage and reliability Conclusion

Summary

Reliability models system life time. Parallel systems allow improving system reliability while serial systems worsen system reliability. Availability models the probability of failures at instant in time. RAID systems improve both performance and reliability of storage systems.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 39/41

slide-40
SLIDE 40

Storage and reliability Conclusion

References

Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections D.1, D.2, D.3, D.4.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 40/41

slide-41
SLIDE 41

Storage and reliability Conclusion

Storage and reliability

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 41/41