Disks and RAID (Chapter 12, 14.2) CS 4410 Operating Systems [R. - - PowerPoint PPT Presentation

disks and raid
SMART_READER_LITE
LIVE PREVIEW

Disks and RAID (Chapter 12, 14.2) CS 4410 Operating Systems [R. - - PowerPoint PPT Presentation

Disks and RAID (Chapter 12, 14.2) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse] Storage Devices Magnetic disks Storage that rarely becomes corrupted Large capacity at low cost Block


slide-1
SLIDE 1

Disks and RAID

(Chapter 12, 14.2)

CS 4410 Operating Systems

[R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

slide-2
SLIDE 2
  • Magnetic disks
  • Storage that rarely becomes corrupted
  • Large capacity at low cost
  • Block level random access
  • Slow performance for random access
  • Better performance for streaming access
  • Flash memory
  • Storage that rarely becomes corrupted
  • Capacity at intermediate cost (50x disk)
  • Block level random access
  • Good performance for reads; worse for random writes

Storage Devices

2

slide-3
SLIDE 3

THAT WAS THEN

  • 13th September 1956
  • The IBM RAMAC 350
  • Total Storage = 5 million characters

(just under 5 MB)

Magnetic Disks are 60 years old!

3

http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/

THIS IS NOW

  • 2.5-3.5” hard drive
  • Example: 500GB Western Digital

Scorpio Blue hard drive

  • easily up to 1 TB
slide-4
SLIDE 4

RAM (Memory) vs. HDD (Disk), 2018

4

RAM HDD

Typical Size 8 GB 1 TB Cost $10 per GB $0.05 per GB Power 3 W 2.5 W Read Latency 15 ns 15 ms Read Speed (Sequential) 8000 MB/s 175 MB/s

Write Speed (Sequential)

10000 MB/s 150 MB/s Read/Write Granularity word sector Power Reliance volatile non-volatile [C. Tan, buildcomputers.net, codecapsule.com, crucial.com, wikipedia]

slide-5
SLIDE 5

Track Sector Head Arm Arm Assembly Platter Surface Surface Motor Motor Spindle

Must specify:

  • cylinder #

(distance from spindle)

  • surface #
  • sector #
  • transfer size
  • memory address

Reading from disk

5

slide-6
SLIDE 6

Track Head Arm Spindle

~ 1 micron wide (1000 nm)

  • Wavelength of light is ~ 0.5 micron
  • Resolution of human eye: 50 microns
  • 100K tracks on a typical 2.5” disk

Track length varies across disk

  • Outside:
  • More sectors per track
  • Higher bandwidth
  • Most of disk area in outer regions

Disk Tracks

6

Track*

*not to scale: head is actually much bigger than a track

Sector

slide-7
SLIDE 7

Disk Latency = Seek Time + Rotation Time + Transfer Time

  • Seek: to get to the track (5-15 millisecs (ms))
  • Rotational Latency: to get to the sector (4-8 millisecs (ms))

(on average, only need to wait half a rotation)

  • Transfer: get bits off the disk (25-50 microsecs (μs)

Disk overheads

7

Track Sector Seek Time Rotational Latency

slide-8
SLIDE 8

Objective: minimize seek time Context: a queue of cylinder numbers (#0-199) Metric: how many cylinders traversed?

Disk Scheduling

8

Head pointer @ 53 Queue: 98, 183, 37, 122, 14, 124, 65, 67

slide-9
SLIDE 9
  • Schedule disk operations in order they arrive
  • Downsides?

FIFO Schedule? Total head movement?

Disk Scheduling: FIFO

9

Head pointer @ 53 Queue: 98, 183, 37, 122, 14, 124, 65, 67

slide-10
SLIDE 10
  • Select request with minimum seek time from

current head position

  • A form of Shortest Job First (SJF) scheduling
  • Not optimal: suppose cluster of requests at

far end of disk ➜ starvation! SSTF Schedule? Total head movement?

Disk Scheduling: Shortest Seek Time First

10

Head pointer @ 53 Queue: 98, 183, 37, 122, 14, 124, 65, 67

slide-11
SLIDE 11

Elevator Algorithm:

  • arm starts at one end of disk
  • moves to other end, servicing requests
  • movement reversed @ end of disk
  • repeat

SCAN Schedule? Total head movement?

Disk Scheduling: SCAN

11

Head pointer @ 53 Queue: 98, 183, 37, 122, 14, 124, 65, 67

slide-12
SLIDE 12

Circular list treatment:

  • head moves from one end to other
  • servicing requests as it goes
  • reaches the end, returns to beginning
  • no requests serviced on return trip

+ More uniform wait time than SCAN

Disk Scheduling: C-SCAN

12

C-SCAN Schedule? Total Head movement?(?) Head pointer @ 53 Queue: 98, 183, 37, 122, 14, 124, 65, 67

slide-13
SLIDE 13

RAM vs. HDD vs Flash, 2018

13

RAM HDD Flash

Typical Size 8 GB 1 TB 250 GB Cost $10 per GB $0.05 per GB $0.32 per GB Power 3 W 2.5 W 1.5 W Read Latency 15 ns 15 ms 30 µs Read Speed (Seq.) 8000 MB/s 175 MB/s 550 MB/s

Write Speed (Seq.)

10000 MB/s 150 MB/s 500 MB/s Read/Write Granularity word sector page* Power Reliance volatile non-volatile non-volatile Write Endurance * ** 100 TB [C. Tan, buildcomputers.net, codecapsule.com, crucial.com, wikipedia]

slide-14
SLIDE 14

Most SSDs based on NAND-flash

  • retains its state for months to years without power

Solid State Drives (Flash)

14

https://flashdba.com/2015/01/09/understanding-flash-floating-gates-and-wear/

Metal Oxide Semiconductor Field Effect Transistor (MOSFET) Floating Gate MOSFET (FGMOS)

slide-15
SLIDE 15

Charge is stored in Floating Gate

(can have Single and Multi-Level Cells)

NAND Flash

15

https://flashdba.com/2015/01/09/understanding-flash-floating-gates-and-wear/

Floating Gate MOSFET (FGMOS)

slide-16
SLIDE 16
  • Erase block: sets each cell to “1”
  • erase granularity = “erasure block” = 128-512 KB
  • time: several ms
  • Write page: can only write erased pages
  • write granularity = 1 page = 2-4KBytes
  • time: 10s of ms
  • Read page:
  • read granularity = 1 page = 2-4KBytes
  • time: 10s of ms

Flash Operations

16

slide-17
SLIDE 17
  • can’t write 1 byte/word (must write whole blocks)
  • limited # of erase cycles per block (memory wear)
  • 103-106 erases and the cell wears out
  • reads can “disturb” nearby words and overwrite them

with garbage

  • Lots of techniques to compensate:
  • error correcting codes
  • bad page/erasure block management
  • wear leveling: trying to distribute erasures across the

entire driver

Flash Limitations

17

slide-18
SLIDE 18

Flash Translation Layer

Flash device firmware maps logical page # to a physical location

  • Garbage collect erasure block by copying live pages to

new location, then erase

  • More efficient if blocks stored at same time are deleted at

same time (e.g., keep blocks of a file together)

  • Wear-levelling: only write each physical page a limited

number of times

  • Remap pages that no longer work (sector sparing)

Transparent to the device user

18

slide-19
SLIDE 19
  • Fast: data is there when you want it
  • Reliable: data fetched is what you stored
  • Affordable: won’t break the bank

Enter: Redundant Array of Inexpensive Disks (RAID)

  • In industry, “I” is for “Independent”
  • The alternative is SLED, single large expensive disk
  • RAID + RAID controller looks just like SLED to computer

(yay, abstraction!)

What do we want from storage?

19

slide-20
SLIDE 20

Files striped across disks + Fast + Cheap – Unreliable

RAID-0

20

stripe 0 stripe 2 stripe 4 stripe 6 stripe 8 stripe 10 stripe 12 stripe 14

Disk 0

stripe 1 stripe 3 stripe 5 stripe 7 stripe 9 stripe 11 stripe 13 stripe 15

Disk 1 . . . . . .

slide-21
SLIDE 21

(1) Isolated Disk Sectors (1+ sectors down, rest OK)

Permanent: physical malfunction (magnetic coating, scratches, contaminants) Transient: data corrupted but new data can be successfully written to / read from sector

(2) Entire Device Failure

  • Damage to disk head, electronic failure, wear out
  • Detected by device driver, accesses return error codes
  • Annual failure rates or Mean Time To Failure (MTTF)

Failure Cases

21

slide-22
SLIDE 22

Striping reduces reliability

  • More disks ➜ higher probability of some disk failing
  • N disks: 1/Nth mean time between failures of 1 disk

What can we do to improve Disk Reliability?

Hint #1: When CPUs stopped being reliable, we also did this…

Striping and Reliability

22

slide-23
SLIDE 23

Disks Mirrored: data written in 2 places

+ Reliable + Fast – Expensive

Example: Google File System replicates data across multiple disks

RAID-1

23

data 0 data 1 data 2 data 3 data 4 data 5 data 6 data 7

Disk 1

. . .

data 0 data 1 data 2 data 3 data 4 data 5 data 6 data 7

Disk 0

. . .

slide-24
SLIDE 24

bit-level striping with ECC codes

  • 7 disk arms synchronized, move in unison
  • Complicated controller (➜ very unpopular)
  • Detect & Correct 1 error with no performance degradation

+ Reliable – Expensive

parity 1 = 3⊕5⊕7 (all disks whose # has 1 in LSB, xx1) parity 2 = 3⊕6⊕7 (all disks whose # has 1 in 2nd bit, x1x) parity 4 = 5⊕6⊕7 (all disks whose # has 1 in MSB, 1xx)

RAID-2

24

bit 2 bit 6 bit 10 bit 14

Disk 5

bit 1 bit 5 bit 9 bit 13

Disk 3 Disk 2

parity 1 parity 4 parity 7 parity 10

Disk 1

parity 3 parity 6 parity 9 parity 12

Disk 4

parity 2 parity 5 parity 8 parity 11 bit 3 bit 7 bit 11 bit 15

Disk 6

bit 4 bit 8 bit 12 bit 16

Disk 7

d

  • w

e r e a l l y n e e d t

  • d

e t e c t ?

001 010 011 100 101 110 111

slide-25
SLIDE 25

parity 1 = 3⊕5⊕7 (all disks whose # has 1 in LSB, xx1) = a⊕b⊕d = 1⊕1⊕1 = 1 parity 2 = 3⊕6⊕7 (all disks whose # has 1 in 2nd bit, x1x) = a⊕c⊕d = 1⊕0⊕1 = 0 parity 4 = 5⊕6⊕7 (all disks whose # has 1 in MSB, 1xx)

= b⊕c⊕d = 1⊕0⊕1 = 0

RAID-2 Generating Parity

25

b 1

Disk 5

a 1

Disk 3 Disk 2

parity 1

1

Disk 1

parity 3

Disk 4

parity 2

c

Disk 6

d 1

Disk 7

001 010 011 100 101 110 111

slide-26
SLIDE 26

I flipped a bit. Which one? parity 1 = 3⊕5⊕7 (all disks whose # has 1 in LSB, xx1) = a⊕b⊕d = 1⊕1⊕0 = 0 ß problem parity 2 = 3⊕6⊕7 (all disks whose # has 1 in 2nd bit, x1x) = a⊕c⊕d = 1⊕0⊕0 = 1 ß problem parity 4 = 5⊕6⊕7 (all disks whose # has 1 in MSB, 1xx)

= b⊕c⊕d = 1⊕0⊕0 = 1 ß problem

RAID-2 Detect and Correct

26

b 1

Disk 5

a 1

Disk 3 Disk 2

parity 1

1

Disk 1

parity 3

Disk 4

parity 2

c

Disk 6

d

Disk 7

001 010 011 100 101 110 111

Problem @ xx1, x1x, 1xx à 111, d was flipped

slide-27
SLIDE 27

RAID-3: byte-level striping + parity disk

  • read accesses all data disks
  • write accesses all data disks + parity disk
  • On disk failure: read parity disk, compute missing data

RAID-4: block-level striping + parity disk + better spatial locality for disk access + Cheap – Slow Writes – Unreliable parity disk is write bottleneck and wears out faster

2 more rarely-used RAIDS

27

data 2 data 6 data 10 data 14

Disk 2

data 1 data 5 data 9 data 13

Disk 1

parity 1 parity 2 parity 3 parity 4

Disk 5

data 3 data 7 data 11 data 15

Disk 3

data 4 data 8 data 12 data 16

Disk 4

slide-28
SLIDE 28

Bit-level ➜ byte-level ➜ block level

fine-grained: stripe a file across all disks

+ high throughput for the file

  • wasted disk seek time
  • limits to transfer of 1 file at a time

coarse-grained: stripe a file over a few disks

  • limits throughput for 1 file

+ better use of spatial locality (for disk seek) + allows more parallel file access

A word about Granularity

28

slide-29
SLIDE 29

+ Reliable + Fast + Affordable

RAID 5: Rotating Parity w/Striping

29

parity 0-3 data 4 data 8 data 12 data 16

Disk 0

data 0 parity 4-7 data 9 data 13 data 17

Disk 1

data 1 data 5 parity 8-11 data 14 data 18

Disk 2

data 2 data 6 data 10 parity 12-15 data 19

Disk 3

data 3 data 7 data 11 data 15 parity 16-19

Disk 4