Practical Scrubbing Getting to the bad sector at the right time - - PowerPoint PPT Presentation

practical scrubbing
SMART_READER_LITE
LIVE PREVIEW

Practical Scrubbing Getting to the bad sector at the right time - - PowerPoint PPT Presentation

Practical Scrubbing Getting to the bad sector at the right time George Amvrosiadis Bianca Schroeder University of Toronto Alina Oprea RSA Laboratories !"#$%#&'()*+$,)-).)/0$/1$234 Tuesday, June 26, 2012 Hard disk errors &


slide-1
SLIDE 1

Practical Scrubbing

Getting to the bad sector at the right time

!"#$%#&'()*+$,)-).)/0$/1$234

George Amvrosiadis Bianca Schroeder

University of Toronto

Alina Oprea

RSA Laboratories

Tuesday, June 26, 2012

slide-2
SLIDE 2

Hard disk errors & Scrubbing

Tuesday, June 26, 2012

slide-3
SLIDE 3

3

What could go wrong?

Tuesday, June 26, 2012

slide-4
SLIDE 4

Spindle failure H e a d c r a s h e d / s t u c k , l u b e b u i l d

  • u

p !

Electrical failure 3

What could go wrong?

Firmware bug

PCB failure

S u r f a c e a b n

  • r

m a l i t i e s

Tuesday, June 26, 2012

slide-5
SLIDE 5

What could go wrong?

Latent Sector Error (LSE)

4

  • L. Bairavasundaram et al., “An analysis of latent sector errors in disk drives”, ACM SIGMETRICS 2007.

Tuesday, June 26, 2012

slide-6
SLIDE 6

0.01 0.10 1.0 10

1.0 2.0 4.0 8.0 16.0 32.0 64.0 Single−parity RAID (3+P) Array Capacity (TB)

  • Exp. errors during reconstruction

1.0E+14

What could go wrong?

5

Steven R. Hetzler, “System Impacts of Storage Trends: Hard Errors and Testability”. In USENIX ;login:, v.36/3

No Redundancy + LSE = Data Loss

2012 (4x4TB) Expected: 0.96 LSEs

~2015 (4x12TB)

Expected: 2.88 LSEs

Tuesday, June 26, 2012

slide-7
SLIDE 7
  • Goal: Detect LSEs timely to enable recovery
  • How: Background process verifying sector contents
  • Detection speed: verify sectors at high frequency
  • Verification cost: avoid delaying workload requests
  • Previous

Work: Focus on detecting LSEs fast

  • Cost of scrubbing? Practical questions raised:

Scrubbing

How do I implement a scrubber? How do I configure it to find LSEs fast? When should I scrub, to minimize impact on the system?

6

Tuesday, June 26, 2012

slide-8
SLIDE 8

Implementation & Configuration

Tuesday, June 26, 2012

slide-9
SLIDE 9

Data scrubbing

  • Option 1: Use READs to verify data integrity
  • Overhead: Data transfer cost, Cache pollution
  • Correctness: Might not check medium surface
  • Option 2: Use

VERIFY firmware command

  • Caveat: Treated as scheduling barriers
  • Solution: Disguise scrubber’s VERIFYs as READs

8 Read

Write Write

Read Read

Write

Read Read

Verify

!

Tuesday, June 26, 2012

slide-10
SLIDE 10

Data scrubbing

  • Option 1: Use READs to verify data integrity
  • Overhead: Data transfer cost, Cache pollution
  • Correctness: Might not check medium surface
  • Option 2: Use

VERIFY firmware command

  • Caveat: Treated as scheduling barriers
  • Solution: Disguise scrubber’s VERIFYs as READs

8 Read

Write Write

Read Read

Write

Read Read Read

Verify

Tuesday, June 26, 2012

slide-11
SLIDE 11

System Overview

I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer

Hard disk

On-disk cache 9

Verify Verify Verify Verify Verify

?

Verify Verify

Tuesday, June 26, 2012

slide-12
SLIDE 12

Finding the sweet spot between scrub throughput and service delay

Parameter 1:

VERIFY request size

10 50 100 150 200 250 300 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M

Service time (msec) VERIFY command size (bytes)

Fujitsu SCSI, 36GB, 10K RPM Fujitsu SAS, 73GB, 15K RPM Hitachi SAS, 300GB, 15K RPM

]

Marginal increase in throughput for requests ≥ 4MB Lower Throughput Seeking cost dominant for requests ≤ 64KB Higher service delay

[

larger requests! smaller requests?!

Tuesday, June 26, 2012

slide-13
SLIDE 13

Staggered scrubbing guarantees fast LSE detection, but: seeking overhead never evaluated in implementation

  • Sequential scrubbing (used in the field today)

Parameter 2:

Scrubbing Order

  • Staggered scrubbing (fast LSE detection) [Oprea, Juels FAST‘10]

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

Sector Logical Block Address Sector Region

11

Logical Block Address

Can we afford this?

HDD capacities: CAGR +100% Data transfer speed: CAGR +40%

Tuesday, June 26, 2012

slide-14
SLIDE 14

Limit seeking between regions, but retain frequent disk passes

12 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 1 2 4 8 16 32 64 128 256 512

Scrubbing throughput (MB/sec) Number of regions

(Staggered) Hitachi SAS, 300GB, 15K RPM (Sequential) (Staggered) Fujitsu SAS, 73GB, 15K RPM (Sequential)

Lower throughput (more seeking) Slower LSE detection (longer disk passes)

Sequential scrubber Staggered scrubber

[

Performance of two approaches equated

Parameter 3:

Number of Regions

more, smaller regions! fewer, larger regions?!

Tuesday, June 26, 2012

slide-15
SLIDE 15

System Overview

I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer

Hard disk

On-disk cache 13

Verify Verify Verify Verify Verify Verify Verify

VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12

!

Region

Tuesday, June 26, 2012

slide-16
SLIDE 16

Minimizing Impact

Tuesday, June 26, 2012

slide-17
SLIDE 17

System Overview

I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer

Hard disk

On-disk cache 15

Verify Verify Verify Verify Verify Verify Verify

?

VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12

!

Region

Tuesday, June 26, 2012

slide-18
SLIDE 18

Background Scheduling

16

  • Fire

VERIFY requests only when disk otherwise idle

  • Previous work: Focus on unobtrusive READs/WRITEs [Lumb’02, Bachmat’02]
  • Avoid collisions with workload requests
  • Start time: When should we start firing VERIFYs?
  • Stop time: When do we stop to avoid collision?
  • Statistical analysis of idleness
  • I/O traces: 2 systems, 77 disks, diverse workloads [SNIA, IOTTA repository]

Time

READ WRITE READ VERIFY VERIFY

Busy interval Idle interval Collision delay

READ WRITE VERIFY

Busy interval Collision

Tuesday, June 26, 2012

slide-19
SLIDE 19

Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09]

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Fraction of largest idle intervals

17

Idleness & Long Tails

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05

Fraction of total idle time

Home Lunch Break Processing b/w requests

Tuesday, June 26, 2012

slide-20
SLIDE 20

Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09] Predictor: Waiting - Fire past threshold

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Fraction of largest idle intervals

17

Idleness & Long Tails

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05

Fraction of total idle time

Time

READ WRITE READ VERIFY Waiting Threshold (Tw) READ VERIFY WRITE Threshold

Tuesday, June 26, 2012

slide-21
SLIDE 21

Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09] Predictor: Waiting - Fire past threshold

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Fraction of largest idle intervals

17

Idleness & Long Tails

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05

Fraction of total idle time

Time

READ WRITE READ VERIFY Waiting Threshold (Tw) READ VERIFY VERIFY WRITE Threshold

10!2 10!1 100 101 102 103 10!6 10!5 10!4 10!3 10!2 10!1 100 101 102 Amount of idle time passed (s) Expected idle time remaining (s)

, stop only on collision

Tuesday, June 26, 2012

slide-22
SLIDE 22

18

Idleness & Periodicity

Property: Periodicity - Repeating patterns in disk traffic

Predictor: Autoregression -

0.01 0.02 0.04 0.10 0.20 0.40 1.00 2.00 4.00

24 48 72 96 120 144 168 Trace hour Number of requests (millions)

0.01 0.02 0.04 0.10 0.20 0.40 1.00 2.00 4.00

24 48 72 96 120 144 168 Trace hour Number of requests (millions)

Time

READ WRITE READ VERIFY READ X ms Hours ago... . . .

X ms

VERIFY VERIFY

Fire if prediction > threshold, don’t stop

> Prediction Threshold (Tp)?

Tuesday, June 26, 2012

slide-23
SLIDE 23

Predictor evaluation

19

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor

Optimal

Tuesday, June 26, 2012

slide-24
SLIDE 24

Predictor evaluation

19

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor

!

Oracle Autoregression Waiting

Oracle: always picks X% largest intervals

Tuesday, June 26, 2012

slide-25
SLIDE 25

Predictor evaluation

19

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor

!

Oracle Autoregression Waiting

! ! ! ! ! ! ! ! !

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor

8ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms

!

Oracle Autoregression Waiting

! ! ! ! ! ! ! ! !

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor

8ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms

!

Oracle Autoregression Waiting

Prediction threshold (Tp) Waiting threshold (Tw)

Tuesday, June 26, 2012

slide-26
SLIDE 26

Average delay per workload I/O request (ms) Scrubbing throughput (MB/s)

10 20 30 40 50 60 70 80 90

1 2 3 4 5 6 7 8 9 10 20

VERIFY request size Waiting threshold

!

(512KB, 32ms)

Fine-tuning the wait

Tuesday, June 26, 2012

slide-27
SLIDE 27

Average delay per workload I/O request (ms) Scrubbing throughput (MB/s)

10 20 30 40 50 60 70 80 90

1 2 3 4 5 6 7 8 9 10

Fine-tuning the wait

21

VERIFY request size Waiting threshold

! ! ! ! ! ! ! ! ! ! ! ! !

512ms 256ms 32ms 8ms 1ms

L a r g e r t h r e s h

  • l

d s

! ! ! ! ! ! ! ! ! ! ! ! !

512ms 256ms 32ms 8ms 1ms

! ! ! !

4096ms 1024ms 512ms

! ! ! ! ! !! ! ! ! ! !!

512ms 256ms 16ms 1ms 512Kb 4096Kb 64Kb

Best effort Worst possible

C F Q I / O s c h e d u l e r

Complete Fair Queueing (CFQ):

  • Default Linux I/O scheduler
  • Always waits 10ms before

declaring disk idle

Tuesday, June 26, 2012

slide-28
SLIDE 28

System Overview

I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer

Hard disk

On-disk cache 22 VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12

!

Region

Time

READ WRITE VERIFY

Threshold

READ READ WRITE VERIFY

Verify

VERIFY request size Waiting threshold

Verify Verify Verify Verify Verify Verify Verify Verify Verify Verify Tuesday, June 26, 2012

slide-29
SLIDE 29

!"#$%#&'()*+$,)-).)/0$/1$234

Also in the paper

  • Scrubbing framework implementation
  • Open-sourced* framework simplifying development of

scrubbing algorithms

  • Why

VERIFY is implemented incorrectly in ATA drives

  • Scrubbing impact evaluation in implementation using

synthetic/realistic workloads

  • Detailed statistical analysis
  • Detection of longer tails than reported in previous work
  • Characterization of periods in traces
  • TPC-C benchmark: idle time distribution

unrepresentative of OLTP workloads

* Kernel/User-level sources available at http://www.cs.toronto.edu/~gamvrosi

Tuesday, June 26, 2012