Practical Scrubbing
Getting to the bad sector at the right time
!"#$%#&'()*+$,)-).)/0$/1$234
George Amvrosiadis Bianca Schroeder
University of Toronto
Alina Oprea
RSA Laboratories
Tuesday, June 26, 2012
Practical Scrubbing Getting to the bad sector at the right time - - PowerPoint PPT Presentation
Practical Scrubbing Getting to the bad sector at the right time George Amvrosiadis Bianca Schroeder University of Toronto Alina Oprea RSA Laboratories !"#$%#&'()*+$,)-).)/0$/1$234 Tuesday, June 26, 2012 Hard disk errors &
!"#$%#&'()*+$,)-).)/0$/1$234
George Amvrosiadis Bianca Schroeder
University of Toronto
Alina Oprea
RSA Laboratories
Tuesday, June 26, 2012
Tuesday, June 26, 2012
3
Tuesday, June 26, 2012
Spindle failure H e a d c r a s h e d / s t u c k , l u b e b u i l d
p !
Electrical failure 3
Firmware bug
PCB failure
S u r f a c e a b n
m a l i t i e s
Tuesday, June 26, 2012
Latent Sector Error (LSE)
4
Tuesday, June 26, 2012
0.01 0.10 1.0 10
1.0 2.0 4.0 8.0 16.0 32.0 64.0 Single−parity RAID (3+P) Array Capacity (TB)
1.0E+14
5
Steven R. Hetzler, “System Impacts of Storage Trends: Hard Errors and Testability”. In USENIX ;login:, v.36/3
2012 (4x4TB) Expected: 0.96 LSEs
~2015 (4x12TB)
Expected: 2.88 LSEs
Tuesday, June 26, 2012
How do I implement a scrubber? How do I configure it to find LSEs fast? When should I scrub, to minimize impact on the system?
6
Tuesday, June 26, 2012
Tuesday, June 26, 2012
8 Read
Write Write
Read Read
Write
Read Read
Verify
Tuesday, June 26, 2012
8 Read
Write Write
Read Read
Write
Read Read Read
Verify
Tuesday, June 26, 2012
I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer
Hard disk
On-disk cache 9
Verify Verify Verify Verify Verify
Verify Verify
Tuesday, June 26, 2012
Finding the sweet spot between scrub throughput and service delay
Parameter 1:
10 50 100 150 200 250 300 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M
Service time (msec) VERIFY command size (bytes)
Fujitsu SCSI, 36GB, 10K RPM Fujitsu SAS, 73GB, 15K RPM Hitachi SAS, 300GB, 15K RPM
Marginal increase in throughput for requests ≥ 4MB Lower Throughput Seeking cost dominant for requests ≤ 64KB Higher service delay
larger requests! smaller requests?!
Tuesday, June 26, 2012
Staggered scrubbing guarantees fast LSE detection, but: seeking overhead never evaluated in implementation
Parameter 2:
Sector Logical Block Address Sector Region
11
Logical Block Address
Can we afford this?
HDD capacities: CAGR +100% Data transfer speed: CAGR +40%
Tuesday, June 26, 2012
Limit seeking between regions, but retain frequent disk passes
12 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 1 2 4 8 16 32 64 128 256 512
Scrubbing throughput (MB/sec) Number of regions
(Staggered) Hitachi SAS, 300GB, 15K RPM (Sequential) (Staggered) Fujitsu SAS, 73GB, 15K RPM (Sequential)
Lower throughput (more seeking) Slower LSE detection (longer disk passes)
Sequential scrubber Staggered scrubber
Performance of two approaches equated
Parameter 3:
more, smaller regions! fewer, larger regions?!
Tuesday, June 26, 2012
I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer
Hard disk
On-disk cache 13
Verify Verify Verify Verify Verify Verify Verify
VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12
!
Region
Tuesday, June 26, 2012
Tuesday, June 26, 2012
I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer
Hard disk
On-disk cache 15
Verify Verify Verify Verify Verify Verify Verify
VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12
!
Region
Tuesday, June 26, 2012
16
Time
READ WRITE READ VERIFY VERIFY
Busy interval Idle interval Collision delay
READ WRITE VERIFY
Busy interval Collision
Tuesday, June 26, 2012
Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09]
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Fraction of largest idle intervals
17
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05
Fraction of total idle time
Home Lunch Break Processing b/w requests
Tuesday, June 26, 2012
Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09] Predictor: Waiting - Fire past threshold
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Fraction of largest idle intervals
17
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05
Fraction of total idle time
Time
READ WRITE READ VERIFY Waiting Threshold (Tw) READ VERIFY WRITE Threshold
Tuesday, June 26, 2012
Property: Long Tail - Majority of idle time in few idle intervals [Riska ’09] Predictor: Waiting - Fire past threshold
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Fraction of largest idle intervals
17
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05
Fraction of total idle time
Time
READ WRITE READ VERIFY Waiting Threshold (Tw) READ VERIFY VERIFY WRITE Threshold
10!2 10!1 100 101 102 103 10!6 10!5 10!4 10!3 10!2 10!1 100 101 102 Amount of idle time passed (s) Expected idle time remaining (s)
, stop only on collision
Tuesday, June 26, 2012
18
Property: Periodicity - Repeating patterns in disk traffic
Predictor: Autoregression -
0.01 0.02 0.04 0.10 0.20 0.40 1.00 2.00 4.00
24 48 72 96 120 144 168 Trace hour Number of requests (millions)
0.01 0.02 0.04 0.10 0.20 0.40 1.00 2.00 4.00
24 48 72 96 120 144 168 Trace hour Number of requests (millions)
Time
READ WRITE READ VERIFY READ X ms Hours ago... . . .
X ms
VERIFY VERIFY
Fire if prediction > threshold, don’t stop
> Prediction Threshold (Tp)?
Tuesday, June 26, 2012
19
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor
Tuesday, June 26, 2012
19
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor
!
Oracle Autoregression Waiting
Oracle: always picks X% largest intervals
Tuesday, June 26, 2012
19
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor
!
Oracle Autoregression Waiting
! ! ! ! ! ! ! ! !
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor
8ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms
!
Oracle Autoregression Waiting
! ! ! ! ! ! ! ! !
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Fraction of idle intervals picked by predictor Fraction of idle time utilized by predictor
8ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms 16ms 32ms 64ms 128ms 256ms 512ms 1024ms 2048ms
!
Oracle Autoregression Waiting
Prediction threshold (Tp) Waiting threshold (Tw)
Tuesday, June 26, 2012
Average delay per workload I/O request (ms) Scrubbing throughput (MB/s)
10 20 30 40 50 60 70 80 90
1 2 3 4 5 6 7 8 9 10 20
VERIFY request size Waiting threshold
!
(512KB, 32ms)
Tuesday, June 26, 2012
Average delay per workload I/O request (ms) Scrubbing throughput (MB/s)
10 20 30 40 50 60 70 80 90
1 2 3 4 5 6 7 8 9 10
21
VERIFY request size Waiting threshold
! ! ! ! ! ! ! ! ! ! ! ! !
512ms 256ms 32ms 8ms 1ms
L a r g e r t h r e s h
d s
! ! ! ! ! ! ! ! ! ! ! ! !
512ms 256ms 32ms 8ms 1ms
! ! ! !
4096ms 1024ms 512ms
! ! ! ! ! !! ! ! ! ! !!
512ms 256ms 16ms 1ms 512Kb 4096Kb 64Kb
Best effort Worst possible
C F Q I / O s c h e d u l e r
Complete Fair Queueing (CFQ):
declaring disk idle
Tuesday, June 26, 2012
I/O Scheduler Generic Block Layer Scrubbing Framework Filesystem Layer
Hard disk
On-disk cache 22 VERIFY size: [64KB, 4MB] Regions: ≥128 1 2 3 4 5 6 7 8 9 10 11 12
!
Region
Time
READ WRITE VERIFY
Threshold
READ READ WRITE VERIFY
Verify
VERIFY request size Waiting threshold
Verify Verify Verify Verify Verify Verify Verify Verify Verify Verify Tuesday, June 26, 2012
!"#$%#&'()*+$,)-).)/0$/1$234
scrubbing algorithms
VERIFY is implemented incorrectly in ATA drives
synthetic/realistic workloads
unrepresentative of OLTP workloads
* Kernel/User-level sources available at http://www.cs.toronto.edu/~gamvrosi
Tuesday, June 26, 2012