The “Coolness” of Reliability and other tales …
Ali R. Butt
The Coolness of Reliability and other tales Ali R. Butt Disk - - PowerPoint PPT Presentation
The Coolness of Reliability and other tales Ali R. Butt Disk Storage Requirements Persistence Data is not lost between power-cycles Integrity Data is not corrupted, what I stored is what I retrieve Availability
Ali R. Butt
– Data is not lost between power-cycles
– Data is not corrupted, “what I stored is what I retrieve”
– Data can be accessed at any time
2
(1000s not that far off)
– Direct connected – Network connected
3
significant effect
Failure mitigation is critical
4
Annualized Failure Rates
(Failure Trends in a Large Disk Drive Population, Pinheiro et. al. FAST’07)
5
P Recovery
6
7
8
P
Attempt Recovery
Check reads are done when disk is idle
9
Write Recovery
*Idle Read After Write, Riska and Riedel, ATC’08
Read Retain in mem. Compare
Scrub during idle periods
10
P Scrubbing Recovery
* Disk scrubbing in large archival storage systems, Schwarz et. al., MASCOTS’04
P
PPP
Spin-down disks during idle periods
11
12
Reliability Energy Savings
– Energy-delay product (EDP): A flexible metric that finds a balance between saving energy vs. improving performance
13
Do scrubbing/ IRAW in idle periods Spin-down disks in idle periods Reconcile?
Energy Savings Reliability Improvement Energy Savings Reliability Improvement
* On the Impact of Disk Scrubbing on Energy Savings, Wang, Butt, Gniady, HotPower’08
ERP = Energy Savings * Reliability Improvement
– Want good energy savings – Want to improve reliability
14
Disk busy Disk busy Disk busy Disk busy
15
I/O request Disk idle period I/O request I/O request I/O request Disk idle period
Disk idle period
– Higher value of MTTDL Better reliability
Scrubbing Period
– Definition: Time between two scrubbing cycles – Shorter scrubbing period, higher MTTDL
[Iliadis2008, Dholakia2008]
16
– ERP = Energy Savings ∗ Increase in MTTDL
scrubbing period ERP Energy Savings ∗ 1/Scrubbing Period
17
and disk spinning-down
– Mozilla, mplayer, writer, calc, impress, xemacs
18
rest for spinning-down
– Disk not spun-down during short idle period – Optimization: use entire short periods for scrubbing
19
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Normalized values Fraction of each idle period used for scrubbing
Energy Savings ERP 20
ERP captures a good trade-off point b/w energy savings & reliability improvements
21
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Normalized values Fraction of each idle period used for scrubbing
Energy Savings ERP
– Duration unknown – Spin-down/up overheads
scrubbing or spinning-down
– We evaluate three such schemes:
22
23
0% 20% 40% 60% 80% 100% 120% mozilla mplayer impress writer calc xemacs
Energy Savings Reliability ERP
180%
– Penalty if another access comes right after spin-off – Timeout periods before spin-off are wasted
24
25
Small contributions in reliability makes this approach impractical
combined effect of disk scrubbing and spinning- down for saving energy
approaches mixing scrubbing and spinning-down
– Develop a reliability model for IRAW – Validate ERP with other workloads – Extend our model with multi-speed disks
26
– Costly, especially for high-speed scratch storage system – Mired with acquisition issues, red-tape
– Adds software complexity
27
and end-user resources
Offloading errors affect Supercomputer serviceability
Upshot: Timely offloading can help improve center performance
resubmission rates (NSF06-573, …)
28
29
Not an ideal solution for data-offloading
30
the center to end users
31 * Timely Offloading of Result-Data in HPC Centers, Monti, Butt, Vazhkudai, ICS’08
32
Transfer limited by end-user available bandwidth Delayed transfer & storage failures may result in loss of data!
33
Addresses many of the problems of point-to point transfers
1. Discovering intermediate nodes 2. Providing incentives to participate 3. Addressing insufficient participants 4. Adapting to dynamic network behavior 5. Ensuring data reliability and availability 6. Meeting SLAs during the offload process
34
networks
35
Identifier space
2128-1
Offload Process
– “Virtual Organizations” - set of geographically distributed users from different sites – Jobs in TeraGrid usually from such organizations
each others offload over time
37
Network Behavior
38
10 Mb/s
5 Mb/s 4 Mb/s 1 Mb/s
Location Failure
39
– Use Direct if it can meet SLA – Otherwise, perform decentralized/staged offload
– Utilize decentralized offload approach
Toffload < Min(Dpurge, JSLA)
– Specifies destination, intermediate nodes, and deadline
#PBS -N myjob #PBS -l nodes=128, walltime=12:00 mpirun -np 128 ~/MyComputation #Stageout Output DestinationSite #InterNode node1.Site1:49665:50GB ... #InterNode nodeN.SiteN:49665:30GB #Deadline 1/14/2007:12:00
Adapting BitTorrent Functionality to Data Offloading
– Peers with less storage than the result-data size can be utilized
– Use NWS bandwidth measurements – Use knowledge of node capacity from PBS scripts – Choose the appropriate nodes with storage capacity
– They may simply pass data onward
Node Manager Offload Manager Erasure Coding SLA Compliance NWS Query Transfer Module Nodes from overlay Result-data NWS Chunks Center SLA
1. Compare with direct transfer, and BitTorrent 2. Observe how system reacts to failures and bandwidth fluctuations:
a. How SLAs are enforced? b. How fault tolerance is achieved?
3. Validate our method as a viable alternative to other
center + end user + 20 intermediate nodes
Compare the proposed method with
45
46
Results: Data Transfer Times with Respect to Direct Transfer
Times are in seconds
File Size 100 MB 240 MB 500 MB 2.1 GB Dire rect ct 286 727 1443 5834 Offload load 38 95 169 570 Push sh 82 179 349 1123 Pull 29 93 202 562
A staged offload is capable of significantly improving offload times
Results: Data Transfer Times with Respect to Standard BitTorrent
Times are in seconds Transferring 2.1 GB file
Phase BitTorren rent Our Method
Send d one copy y from m center er (Offload) load) 1172 570 Send d to all intermedi mediat ate e nodes es (Push) h) 1593 1123 Subm bmissi sion
site downl nload
l) 571 562
Monitoring based offload is capable of outperforming standard BitTorrent
Results: Adapting to Dynamic Network Behavior
SLA is 600 seconds Transferring 2.1 GB file
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 51 101 151 201 251 301 351 401 451 501 551
Available Bandwidth at each Node (MB/s)
Time (s)
Time 10s Direct bandwidth reduced by 1/10 Time 150s node bandwidth drops to 1MB/s Time 250s node Fails
A staged offload is capable of adapting to bandwidth changes or failures
49
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10
Available Data(%) Number of failed nodes
Encoding 2 copies Encoding 1 copy No encoding 2 copies No encoding 1 copy
A staged offload can protect data even when many nodes fail
Transferred 2.1 GB file Randomly failed 10 nodes during the transfer
– Decentralized approach – Monitoring-based adaptation
in our experiments
(GPUs, PS3, …)
51