The Coolness of Reliability and other tales Ali R. Butt Disk - - PowerPoint PPT Presentation

the coolness of reliability and other tales
SMART_READER_LITE
LIVE PREVIEW

The Coolness of Reliability and other tales Ali R. Butt Disk - - PowerPoint PPT Presentation

The Coolness of Reliability and other tales Ali R. Butt Disk Storage Requirements Persistence Data is not lost between power-cycles Integrity Data is not corrupted, what I stored is what I retrieve Availability


slide-1
SLIDE 1

The “Coolness” of Reliability and other tales …

Ali R. Butt

slide-2
SLIDE 2

Disk Storage Requirements

  • Persistence

– Data is not lost between power-cycles

  • Integrity

– Data is not corrupted, “what I stored is what I retrieve”

  • Availability

– Data can be accessed at any time

  • Performance: Sustain high data transfer rates
  • Efficiency: Reduce resource (energy, space) wastage

2

slide-3
SLIDE 3

Modern Storage Systems Characteristics

  • Employ 10s to 100s of disks

(1000s not that far off)

  • Package disks into storage units (appliances)

– Direct connected – Network connected

  • Support simultaneous access for performance
  • Use redundancy to protect against disk failures

3

slide-4
SLIDE 4

Large Number of Disks Failures are Common

+Aging does not have a

significant effect

–Disks can fail in batches

Failure mitigation is critical

4

Annualized Failure Rates

(Failure Trends in a Large Disk Drive Population, Pinheiro et. al. FAST’07)

slide-5
SLIDE 5

Tolerating Disk Failures using RAID

5

P Recovery

slide-6
SLIDE 6

Growing Disk Density

6

slide-7
SLIDE 7

How Latent Sector Errors Occur?

  • OS writes data to disk, perceives it to be successful
  • Data is corrupted due to bit flips, media failures, etc.
  • Errors remain undiscovered (hidden)
  • Later OS is unable to read data  ERROR

7

slide-8
SLIDE 8

Effect of Latent Sector Errors

8

P

Data Loss

Attempt Recovery

slide-9
SLIDE 9

Protecting Against Latent Errors: Idle Read After Write (IRAW*)

  • IRAW can improve data reliability

 Check reads are done when disk is idle

9

Write Recovery

*Idle Read After Write, Riska and Riedel, ATC’08

Read Retain in mem. Compare

slide-10
SLIDE 10

Protecting Against Latent Errors: Disk Scrubbing*

  • Scrubbing improves data reliability

 Scrub during idle periods

10

P Scrubbing Recovery

* Disk scrubbing in large archival storage systems, Schwarz et. al., MASCOTS’04

slide-11
SLIDE 11

A Large Number of Disks can Consume Significant Energy

P

$$

PPP

  • Spinning-down disks saves energy

 Spin-down disks during idle periods

11

slide-12
SLIDE 12

Reliability or Energy Savings? Or Both?

12

Reliability Energy Savings

slide-13
SLIDE 13

Reliability Vs. Energy Savings: Which Way To Go? *

  • Similar trade-offs present themselves in energy-performance optimization domain

– Energy-delay product (EDP): A flexible metric that finds a balance between saving energy vs. improving performance

13

Do scrubbing/ IRAW in idle periods Spin-down disks in idle periods Reconcile?

Energy Savings Reliability Improvement Energy Savings Reliability Improvement

* On the Impact of Disk Scrubbing on Energy Savings, Wang, Butt, Gniady, HotPower’08

slide-14
SLIDE 14

Energy-Reliability Product (ERP)

  • A new metric that considers both energy and reliability

ERP = Energy Savings * Reliability Improvement

  • Can ERP help us reconcile energy & reliability?

– Want good energy savings – Want to improve reliability

  • Goal: Maximize ERP

14

slide-15
SLIDE 15

Background: Anatomy of a Disk Idle Period

Disk busy Disk busy Disk busy Disk busy

15

I/O request Disk idle period I/O request I/O request I/O request Disk idle period

Disk idle period

slide-16
SLIDE 16

Measuring Reliability

  • A common metric: Mean Time to Data Loss (MTTDL)

– Higher value of MTTDL  Better reliability

  • For scrubbing, MTTDL can be expressed in terms of

Scrubbing Period

– Definition: Time between two scrubbing cycles – Shorter scrubbing period, higher MTTDL

  • Detailed models of MTTDL for scrubbing have been developed

[Iliadis2008, Dholakia2008]

16

slide-17
SLIDE 17

Determining ERP

  • ERP = Energy Savings ∗ Reliability Improvement
  • ERP can be expressed in terms of MTTDL:

– ERP = Energy Savings ∗ Increase in MTTDL

  • For scrubbing, MTTDL is inversely proportional to

scrubbing period  ERP  Energy Savings ∗ 1/Scrubbing Period

17

slide-18
SLIDE 18

Validation of ERP

  • Employ trace-driven simulation on scrubbing

and disk spinning-down

  • Use traces of typical desktop applications:

– Mozilla, mplayer, writer, calc, impress, xemacs

18

slide-19
SLIDE 19

Time-Share Allocation

  • Preset fraction of idle period used for scrubbing,

rest for spinning-down

– Disk not spun-down during short idle period – Optimization: use entire short periods for scrubbing

19

slide-20
SLIDE 20

Time-Share Allocation for Mozilla

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Normalized values Fraction of each idle period used for scrubbing

  • Reliab. Improv.

Energy Savings ERP 20

slide-21
SLIDE 21

Time-Share Allocation in Xemacs

ERP captures a good trade-off point b/w energy savings & reliability improvements

21

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Normalized values Fraction of each idle period used for scrubbing

  • Reliab. Improv.

Energy Savings ERP

slide-22
SLIDE 22

Applying ERP

  • Dividing each idle period is impractical

– Duration unknown – Spin-down/up overheads

  • Use each idle period for only one task,

scrubbing or spinning-down

– We evaluate three such schemes:

  • Two-phase allocation
  • Scrub only in small idle periods
  • Alternate allocation

22

slide-23
SLIDE 23

Result: Alternate Allocation

23

0% 20% 40% 60% 80% 100% 120% mozilla mplayer impress writer calc xemacs

Energy Savings Reliability ERP

180%

slide-24
SLIDE 24

ERP in Timeout-based Approach

  • Information about future I/Os is not known a-priori
  • Use a timeout-based approach

– Penalty if another access comes right after spin-off – Timeout periods before spin-off are wasted

  • Can be used for scrubbing

24

slide-25
SLIDE 25

Timeout-based Allocation

25

Small contributions in reliability makes this approach impractical

slide-26
SLIDE 26

Thoughts on ERP

  • ERP is a intuitive metric for capturing the

combined effect of disk scrubbing and spinning- down for saving energy

  • ERP can be successfully applied to compare

approaches mixing scrubbing and spinning-down

  • Future Work

– Develop a reliability model for IRAW – Validate ERP with other workloads – Extend our model with multi-speed disks

26

slide-27
SLIDE 27

Role of Storage Errors in HPC Centers

  • Problem: Large storage systems are error prone
  • Solution 1: Improve redundancy, add/replace disks

– Costly, especially for high-speed scratch storage system – Mired with acquisition issues, red-tape

  • Solution 2: Reduce duration of usage

– Adds software complexity

  • We opt for reducing duration of HPC scratch usage

27

slide-28
SLIDE 28

HPC Center Data Offload Problem

  • Offloading entails moving large data between center

and end-user resources

  • Failure prone: end resource unavailability, transfer errors

 Offloading errors affect Supercomputer serviceability

  • Delayed offloading is highly undesirable
  • From a center standpoint:
  • Wastes scratch space
  • Renders result data vulnerable to purging
  • From a user job standpoint:
  • Increased turnaround time if part of the job workflow depends on
  • ffloaded data
  • Potential resubmits due to purging

Upshot: Timely offloading can help improve center performance

  • HPC acquisition solicitations are asking for stringent uptime and

resubmission rates (NSF06-573, …)

28

slide-29
SLIDE 29

Current Methods to Offload Data

  • Home grown solutions
  • Every center has its own
  • Utilize point-to-point (direct) transfer tools:
  • GridFTP
  • HSI
  • scp

29

slide-30
SLIDE 30

Limitations of Direct Transfers

  • Require end resources to be available
  • Do not exploit orthogonal bandwidth
  • Do not consider SLAs or purge deadlines

Not an ideal solution for data-offloading

30

slide-31
SLIDE 31

A Decentralized Data-Offloading Service*

  • Utilizes army of intermediate storage locations
  • Exploits nearby nodes for moving data
  • Supports multi-hop data migration to end users
  • Decouples offloading and end-users availability
  • Integrates with real-world tools
  • Portable Batch System (PBS)
  • BitTorrent
  • Provides multiple fault-tolerant data flow paths from

the center to end users

31 * Timely Offloading of Result-Data in HPC Centers, Monti, Butt, Vazhkudai, ICS’08

slide-32
SLIDE 32

32

Transfer limited by end-user available bandwidth Delayed transfer & storage failures may result in loss of data!

slide-33
SLIDE 33

33

Addresses many of the problems of point-to point transfers

slide-34
SLIDE 34

Challenges Faced in Our Approach

1. Discovering intermediate nodes 2. Providing incentives to participate 3. Addressing insufficient participants 4. Adapting to dynamic network behavior 5. Ensuring data reliability and availability 6. Meeting SLAs during the offload process

34

slide-35
SLIDE 35
  • 1. Intermediate Node Discovery
  • Utilize DHT abstraction provided by structured p2p

networks

  • Nodes advertise their availability to others
  • Receiving nodes discovers the advertiser
  • Discovered nodes utilized as necessary

35

Identifier space

2128-1

slide-36
SLIDE 36
  • 2. Incentives to Participate in

Offload Process

  • Modern HPC jobs are often collaborative

– “Virtual Organizations” - set of geographically distributed users from different sites – Jobs in TeraGrid usually from such organizations

  • Resource bartering among participants to facilitate

each others offload over time

  • Nodes specified and trusted by the user
slide-37
SLIDE 37
  • 3. Addressing Insufficient Participants
  • Problem: Sufficient participants not available
  • Solution: Use Landmark Nodes
  • Nodes that are stable and available
  • Willing to store data
  • Leverage out-of-band agreements
  • Other researchers who are also interested in the data
  • Data warehouses
  • cheaper option than storing at the HPC center
  • Note: Landmark Nodes used as a safety net!

37

slide-38
SLIDE 38
  • 4. Adapting Data Distribution To Dynamic

Network Behavior

  • Available bandwidth can change
  • Distribute data randomly - may not be effective
  • Utilize network monitoring
  • Network Weather Service (NWS)
  • Provides bandwidth Measurement
  • Predicts future bandwidth
  • Choose dynamically changing data paths
  • Select enough nodes to satisfy a given SLA
  • Monitor and update the selected nodes

38

10 Mb/s

5 Mb/s 4 Mb/s 1 Mb/s

slide-39
SLIDE 39
  • 5. Protecting Data from Intermediate Storage

Location Failure

  • Problem: Node failure may cause data loss
  • Solution:
  • 1. Use data replication
  • Achieved through multiple data flow paths
  • 2. Employ Erasure coding
  • Can be done at the Center or intermediates
  • End user may pay for coding at the Center

39

slide-40
SLIDE 40
  • 6. Managing SLAs during Offload
  • Use NWS to measure available bandwidths

– Use Direct if it can meet SLA – Otherwise, perform decentralized/staged offload

  • In case end host fails or cannot meet SLA

– Utilize decentralized offload approach

Toffload < Min(Dpurge, JSLA)

slide-41
SLIDE 41

Integrating Staged Offload with PBS

  • Provide new PBS directives

– Specifies destination, intermediate nodes, and deadline

#PBS -N myjob #PBS -l nodes=128, walltime=12:00 mpirun -np 128 ~/MyComputation #Stageout Output DestinationSite #InterNode node1.Site1:49665:50GB ... #InterNode nodeN.SiteN:49665:30GB #Deadline 1/14/2007:12:00

slide-42
SLIDE 42

Adapting BitTorrent Functionality to Data Offloading

  • Tailor BitTorrent to meet the needs of offloading
  • Restrict the amount of result-data sent to a peer

– Peers with less storage than the result-data size can be utilized

  • Incorporate global information into peer selection

– Use NWS bandwidth measurements – Use knowledge of node capacity from PBS scripts – Choose the appropriate nodes with storage capacity

  • Recipients are not necessarily end-hosts

– They may simply pass data onward

slide-43
SLIDE 43

Putting it all Together

Node Manager Offload Manager Erasure Coding SLA Compliance NWS Query Transfer Module Nodes from overlay Result-data NWS Chunks Center SLA

slide-44
SLIDE 44

Evaluation Objectives

1. Compare with direct transfer, and BitTorrent 2. Observe how system reacts to failures and bandwidth fluctuations:

a. How SLAs are enforced? b. How fault tolerance is achieved?

3. Validate our method as a viable alternative to other

  • ffloading methods
slide-45
SLIDE 45

Evaluation: Experimental Setup

  • PlanetLab test bed
  • 22 PlanetLab nodes:

center + end user + 20 intermediate nodes

  • Experiments:

Compare the proposed method with

  • Point-to-point transfer (scp)
  • Standard BitTorrent
  • Observe the effect of bandwidth changes

45

slide-46
SLIDE 46

46

Results: Data Transfer Times with Respect to Direct Transfer

Times are in seconds

File Size 100 MB 240 MB 500 MB 2.1 GB Dire rect ct 286 727 1443 5834 Offload load 38 95 169 570 Push sh 82 179 349 1123 Pull 29 93 202 562

A staged offload is capable of significantly improving offload times

slide-47
SLIDE 47

Results: Data Transfer Times with Respect to Standard BitTorrent

Times are in seconds Transferring 2.1 GB file

Phase BitTorren rent Our Method

Send d one copy y from m center er (Offload) load) 1172 570 Send d to all intermedi mediat ate e nodes es (Push) h) 1593 1123 Subm bmissi sion

  • n si

site downl nload

  • ad (Pull)

l) 571 562

Monitoring based offload is capable of outperforming standard BitTorrent

slide-48
SLIDE 48

Results: Adapting to Dynamic Network Behavior

SLA is 600 seconds Transferring 2.1 GB file

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 51 101 151 201 251 301 351 401 451 501 551

Available Bandwidth at each Node (MB/s)

Time (s)

Time 10s Direct bandwidth reduced by 1/10 Time 150s node bandwidth drops to 1MB/s Time 250s node Fails

A staged offload is capable of adapting to bandwidth changes or failures

slide-49
SLIDE 49

49

Results: Replication vs. Erasure Coding

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10

Available Data(%) Number of failed nodes

Encoding 2 copies Encoding 1 copy No encoding 2 copies No encoding 1 copy

A staged offload can protect data even when many nodes fail

Transferred 2.1 GB file Randomly failed 10 nodes during the transfer

slide-50
SLIDE 50

Thoughts on Eager Offloading

  • A fresh look at Offloading

– Decentralized approach – Monitoring-based adaptation

  • Considers SLAs and purge policies
  • Integrated with real-world tools
  • Provides high reliability for data
  • Outperforms direct transfer by 90.2%

in our experiments

slide-51
SLIDE 51

Some projects we are involved in

  • Enabling High-performance I/O for asymmetric multi-core systems

(GPUs, PS3, …)

  • Simulation/capacity planning tools for cloud computing
  • Advanced data caching and prefetching
  • Just-in-time Data Staging
  • Managing HPC center scratch space as a hierarchical cache
  • I/O shaping for HPC applications
  • Hybrid disk modeling
  • Real-time data processing for advanced nano-bionics
  • On the web: http://research.cs.vt.edu/dssl
  • Contact email: butta@cs.vt.edu

51