Optimizing Center Performance through Coordinated Data Staging, - - PowerPoint PPT Presentation

optimizing center performance through coordinated data
SMART_READER_LITE
LIVE PREVIEW

Optimizing Center Performance through Coordinated Data Staging, - - PowerPoint PPT Presentation

Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, Frank Mueller NC State University & Oak Ridge National


slide-1
SLIDE 1

1

Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, Frank Mueller NC State University & Oak Ridge National Laboratory

slide-2
SLIDE 2

2

Problem Space: Petascale Storage Challenge

  • Unique storage challenges in scaling to PF scale

− 1000s of I/O nodes; 100K – 1M disks; Failure a norm, not an exception! − Data availability affects HPC center serviceability

  • Storage failures: significant contributor to system down time

− Macroscopic view − Microscopic view (from both commercial and HPC centers)

  • In a year:

− 3% to 7%

  • f disks fails; 3%

to 16%

  • f controllers; up to

12%

  • f SAN switches;

− 8.5%

  • f a million disks have latent sector faults
  • 10 times expected rates specified by disk vendors

Storage, mem 20 reboots/day 15000 Google Storage, mem 37.5 hrs 23452 NLCF (Jaguar) Storage, CPU Storage, CPU Outage Source 40 hrs 8192 ASCI White 6.5 hrs 8192 ASCI Q MTBF/I # CPUs System

slide-3
SLIDE 3

3

Data Availability Issues in Users' Workflow

  • Supercomputer service availability also affected by data staging

and offloading errors

  • With existing job workflows

− Manual staging

  • Error-prone
  • Early staging and late offloading wastes scratch space
  • Delayed offloading renders result data vulnerable

− Scripted staging

  • Compute time wasted on staging at beginning of job
  • Expensive
  • Observations

− Supercomputer storage systems host transient job data − Currently data operations not coordinated with job scheduling

slide-4
SLIDE 4

4

Solution

  • Novel ways to manage the way transient data is

− Scheduled and recovered

  • Coordinating data storage with job scheduling
  • On-demand, transparent data reconstruction to address

transient job input data availability

slide-5
SLIDE 5

5

Solution

  • Novel ways to manage the way transient data is:

− Scheduled and recovered

  • Coordinating data storage with job scheduling

− Enhanced PBS script and Moab scheduling system

  • On-demand, transparent data reconstruction to address

transient job input data availability

slide-6
SLIDE 6

6

Solution

  • Novel ways to manage the way transient data is:

− Scheduled and recovered

  • Coordinating data storage with job scheduling

− Enhanced PBS script and Moab scheduling system

  • On-demand, transparent data reconstruction to address

transient job input data availability

− Extended Lustre parallel file system

slide-7
SLIDE 7

7

Solution

  • Novel ways to manage the way transient data is:

− Scheduled and recovered

  • Coordinating data storage with job scheduling

− Enhanced PBS script and Moab scheduling system

  • On-demand, transparent data reconstruction to address

transient job input data availability

− Extended Lustre parallel file system

  • Results:

− From center's standpoint:

  • Optimized global resource usage
  • Increased data and service availability

− From a user job standpoint:

  • Reduced job turnaround time
  • Scripted staging without charges
slide-8
SLIDE 8

8

Coordination of Data Operations and Computation

  • Treat data transfers as “data jobs”

− Scheduling and management

  • Setup a zero-charge data queue

− Ability to account and charge if necessary

  • Decomposition of stage-in, stage-out and compute jobs
  • Planning

− Dependency setup and submission

Data Queue Job Queue Head Node

  • 1. Stage Data
  • 2. Compute Job
  • 3. Offload Data

Job Script I/O Nodes Compute Nodes Planner

1 2 after 1 3 after 2

slide-9
SLIDE 9

9

#STAGEOUT any parameters here #STAGEOUT scp /scratch/user/output/user@destination

Instrumenting the Job Script

#PBS -N myjob #PBS -l nodes=128, walltime=12:00 #STAGEIN any parameters here #STAGEIN -retry 2 #STAGEIN hpss://host.gov/input_file /scratch/dest_file

  • Example of Enhanced PBS job script

mpirun -np 128 ~/programs/myapp

slide-10
SLIDE 10

10

stageout.pbs #STAGEOUT any parameters here #STAGEOUT scp /scratch/user/output/user@destination

Instrumenting the Job Script

#PBS -N myjob #PBS -l nodes=128, walltime=12:00 stagein.pbs #STAGEIN any parameters here #STAGEIN -retry 2 #STAGEIN hpss://host.gov/input_file /scratch/dest_file

  • Example of Enhanced PBS job script

compute.pbs mpirun -np 128 ~/programs/myapp

slide-11
SLIDE 11

11

stageout.pbs #STAGEOUT any parameters here #STAGEOUT scp /scratch/user/output/user@destination

Instrumenting the Job Script

#PBS -N myjob #PBS -l nodes=128, walltime=12:00 stagein.pbs #STAGEIN any parameters here #STAGEIN -retry 2 #STAGEIN hpss://host.gov/input_file /scratch/dest_file

  • Example of Enhanced PBS job script

compute.pbs mpirun -np 128 ~/programs/myapp

slide-12
SLIDE 12

12

stageout.pbs #STAGEOUT any parameters here #STAGEOUT scp /scratch/user/output/user@destination

Instrumenting the Job Script

#PBS -N myjob #PBS -l nodes=128, walltime=12:00 stagein.pbs #STAGEIN any parameters here #STAGEIN -retry 2 #STAGEIN hpss://host.gov/input_file /scratch/dest_file

  • Example of Enhanced PBS job script

compute.pbs mpirun -np 128 ~/programs/myapp

slide-13
SLIDE 13

13

On-demand, Transparent Data Recovery

  • Ensuring availability of automatically staged data

− Against storage failures between staging and job dispatch − Standard availability techniques (RAID) not enough

slide-14
SLIDE 14

14

On-demand, Transparent Data Recovery

  • Ensuring availability of automatically staged data

− Against storage failures between staging and job dispatch − Standard availability techniques (RAID) not enough

  • Recovery from staging sources

− Job input data transient on supercomputer, with immutable primary copy elsewhere

  • Natural data redundancy for staged data

− Network costs drastically reducing each year − Better bulk transfer tools with support for partial data fetches

slide-15
SLIDE 15

15

On-demand, Transparent Data Recovery

  • Ensuring availability of automatically staged data

− Against storage failures between staging and job dispatch − Standard availability techniques (RAID) not enough

  • Recovery from staging sources

− Job input data transient on supercomputer, with immutable primary copy elsewhere

  • Natural data redundancy for staged data

− Network costs drastically reducing each year − Better bulk transfer tools with support for partial data fetches

  • Novel mechanisms to address “transient data availability”

− Augmenting FS metadata with “recovery info”

  • Again, automatically extracted from job script

− Periodic file availability checking for queued jobs − On-the-fly data reconstruction from staging source

slide-16
SLIDE 16

16

Augmenting File System Metadata

  • Metadata extracted from job script

− “source” and “sink” URIs recorded with staged files

  • Implementation: Lustre parallel file system

− Utilizing file extended attribute (EA) mechanism − New “recov” EA at metadata server

  • Less than 64 bytes per file
  • Minimal communication costs

− Additional Lustre commands

  • lfs setrecov
  • lfs getrecov
slide-17
SLIDE 17

17

Failure Detection & File Reconstruction

  • Periodic failure detection

− Parallel checking of storage units upon which dataset is striped

  • Reconstruction:

MDS Headnode

  • st1
  • st2
  • st3

Remote Source

slide-18
SLIDE 18

18

Failure Detection & File Reconstruction

  • Periodic failure detection

− Parallel checking of storage units upon which dataset is striped

  • Reconstruction:

MDS Headnode

hpss://host.gov/foo

  • st1
  • st2
  • st3

1 Remote Source

slide-19
SLIDE 19

19

Failure Detection & File Reconstruction

  • Periodic failure detection

− Parallel checking of storage units upon which dataset is striped

  • Reconstruction:

MDS Headnode

hpss://host.gov/foo

  • st1
  • st6
  • st3

2 1 Remote Source OST6 2

slide-20
SLIDE 20

20

Failure Detection & File Reconstruction

  • Periodic failure detection

− Parallel checking of storage units upon which dataset is striped

  • Reconstruction:

MDS Headnode

hpss://host.gov/foo

  • st1
  • st6
  • st3

2 1 Remote Source 3

(1M~2M) (4M~5M) (7M~8M)

OST6 2

slide-21
SLIDE 21

21

Failure Detection & File Reconstruction

  • Periodic failure detection

− Parallel checking of storage units upon which dataset is striped

  • Reconstruction:

MDS Headnode

hpss://host.gov/foo

  • st1
  • st6
  • st3

2 1 Remote Source 3

(1M~2M) (4M~5M) (7M~8M)

OST6 2 4

slide-22
SLIDE 22

22

Putting it all together…

slide-23
SLIDE 23

23

Performance - Overview

  • Part I: Cost of reconstruction with our method

− Real systems − Running our prototype on real cluster and data sources − Testing the costs of each step of our reconstruction − Using different system configurations and tasks

  • Part II:

− Trace-driven simulations − Taking result of Part I as parameters − Using real system failure and job submission traces − Simulating real HPC centers − Considering both average performance and fairness

slide-24
SLIDE 24

24

Reconstruction Testbed

  • A cluster with 40 nodes at ORNL

− 2.0GHz Intel P4 CPU − 768 MB memory − 10/100 Mb ethernet − FC4 Linux, 2.6.12.6 kernel − 32 data servers, 1 metadata server, 1 client (also as headnode)

  • Data sources

− NFS server at ORNL (Local NFS) − NFS server at NCSU (Remote NFS) − GridFTP server with PVFS file system at ORNL (GridFTP) ORNL NCSU

NFS NFS PVFS Internet Intranet Intranet

slide-25
SLIDE 25

25

  • Finding failed server

Performance - Reconstruction

slide-26
SLIDE 26

26

  • Patching the lost data

Performance - Reconstruction

Local NFS

Local NFS Remote NFS GridFTP

slide-27
SLIDE 27

27

  • Patching the lost data

Performance - Reconstruction

Remote NFS

Local NFS Remote NFS GridFTP

slide-28
SLIDE 28

28

  • Patching the lost data

Performance - Reconstruction

GridFTP

Local NFS Remote NFS GridFTP

slide-29
SLIDE 29

29

Simulation Setup

  • Operational data from Los Alamos National Laboratory

http://institutes.lanl.gov/data/fdata

− System 20, with 512 nodes, 4CPUs/node

  • Node failure trace

− 2,049 failure records over 1,349 days

  • Job submission trace

− 489,376 job submission and completion records over 1,073 days

  • Coupling failure & job traces

− Calculated failure rate, repair time, and generated I/O node failure events

  • Obtained scratch logs and file statistics from ORNL NLCF to

create input files and stating operations

slide-30
SLIDE 30

30

  • Mean wait time of jobs
  • Standard deviation for wait time of jobs
  • Performance degradation with larger stripe count w/o reconstruction
  • Performance w. reconstruction close to ”no failure” case

Performance – Scheduling Simulation

slide-31
SLIDE 31

31

Related Work: Coordination

  • Coordinating data and job scheduling

− Stork, Condor and DAGMan: used to schedule data and computation together in Grid environments − Condor and SRM: used to schedule jobs where data is available − Simulation studies in Grid suggest data-aware scheduling improves job response time − Focused as part of an application workflow rather than a set of HPC center integrated services

  • BAD-FS

− A “file system” for I/O intensive batch jobs on remote clusters − Exposes distributed file system decisions to an external, workload- aware scheduler

  • IBP and Kangaroo:

− Address scratch space purging problem by timely offloading of results − Do not address the scheduling or coupling of this activity along side computation

  • Moab has similar goals and allows staging specification

− However, it is not fault-tolerant − Does not support offloading and is not cheap!

slide-32
SLIDE 32

32

Related Work: Storage System Availability

  • Standard data availability techniques designed with persistent

data in mind

− Multiple disk failure within a RAID-group can be crippling − I/O node failovers not always possible (thousands of nodes) − Replication consumes extra scratch space, which is an expensive commodity

  • We address availability of transient, job input data!
slide-33
SLIDE 33

33

  • In Summary

− Novel ways to schedule and recover transient data − Coordination b/w data movement and computation

  • Modification of production job scheduler (deployed @ ORNL)

− On-demand recovery techniques for data availability issue

  • Extension of Lustre: transparent replacement of failed OSTs
  • Next Steps

− Online recovery − Result data offloading

Conclusion and Future Work

slide-34
SLIDE 34

34

Questions?

This work is sponsored by:

  • U.S. Department of Energy Contracts

− DE-AC05-00OR22725 − DE-FG02-05ER25685

  • NSF Contract

− CCF-0621470

Project websites:

NCSU: http://research.csc.ncsu.edu/palm/ ORNL: http://www.csm.ornl.gov/~vazhkuda/Storage.html