Automatic and Coordinated Job Recovery for High Performance - PowerPoint PPT Presentation

Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 1 / 24

Outline Motivation System Design Implementation Evaluations Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 2 / 24

System Failure and Fault Tolerance System failures are increasingly common as the scale of supercomputers grows Fault tolerance schemes have been proposed continuously Redundancy and Replication Checkpoint/Restart Failure prediction + process migration Failure prediction + fault-aware job scheduling Most of existing fault tolerance schemes are pre-failure avoidance though post-failure handling is equally important. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 3 / 24

Resource management system Functionality: Manages the processing load Prevents jobs from competing with each other for limited compute resources Two parts: Resource manager: maintains resources, e.g., job queues, computing nodes, etc. Job scheduler: makes scheduling decisions, i.e., when and where to run a job. Examples: PBS (Altair), Moab (Adaptive Computing), LSF (Platform), LoadLeveler (IBM), Cobalt (ANL) Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 4 / 24

Motivation Fault-tolerance aspect: Precautionary fault avoidance dont suffice because of inevitability of failures. Post-failure recovery is import, but existing work is few. Resource management aspect: Resource manager assumes jobs will run to completion, it hardly support post failure handling. Due to resource limitation, failed jobs should be treated differently according to their diverse importance or priority. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 5 / 24

Our approach AuCoRe: Automatic and Coordinated job Recovery Extend resource management system to support post-failure handling AuCoRe automatically resubmit failed job in a systematical manner treating failed jobs with different recovery priority coordinating the failed job recovery with the queuing of regular jobs. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 6 / 24

Design diagram Figure: Diagram of AuCoRe. Users are allowed to specify their job recovery options in job submission scripts or commands. Jobs are maintained in three groups, namely the waiting job queue, the running job list, and the failed job queue. A recovery manager enables automatic and coordinated job recovery and supports an incentive management mechanism. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 7 / 24

Recovery Options Specify recovery option by user in the submission script Suggested options: Option A: notify only Option B: resubmit to rear of the queue Option C: restart the job on original nodes when they are repaired Option D: insert the job in the middle of the queue Option E: resubmit to head of the queue Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 8 / 24

Coordinated recovery Figure: Treatments for failed job with different recovery options. Option-A jobs are stepped out waiting for manually resubmit; option-C jobs are suspended until computing nodes are recovered; Jobs with option B, D, and E are resubmitted to different part of waiting job queues. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 9 / 24

Incentive management Users behavior is hard to manage: Ignoring the recovery option Gaming the system by always specifying high options Intentive mechanism Users pay for each recovery option with some (virtual) credits at job submission Higher recovery priority costs more credits Credits are prepaid and not returned even no failure occurs. (like insurance) Default to lowest option if not specified Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 10 / 24

Incentive mechansism Pricing: C = α i × T × N C – the cost for a job with recovery option i α i – the cost for a job with recovery option i : T – the job’s running time (in hour) N – the number of the job’s computing nodes. User Recovery Account: S = β × T × N S – Each time a user submits a job, he is assigned a certain amount of credits S β – a parameter set by system owner, ususally, median unit price ( P m ) Charging: B = ( α i − β ) × T × N B – actual charge for a job Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 11 / 24

Implementation Figure: AuCoRe Implementation with Cobalt, a production resource management system developed by Argonne National Laboratory. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 12 / 24

Evaluation Event-driven simulation using Qsim, a job scheduling simulator along with Cobalt resource mananger Uses real job trace from Blue Gene/P system at Argonne National Laboratory Uses synthetic failure events that follow Weibull distribution Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 13 / 24

Simulation cases Cases Denote Description FF failure-free W/O AuCoRe MR failure-present, manual resubmit Even Option proportion is 1:1:1:1:1 W/ AuCoRe (multi-opt) Normal Option proportion is 1:2:4:2:1 All-B all with option B All-C all with option C W/ AuCoRe (single-opt) All-D all with option D All-E all with option E Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 14 / 24

Evaluation metrics Response time (RESP) a jobs response time is the time from jobs submission to its completion. average among all jobs. Failure slowdown (FSD) the ratio of time delay caused by failure to failure-free job execution time. average among failed jobs Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 15 / 24

Baseline simulations Figure: Baseline Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 16 / 24

Comparison Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 17 / 24

Multi-option vs single-option Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 18 / 24

Performance under different MTTR Figure: Performance under different MTTR. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 19 / 24

Performance under different system MTBF Figure: Performance under different system MTBF. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 20 / 24

Performance under different job arrival rates Figure: Performance under different job arrival rates. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 21 / 24

Results Summary AucoRe can significantly improve performance of failed jobs and the overall system performance. In the multi-option cases, higher-priority recovery options result in more performance gains than lower-priority options, especially on FSD. That is, having recovery option diversity can benefit part of jobs that are really thought important. The recovery performance is sensitive to MTTR. Therefore, when setting the relative unit price of option C, MTTR should be considered. AuCoRe is effective under different system failure rates and job arrival rates. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 22 / 24

Automatic and Coordinated Job Recovery for High Performance - PowerPoint PPT Presentation

Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan,

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Coordinated Family Care MISSION Coordinated Family Care provides child centered and strength

Coordinated Mobility Creating trips for those who need them most 1 UTA Coordinated Mobility

OHIOS RECOVERY HAS TRACKED THE RECOVERY STARTED WELL BEFORE NATIONAL RECOVERY GOV KASICHS

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

FNQROC Presentation What is Recovery? Recovery is the coordinated process of supporting

You Got the Job Virtual Reality Job Interview Skill Training for People in Recovery

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Dependently Typed Programming with Finite Sets Denis Firsov and Tarmo Uustalu Institute of

Statistical Methods and State of the Techniques in Exposure Modeling Howard Chang Department of

Lecture 27: Pot-Pourri Todays topics: Consistency Models Shared memory vs

The Mobile Money Revolution in Kenya: Can the Promise be Fulfilled? An Efficient Financial

The facial weak order in hyperplane arrangements Aram Dermenjian 1,3 Christophe Hohlweg 1 , Thomas

Todays Plan P0 Review, Q&A review the concepts of memory and pointers EGOS demo

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Construction of the pion scalar form factor from few poles and zero Robert Kami nski IFJ PAN,

Automatic and Coordinated Job Recovery for High Performance - PowerPoint PPT Presentation

Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan,

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Coordinated Family Care MISSION Coordinated Family Care provides child centered and strength

Coordinated Mobility Creating trips for those who need them most 1 UTA Coordinated Mobility

OHIOS RECOVERY HAS TRACKED THE RECOVERY STARTED WELL BEFORE NATIONAL RECOVERY GOV KASICHS

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

FNQROC Presentation What is Recovery? Recovery is the coordinated process of supporting

You Got the Job Virtual Reality Job Interview Skill Training for People in Recovery

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Dependently Typed Programming with Finite Sets Denis Firsov and Tarmo Uustalu Institute of

Statistical Methods and State of the Techniques in Exposure Modeling Howard Chang Department of

Lecture 27: Pot-Pourri Todays topics: Consistency Models Shared memory vs

The Mobile Money Revolution in Kenya: Can the Promise be Fulfilled? An Efficient Financial

The facial weak order in hyperplane arrangements Aram Dermenjian 1,3 Christophe Hohlweg 1 , Thomas

Todays Plan P0 Review, Q&amp;A review the concepts of memory and pointers EGOS demo

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Construction of the pion scalar form factor from few poles and zero Robert Kami nski IFJ PAN,

Todays Plan P0 Review, Q&A review the concepts of memory and pointers EGOS demo