A nice little scheduling problem Yves Robert Ecole Normale Sup - PowerPoint PPT Presentation

Framework Sequential jobs Parallel jobs Results No prediction A nice little scheduling problem Yves Robert Ecole Normale Sup´ erieure de Lyon & Institut Universitaire de France CCGSC’2010 Asheville Yves.Robert@ens-lyon.fr Scheduling 1/ 39

Framework Sequential jobs Parallel jobs Results No prediction A few nice little scheduling problems I made it to the 10 CCGSC workshops! I talked about a nice little scheduling problem in 1992 I talked about a nice little scheduling problem in 1994 I talked about a nice little scheduling problem in 1996 I talked about a nice little scheduling problem in 1998 I talked about a nice little scheduling problem in 2000 I talked about a nice little scheduling problem in 2002 I talked about a nice little scheduling problem in 2004 I talked about a nice little scheduling problem in 2006 I talked about a nice little scheduling problem in 2008 Yves.Robert@ens-lyon.fr Scheduling 2/ 39

Framework Sequential jobs Parallel jobs Results No prediction A few nice little scheduling problems I made it to the 10 CCGSC workshops! I talked about a nice little scheduling problem in 1992 I talked about a nice little scheduling problem in 1994 I talked about a nice little scheduling problem in 1996 I talked about a nice little scheduling problem in 1998 I talked about a nice little scheduling problem in 2000 I talked about a nice little scheduling problem in 2002 At last I talked about a nice little scheduling problem in 2004 a fundamental problem I talked about a nice little scheduling problem in 2006 in exascale computing!! I talked about a nice little scheduling problem in 2008 Yves.Robert@ens-lyon.fr Scheduling 2/ 39

Framework Sequential jobs Parallel jobs Results No prediction Checkpointing versus Migration for Post-Petascale Machines Franck Cappello INRIA-Illinois Joint Laboratory for Petascale Computing Henri Casanova University of Hawai‘i Yves Robert Ecole Normale Sup´ erieure de Lyon & Institut Universitaire de France CCGSC’2010 Asheville Yves.Robert@ens-lyon.fr Checkpointing. Or not. 3/ 39

Framework Sequential jobs Parallel jobs Results No prediction Dealing with failures Fault tolerant computing becomes unavoidable Caveat: same story told for a very long time! � Coming for real on future machines, e.g. Blue Waters INRIA-Illinois Joint Laboratory for Petascale Computing Techniques: failure avoidance (as opposed to failure tolerance) checkpointing, migration Yves.Robert@ens-lyon.fr Checkpointing. Or not. 4/ 39

Framework Sequential jobs Parallel jobs Results No prediction Outline Framework 1 Sequential jobs 2 Parallel jobs 3 Numerical results 4 To predict or not to predict 5 Yves.Robert@ens-lyon.fr Checkpointing. Or not. 5/ 39

Framework Sequential jobs Parallel jobs Results No prediction Relying on failure prediction Applications will face resource faults during execution Failure prediction available (e.g. alarm when a disk or CPU becomes unusually hot) Application must dynamically prepare for, and recover from, expected failures Compare two well-known strategies: Checkpointing: purely local, but can be very costly Migration: requires availability of a spare resource Remember, we assume accurate failure prediction Yves.Robert@ens-lyon.fr Checkpointing. Or not. 7/ 39

Framework Sequential jobs Parallel jobs Results No prediction Relying on failure prediction Applications will face resource faults during execution Failure prediction available (e.g. alarm when a disk or CPU becomes unusually hot) Application must dynamically prepare for, and recover from, expected failures Compare two well-known strategies: Preventive Checkpointing: purely local, but can be very costly Preventive Migration: requires availability of a spare resource Remember, we assume accurate failure prediction Yves.Robert@ens-lyon.fr Checkpointing. Or not. 7/ 39

Framework Sequential jobs Parallel jobs Results No prediction Preventive checkpointing fault fault D µ D µ . . . R C R C available available D : length of downtime intervals µ : (average) length of execution intervals, a.k.a. MTTF R : recovery time (beginning of interval) C : checkpoint time (end of interval, just before failure) Yves.Robert@ens-lyon.fr Checkpointing. Or not. 8/ 39

Framework Sequential jobs Parallel jobs Results No prediction Preventive migration fault fault D µ D µ . . . M M available available D : length of downtime intervals µ : (average) length of execution intervals M : migration time (end of interval, just before failure) Need spare node � Yves.Robert@ens-lyon.fr Checkpointing. Or not. 9/ 39

Framework Sequential jobs Parallel jobs Results No prediction Notations C : checkpoint save time (in minutes) R : checkpoint recovery time (in minutes) D : down/reboot time (in minutes) M : migration time (in minutes) µ : mean time to failure (e.g., 1 /λ if failures are exponentially distributed) N : total number of cluster nodes n : number of spares (migration) Yves.Robert@ens-lyon.fr Checkpointing. Or not. 10/ 39

Framework Sequential jobs Parallel jobs Results No prediction Caveat Checkpointing/migration comparison makes sense only if M < C + D + R otherwise better use faulty machine as own spare Live migration without any disk access, thereby dramatically reducing migration time Yves.Robert@ens-lyon.fr Checkpointing. Or not. 11/ 39

Framework Sequential jobs Parallel jobs Results No prediction Checkpointing fault fault D µ D µ . . . R C R C available available Probability of node being active � 0 , µ − R − C � u c = max µ + D Global throughput � 0 , µ − R − C � ρ c = u c × N = max × N µ + D Yves.Robert@ens-lyon.fr Checkpointing. Or not. 13/ 39

Framework Sequential jobs Parallel jobs Results No prediction Migration (1/2) fault fault D µ D µ . . . M M available available Probability of node being active � 0 , µ − M � u m = max µ + D Global throughput � 0 , µ − M � ρ m = u m × ( N − n ) = max × ( N − n ) µ + D Yves.Robert@ens-lyon.fr Checkpointing. Or not. 14/ 39

Framework Sequential jobs Parallel jobs Results No prediction Migration (2/2) fault fault D µ D µ . . . M M available available No shortage of spare nodes? n � N � � u N − k (1 − u m ) k success ( n ) = m k k =0 Find n = α ( ε, N ) that “guarantees” a successful execution with probability at least 1 − ε Solve numerically Yves.Robert@ens-lyon.fr Checkpointing. Or not. 15/ 39

Framework Sequential jobs Parallel jobs Results No prediction Distribution (1/3) Number of processors required by typical jobs: two-stage log-uniform distribution biased to powers of two Let N = 2 Z for simplicity Probability that a job is sequential: α 0 = p 1 ≈ 0 . 25 Otherwise, the job is parallel, and uses 2 j processors with identical probability α j = α = (1 − p 1 ) × 1 Z for 1 ≤ j ≤ Z = log 2 N Yves.Robert@ens-lyon.fr Checkpointing. Or not. 17/ 39

Framework Sequential jobs Parallel jobs Results No prediction Distribution (1/3) Number of processors required by typical jobs: two-stage log-uniform distribution biased to powers of two (says Dr. Feitelson) Let N = 2 Z for simplicity Probability that a job is sequential: α 0 = p 1 ≈ 0 . 25 Otherwise, the job is parallel, and uses 2 j processors with identical probability α j = α = (1 − p 1 ) × 1 Z for 1 ≤ j ≤ Z = log 2 N Yves.Robert@ens-lyon.fr Checkpointing. Or not. 17/ 39

Framework Sequential jobs Parallel jobs Results No prediction Distribution (2/3) Steady-state utilization of whole platform: - all processors always active - constant proportion of jobs using any processor number Expectation of the number of jobs: - K total number of jobs running - β j jobs that use 2 j processors exactly Yves.Robert@ens-lyon.fr Checkpointing. Or not. 18/ 39

A nice little scheduling problem Yves Robert Ecole Normale Sup - PowerPoint PPT Presentation

Framework Sequential jobs Parallel jobs Results No prediction A nice little scheduling problem Yves Robert Ecole Normale Sup erieure de Lyon & Institut Universitaire de France CCGSC2010 Asheville Yves.Robert@ens-lyon.fr

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

PORTUGAL Nice wheather, Nice people Nice country! POR NSO Anbal Marianito Lausanne

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

Interaction Design - Project TDA501 08 02 04 - 1 Today The project Examination recap

Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif

Transparent Fault Tolerance Support in Model-Based Design Ivan Cibrario Bertolotti * , Tingting Hu

Using CPAL to model and validate the timing behaviour of embedded systems Sebastian Altmeyer,

Data Structures in Java Lecture 10: AVL Trees. 10/12/2015 Daniel Bauer Balanced BSTs

Balance and Clustering in Signed Graphs Thomas Zaslavsky Binghamton University (State University

Structured Markov Chains Ivo Adan and Johan van Leeuwaarden Where innovation starts Book on

Computational Concepts Toolbox Data type: values, literals, Higher Order Functions