a nice little scheduling problem
play

A nice little scheduling problem Yves Robert Ecole Normale Sup - PowerPoint PPT Presentation

Framework Sequential jobs Parallel jobs Results No prediction A nice little scheduling problem Yves Robert Ecole Normale Sup erieure de Lyon & Institut Universitaire de France CCGSC2010 Asheville Yves.Robert@ens-lyon.fr


  1. Framework Sequential jobs Parallel jobs Results No prediction A nice little scheduling problem Yves Robert Ecole Normale Sup´ erieure de Lyon & Institut Universitaire de France CCGSC’2010 Asheville Yves.Robert@ens-lyon.fr Scheduling 1/ 39

  2. Framework Sequential jobs Parallel jobs Results No prediction A few nice little scheduling problems I made it to the 10 CCGSC workshops! I talked about a nice little scheduling problem in 1992 I talked about a nice little scheduling problem in 1994 I talked about a nice little scheduling problem in 1996 I talked about a nice little scheduling problem in 1998 I talked about a nice little scheduling problem in 2000 I talked about a nice little scheduling problem in 2002 I talked about a nice little scheduling problem in 2004 I talked about a nice little scheduling problem in 2006 I talked about a nice little scheduling problem in 2008 Yves.Robert@ens-lyon.fr Scheduling 2/ 39

  3. Framework Sequential jobs Parallel jobs Results No prediction A few nice little scheduling problems I made it to the 10 CCGSC workshops! I talked about a nice little scheduling problem in 1992 I talked about a nice little scheduling problem in 1994 I talked about a nice little scheduling problem in 1996 I talked about a nice little scheduling problem in 1998 I talked about a nice little scheduling problem in 2000 I talked about a nice little scheduling problem in 2002 At last I talked about a nice little scheduling problem in 2004 a fundamental problem I talked about a nice little scheduling problem in 2006 in exascale computing!! I talked about a nice little scheduling problem in 2008 Yves.Robert@ens-lyon.fr Scheduling 2/ 39

  4. Framework Sequential jobs Parallel jobs Results No prediction Checkpointing versus Migration for Post-Petascale Machines Franck Cappello INRIA-Illinois Joint Laboratory for Petascale Computing Henri Casanova University of Hawai‘i Yves Robert Ecole Normale Sup´ erieure de Lyon & Institut Universitaire de France CCGSC’2010 Asheville Yves.Robert@ens-lyon.fr Checkpointing. Or not. 3/ 39

  5. Framework Sequential jobs Parallel jobs Results No prediction Dealing with failures Fault tolerant computing becomes unavoidable Caveat: same story told for a very long time! � Coming for real on future machines, e.g. Blue Waters INRIA-Illinois Joint Laboratory for Petascale Computing Techniques: failure avoidance (as opposed to failure tolerance) checkpointing, migration Yves.Robert@ens-lyon.fr Checkpointing. Or not. 4/ 39

  6. Framework Sequential jobs Parallel jobs Results No prediction Dealing with failures Fault tolerant computing becomes unavoidable Caveat: same story told for a very long time! � Coming for real on future machines, e.g. Blue Waters INRIA-Illinois Joint Laboratory for Petascale Computing Techniques: failure avoidance (as opposed to failure tolerance) checkpointing, migration Yves.Robert@ens-lyon.fr Checkpointing. Or not. 4/ 39

  7. Framework Sequential jobs Parallel jobs Results No prediction Dealing with failures Fault tolerant computing becomes unavoidable Caveat: same story told for a very long time! � Coming for real on future machines, e.g. Blue Waters INRIA-Illinois Joint Laboratory for Petascale Computing Techniques: failure avoidance (as opposed to failure tolerance) checkpointing, migration Yves.Robert@ens-lyon.fr Checkpointing. Or not. 4/ 39

  8. Framework Sequential jobs Parallel jobs Results No prediction Outline Framework 1 Sequential jobs 2 Parallel jobs 3 Numerical results 4 To predict or not to predict 5 Yves.Robert@ens-lyon.fr Checkpointing. Or not. 5/ 39

  9. Framework Sequential jobs Parallel jobs Results No prediction Outline Framework 1 Sequential jobs 2 Parallel jobs 3 Numerical results 4 To predict or not to predict 5 Yves.Robert@ens-lyon.fr Checkpointing. Or not. 6/ 39

  10. Framework Sequential jobs Parallel jobs Results No prediction Relying on failure prediction Applications will face resource faults during execution Failure prediction available (e.g. alarm when a disk or CPU becomes unusually hot) Application must dynamically prepare for, and recover from, expected failures Compare two well-known strategies: Checkpointing: purely local, but can be very costly Migration: requires availability of a spare resource Remember, we assume accurate failure prediction Yves.Robert@ens-lyon.fr Checkpointing. Or not. 7/ 39

  11. Framework Sequential jobs Parallel jobs Results No prediction Relying on failure prediction Applications will face resource faults during execution Failure prediction available (e.g. alarm when a disk or CPU becomes unusually hot) Application must dynamically prepare for, and recover from, expected failures Compare two well-known strategies: Preventive Checkpointing: purely local, but can be very costly Preventive Migration: requires availability of a spare resource Remember, we assume accurate failure prediction Yves.Robert@ens-lyon.fr Checkpointing. Or not. 7/ 39

  12. Framework Sequential jobs Parallel jobs Results No prediction Preventive checkpointing fault fault D µ D µ . . . R C R C available available D : length of downtime intervals µ : (average) length of execution intervals, a.k.a. MTTF R : recovery time (beginning of interval) C : checkpoint time (end of interval, just before failure) Yves.Robert@ens-lyon.fr Checkpointing. Or not. 8/ 39

  13. Framework Sequential jobs Parallel jobs Results No prediction Preventive migration fault fault D µ D µ . . . M M available available D : length of downtime intervals µ : (average) length of execution intervals M : migration time (end of interval, just before failure) Need spare node � Yves.Robert@ens-lyon.fr Checkpointing. Or not. 9/ 39

  14. Framework Sequential jobs Parallel jobs Results No prediction Notations C : checkpoint save time (in minutes) R : checkpoint recovery time (in minutes) D : down/reboot time (in minutes) M : migration time (in minutes) µ : mean time to failure (e.g., 1 /λ if failures are exponentially distributed) N : total number of cluster nodes n : number of spares (migration) Yves.Robert@ens-lyon.fr Checkpointing. Or not. 10/ 39

  15. Framework Sequential jobs Parallel jobs Results No prediction Caveat Checkpointing/migration comparison makes sense only if M < C + D + R otherwise better use faulty machine as own spare Live migration without any disk access, thereby dramatically reducing migration time Yves.Robert@ens-lyon.fr Checkpointing. Or not. 11/ 39

  16. Framework Sequential jobs Parallel jobs Results No prediction Outline Framework 1 Sequential jobs 2 Parallel jobs 3 Numerical results 4 To predict or not to predict 5 Yves.Robert@ens-lyon.fr Checkpointing. Or not. 12/ 39

  17. Framework Sequential jobs Parallel jobs Results No prediction Checkpointing fault fault D µ D µ . . . R C R C available available Probability of node being active � 0 , µ − R − C � u c = max µ + D Global throughput � 0 , µ − R − C � ρ c = u c × N = max × N µ + D Yves.Robert@ens-lyon.fr Checkpointing. Or not. 13/ 39

  18. Framework Sequential jobs Parallel jobs Results No prediction Migration (1/2) fault fault D µ D µ . . . M M available available Probability of node being active � 0 , µ − M � u m = max µ + D Global throughput � 0 , µ − M � ρ m = u m × ( N − n ) = max × ( N − n ) µ + D Yves.Robert@ens-lyon.fr Checkpointing. Or not. 14/ 39

  19. Framework Sequential jobs Parallel jobs Results No prediction Migration (2/2) fault fault D µ D µ . . . M M available available No shortage of spare nodes? n � N � � u N − k (1 − u m ) k success ( n ) = m k k =0 Find n = α ( ε, N ) that “guarantees” a successful execution with probability at least 1 − ε Solve numerically Yves.Robert@ens-lyon.fr Checkpointing. Or not. 15/ 39

  20. Framework Sequential jobs Parallel jobs Results No prediction Outline Framework 1 Sequential jobs 2 Parallel jobs 3 Numerical results 4 To predict or not to predict 5 Yves.Robert@ens-lyon.fr Checkpointing. Or not. 16/ 39

  21. Framework Sequential jobs Parallel jobs Results No prediction Distribution (1/3) Number of processors required by typical jobs: two-stage log-uniform distribution biased to powers of two Let N = 2 Z for simplicity Probability that a job is sequential: α 0 = p 1 ≈ 0 . 25 Otherwise, the job is parallel, and uses 2 j processors with identical probability α j = α = (1 − p 1 ) × 1 Z for 1 ≤ j ≤ Z = log 2 N Yves.Robert@ens-lyon.fr Checkpointing. Or not. 17/ 39

  22. Framework Sequential jobs Parallel jobs Results No prediction Distribution (1/3) Number of processors required by typical jobs: two-stage log-uniform distribution biased to powers of two (says Dr. Feitelson) Let N = 2 Z for simplicity Probability that a job is sequential: α 0 = p 1 ≈ 0 . 25 Otherwise, the job is parallel, and uses 2 j processors with identical probability α j = α = (1 − p 1 ) × 1 Z for 1 ≤ j ≤ Z = log 2 N Yves.Robert@ens-lyon.fr Checkpointing. Or not. 17/ 39

  23. Framework Sequential jobs Parallel jobs Results No prediction Distribution (2/3) Steady-state utilization of whole platform: - all processors always active - constant proportion of jobs using any processor number Expectation of the number of jobs: - K total number of jobs running - β j jobs that use 2 j processors exactly Yves.Robert@ens-lyon.fr Checkpointing. Or not. 18/ 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend