Learning-based Approaches to Estimate Job Wait Time in HTC - - PowerPoint PPT Presentation

learning based approaches to estimate job wait time in
SMART_READER_LITE
LIVE PREVIEW

Learning-based Approaches to Estimate Job Wait Time in HTC - - PowerPoint PPT Presentation

Learning-based Approaches to Estimate Job Wait Time in HTC Datacenters Luc Gombert and Fr ed eric Suter IN2P3 Computing Center / CNRS Villeurbanne, France HEPiX Fall Workshop October 13, 2020 F. Suter HEPiX Fall 2020 Workshop 1/15


slide-1
SLIDE 1

Learning-based Approaches to Estimate Job Wait Time in HTC Datacenters

Luc Gombert and Fr´ ed´ eric Suter

IN2P3 Computing Center / CNRS Villeurbanne, France

HEPiX Fall Workshop October 13, 2020

  • F. Suter – HEPiX Fall 2020 Workshop

1/15

slide-2
SLIDE 2

Previously in HEPiX series . . .

◮ A first study of the workload processed at CC-IN2P3 ◮ Focus on fairness for Local users ◮ Simulation of queue reconfiguration

  • F. Suter – HEPiX Fall 2020 Workshop

2/15

slide-3
SLIDE 3

Acknowledgment

◮ Original motivation for this work came from a talk by Wataru Takase (KEK) at the FJPPL — Japan-France workshop on computing technologies

  • F. Suter – HEPiX Fall 2020 Workshop

3/15

slide-4
SLIDE 4

Motivations and Objectives

◮ Fair-share scheduling ⇒ no estimation of job start time returned to the user! ◮ Distribution of Local job wait time

◮ Over 23 weeks from June 25, 2018 to December 2, 2018 ◮ 5,748,922 jobs on 35,000 cores

26.9 % 29.3 % 33.6 % 10.2 % 0.00 0.05 0.10 0.15 0.20 0.25 10s 1m 5mn 30mn 3h 9h 1d 3d 1w 1mo

Job wait time Density

  • F. Suter – HEPiX Fall 2020 Workshop

4/15

slide-5
SLIDE 5

Motivations and Objectives

◮ Fair-share scheduling ⇒ no estimation of job start time returned to the user! ◮ Distribution of Local job wait time

◮ Over 23 weeks from June 25, 2018 to December 2, 2018 ◮ 5,748,922 jobs on 35,000 cores

26.9 % 29.3 % 33.6 % 10.2 % 0.00 0.05 0.10 0.15 0.20 0.25 10s 1m 5mn 30mn 3h 9h 1d 3d 1w 1mo

Job wait time Density

  • 1. Can we explain why a job waits more than another?
  • 2. Can we train some Machine Learning algorithms?
  • 3. Can we get a good estimation of job wait time in the orange and red zones?
  • F. Suter – HEPiX Fall 2020 Workshop

4/15

slide-6
SLIDE 6

Outline

Introduction Some Intuitive Causes of Job Wait Time Who Submits the Job? What is the Job Requesting? When and Where is the Job Submitted? Learning-Based Job Wait Time Estimators Objectives and Performance Metrics ML Algorithm Selection Experimental Evaluation Conclusion and Future Work

  • F. Suter – HEPiX Fall 2020 Workshop

5/15

slide-7
SLIDE 7

Who Submits the Job?

Job Features

◮ Owner: more than 2,500 individual accounts at CC-IN2P3 ◮ Group: About 80 scientific collaborations

Resource Allocation Principle

  • 1. Groups express pledges every year (as a computing power in HS06)
  • 2. The sum of all pledges defines what CC-IN2P3 has to deliver
  • 3. Each group gets a proportional share of this

◮ Defines an consumption objective ◮ Used by the job scheduler as a basis of its Fair-Share policy

  • F. Suter – HEPiX Fall 2020 Workshop

6/15

slide-8
SLIDE 8

Who Submits the Job?

Job Features

◮ Owner: more than 2,500 individual accounts at CC-IN2P3 ◮ Group: About 80 scientific collaborations

Resource Allocation Principle

  • 1. Groups express pledges every year (as a computing power in HS06)
  • 2. The sum of all pledges defines what CC-IN2P3 has to deliver
  • 3. Each group gets a proportional share of this

◮ Defines an consumption objective ◮ Used by the job scheduler as a basis of its Fair-Share policy

Intuitive Causes

  • 1. Small groups get less resources wait more!
  • 2. Overconsumption of share lower priority wait more!
  • 3. Job owners can be manually blocked by operators wait more!
  • F. Suter – HEPiX Fall 2020 Workshop

6/15

slide-9
SLIDE 9

What is the Job Requesting?

Job Features

◮ Time: either Walltime or CPU time

◮ hard or soft limits – default values if none provided

◮ Memory: either resident or virtual

◮ hard or soft limits – default values if none provided

◮ Slots: almost always one for Local jobs ◮ Access to special resources: submitted to quotas

  • F. Suter – HEPiX Fall 2020 Workshop

7/15

slide-10
SLIDE 10

What is the Job Requesting?

Job Features

◮ Time: either Walltime or CPU time

◮ hard or soft limits – default values if none provided

◮ Memory: either resident or virtual

◮ hard or soft limits – default values if none provided

◮ Slots: almost always one for Local jobs ◮ Access to special resources: submitted to quotas

Intuituive Causes

  • 1. HTC is not HPC! low impact of time, memory, and slot requests
  • 2. Lots of (stringent) quotas wait more if reached!
  • 500

1000 1500 2000 24 48 72 96 120 144

Hours since submission time Job id

  • F. Suter – HEPiX Fall 2020 Workshop

7/15

slide-11
SLIDE 11

When and Where is the Job Submitted?

Job and System features

◮ Submission time ◮ Current queue status: number of pending jobs ◮ Current platform status: number of running jobs

Intuitive Causes

  • Week night

Week day Week evening Weekend night Weekend day 500 1000 1500 2000 1 3 5 7 9 11 13 15 17 19 21 23

Time of day

  • Avg. number of jobs submitted per hour
  • Business days

Weekend 4000 8000 12000 16000 Number of requested slots

Local jobs in 'long' queue All jobs 5000 10000

  • Sat. 09/08
  • Sun. 09/09
  • Mon. 09/10
  • Tue. 09/11
  • Wed. 09/12
  • Thu. 09/13

Number of jobs waiting for

Less than 1 hour Between 1 and 12 hours More than 12 hours

  • F. Suter – HEPiX Fall 2020 Workshop

8/15

slide-12
SLIDE 12

Outline

Introduction Some Intuitive Causes of Job Wait Time Who Submits the Job? What is the Job Requesting? When and Where is the Job Submitted? Learning-Based Job Wait Time Estimators Objectives and Performance Metrics ML Algorithm Selection Experimental Evaluation Conclusion and Future Work

  • F. Suter – HEPiX Fall 2020 Workshop

9/15

slide-13
SLIDE 13

Objectives and Performance Metrics

Objectives

◮ Regression problem: Estimate the time a job will wait when submitted

◮ Users may not really need that level of precision

◮ Classification problem: Determine in which time range a job will fall

Class Wait Time Range 1 Less than 30 minutes 2 30 minutes to 2 hours 3 2 hours to 4 hours 4 4 hours to 6 hours 5 6 hours to 9 hours 6 9 hours to 12 hours 7 12 hours to 24 hours 8 more than 24 hours

Performance metrics

◮ Learning and Prediction times: Has to be usable in production! ◮ Wait time estimation: Error distribution ◮ Wait time range classification: Confusion matrix

  • F. Suter – HEPiX Fall 2020 Workshop

10/15

slide-14
SLIDE 14

ML Algorithm Selection

Common Properties

◮ Rely on ScikitLearn implementations ◮ Favor fast algorithms

Regression

◮ Linear Regression ◮ Decision Tree Regressor ◮ Ensemble Methods

◮ AdaBoost and Bagging ◮ Depth-9 DT as weak learner ◮ 50 subsets

Classification

◮ Naive Bayes ◮ Decision Tree Classifier ◮ Ensemble Methods

◮ AdaBoost and Bagging ◮ Depth-1 DT as weak classifier ◮ 50 subsets

Additionnal Approach

◮ Two-step Classification: solve regression and then classify

  • F. Suter – HEPiX Fall 2020 Workshop

11/15

slide-15
SLIDE 15

Accuracy of the Job Wait Time Estimation

  • 1

3 10 30 4 8 12 16

Tree depth Median Absolute Error (in hour)

  • AdaBoost

Bagging DecisionTree Linear Regression

◮ AdaBoost is bad ◮ Bagging ≈ DT ◮ Less than 1h error for 50% of the jobs ◮ Satisfying!

  • F. Suter – HEPiX Fall 2020 Workshop

12/15

slide-16
SLIDE 16

Accuracy of the Job Wait Time Estimation

  • 1

3 10 30 4 8 12 16

Tree depth Median Absolute Error (in hour)

  • AdaBoost

Bagging DecisionTree Linear Regression

◮ AdaBoost is bad ◮ Bagging ≈ DT ◮ Less than 1h error for 50% of the jobs ◮ Satisfying! ◮ Split by ”zone” ◮ Better for early starters ◮ Degradation for others ◮ Not satisfying :-/

  • AdaBoost

Bagging DecisionTree 4 8 12 16 4 8 12 16 4 8 12 16 0.3 1.0 3.0 10.0 30.0

Tree depth Median Absolute Error (in hours)

  • 0 − 1mn

1mn − 30mn 30mn − 9h > 9h

  • F. Suter – HEPiX Fall 2020 Workshop

12/15

slide-17
SLIDE 17

Accuracy of the Time Range Classification

∼ 43 % of jobs in the right class ∼ 73% of jobs in right or adjacent class

  • F. Suter – HEPiX Fall 2020 Workshop

13/15

slide-18
SLIDE 18

Accuracy of the Time Range Classification

Bagging Classifier

88.88% 69.18% 48.51% 31.54% 31.59% 36.09% 29.61% 46.06% 7.3% 16.51% 17.65% 15.01% 14.33% 13.44% 15.68% 15.58% 1.05% 4.15% 4.21% 0.16% 0.45% 0.62% 0.66% 0% 0.39% 1.12% 3.18% 1.56% 0.83% 2.8% 0.99% 0% 0.46% 1.68% 1.81% 1.63% 4.61% 0% 0.6% 0% 0.28% 0.8% 1.76% 1.4% 0% 2.46% 0% 0% 1.59% 6.55% 22.84% 48.7% 47.98% 43.84% 52.33% 37.05% 0.05% 0.01% 0.04% 0.01% 0.21% 0.77% 0.13% 1.32%

>= 24h [12h − 24h[ [9h − 12h[ [6h − 9h[ [4h − 6h[ [2h − 4h[ [30mn − 2h[ < 30mn < 30mn [30mn − 2h[ [2h − 4h[ [4h − 6h[ [6h − 9h[ [9h − 12h[ [12h − 24h[ >= 24h

  • F. Suter – HEPiX Fall 2020 Workshop

13/15

slide-19
SLIDE 19

Accuracy of the Time Range Classification

Decision Tree Regressor + Classification

54.85% 22.54% 8.85% 4.66% 6.4% 3.26% 2.06% 1.69% 34.57% 44.79% 22.69% 7.59% 7.11% 13.86% 9.62% 8.11% 6.45% 15.55% 22.07% 11.11% 12.84% 13.45% 6.56% 6% 0.9% 3.11% 13.52% 28.53% 3.3% 0.07% 0.12% 5.52% 1.76% 6.59% 22.14% 28.04% 29.38% 12.53% 40.92% 15.82% 0.36% 0.52% 1.65% 2.31% 5.25% 7.62% 10.76% 30.23% 0.96% 5.95% 6.71% 9.79% 24.91% 29.73% 17.43% 0.8% 0.16% 0.94% 2.38% 7.98% 10.81% 19.48% 12.54% 31.82%

>= 24h [12h − 24h[ [9h − 12h[ [6h − 9h[ [4h − 6h[ [2h − 4h[ [30mn − 2h[ < 30mn < 30mn [30mn − 2h[ [2h − 4h[ [4h − 6h[ [6h − 9h[ [9h − 12h[ [12h − 24h[ >= 24h

  • F. Suter – HEPiX Fall 2020 Workshop

13/15

slide-20
SLIDE 20

Conclusion and Future Work

Conclusion

◮ Analyzed 23 weeks of job submissions to a HTC center ◮ Identified some intuitive causes of job wait time ◮ Learn on 15 job and system features to predict job wait time ◮ Early results for Regression and Classification problems

◮ Assessing the performance of multiple ML algorithms ◮ Some biases have to be solved

Future Work

◮ Improve our predictions

◮ Take early starter jobs into account

◮ Investigate the use of Deep Learning algorithms ◮ Automate and transfer procedure to User Support team at CC-IN2P3 ◮ Integrate this work to the newly deployed CC-IN2P3 user portal

  • F. Suter – HEPiX Fall 2020 Workshop

14/15

slide-21
SLIDE 21

Learning-based Approaches to Estimate Job Wait Time in HTC Datacenters QUESTIONS?

Luc Gombert and Fr´ ed´ eric Suter

IN2P3 Computing Center / CNRS

  • F. Suter – HEPiX Fall 2020 Workshop

15/15