The Effect of System Utilization on Application Performance - - PowerPoint PPT Presentation

the effect of system utilization on application
SMART_READER_LITE
LIVE PREVIEW

The Effect of System Utilization on Application Performance - - PowerPoint PPT Presentation

The Effect of System Utilization on Application Performance Variability Boyang Li*, Sudheer Chunduri+ , Kevin Harms+ , Yuping Fan* , Zhiling Lan* Illinois Institute of Technology* Argonne National Laboratory+ Outline Motivation Related Work


slide-1
SLIDE 1

The Effect of System Utilization on Application Performance Variability

Boyang Li*, Sudheer Chunduri+ , Kevin Harms+ , Yuping Fan* , Zhiling Lan* Illinois Institute of Technology* Argonne National Laboratory+

slide-2
SLIDE 2

Outline

Related Work Motivation

1

Project Contributions Summary

slide-3
SLIDE 3

Dragonfly topology becomes popular

  • High-radix
  • Low-diameter

Theta at Argonne

  • 4,392 nodes
  • Peak performance of 11.69 petaflops
  • 2D-Dragonfly topology

Dragonfly topology

Motivation

2

Performance variability due to network sharing!

slide-4
SLIDE 4
  • Job placement
  • Routing policy
  • Task mapping

q Communication interference due to network contention is a dominant cause of performance variability. q Existing studies of exploiting job scheduling to mitigate communication interference:

Related Work

[1] Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and LaxmikantV Kale. 2014. Maximizing throughput on a dragonfly network SC14’ [2] Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study

  • n dragonfly network. SC16’

[3]Xin Wang, Misbah Mubarak, Xu Yang, Robert B Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. IPDPS18’ 3

slide-5
SLIDE 5

Distinct from previous studies, we investigate how system utilization influences application runtime variability.

Overview

4

  • Empirical analysis:
  • Log analysis
  • Application experiments (over 4000 tests)
  • New scheduling design:
  • CEIL (Cut-off Extreme hIgh utiLization) design
slide-6
SLIDE 6

Empirical Study - Log Analysis

5

  • Records belong to the same application: all of the above Aprun log information is matched
  • Fifteen applications that have multiple executions are identified.
  • Top five applications with high repetition frequency for various job sizes are presented.

Table: Theta Aprun log field names and description Table: Logs of Theta at ALCF

slide-7
SLIDE 7

Application runtimes (Jan-March of 2018 on Theta) under different system utilization rates.

Empirical Study - Log Analysis

Positive correlation between high system utilization and application performance degradation (up to 21%) Maximum runtime always occurred during high utilization periods.

6

slide-8
SLIDE 8

Empirical Study - Application Experiments

Ø Four applications: MILC, Reordered MILC, Nek5000, NEKBONE Ø Over 4000 application tests in total on different days and times Ø Cobalt log => average system utilization during these application runs.

Table: Experiment description

7

slide-9
SLIDE 9

Empirical Study - Application Experiments

8

Same observation as from log analysis!

slide-10
SLIDE 10

Rethink HPC Scheduling Design

8

Q: Shall we solely target high system utilization

  • n Dragonfly system for scheduling?
slide-11
SLIDE 11

Illustrative Example

Scheduling for utilization vs for productivity

High system utilization does not necessarily mean high system productivity

9

  • Nine 9-node jobs and nine 1-node jobs, each

having a runtime estimate of 5 hours

  • Assume each application’s runtime will be

increased by 20% (thus becoming 6 hours) due to network sharing when system utilization is greater than a threshold (e.g., 95%).

slide-12
SLIDE 12
  • Resource utilization exhibits a fluctuating pattern throughout a day.
  • Not all the users are in a hurry for the job completion.

CEIL: Scheduling Design

Two assumptions:

10

Day 1 0.0 0.2 0.4 0.6 0.8 1.0 System utilization

Actual system utilization System utilization under CEIL 95% utilization

Day 2 0.0 0.2 0.4 0.6 0.8 1.0

Actual system utilization System utilization under CEIL 95% utilization

Day 3 0.0 0.2 0.4 0.6 0.8 1.0

Actual system utilization System utilization under CEIL 95% utilization

slide-13
SLIDE 13

CEIL: Scheduling Design

CEIL (Cut-off Extreme hIgh utiLization) scheduling design:

Ø There is an additional Postpone Queue besides traditional Waiting Queue Ø Only the jobs in the Waiting Queue can be scheduled for execution. Ø One of the following conditions is satisfied, jobs move from Postpone Queue to Waiting Queue

  • Empty Waiting Queue
  • Low utilization
  • Approaching user’s expected job completion time

Flowchart of CEIL design

11

slide-14
SLIDE 14

Scheduling Evaluation

Table: Workload traces from Theta at ALCF Table: Workloads with various postponed rates

  • Theta workload logs
  • Synthetic logs
  • Trace-based scheduling simulator: CQSim

CQSim github link: https://github.com/SPEAR-IIT/CQSim 12

slide-15
SLIDE 15

Evaluation Metrics

System centric metrics:

Ø Makespan (e.g., to evaluate scheduling throughput)

  • Total length of the schedule to complete all the jobs.

Ø Percentage of high utilization periods

  • Proportion of the time when the system utilization is higher than 95% in this study

User centric metrics:

Ø User wait time

  • Time period between a job’s expected end time and its actual end time.

Ø Job bounded slowdown

  • Ratio of job response time (user wait time plus job runtime) to the job runtime

13

slide-16
SLIDE 16

System Centric Results

Table: Comparison of system-level scheduling metrics

CEIL can significantly reduce the percentage of high utilization periods. CEIL does not does not impact system throughput.

14

  • We compare CEIL with WFP (original scheduling policy deployed on Theta).
  • EASY Backfilling is used to mitigate resource fragmentation.
slide-17
SLIDE 17

User Centric Results

Comparison of CEIL and WFP

CEIL can effectively reduce average user wait time by 12.5%-35.3%. Job bounded slowdown is reduced by 7.4%−20.2%.

15

slide-18
SLIDE 18

Summary

In this work, our contributions are summarized as below:

  • There is a strong correlation between application runtime and system

utilization.

  • We have investigated a scheduling strategy CEIL to proactively avoid

job allocation under high system utilization. This is a proof of concept study. Limitations:

  • Selection of 95% as the high utilization is specific to the Theta workload.
  • Not suitable for the systems which are always heavily utilized.

16

slide-19
SLIDE 19

Acknowledgement

17

slide-20
SLIDE 20 P P T 模板下载:w w w . 1 p p t . c
  • m /
m o b a n / 行业P P T 模板:w w w . 1 p p t . c
  • m /
h a n g y e / 节日P P T 模板:w w w . 1 p p t . c
  • m /
j i e r i / P P T 素材下载:w w w . 1 p p t . c
  • m /
s u c a i / P P T 背景图片:w w w . 1 p p t . c
  • m /
b e i j i n g / P P T 图表下载:w w w . 1 p p t . c
  • m /
t u b i a o / 优秀P P T 下载:w w w . 1 p p t . c
  • m /
x i a z a i / P P T 教程:w w w . 1 p p t . c
  • m /
p o w e r p o i n t / W o r d 教程:w w w . 1 p p t . c
  • m /
w o r d / E x c e l 教程:w w w . 1 p p t . c
  • m /
e x c e l / 资料下载:w w w . 1 p p t . c
  • m /
z i l i a o / P P T 课件下载:w w w . 1 p p t . c
  • m /
k e j i a n / 范文下载:w w w . 1 p p t . c
  • m /
f a n w e n / 试卷下载:w w w . 1 p p t . c
  • m /
s h i t i / 教案下载:w w w . 1 p p t . c
  • m /
j i a o a n / P P T 论坛:w w w . 1 p p t . c n

Thank you!

Qu Questions