The Effect of Asymmetric Performance on Asynchronous Task Based - - PowerPoint PPT Presentation

the effect of asymmetric performance on asynchronous task
SMART_READER_LITE
LIVE PREVIEW

The Effect of Asymmetric Performance on Asynchronous Task Based - - PowerPoint PPT Presentation

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017 Ch Changi ging g Face ce of HPC Environments Task-based Runtimes: Potential solution Traditional: Dedicated Resources


slide-1
SLIDE 1

ROSS 2017

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes

Debashis Ganguly and John R. Lange

slide-2
SLIDE 2

Ch Changi ging g Face ce of HPC Environments

Goal: Can asynchronous task-based runtimes handle asymmetric performance

2

Future: Collocated Workloads

Supercomputer Simulation Visualization

Traditional: Dedicated Resources

Supercomputer Storage Cluster Processing Cluster Simulation Visualization

  • Task-based Runtimes: Potential solution
slide-3
SLIDE 3

Ta Task-ba based R d Run unti times

  • Experiencing renewal in interest in systems

community

  • Assumed to better address performance variability
  • Adopt (Over-)Decomposed task-based model
  • Allow fine-grained scheduling decisions
  • Able to adapt to asymmetric/variable performance
  • But…
  • Originally designed for application induced load

imbalances, e.g., an adaptive mesh refinement (AMR) based application

  • Performance asymmetry can be of finer

granularity, e.g., variable CPU time in time-shared environments

3

slide-4
SLIDE 4

Basic c Experimental Evaluation

  • Synthetic situation
  • Emulate performance asymmetry in time-shared

configuration

  • Static and predictable setting
  • Benchmark on 12 cores, share one core with

background workload

  • Vary the percentage of CPU time of competing

workload

  • Environment: 12 core dual socket compute node,

hyperthreading disabled

  • Used cpulimit to control percentage of CPU time

4

slide-5
SLIDE 5

Wo Workload Configuration

5

Idle Benchmark Competing Workload

Node 0 Node 1 11 cores settings 12 cores settings

slide-6
SLIDE 6

Ex Expe peri rimental Setup up

  • Evaluated two different runtimes:
  • Charm++: LeanMD
  • HPX-5: LULESH, HPCG, LibPXGL
  • Competing Workload:
  • Prime Number Generator: entirely CPU bound,

a minimal memory footprint

  • Kernel Compilation: stresses internal OS

features such as I/O and memory subsystem

6

slide-7
SLIDE 7

Ch Charm+ m++

7

  • Iterative over-decomposed applications
  • Object based programming model
  • Tasks implemented as C++ objects
  • Objects can migrate across intra and inter-node

boundaries

slide-8
SLIDE 8

Ch Charm+ m++

8

  • A separate centralized load balancer component
  • Preempts application progress
  • Actively migrates objects based on current state
  • Causes computation to block across the other

cores

slide-9
SLIDE 9
  • 7.11

39.33 85.77 132.21 178.65 225.09 100 150 200 250 300 350 10 20 30 40 50 60 70 80 90

Percentage performance degradation Runtime (s)

  • Perc. of CPU utilized by the background workload of prime number generator running on 12th core

GreedyLB RotateLB RandCentLB RefineLB RefineSwapLB No load balabncer

Choice ce of Lo Load Balance cer Matters

9

We selected RefineSwapLB for the rest of the experiments.

  • Comparing performance of different load balancing

strategies and without any load balancer

198% divergence

slide-10
SLIDE 10

Invocation Frequency cy Matters

  • MetaLB:
  • Invoke load balancer less frequently based on heuristics

10

Load balancing overhead of RefineSwapLB with or without MetaLB

50 100 150 200 250 300 350 400 450 10 20 30 40 50 60 70 80 90

Time (s)

  • Perc. of CPU utilized by the background workload of prime number generator running on 12th core

Total Runtime of RefineSwapLB with MetaLB Overhead of RefineSwapLB with MetaLB Total Runtime of RefineSwapLB without MetaLB Overhead of RefineSwapLB without MetaLB

We enabled MetaLB for our experiments.

slide-11
SLIDE 11

Ch Charm+ m++: LE : LEANM ANMD

  • 12 cores are worse than 11 cores
  • …unless you have at least 75% of the core’s capacity.
  • If the application cannot get more than 75% of the core’s

capacity, then is better off ignoring the core completely.

11

Sensitivity of perc. of CPU utilization by the background workload of prime number generator

  • 7.11

2.18 11.47 20.76 30.05 39.34 48.63 57.92 67.21 76.5 100 110 120 130 140 150 160 170 180 190 10 20 30 40 50 60 70 80 90 100 Percentage performance degradation Runtime (s)

  • Perc. of CPU utilized by the background workload of prime number generator on 12th core

11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) 12 threads on 12 cores

25% to background workload

53% divergence

slide-12
SLIDE 12

Ch Charm+ m++: LE : LEANM ANMD

More variable, but consistent mean performance.

12

Sensitivity of perc. of CPU utilization by the background workload of kernel compilation

  • 2.78

1.96 6.7 11.44 16.18 20.92 25.66 30.4 35.14 39.88 105 110 115 120 125 130 135 140 145 150 10 20 30 40 50 60 70 80 90 100

Percentage performace degradation Runtime (s)

  • Perc. of CPU utilized by the background workload of kernel compilation on 12th core

12 threads on 12 cores 11 threads on 11 cores

25% to background workload

slide-13
SLIDE 13

HP HPX-5

13

f f f f f f f f f f f f

  • Parcel:
  • Contains a computational task and a reference to the data the

task operates on

  • Follows Work-First principle of Cilk-5.
  • Every scheduling entity processes parcels from top of their

scheduling queues.

f f f f

slide-14
SLIDE 14

HP HPX-5

14

f f f f f

  • Implemented using Random Work Stealing
  • No centralized decision making process
  • Overhead of work stealing is assumed by the

stealer.

slide-15
SLIDE 15

Ope OpenMP nMP: : LULE LESH

  • Overall application performance determined by the

slowest rank.

  • Vulnerable to asymmetries in performance.
  • Rely on collective based communication.

15

  • 13.9

14.8 43.5 72.2 100.9 129.6 158.3 187 215.7 60 80 100 120 140 160 180 200 220 10 20 30 40 50 60 70 80 90 100

Percentage performance degardation Runtime (s)

  • Perc. of CPU utilized by the background workload of prime number generator on 12th core

12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)

Sensitivity of perc. of CPU utilization by the background workload of prime number generator 185% divergence

slide-16
SLIDE 16

HP HPX-5: L 5: LULESH

  • No cross-over point
  • 12 cores are consistently worse than 11 cores

16

  • 5.47

5.03 15.53 26.03 36.53 47.03 57.53 68.03 78.53 89.03 99.53 90 100 110 120 130 140 150 160 170 180 190 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s)

  • Perc. of CPU utilization by the background workload of prime number generator on 12th core

12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)

  • A traditional BSP application implemented using

task-based programming

42% divergence

Sensitivity of perc. of CPU utilization by the background workload of prime number generator

slide-17
SLIDE 17
  • 3.83
  • 1.645

0.54 2.725 4.91 7.095 9.28 11.465 220 225 230 235 240 245 250 255 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s)

  • Perc. of CPU utilization by the background workload of prime number generator on 12th core

12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)

HP HPX-5: H 5: HPCG

  • Better than the theoretical expectation
  • 12 cores are consistently worse than 11 cores

17

  • Another BSP application implemented in task-

based model

10% to background workload

5% divergence

Sensitivity of perc. of CPU utilization by the background workload of prime number generator

slide-18
SLIDE 18

HP HPX-5: 5: Li LibPXGL

  • No cross-over point
  • 12 cores are consistently worse than 11 cores

18

  • An asynchronous graph processing library
  • A more natural fit
  • 4.5
  • 0.16

4.18 8.52 12.86 17.2 110 115 120 125 130 135 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s)

  • Perc. of CPU utilization by the background workload of prime number generator on 12th core

12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)

22% to background workload

5% divergence

Sensitivity of perc. of CPU utilization by the background workload of prime number generator

slide-19
SLIDE 19
  • 4.7
  • 0.37

3.96 8.29 12.62 16.95 21.28 220 230 240 250 260 270 280 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core

12 threads on 12 cores 11 threads on 11 cores

HP HPX-5: K 5: Kernel Co Comp mpilati tion

19

LULESH HPCG LibPXGL

  • 6.26
  • 1.06

4.14 9.34 14.54 19.74 24.94 30.14 35.34 40.54 45.74 90 95 100 105 110 115 120 125 130 135 140 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core

12 threads on 12 cores 11 threads on 11 cores

  • 4.7
  • 0.37

3.96 8.29 12.62 16.95 110 115 120 125 130 135 10 20 30 40 50 60 70 80 90 100

Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core

12 threads on 12 cores 11 threads on 11 cores

More immediate, instead of gradual decline.

slide-20
SLIDE 20

Concl clusion

  • Performance asymmetry is still challenging
  • Preliminary evaluation:
  • Tightly controlled time-shared CPUs
  • Static and consistent configuration
  • Better than BSP, but…
  • On average a CPU loses its utility to a task

based runtime as soon as its performance diverges by only 25%.

20

slide-21
SLIDE 21

Th Than ank You

  • u
  • Debashis Ganguly
  • Ph.D. Student, Computer Science Department, University of

Pittsburgh

  • debashis@cs.pitt.edu
  • https://people.cs.pitt.edu/~debashis/
  • The Prognostic Lab
  • http://www.prognosticlab.org

21