ROSS 2017
The Effect of Asymmetric Performance on Asynchronous Task Based - - PowerPoint PPT Presentation
The Effect of Asymmetric Performance on Asynchronous Task Based - - PowerPoint PPT Presentation
The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017 Ch Changi ging g Face ce of HPC Environments Task-based Runtimes: Potential solution Traditional: Dedicated Resources
Ch Changi ging g Face ce of HPC Environments
Goal: Can asynchronous task-based runtimes handle asymmetric performance
2
Future: Collocated Workloads
Supercomputer Simulation Visualization
Traditional: Dedicated Resources
Supercomputer Storage Cluster Processing Cluster Simulation Visualization
- Task-based Runtimes: Potential solution
Ta Task-ba based R d Run unti times
- Experiencing renewal in interest in systems
community
- Assumed to better address performance variability
- Adopt (Over-)Decomposed task-based model
- Allow fine-grained scheduling decisions
- Able to adapt to asymmetric/variable performance
- But…
- Originally designed for application induced load
imbalances, e.g., an adaptive mesh refinement (AMR) based application
- Performance asymmetry can be of finer
granularity, e.g., variable CPU time in time-shared environments
3
Basic c Experimental Evaluation
- Synthetic situation
- Emulate performance asymmetry in time-shared
configuration
- Static and predictable setting
- Benchmark on 12 cores, share one core with
background workload
- Vary the percentage of CPU time of competing
workload
- Environment: 12 core dual socket compute node,
hyperthreading disabled
- Used cpulimit to control percentage of CPU time
4
Wo Workload Configuration
5
Idle Benchmark Competing Workload
Node 0 Node 1 11 cores settings 12 cores settings
Ex Expe peri rimental Setup up
- Evaluated two different runtimes:
- Charm++: LeanMD
- HPX-5: LULESH, HPCG, LibPXGL
- Competing Workload:
- Prime Number Generator: entirely CPU bound,
a minimal memory footprint
- Kernel Compilation: stresses internal OS
features such as I/O and memory subsystem
6
Ch Charm+ m++
7
- Iterative over-decomposed applications
- Object based programming model
- Tasks implemented as C++ objects
- Objects can migrate across intra and inter-node
boundaries
Ch Charm+ m++
8
- A separate centralized load balancer component
- Preempts application progress
- Actively migrates objects based on current state
- Causes computation to block across the other
cores
- 7.11
39.33 85.77 132.21 178.65 225.09 100 150 200 250 300 350 10 20 30 40 50 60 70 80 90
Percentage performance degradation Runtime (s)
- Perc. of CPU utilized by the background workload of prime number generator running on 12th core
GreedyLB RotateLB RandCentLB RefineLB RefineSwapLB No load balabncer
Choice ce of Lo Load Balance cer Matters
9
We selected RefineSwapLB for the rest of the experiments.
- Comparing performance of different load balancing
strategies and without any load balancer
198% divergence
Invocation Frequency cy Matters
- MetaLB:
- Invoke load balancer less frequently based on heuristics
10
Load balancing overhead of RefineSwapLB with or without MetaLB
50 100 150 200 250 300 350 400 450 10 20 30 40 50 60 70 80 90
Time (s)
- Perc. of CPU utilized by the background workload of prime number generator running on 12th core
Total Runtime of RefineSwapLB with MetaLB Overhead of RefineSwapLB with MetaLB Total Runtime of RefineSwapLB without MetaLB Overhead of RefineSwapLB without MetaLB
We enabled MetaLB for our experiments.
Ch Charm+ m++: LE : LEANM ANMD
- 12 cores are worse than 11 cores
- …unless you have at least 75% of the core’s capacity.
- If the application cannot get more than 75% of the core’s
capacity, then is better off ignoring the core completely.
11
Sensitivity of perc. of CPU utilization by the background workload of prime number generator
- 7.11
2.18 11.47 20.76 30.05 39.34 48.63 57.92 67.21 76.5 100 110 120 130 140 150 160 170 180 190 10 20 30 40 50 60 70 80 90 100 Percentage performance degradation Runtime (s)
- Perc. of CPU utilized by the background workload of prime number generator on 12th core
11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) 12 threads on 12 cores
25% to background workload
53% divergence
Ch Charm+ m++: LE : LEANM ANMD
More variable, but consistent mean performance.
12
Sensitivity of perc. of CPU utilization by the background workload of kernel compilation
- 2.78
1.96 6.7 11.44 16.18 20.92 25.66 30.4 35.14 39.88 105 110 115 120 125 130 135 140 145 150 10 20 30 40 50 60 70 80 90 100
Percentage performace degradation Runtime (s)
- Perc. of CPU utilized by the background workload of kernel compilation on 12th core
12 threads on 12 cores 11 threads on 11 cores
25% to background workload
HP HPX-5
13
f f f f f f f f f f f f
- Parcel:
- Contains a computational task and a reference to the data the
task operates on
- Follows Work-First principle of Cilk-5.
- Every scheduling entity processes parcels from top of their
scheduling queues.
f f f f
HP HPX-5
14
f f f f f
- Implemented using Random Work Stealing
- No centralized decision making process
- Overhead of work stealing is assumed by the
stealer.
Ope OpenMP nMP: : LULE LESH
- Overall application performance determined by the
slowest rank.
- Vulnerable to asymmetries in performance.
- Rely on collective based communication.
15
- 13.9
14.8 43.5 72.2 100.9 129.6 158.3 187 215.7 60 80 100 120 140 160 180 200 220 10 20 30 40 50 60 70 80 90 100
Percentage performance degardation Runtime (s)
- Perc. of CPU utilized by the background workload of prime number generator on 12th core
12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)
Sensitivity of perc. of CPU utilization by the background workload of prime number generator 185% divergence
HP HPX-5: L 5: LULESH
- No cross-over point
- 12 cores are consistently worse than 11 cores
16
- 5.47
5.03 15.53 26.03 36.53 47.03 57.53 68.03 78.53 89.03 99.53 90 100 110 120 130 140 150 160 170 180 190 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s)
- Perc. of CPU utilization by the background workload of prime number generator on 12th core
12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)
- A traditional BSP application implemented using
task-based programming
42% divergence
Sensitivity of perc. of CPU utilization by the background workload of prime number generator
- 3.83
- 1.645
0.54 2.725 4.91 7.095 9.28 11.465 220 225 230 235 240 245 250 255 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s)
- Perc. of CPU utilization by the background workload of prime number generator on 12th core
12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)
HP HPX-5: H 5: HPCG
- Better than the theoretical expectation
- 12 cores are consistently worse than 11 cores
17
- Another BSP application implemented in task-
based model
10% to background workload
5% divergence
Sensitivity of perc. of CPU utilization by the background workload of prime number generator
HP HPX-5: 5: Li LibPXGL
- No cross-over point
- 12 cores are consistently worse than 11 cores
18
- An asynchronous graph processing library
- A more natural fit
- 4.5
- 0.16
4.18 8.52 12.86 17.2 110 115 120 125 130 135 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s)
- Perc. of CPU utilization by the background workload of prime number generator on 12th core
12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation)
22% to background workload
5% divergence
Sensitivity of perc. of CPU utilization by the background workload of prime number generator
- 4.7
- 0.37
3.96 8.29 12.62 16.95 21.28 220 230 240 250 260 270 280 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core
12 threads on 12 cores 11 threads on 11 cores
HP HPX-5: K 5: Kernel Co Comp mpilati tion
19
LULESH HPCG LibPXGL
- 6.26
- 1.06
4.14 9.34 14.54 19.74 24.94 30.14 35.34 40.54 45.74 90 95 100 105 110 115 120 125 130 135 140 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core
12 threads on 12 cores 11 threads on 11 cores
- 4.7
- 0.37
3.96 8.29 12.62 16.95 110 115 120 125 130 135 10 20 30 40 50 60 70 80 90 100
Percentage performance degradation Runtime (s) Percentage of CPU speed consumed by background workload on 12th core
12 threads on 12 cores 11 threads on 11 cores
More immediate, instead of gradual decline.
Concl clusion
- Performance asymmetry is still challenging
- Preliminary evaluation:
- Tightly controlled time-shared CPUs
- Static and consistent configuration
- Better than BSP, but…
- On average a CPU loses its utility to a task
based runtime as soon as its performance diverges by only 25%.
20
Th Than ank You
- u
- Debashis Ganguly
- Ph.D. Student, Computer Science Department, University of
Pittsburgh
- debashis@cs.pitt.edu
- https://people.cs.pitt.edu/~debashis/
- The Prognostic Lab
- http://www.prognosticlab.org
21