The Effect of Asymmetric Performance on Asynchronous Task Based - PowerPoint PPT Presentation

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017

Ch Changi ging g Face ce of HPC Environments • Task-based Runtimes: Potential solution Traditional: Dedicated Resources Future: Collocated Workloads Simulation Visualization Simulation Visualization Supercomputer Processing Cluster Supercomputer Storage Cluster Goal: Can asynchronous task-based runtimes handle asymmetric performance 2

Ta Task-ba based R d Run unti times • Experiencing renewal in interest in systems community • Assumed to better address performance variability • Adopt (Over-)Decomposed task-based model • Allow fine-grained scheduling decisions • Able to adapt to asymmetric/variable performance • But… • Originally designed for application induced load imbalances, e.g., an adaptive mesh refinement (AMR) based application • Performance asymmetry can be of finer granularity, e.g., variable CPU time in time-shared environments 3

Basic c Experimental Evaluation • Synthetic situation • Emulate performance asymmetry in time-shared configuration • Static and predictable setting • Benchmark on 12 cores, share one core with background workload • Vary the percentage of CPU time of competing workload • Environment: 12 core dual socket compute node, hyperthreading disabled • Used cpulimit to control percentage of CPU time 4

Wo Workload Configuration Node 1 Node 0 Idle 11 cores settings Benchmark Competing Workload 12 cores settings 5

Ex Expe peri rimental Setup up • Evaluated two different runtimes: • Charm++ : LeanMD • HPX-5 : LULESH, HPCG, LibPXGL • Competing Workload: • Prime Number Generator : entirely CPU bound, a minimal memory footprint • Kernel Compilation : stresses internal OS features such as I/O and memory subsystem 6

Ch Charm+ m++ • Iterative over-decomposed applications • Object based programming model • Tasks implemented as C++ objects • Objects can migrate across intra and inter-node boundaries 7

Ch Charm+ m++ • A separate centralized load balancer component • Preempts application progress • Actively migrates objects based on current state • Causes computation to block across the other cores 8

Choice ce of Lo Load Balance cer Matters • Comparing performance of different load balancing strategies and without any load balancer 350 225.09 198% divergence Percentage performance degradation 300 178.65 250 132.21 Runtime (s) 200 85.77 150 39.33 100 -7.11 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core GreedyLB RotateLB RandCentLB RefineLB RefineSwapLB No load balabncer We selected RefineSwapLB for the rest of the experiments. 9

Invocation Frequency cy Matters • MetaLB: • Invoke load balancer less frequently based on heuristics 450 400 350 300 Time (s) 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core Total Runtime of RefineSwapLB with MetaLB Overhead of RefineSwapLB with MetaLB Total Runtime of RefineSwapLB without MetaLB Overhead of RefineSwapLB without MetaLB Load balancing overhead of RefineSwapLB with or without MetaLB We enabled MetaLB for our experiments. 10

Charm+ Ch m++: LE : LEANM ANMD 190 76.5 Percentage performance degradation 180 67.21 25% to 170 57.92 background 160 48.63 workload Runtime (s) 150 39.34 53% divergence 140 30.05 130 20.76 120 11.47 110 2.18 100 -7.11 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) 12 threads on 12 cores Sensitivity of perc. of CPU utilization by the background workload of prime number generator • 12 cores are worse than 11 cores • …unless you have at least 75% of the core’s capacity. • If the application cannot get more than 75% of the core’s capacity, then is better off ignoring the core completely. 11

Ch Charm+ m++: LE : LEANM ANMD More variable, but consistent mean performance. 25% to background workload 150 39.88 Percentage performace degradation 145 35.14 140 30.4 135 25.66 Runtime (s) 130 20.92 125 16.18 120 11.44 115 6.7 110 1.96 105 -2.78 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of kernel compilation on 12th core 12 threads on 12 cores 11 threads on 11 cores Sensitivity of perc. of CPU utilization by the background workload of kernel compilation 12

HP HPX-5 f f f f f f f f f f f f f f f f • Parcel: • Contains a computational task and a reference to the data the task operates on • Follows Work-First principle of Cilk-5. • Every scheduling entity processes parcels from top of their scheduling queues. 13

HP HPX-5 f f f f f • Implemented using Random Work Stealing • No centralized decision making process • Overhead of work stealing is assumed by the stealer. 14

Ope OpenMP nMP: : LULE LESH • Overall application performance determined by the slowest rank. • Vulnerable to asymmetries in performance. • Rely on collective based communication. 220 215.7 Percentage performance degardation 200 187 180 158.3 Runtime (s) 160 129.6 185% divergence 140 100.9 120 72.2 100 43.5 80 14.8 60 -13.9 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator 15

HP HPX-5: L 5: LULESH • A traditional BSP application implemented using task-based programming 190 99.53 Percentage performance degradation 180 89.03 170 78.53 160 68.03 Runtime (s) 150 42% divergence 57.53 140 47.03 130 36.53 120 26.03 110 15.53 100 5.03 90 -5.47 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • No cross-over point • 12 cores are consistently worse than 11 cores 16

HPX-5: H HP 5: HPCG • Another BSP application implemented in task- based model 255 11.465 10% to Percentage performance degradation background 250 9.28 workload 245 7.095 5% divergence Runtime (s) 240 4.91 235 2.725 230 0.54 225 -1.645 220 -3.83 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • Better than the theoretical expectation • 12 cores are consistently worse than 11 cores 17

HPX-5: HP 5: Li LibPXGL • An asynchronous graph processing library • A more natural fit 135 17.2 22% to 5% divergence Percentage performance degradation background 130 12.86 workload Runtime (s) 125 8.52 120 4.18 115 -0.16 110 -4.5 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • No cross-over point • 12 cores are consistently worse than 11 cores 18

HP HPX-5: K 5: Kernel Co Comp mpilati tion More immediate, instead of gradual decline. 140 45.74 280 21.28 Percentage performance degradation Percentage performance degradation 135 40.54 270 16.95 130 35.34 125 30.14 260 12.62 Runtime (s) Runtime (s) 120 24.94 115 19.74 250 8.29 110 14.54 240 3.96 105 9.34 100 4.14 230 -0.37 95 -1.06 90 -6.26 220 -4.7 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores 11 threads on 11 cores LULESH HPCG 135 16.95 Percentage performance degradation 130 12.62 Runtime (s) 125 8.29 120 3.96 115 -0.37 110 -4.7 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores LibPXGL 19

Concl clusion • Performance asymmetry is still challenging • Preliminary evaluation: • Tightly controlled time-shared CPUs • Static and consistent configuration • Better than BSP, but… • On average a CPU loses its utility to a task based runtime as soon as its performance diverges by only 25%. 20

Th Than ank You ou • Debashis Ganguly • Ph.D. Student, Computer Science Department, University of Pittsburgh • debashis@cs.pitt.edu • https://people.cs.pitt.edu/~debashis/ • The Prognostic Lab • http://www.prognosticlab.org 21

The Effect of Asymmetric Performance on Asynchronous Task Based - PowerPoint PPT Presentation

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017 Ch Changi ging g Face ce of HPC Environments Task-based Runtimes: Potential solution Traditional: Dedicated Resources

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

Beyond the Asymmetric Turing Test Fintan Mallory Rethinking, Reworking and Revolutionising The

Asymmetric Inventory Dynamics and Product Market Search Linxi Chen November 29, 2017 1 / 46

Symmetric and Asymmetric Key Cryptography(Part 2) By Radhika B S Contents Asymmetric Key

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

Asynchronous Presentation Asynchronous Presentation VoiceThreads http://voicethreads.com

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The

Formalizing the Informal, From Equations to . . . Precisiating the Imprecise: Divergence: A

Agreement: Implications of Proposals to date Xolisa Ngwadla, Marianne Karlsen CCXG Global Forum

Exercise 1: Energy Deposition FLUKA Advanced Course Exercise 1a Study case Beam dump of a