Palirria: Accurate On-line Parallelism Estimation for Adaptive - PowerPoint PPT Presentation

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing Georgios Varisteas, Mats Brorsson PMAM, February 2014 KTH Royal Institute of Technology

Motivation ● Increasing number of cores per die – Worrisome power budget – Unequipped OS resource management Intel i7 AMD Phenom II Intel Xeon Phi 2

Motivation: Scheduling ● Keep the system utilized just enough to lower the power budget – Conservative core allotment ● Allot cores so that application performance is maximized – Liberal core allotment 3

Dynamic Multiprogramming ● Adapt allotment size to actual application processing requirements – Each application must provide knowledge on its exposed parallelism – The OS can intelligently partition available resources 4

Summary ● Palirria – Method for estimating a task-based workload's concurrency ● Accurate, lightweight, online, no training – Built upon a variation to traditional work-stealing ● Deterministic Victim Selection ( DVS ) replaces victim selection in any work-stealing scheduler ➔ Good performance with less worker threads for workloads of irregular parallelism 5

Task-centric programming models ● Expose independent computations, executable in parallel ● Adapt easily – Logical, not bound to hardware task Sync Spawn main main Sync Spawn task 6

Work Stealing ● Pre created pool of worker threads ● Local task queue per worker thread ● Workers place spawned tasks in their queue ● If worker idle: 1. Steals from its own task-queue 2. Steals from a remote task-queue (victim) ● Victim selection : find a non-empty remote queue – Traditionally employs some randomness 7

From Estimation to Adaptation ● Estimate a workload's parallelism – Metric for quantifying parallelism ● Decide adequate allotment size – Conditions for requesting change 8

Parallelism Estimation: Metrics ● Traditional black box approaches ➔ Measure cycles or other perf. counters ✗ Estimate based on past behavior ✗ Hardware dependent ● Could we exploit the scheduling? ➔ Parallelism currency: task-queue size ✔ Estimate based on future processing needs ✔ Hardware agnostic 9

Parallelism Estimation: Decision ● Maybe add more workers – Over-utilized allotment – Non empty task queues ● Probably need less workers – Under-utilized allotment – Empty task-queues 10

Parallelism Estimation: Issues ● Threshold: What queue size should decide over-utilization? ● Overhead: How many workers should qualify this condition? ● Balance: What if some workers are over- and others under- utilized? ● Random victim selection hinders estimation 11

Scheduling Support for Parallelism Estimation ● Must normalize work discovery latency – Predictable distribution of tasks among workers ● Must infer global status from some workers – Uniform distribution of tasks among workers 12

DVS: Deterministic Victim Selection ● Completely non-random victim selection ➔ Uniformly distributes tasks to all workers ➔ Reduces worst latency for task discovery ➔ Maintains performance Paper: G. Varisteas, M. Brorsson. DVS: Deterministic Victim Selection to Improve Performance in Work-Stealing Schedulers . MULTIPROG 2014, Vienna http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-139400 13

DVS: Worker Classification ● Model available workers as a virtual mesh grid ● Classify workers based on location – X : vertically & horizontally from the source – Z : at maximum distance from the source – F : what remains 14

Palirria: Decision Policy ● Under-utilized : decrease – All workers in Z have empty task-queue ● Over-utilized : increase – All workers in X have more than L tasks in their task-queue ● Balanced : no change – If otherwise 15

Palirria: Over-utilization condition ● L i > |O i | – |O i |: Number of Outer victims 16

Palirria: Over-utilization condition ● L i > |O i | – |O i |: Number of Outer victims w i 17

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | – |O i |: Number of Outer victims w i 18

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | – |O i |: Number of Outer victims w i 19

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | L > 3 – |O i |: Number of Outer victims w i 20

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | L > 3 – |O i |: Number of Outer victims w i ● O i : workers that have w i as their primary victim 21

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | L i > 3 – |O i |: Number of Outer victims w i ● O i : workers that have w i as their primary victim ● L tunes tolerance 22

Palirria: Over-utilization condition Outer victims of w i ● L i > |O i | L i > 3 – |O i |: Number of Outer victims w i ● O i : workers that have w i as their primary victim ● L = |O i | + 1 ● L is calculated when constructing the victim-set 23

ASTEAL: prominent related work ● Metric : cycles spent on wasteful actions – Failed steal attempts ● Samples the cycle counter of all workers 24

Palirria Evaluation ● All implementations using the same WOOL scheduler ● Linux on a 48-core Opteron Numa system 25

Accuracy ● Dynamically changed allotment size over time ● WOOL: best fixed size execution time 26

Accuracy: irregular workloads 27

Accuracy: regular workloads 28

Wastefulness ● Percentage of the avg per worker execution time spent: – idling – on failed steal attempts % n: fixed n-workers AS: Asteal adaptive PA: Palirria adaptive 29

Wastefulness: irregular workloads 30

Wastefulness: regular workloads 31

Conclusions ● Non-random workload distribution techniques – Are efficient – Enable accurate estimation of parallelism ● Task-queue size – Quantifies future parallelism – Is hardware agnostic 32

Summary ● Palirria – Method for estimating a task-based workload's concurrency ● Accurate, lightweight, online, no training – Built upon a variation to traditional work-stealing ● Deterministic Victim Selection ( DVS ) replaces victim selection in any work-stealing scheduler ➔ Good performance with less worker threads for workloads of irregular parallelism 33

Thank you 34

Dynamic Resource Allocation ● The operating system knows resource availability ● The application runtime knows resource requirements 35

Two Level Scheduling Scheme 36

Flow of Tasks Parallel program One parallel section sequence of parallel sections 37

Flow of Tasks Spawn Spawn main task task Spawn Spawn task task Spawn Spawn task task 38

Task Scheduling Issues ● Adaptation of allotment size – Dynamically estimate actual parallelism ➔ Predictable distribution of tasks ● Uniform distribution – Available tasks equally distributed ➔ Controllable distribution of tasks 39

Work-stealing ● Victim selection – Random ● Uncontrollable distribution – Semi-random (leap-frogging) ● Unpredictable distribution – Non-random? ● Controllable and predictable distribution ● Can it be as fast? 40

DVS: Deterministic Victim Selection 41

DVS: Deterministic Victim Selection 42

DVS: Workers' Useful Time 43

DVS: First successful steal latency 44

DVS: Execution time 45

DVS: Execution time 46

Palirria: Accurate On-line Parallelism Estimation for Adaptive - PowerPoint PPT Presentation

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing Georgios Varisteas, Mats Brorsson PMAM, February 2014 KTH Royal Institute of Technology Motivation Increasing number of cores per die Worrisome power budget

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Title Slide Math 696 Class July 19, 2002 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Judicious Choice of Waveform Parameters and Judicious Choice of Waveform Parameters and Accurate

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Se lf Dir e c tion in the Ne w Wor ld Or de r of Soc ial Pr ogr ams: An Innovative Oppor

Special Education Enhancement Fund (SEEF) F ormula Grant Application Webinar July 11, 2017 and

Memory Management Summer 2016 Cornell University Today Overview of memory The role of

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

Newcastle Parks Charitable Trust At the event on 24 February one member of the audience

Medicaid is US NASTAD Prevention & Care Technical Assistance Meeting July 20, 2017 Agenda

File-System Interface Summer 2013 Cornell University 1 Today Why do we use files and how

KERRVILLE SOLAR PARTNERS Ramsey Cripe Consultant, Schneider Engineering October 21, 2019 Team

Palirria: Accurate On-line Parallelism Estimation for Adaptive - PowerPoint PPT Presentation

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing Georgios Varisteas, Mats Brorsson PMAM, February 2014 KTH Royal Institute of Technology Motivation Increasing number of cores per die Worrisome power budget

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Title Slide Math 696 Class July 19, 2002 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Judicious Choice of Waveform Parameters and Judicious Choice of Waveform Parameters and Accurate

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Se lf Dir e c tion in the Ne w Wor ld Or de r of Soc ial Pr ogr ams: An Innovative Oppor

Special Education Enhancement Fund (SEEF) F ormula Grant Application Webinar July 11, 2017 and

Memory Management Summer 2016 Cornell University Today Overview of memory The role of

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

Newcastle Parks Charitable Trust At the event on 24 February one member of the audience

Medicaid is US NASTAD Prevention &amp; Care Technical Assistance Meeting July 20, 2017 Agenda

File-System Interface Summer 2013 Cornell University 1 Today Why do we use files and how

KERRVILLE SOLAR PARTNERS Ramsey Cripe Consultant, Schneider Engineering October 21, 2019 Team

Medicaid is US NASTAD Prevention & Care Technical Assistance Meeting July 20, 2017 Agenda