Palirria: Accurate On-line Parallelism Estimation for Adaptive - - PowerPoint PPT Presentation

palirria accurate on line parallelism estimation for
SMART_READER_LITE
LIVE PREVIEW

Palirria: Accurate On-line Parallelism Estimation for Adaptive - - PowerPoint PPT Presentation

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing Georgios Varisteas, Mats Brorsson PMAM, February 2014 KTH Royal Institute of Technology Motivation Increasing number of cores per die Worrisome power budget


slide-1
SLIDE 1

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing

Georgios Varisteas, Mats Brorsson

PMAM, February 2014

KTH Royal Institute of Technology

slide-2
SLIDE 2

2

Motivation

  • Increasing number of cores per die

– Worrisome power budget – Unequipped OS resource management

Intel i7 AMD Phenom II Intel Xeon Phi

slide-3
SLIDE 3

3

Motivation: Scheduling

  • Keep the system utilized just enough to lower

the power budget

– Conservative core allotment

  • Allot cores so that application performance is

maximized

– Liberal core allotment

slide-4
SLIDE 4

4

Dynamic Multiprogramming

  • Adapt allotment size to actual application

processing requirements

– Each application must provide knowledge

  • n its exposed parallelism

– The OS can intelligently partition available

resources

slide-5
SLIDE 5

5

Summary

  • Palirria

– Method for estimating a task-based workload's

concurrency

  • Accurate, lightweight, online, no training

– Built upon a variation to traditional work-stealing

  • Deterministic Victim Selection (DVS) replaces victim

selection in any work-stealing scheduler

➔ Good performance with less worker threads for

workloads of irregular parallelism

slide-6
SLIDE 6

6

Task-centric programming models

  • Expose independent computations,

executable in parallel

  • Adapt easily

– Logical, not bound to hardware

main Spawn Spawn task task Sync Sync main

slide-7
SLIDE 7

7

Work Stealing

  • Pre created pool of worker threads
  • Local task queue per worker thread
  • Workers place spawned tasks in their queue
  • If worker idle:

1.Steals from its own task-queue 2.Steals from a remote task-queue (victim)

  • Victim selection: find a non-empty remote queue

– Traditionally employs some randomness

slide-8
SLIDE 8

8

From Estimation to Adaptation

  • Estimate a workload's parallelism

– Metric for quantifying parallelism

  • Decide adequate allotment size

– Conditions for requesting change

slide-9
SLIDE 9

9

Parallelism Estimation: Metrics

  • Traditional black box approaches

➔ Measure cycles or other perf. counters

✗ Estimate based on past behavior ✗ Hardware dependent

  • Could we exploit the scheduling?

➔ Parallelism currency: task-queue size ✔ Estimate based on future processing needs ✔ Hardware agnostic

slide-10
SLIDE 10

10

Parallelism Estimation: Decision

  • Maybe add more workers

– Over-utilized allotment – Non empty task queues

  • Probably need less workers

– Under-utilized allotment – Empty task-queues

slide-11
SLIDE 11

11

Parallelism Estimation: Issues

  • Threshold: What queue size should decide
  • ver-utilization?
  • Overhead: How many workers should qualify

this condition?

  • Balance: What if some workers are over- and
  • thers under- utilized?
  • Random victim selection hinders estimation
slide-12
SLIDE 12

12

Scheduling Support for Parallelism Estimation

  • Must normalize work discovery latency

– Predictable distribution of tasks among workers

  • Must infer global status from some workers

– Uniform distribution of tasks among workers

slide-13
SLIDE 13

13

DVS: Deterministic Victim Selection

  • Completely non-random victim selection

➔ Uniformly distributes tasks to all workers ➔ Reduces worst latency for task discovery ➔ Maintains performance

Paper: G. Varisteas, M. Brorsson. DVS: Deterministic Victim Selection to Improve Performance in Work-Stealing Schedulers. MULTIPROG 2014, Vienna http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-139400

slide-14
SLIDE 14

14

DVS: Worker Classification

  • Model available workers as a virtual mesh grid
  • Classify workers

based on location

– X: vertically &

horizontally from the source

– Z: at maximum

distance from the source

– F: what remains

slide-15
SLIDE 15

15

Palirria: Decision Policy

  • Under-utilized: decrease

– All workers in Z

have empty task-queue

  • Over-utilized: increase

– All workers in X

have more than L tasks in their task-queue

  • Balanced: no change

– If otherwise

slide-16
SLIDE 16

16

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

slide-17
SLIDE 17

17

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

wi

slide-18
SLIDE 18

18

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

wi

Outer victims of wi

slide-19
SLIDE 19

19

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

wi

Outer victims of wi

slide-20
SLIDE 20

20

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

wi

Outer victims of wi

L > 3

slide-21
SLIDE 21

21

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

  • Oi: workers that have wi

as their primary victim

wi

Outer victims of wi

L > 3

slide-22
SLIDE 22

22

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

  • Oi: workers that have wi

as their primary victim

  • L tunes tolerance

wi

Outer victims of wi

Li > 3

slide-23
SLIDE 23

23

Palirria: Over-utilization condition

  • Li > |Oi|

– |Oi|: Number of Outer

victims

  • Oi: workers that have wi as

their primary victim

  • L = |Oi| + 1
  • L is calculated when

constructing the victim-set

wi

Outer victims of wi

Li > 3

slide-24
SLIDE 24

24

ASTEAL: prominent related work

  • Metric: cycles spent on wasteful actions

– Failed steal attempts

  • Samples the cycle counter of all workers
slide-25
SLIDE 25

25

Palirria Evaluation

  • All implementations using the same WOOL

scheduler

  • Linux on a 48-core Opteron Numa system
slide-26
SLIDE 26

26

Accuracy

  • Dynamically changed allotment size over time
  • WOOL: best fixed size execution time
slide-27
SLIDE 27

27

Accuracy: irregular workloads

slide-28
SLIDE 28

28

Accuracy: regular workloads

slide-29
SLIDE 29

29

Wastefulness

  • Percentage of the avg per worker

execution time spent:

– idling – on failed steal attempts

n: fixed n-workers AS: Asteal adaptive PA: Palirria adaptive

%

slide-30
SLIDE 30

30

Wastefulness: irregular workloads

slide-31
SLIDE 31

31

Wastefulness: regular workloads

slide-32
SLIDE 32

32

Conclusions

  • Non-random workload distribution techniques

– Are efficient – Enable accurate estimation of parallelism

  • Task-queue size

– Quantifies future parallelism – Is hardware agnostic

slide-33
SLIDE 33

33

Summary

  • Palirria

– Method for estimating a task-based workload's

concurrency

  • Accurate, lightweight, online, no training

– Built upon a variation to traditional work-stealing

  • Deterministic Victim Selection (DVS) replaces victim

selection in any work-stealing scheduler

➔ Good performance with less worker threads for

workloads of irregular parallelism

slide-34
SLIDE 34

34

Thank you

slide-35
SLIDE 35

35

Dynamic Resource Allocation

  • The operating system knows resource

availability

  • The application runtime knows resource

requirements

slide-36
SLIDE 36

36

Two Level Scheduling Scheme

slide-37
SLIDE 37

37

Flow of Tasks

Parallel program sequence of parallel sections One parallel section

slide-38
SLIDE 38

38

Flow of Tasks

main Spawn Spawn Spawn Spawn Spawn task task task task task task Spawn

slide-39
SLIDE 39

39

Task Scheduling Issues

  • Adaptation of allotment size

– Dynamically estimate actual parallelism

➔ Predictable distribution of tasks

  • Uniform distribution

– Available tasks equally distributed

➔ Controllable distribution of tasks

slide-40
SLIDE 40

40

Work-stealing

  • Victim selection

– Random

  • Uncontrollable distribution

– Semi-random (leap-frogging)

  • Unpredictable distribution

– Non-random?

  • Controllable and predictable distribution
  • Can it be as fast?
slide-41
SLIDE 41

41

DVS: Deterministic Victim Selection

slide-42
SLIDE 42

42

DVS: Deterministic Victim Selection

slide-43
SLIDE 43

43

DVS: Workers' Useful Time

slide-44
SLIDE 44

44

DVS: First successful steal latency

slide-45
SLIDE 45

45

DVS: Execution time

slide-46
SLIDE 46

46

DVS: Execution time

slide-47
SLIDE 47

47