[PPT] - Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 51 PowerPoint Presentation

SLIDE 1

Understanding Task Scheduling Algorithms

Kenjiro Taura

1 / 51

SLIDE 2

Introduction

in this part, we study

how tasks in task parallel programs are scheduled what can we expect about its performance

✞

1

void ms(elem * a, elem * a_end,

2

elem * t, int dest) {

3

long n = a_end - a;

4

if (n == 1) {

5

...

6

} else {

7

...

8

create task(ms(a, c, t, 1 - dest));

9

ms(c, a_end, t + nh, 1 - dest);

10

wait tasks;

11

}

12

}

T0 T1 T161 T2 T40 T3 T31 T4 T29 T5 T11 T6 T7 T8 T9 T10 T12 T24 T13 T14 T15 T23 T16 T20 T17 T18 T19 T21 T22 T25 T26 T27 T28 T30 T32 T38 T33 T37 T34 T35 T36 T39 T41 T77 T42 T66 T43 T62 T44 T45 T61 T46 T60 T47 T56 T48 T49 T55 T50 T54 T51 T53 T52 T57 T58 T59 T63 T65 T64 T67 T74 T68 T72 T69 T71 T70 T73 T75 T76 T78 T102 T79 T82 T80 T81 T83 T101 T84 T93 T85 T86 T87 T88 T92 T89 T90 T91 T94 T95 T96 T97 T98 T100 T99 T103 T153 T104 T122 T105 T120 T106 T111 T107 T110 T108 T109 T112 T114 T113 T115 T117 T116 T118 T119 T121 T123 T137 T124 T128 T125 T126 T127 T129 T135 T130 T131 T132 T134 T133 T136 T138 T152 T139 T143 T140 T141 T142 T144 T146 T145 T147 T150 T148 T149 T151 T154 T155 T156 T158 T157 T159 T160 T162 T184 T163 T172 T164 T166 T165 T167 T171 T168 T169 T170 T173 T175 T174 T176 T181 T177 T179 T178 T180 T182 T183 T185 T187 T186 T188 T190 T189 T191 T192 T193 T195 T194 T196 T198 T197 T199

4 / 51

SLIDE 5

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler)

5 / 51

SLIDE 6

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler) execution time (without modeling communication):

how much time does a scheduler take to finish a computation? in particular, how close is it to greedy schedulers?

5 / 51

SLIDE 7

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler) execution time (without modeling communication):

how much time does a scheduler take to finish a computation? in particular, how close is it to greedy schedulers?

data access (communication) cost:

when a computation is executed in parallel by a scheduler, how much data are transferred (caches ↔ memory, caches ↔ cache)? in particular, how much are they worse (or better) than those of the serial execution?

5 / 51

SLIDE 8

Model of computation

assume a program performs the following operations

create task(S): create a task that performs S wait tasks: waits for completion of tasks it has created (but has not waited for)

e.g.,

✞

1

int fib(n) {

2

if (n < 2) return 1;

3

else {

4

int x, y;

5

create_task({ x = fib(n - 1); }); // share x

6

y = fib(n - 2);

7

wait_tasks;

8

return x + y;

9

}

10

}

7 / 51

SLIDE 10

Model of computation

model an execution as a DAG (directed acyclic graph)

node: a sequence of instructions edge: dependency

assume no other dependencies besides induced by create task(S) and wait tasks e.g., (note C1 and C2 may be subgraphs, not single nodes)

✞

1

P1

2

create_task(C1);

3

P2

4

create_task(C2);

5

P3

6

wait_tasks;

7

P4 P1 P2 P3 P4 C1 C2

8 / 51

SLIDE 11

Terminologies and remarks

a single node in the DAG represents a sequence of instructions performing no task-related operations note that a task ̸= a single node, but = a sequence of nodes we say a node is ready when all its predecessors have finished; we say a task is ready to mean a node of it becomes ready

P1 P2 P3 P4 C1 C2

9 / 51

SLIDE 12

Work stealing scheduler

a state of the art scheduler of task parallel systems the main ideas invented in 1990:

Mohr, Kranz, and Halstead. Lazy task creation: a technique for increasing the granularity of parallel programs. ACM conference on LISP and functional programming.

riginally termed “Lazy Task Creation,” but essentially the

same strategy is nowadays called “work stealing”

10 / 51

SLIDE 13

Work stealing scheduler: data structure

W0 W1 W2 Wn−1 · · · · · · top bottom ready tasks executing tasks ready deques each worker maintains its “ready deque” that contains ready tasks the top entry of each ready deque is an executing task

11 / 51

SLIDE 14

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent)

12 / 51

SLIDE 15

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

12 / 51

SLIDE 16

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

FIFO stealing; it partitions tasks at points close to the root

f the task tree

12 / 51

SLIDE 17

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

FIFO stealing; it partitions tasks at points close to the root

f the task tree

it is a practical approximation of a greedy scheduler, in the sense that any ready task can be (eventually) stolen by any idle worker

12 / 51

SLIDE 18

Work stealing scheduler in action

describing a scheduler boils down to defining actions on each

f the following events

(1) create task (2) a worker becoming idle (3) wait tasks (4) a task termination

we will see them in detail

13 / 51

SLIDE 19

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(1) worker W encounters create task(S):

14 / 51

SLIDE 20

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom S P

(1) worker W encounters create task(S):

1

W pushes S to its deque

2

and immediately starts executing S

14 / 51

SLIDE 21

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom S P

(2) a worker with empty deque repeats work stealing:

14 / 51

SLIDE 22

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom S P

(2) a worker with empty deque repeats work stealing:

1

picks a random worker V as the victim

14 / 51

SLIDE 23

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(2) a worker with empty deque repeats work stealing:

1

picks a random worker V as the victim

2

steals the task at the bottom of V ’s deque

14 / 51

SLIDE 24

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(3) a worker W encounters wait tasks: there are two cases

14 / 51

SLIDE 25

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(3) a worker W encounters wait tasks: there are two cases

1

tasks to wait for have finished ⇒ W just continues the task

14 / 51

SLIDE 26

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(3) a worker W encounters wait tasks: there are two cases

1

tasks to wait for have finished ⇒ W just continues the task

14 / 51

SLIDE 27

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(3) a worker W encounters wait tasks: there are two cases

1

tasks to wait for have finished ⇒ W just continues the task

2

therwise ⇒ pops the task from its deque (the task is now

blocked, and W will start work stealing)

14 / 51

SLIDE 28

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(3) a worker W encounters wait tasks: there are two cases

1

tasks to wait for have finished ⇒ W just continues the task

2

therwise ⇒ pops the task from its deque (the task is now

blocked, and W will start work stealing)

14 / 51

SLIDE 29

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

14 / 51

SLIDE 30

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom T P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

14 / 51

SLIDE 31

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

14 / 51

SLIDE 32

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom T P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

2

ther cases ⇒ no particular action; continues to the next

task in its deque or starts work stealing if it becomes empty

14 / 51

SLIDE 33

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

2

ther cases ⇒ no particular action; continues to the next

task in its deque or starts work stealing if it becomes empty

14 / 51

SLIDE 34

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom T P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

2

ther cases ⇒ no particular action; continues to the next

task in its deque or starts work stealing if it becomes empty

14 / 51

SLIDE 35

Work stealing scheduler in action

W0 W1 W2 Wn−1 · · · · · · top bottom P

(4) when W encounters the termination of a task T, W pops T from its deque. there are two cases about T’s parent P:

1

P has been blocked and now becomes ready again ⇒ W enqueues and continues to P

2

ther cases ⇒ no particular action; continues to the next

task in its deque or starts work stealing if it becomes empty

14 / 51

SLIDE 36

A note about the cost of operations

W0 W1 W2 Wn−1 · · · · · · top bottom S P

with work stealing, the cost of a task creation is cheap, unless its parent is stolen

1

a task gets created

2

the control jumps to the new task

3

when finished, the control returns back to the parent (as it has not been stolen)

15 / 51

SLIDE 37

A note about the cost of operations

W0 W1 W2 Wn−1 · · · · · · top bottom S P

much like a procedure call, except:

the parent and the child each needs a separate stack, as the parent might be executed concurrently with the child, as the parent might be executed without returning from the child, the parent generally cannot assume callee save registers are preserved

the net overhead is ≈ 100-200 instructions, from task creation to termination

15 / 51

SLIDE 38

What you must remember when using work stealing systems

when using (good) work stealing scheduler, don’t try to match the number of tasks to the number of processors

bad idea 1: create ≈ P tasks when you have P processors bad idea 2 (task pooling): keep exactly P tasks all the time and let each grab work

they are effective with OS-managed threads or processes but not with work stealing schedulers remember: keep the granularity of a task above a constant factor of task creation overhead (so the relative overhead is a sufficiently small constant. e.g., 2%)

good idea: make the granularity ≥ 5000 cycles

16 / 51

SLIDE 39

Analyzing execution time of work stealing

we analyze execution time of work stealing scheduler in terms

f T1 (total work) and T∞ (critical path)

Blumofe et al. Scheduling multithreaded computations by work stealing Journal of the ACM 46(5). 1999.

due to the random choices of victims, the upper bound is necessarily probabilistic (e.g., TP ≤ · · · with a probability ≥ · · · ) for mathematical simplicity, we are satisfied with a result about average (expected) execution time

19 / 51

SLIDE 42

Analyzing execution time of work stealing

the main result: with P processors, E(TP) ≤ T1 P + aT∞, with c a small constant reflecting the cost of a work steal remember the greedy scheduler theorem? TP ≤ T1 P + T∞

20 / 51

SLIDE 43

Recap : DAG model

recall the DAG model of computation T1: total work (= execution time by a single processor) T∞: critical path (= execution time by an arbitrarily many processors) TP: execution time by P processors T∞ two obvious lower bounds of execution time of any scheduler: TP ≥ T1 P and TP ≥ T∞

r equivalently,

TP ≥ max (T1 P , T∞ )

22 / 51

SLIDE 45

Recap : greedy scheduler

greedy scheduler: “a worker is never idle, as long as any ready task is left”

ready nodes

the greedy scheduler theorem: any greedy scheduler achieves the following upper bound TP ≤ T1 P + T∞ considering both T1

P and T∞ are lower bounds, this shows any

greedy scheduler is within a factor of two of optimal

23 / 51

SLIDE 46

Proof of the greedy scheduler theorem : settings

for the sake of simplifying analysis, assume all nodes take a unit time to execute (longer nodes can be modeled by a chain

f unit-time nodes)

there are P workers workers execute in a “lockstep” manner

24 / 51

SLIDE 47

Proof of the greedy scheduler theorem : settings

ready nodes 1 2 3 4 5 6 7

in each time step, either of the following happens

25 / 51

SLIDE 48

Proof of the greedy scheduler theorem : settings

ready nodes 1 2 3 4 5 6 7

in each time step, either of the following happens

(S) there are ≥ P ready tasks ⇒ each worker executes any ready task (Saturated)

25 / 51

SLIDE 49

Proof of the greedy scheduler theorem : settings

ready nodes 1 2 3 4 5 6

in each time step, either of the following happens

(S) there are ≥ P ready tasks ⇒ each worker executes any ready task (Saturated) (U) there are ≤ P ready tasks ⇒ each ready task is executed by any worker (Unsaturated)

25 / 51

SLIDE 50

Proof of the greedy scheduler theorem : settings

ready nodes

in each time step, either of the following happens

(S) there are ≥ P ready tasks ⇒ each worker executes any ready task (Saturated) (U) there are ≤ P ready tasks ⇒ each ready task is executed by any worker (Unsaturated)

25 / 51

SLIDE 51

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

terminated ready

26 / 51

SLIDE 52

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

terminated ready

26 / 51

SLIDE 53

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

(S) all P workers execute a node, or

terminated ready

26 / 51

SLIDE 54

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

(S) all P workers execute a node, or (U) the ready node on the path gets executed

terminated ready

26 / 51

SLIDE 55

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

(S) all P workers execute a node, or (U) the ready node on the path gets executed

terminated ready

(S) happens ≤ T1/P times and (U) ≤ T∞ times

26 / 51

SLIDE 56

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

(S) all P workers execute a node, or (U) the ready node on the path gets executed

terminated ready

(S) happens ≤ T1/P times and (U) ≤ T∞ times therefore, the end node will be executed within TP ≤ (T1/P + T∞) steps

26 / 51

SLIDE 57

Proof of the greedy scheduler theorem

there is a path from the start node to the end node, along which a node is always ready (let’s call such a path a ready path)

exercise: prove there is a ready path

at each step, either of the following must happen

(S) all P workers execute a node, or (U) the ready node on the path gets executed

terminated ready

(S) happens ≤ T1/P times and (U) ≤ T∞ times therefore, the end node will be executed within TP ≤ (T1/P + T∞) steps

note: you can actually prove TP ≤ (T1/P + (1−1/P)T∞) steps

26 / 51

SLIDE 58

What about work stealing scheduler?

what is the difference between the genuine greedy scheduler and work stealing scheduler? the greedy scheduler finds ready tasks with zero delays

28 / 51

SLIDE 60

What about work stealing scheduler?

what is the difference between the genuine greedy scheduler and work stealing scheduler? the greedy scheduler finds ready tasks with zero delays any practically implementable scheduler will inevitably cause some delays, from the time a node becomes ready until the time it gets executed

28 / 51

SLIDE 61

What about work stealing scheduler?

what is the difference between the genuine greedy scheduler and work stealing scheduler? the greedy scheduler finds ready tasks with zero delays any practically implementable scheduler will inevitably cause some delays, from the time a node becomes ready until the time it gets executed in the work stealing scheduler, the delay is the time to randomly search other workers’ deques for ready tasks, without knowing which deques have tasks to steal

28 / 51

SLIDE 62

Analyzing work stealing scheduler : settings

a similar setting with greedy scheduler

all nodes take a unit time P workers

29 / 51

SLIDE 63

Analyzing work stealing scheduler : settings

a similar setting with greedy scheduler

all nodes take a unit time P workers

each worker does the following in each time step

its deque is not empty ⇒ executes the node designated by the algorithm its deque is empty ⇒ attempts to steal a node; if it succeeds, the node will be executed in the next step

29 / 51

SLIDE 64

Analyzing work stealing scheduler : settings

a similar setting with greedy scheduler

all nodes take a unit time P workers

each worker does the following in each time step

its deque is not empty ⇒ executes the node designated by the algorithm its deque is empty ⇒ attempts to steal a node; if it succeeds, the node will be executed in the next step

remarks on work stealing attempts

if the chosen victim has no ready tasks (besides the one executing), an attempt fails when two or more workers choose the same victim, only one can succeed in a single time

29 / 51

SLIDE 65

The overall strategy of the proof

T1 steal attempts P TP

in each time step, each processor either executes a node or attempts a steal ⇒ if we can estimate the number of steal attempts (succeeded or not), we can estimate the execution time, as: TP = T1 + steal attempts P

ur goal is to estimate the number of steal attempts

30 / 51

SLIDE 66

Analyzing work stealing scheduler

how many steal attempts? ready executed

similar to the proof of greedy scheduler, consider a ready path (a path from the start node to the end node, along which a node is always ready)

31 / 51

SLIDE 67

Analyzing work stealing scheduler

how many steal attempts? ready executed

similar to the proof of greedy scheduler, consider a ready path (a path from the start node to the end node, along which a node is always ready) the crux is to estimate how many steal attempts are enough to make a “progress” along the ready path

31 / 51

SLIDE 68

The number of steal attempts: the key

bservation

a task at the bottom of the deque will be stolen by a steal attempt with probability 1/P

32 / 51

SLIDE 69

The number of steal attempts: the key

bservation

a task at the bottom of the deque will be stolen by a steal attempt with probability 1/P thus, on average, such a task will be stolen, on average, with P steal attempts, and is executed in the next step you roll a dice, and you’ll get the first after six rolls

≈ P steal attempts create task

32 / 51

SLIDE 70

The number of steal attempts: the key

bservation

a task at the bottom of the deque will be stolen by a steal attempt with probability 1/P thus, on average, such a task will be stolen, on average, with P steal attempts, and is executed in the next step you roll a dice, and you’ll get the first after six rolls

≈ P steal attempts create task

we are going to extend the argument to any ready task along a ready path, and establish an average number of steal attempts for any ready task to get executed

32 / 51

SLIDE 71

How many steal attempts to occur for any ready task to get executed?

there are five types of edges

(A) create task → child (B) create task → the continuation (C) wait task → the continuation (D) last node of a task → the parent’s continuation after the corresponding wait (E) non-task node → the continuation

create task create task wait tasks wait tasks (A) (B) (C) (D)

33 / 51

SLIDE 72

How many steal attempts to occur for any ready task to get executed?

the successor of type (A), (C), (D), and (E) is executed immediately after the predecessor is executed. there are no delays on edges

f these types

create task create task wait tasks wait tasks (A) (B) (C) (D)

33 / 51

SLIDE 73

How many steal attempts to occur for any ready task to get executed?

nly the successor of type

(B) edges may need a steal attempt to get executed as has been discussed, once a task is at the bottom of a deque, it needs ≈ P steal attempts on average until it gets stolen

create task create task wait tasks wait tasks (A) (B) (C) (D)

33 / 51

SLIDE 74

How many steal attempts to occur for any ready task to get executed?

note that a successor of a type (B) edge (continuation

f a task creation) is not

necessarily at the bottom of a deque e.g., y cannot be stolen until x has been stolen

create task create task wait tasks wait tasks (A) (B) (C) (D) x y

33 / 51

SLIDE 75

How many steal attempts to occur for any ready task to get executed?

stealing such a task requires an accordingly many steal attempts e.g., stealing y requires 2P attempts on average (P to steal x and another P to steal y)

create task create task wait tasks wait tasks (A) (B) (C) (D) x y

33 / 51

SLIDE 76

When a task becomes stealable?

in general, in order for the continuation

f a create task (n) to be stolen,

continuations of all create tasks along the path from the start node to n must be stolen

create task create task a ready path create task create task n

34 / 51

SLIDE 77

Summary of the proof

Now we have all ingredients to finish the proof the average number of steal attempts to finish the ready path therefore, average of Tp ≤ T1 + PT∞ P = T1 P + T∞

35 / 51

SLIDE 78

Summary of the proof

Now we have all ingredients to finish the proof the average number of steal attempts to finish the ready path ≈ P × the length of the ready path therefore, average of Tp ≤ T1 + PT∞ P = T1 P + T∞

35 / 51

SLIDE 79

Summary of the proof

Now we have all ingredients to finish the proof the average number of steal attempts to finish the ready path ≈ P × the length of the ready path ≤ PT∞ therefore, average of Tp ≤ T1 + PT∞ P = T1 P + T∞

35 / 51

SLIDE 80

Extensions

we assumed a steal attempt takes a single time step, but it can be generalized to a setting where a steal attempt takes a time steps, E(TP) ≤ T1 P + aT∞ we can also probabilistically bound the execution time the basis is the probability that a critical node takes cP steal attempts to be executed is ≤ e−c ∵ ( 1 − 1 P )cP ≤ e−c based on this we bound the probability that a path of length l takes ClP steal attempts, for a large enough constant C

36 / 51

SLIDE 81

Analyzing cache misses of work stealing

we like to know the amount of data transfers between a processor’s cache and { main memory, other caches }, under a task parallel scheduler

38 / 51

SLIDE 83

Analyzing cache misses of work stealing

we like to know the amount of data transfers between a processor’s cache and { main memory, other caches }, under a task parallel scheduler in particular, we like to understand how much can it be worse (or better) than its serial execution

memory controller

L3 cache

hardware thread (virtual core, CPU)

L2 cache

L1 cache

38 / 51

SLIDE 84

An analysis methodology of serial computation

we have learned how to analyze data transfer between the cache and main memory, in single processor machines

capacity C capacity ∞ cache main memory ≤ C ≤ C ≤ C

39 / 51

SLIDE 85

An analysis methodology of serial computation

we have learned how to analyze data transfer between the cache and main memory, in single processor machines the key was to identify “cache-fitting” subcomputations (working set size ≤ C words); and

capacity C capacity ∞ cache main memory ≤ C ≤ C ≤ C

39 / 51

SLIDE 86

An analysis methodology of serial computation

we have learned how to analyze data transfer between the cache and main memory, in single processor machines the key was to identify “cache-fitting” subcomputations (working set size ≤ C words); and a cache-fitting subcomputation induces ≤ C words data transfers

capacity C capacity ∞ cache main memory ≤ C ≤ C ≤ C

39 / 51

SLIDE 87

Minor remarks (data transfer vs. cache misses)

we hereafter use “a single cache miss” to mean “a single data transfer from/to a cache” in real machines, some data transfers do not induce to cache misses due to prefetches we say “cache misses” for simplicity

40 / 51

SLIDE 88

What’s different in parallel execution?

the argument that “cache misses by a cache-fitting subcomputation ≤ C” no longer holds in parallel execution

A B ≤ C

41 / 51

SLIDE 89

What’s different in parallel execution?

the argument that “cache misses by a cache-fitting subcomputation ≤ C” no longer holds in parallel execution consider two subcomputations A and B

✞

1

create_task({ A });

2

B

assume A and B together fit in the cache even so, if A and B are executed on different processors, originally a cache hit in B may miss

A B ≤ C

41 / 51

SLIDE 90

What’s different in parallel execution?

so a parallel execution might increase cache misses but how much?

The data locality of work stealing. SPAA ’00 Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures.

42 / 51

SLIDE 91

Problem settings

P processors (≡ P workers) caches are private to each processor (no shared caches) consider only a single-level cache, with the capacity of C words LRU replacement: i.e., a cache holds most recently accessed C distinct words

capacity of each C capacity ∞ caches main memory

43 / 51

SLIDE 92

The key observation

we have learned the work stealing scheduler tends to preserve much of the serial execution order ⇒ extra cache misses are caused by work stealings a work stealing essentially brings a subcomputation to a processor with unknown cache states

A B ?????

44 / 51

SLIDE 93

The key questions

key question 1: how many extra misses can occur when a subcomputation moves to an unknown cache states?

executed instructions (identical) initial cache states (different) hit miss

key question 2: how many times work stealings happen? (we know an answer)

45 / 51

SLIDE 94

Roadmap

(1) bound the number of extra cache misses that occurs each time a node is drifted (i.e., executed in a different order with the serial execution) (2) we know an upper bound on the number of steals (3) from (2), bound the number of drifted nodes combine (1) and (3) to derive an upper bound on the total number of extra cache misses

46 / 51

SLIDE 95

Extra misses per drifted node

when caches are LRU, the two cache states converge to an identical state, after no more than C cache misses occur in either cache

initial cache states (different) ≤ C t r a n s f e r s ( m i s s e s ) ≤ C transfers (misses)

this is because the cache is LRU (holds most recently accessed distinct C words) ∴ extra cache misses for each drifted node ≤ C

47 / 51

SLIDE 96

A bound on the drifted nodes

let v a node in the DAG and u the node that would immediately precedes v in the serial execution we say v is drifted when u and v are not executed consecutively

n the same processor

work stealing drifted would immediately precede a drifted node in the serial order

without a detailed proof, we note: the number of drifted nodes in the work stealing scheduler ≤ 2 × the number of steals

48 / 51

SLIDE 97

The main result

the average number of work stealings ≈ PT∞ ⇒ the average number of drifted nodes ≈ 2PT∞ ⇒ the average number of extra cache misses ≤ 2CPT∞ average execution time TP ≤ T1 P + 2mCT∞, where m is the cost of a single cache miss

49 / 51

SLIDE 98

Summary

the basics of work stealing scheduler

work-first (preserve serial execution order) steal tasks from near the root

average execution time (without cost of communication) TP ≤ T1 P + T∞ with the cost of communication TP ≤ T1 P + 2mCT∞ where mC essentially represents the time to fill the cache

51 / 51

Understanding Task Scheduling Algorithms

Kenjiro Taura

Contents

Introduction DAG model and greedy schedulers Work stealing schedulers

Contents

Introduction DAG model and greedy schedulers Work stealing schedulers

Introduction

in this part, we study

how tasks in task parallel programs are scheduled what can we expect about its performance

✞

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler)

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler) execution time (without modeling communication):

how much time does a scheduler take to finish a computation? in particular, how close is it to greedy schedulers?

Goals

understand a state-of-the-art scheduling algorithm (work stealing scheduler) execution time (without modeling communication):

how much time does a scheduler take to finish a computation? in particular, how close is it to greedy schedulers?

data access (communication) cost:

when a computation is executed in parallel by a scheduler, how much data are transferred (caches ↔ memory, caches ↔ cache)? in particular, how much are they worse (or better) than those of the serial execution?

Contents

Introduction DAG model and greedy schedulers Work stealing schedulers

Model of computation

assume a program performs the following operations

create task(S): create a task that performs S wait tasks: waits for completion of tasks it has created (but has not waited for)

e.g.,

✞

Model of computation

model an execution as a DAG (directed acyclic graph)

node: a sequence of instructions edge: dependency

assume no other dependencies besides induced by create task(S) and wait tasks e.g., (note C1 and C2 may be subgraphs, not single nodes)

✞

Terminologies and remarks

a single node in the DAG represents a sequence of instructions performing no task-related operations note that a task ̸= a single node, but = a sequence of nodes we say a node is ready when all its predecessors have finished; we say a task is ready to mean a node of it becomes ready

P1 P2 P3 P4 C1 C2

Work stealing scheduler

a state of the art scheduler of task parallel systems the main ideas invented in 1990:

Mohr, Kranz, and Halstead. Lazy task creation: a technique for increasing the granularity of parallel programs. ACM conference on LISP and functional programming.

same strategy is nowadays called “work stealing”

Work stealing scheduler: data structure

W0 W1 W2 Wn−1 · · · · · · top bottom ready tasks executing tasks ready deques each worker maintains its “ready deque” that contains ready tasks the top entry of each ready deque is an executing task

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent)

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

FIFO stealing; it partitions tasks at points close to the root

Work stealing scheduler : in a nutshell

work-first; when creating a task, the created task gets executed first (before the parent) LIFO execution order within a worker; without work stealing, the order of execution is as if it were a serial program

create task(S) ≡ S wait tasks ≡ noop

FIFO stealing; it partitions tasks at points close to the root

it is a practical approximation of a greedy scheduler, in the sense that any ready task can be (eventually) stolen by any idle worker

Work stealing scheduler in action

describing a scheduler boils down to defining actions on each

(1) create task (2) a worker becoming idle (3) wait tasks (4) a task termination

we will see them in detail

Work stealing scheduler in action

(1) worker W encounters create task(S):

Work stealing scheduler in action

(1) worker W encounters create task(S):

W pushes S to its deque

and immediately starts executing S

Work stealing scheduler in action

(2) a worker with empty deque repeats work stealing:

Work stealing scheduler in action

(2) a worker with empty deque repeats work stealing:

picks a random worker V as the victim

Work stealing scheduler in action

(2) a worker with empty deque repeats work stealing:

picks a random worker V as the victim

steals the task at the bottom of V ’s deque

Work stealing scheduler in action

(3) a worker W encounters wait tasks: there are two cases

Work stealing scheduler in action

(3) a worker W encounters wait tasks: there are two cases

tasks to wait for have finished ⇒ W just continues the task

Work stealing scheduler in action