[PPT] - Scheduling Parallel Programs by Work Stealing with Private Deques PowerPoint Presentation

SLIDE 1

Scheduling Parallel Programs by Work Stealing with Private Deques

Umut Acar

Carnegie Mellon University

Arthur Charguéraud

INRIA

Mike Rainey

Max Planck Institute for Software Systems

PPoPP 25.2.2013

1

Friday, July 5, 13

SLIDE 2

Scheduling parallel tasks

2

Friday, July 5, 13

SLIDE 3

Scheduling parallel tasks

set of cores

2

Friday, July 5, 13

SLIDE 4

Scheduling parallel tasks

pool of tasks

2

Friday, July 5, 13

SLIDE 5

Scheduling parallel tasks

2

Goal: dynamic load balancing
A centralized approach: does not scale up
Popular approach: work stealing
Our work: study implementations of work stealing

Friday, July 5, 13

SLIDE 6

Work stealing

3

Friday, July 5, 13

SLIDE 7

Work stealing

deque

3

Friday, July 5, 13

SLIDE 8

Work stealing

3

Friday, July 5, 13

SLIDE 9

Work stealing

3

push pop push pop push pop

Friday, July 5, 13

SLIDE 10

Work stealing

3

Friday, July 5, 13

SLIDE 11

Work stealing

3

steal

Friday, July 5, 13

SLIDE 12

Work stealing

3

Friday, July 5, 13

SLIDE 13

Concurrent deques

Deques are shared.
Two sources of race:
between thieves
between owner and thief
Chase-Lev data structure resolves

these races using atomic compare&swap and memory fences.

4

top bot push pop steals

Friday, July 5, 13

SLIDE 14

Concurrent deques

Well studied: shown to perform well

both in theory and in practice ...

Runtime overhead: In a relaxed

memory model, pop must use a memory fence.

Lack of flexibility: Simple extensions

(e.g., steal half) involve major challenges.

5

however, researchers identified two main limitations

Friday, July 5, 13

SLIDE 15

Previous studies of private deques

6

Feeley 1992 Multilisp Hendler & Shavit 2002 C Umatani 2003 Java Hirashi et al. 2009 C Sanchez et al. 2010 C Fluet et al. 2011 Parallel ML

Friday, July 5, 13

SLIDE 16

Private deques

Each core has exclusive access

to its own deque.

An idle core obtains a task by

making a steal request.

A busy core regularly checks for

incoming requests.

7

steal request pop & send push pop

Friday, July 5, 13

SLIDE 17

Private deques

8

no need for memory fence
flexible deques (any data structure can be used)
new cost associated with regular polling
additional delay associated with steals

but Addresses the main limitations of concurrent deques:

Friday, July 5, 13

SLIDE 18

Unknowns of private deques

What is the best way to implement work

stealing with private deques?

How does it compare on state of art

benchmarks with concurrent deques?

Can establish tight bounds on the runtime?

9

Friday, July 5, 13

SLIDE 19

Unknowns of private deques

What is the best way to implement work

stealing with private deques?

How does it compare on state of art

benchmarks with concurrent deques?

Can establish tight bounds on the runtime?

9

We give a receiver- and a sender-initiated algorithm. We evaluate on a collection of benchmarks. We prove a theorem w.r.t. delay and polling overhead.

Friday, July 5, 13

SLIDE 20

Receiver initiated

1 3 4

10

1
1
1

2 2

1

Friday, July 5, 13

SLIDE 21

Receiver initiated

1 3 4

10

1
1
1

2 2

1

Friday, July 5, 13

SLIDE 22

Receiver initiated

1 3 4

10

1
1
1

CAS 2 2

1

Friday, July 5, 13

SLIDE 23

Receiver initiated

1 3 4

10

1
1

CAS 2 2

1

Friday, July 5, 13

SLIDE 24

Receiver initiated

1 3 4

10

1
1

2 2

1

Friday, July 5, 13

SLIDE 25

Receiver initiated

1 3 4

10

1
1

2 2

1

Friday, July 5, 13

SLIDE 26

Receiver initiated

1 3 4

10

1
1

2

1
1

Friday, July 5, 13

SLIDE 27

Receiver initiated

1 3 4

10

1
1

2

1
1

Friday, July 5, 13

SLIDE 28

From receiver to sender initiated

Receiver initiated: each idle core targets
ne busy core at random
Sender initiated: each busy core targets one

core at random

Sender initiated idea is adapted from

distributed computing.

Sender initiated is simpler to implement.

11

Friday, July 5, 13

SLIDE 29

Sender initiated

1 3 4

12

... ... ...

2

...

Friday, July 5, 13

SLIDE 30

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

SLIDE 31

Sender initiated

1 3 4

12

... ... ...

CAS 2

Friday, July 5, 13

SLIDE 32

Sender initiated

1 3 4

12

... ... ...

CAS 2

Friday, July 5, 13

SLIDE 33

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

SLIDE 34

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

SLIDE 35

Sender initiated

1 3 4

12

... ... ...

2

...

Friday, July 5, 13

SLIDE 36

Performance study

We implemented in our own C++ library:
our receiver-initiated algorithm
our sender-initiated algorithm
our Chase-Lev implementation
We compare all of those implementations

against Cilk Plus.

13

Friday, July 5, 13

SLIDE 37

Benchmarks

Classic Cilk benchmarks and Problem Based

Benchmark Suite (Blelloch et al 2012)

Problem areas: merge sort, sample sort,

maximal independent set, maximal matching, convex hull, fibonacci, and dense matrix multiply.

14

Friday, July 5, 13

SLIDE 38

Performance results

15

Intel Xeon, 30 cores polling period = 30µsec

matmul cilksort(exptintseq) cilksort(randintseq) fib matching(eggrid2d) matching(egrlg) matching(egrmat) MIS(grid2d) MIS(rlg) MIS(rmat) hull(plummer2d) hull(uniform2d) Shared deques Recv.−init. Sender−init. Cilk Plus normalized execution time 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Normalized run time

concurrent deques sender init receiver init Cilk Plus

Friday, July 5, 13

SLIDE 39

Analytical model

16

δ

polling interval F maximal number of forks in a path P number of cores T1 serial run time T∞ minimal run time with infinite cores TP parallel run time with P cores

Friday, July 5, 13

SLIDE 40

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

·

⇣ 1 + O(1)

δ

⌘

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

Friday, July 5, 13

SLIDE 41

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

·

⇣ 1 + O(1)

δ

⌘

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: cost of steals Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

Friday, July 5, 13

SLIDE 42

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

·

⇣ 1 + O(1)

δ

⌘

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: cost of steals polling

verhead

Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

cost of steals

Friday, July 5, 13

SLIDE 43

Conclusion

We presented two new private-deques

algorithms, evaluated them, and proved analytical results.

In the paper, we demonstrated the

flexibility of private deques by implementing the steal half policy.

18

Friday, July 5, 13