Scheduling Parallel Programs by Work Stealing with Private Deques - - PowerPoint PPT Presentation

scheduling parallel programs by work stealing with
SMART_READER_LITE
LIVE PREVIEW

Scheduling Parallel Programs by Work Stealing with Private Deques - - PowerPoint PPT Presentation

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud Mike Rainey Carnegie Mellon INRIA Max Planck Institute University for Software Systems PPoPP 25.2.2013 1 Friday, July 5, 13 Scheduling


slide-1
SLIDE 1

Scheduling Parallel Programs by Work Stealing with Private Deques

Umut Acar

Carnegie Mellon University

Arthur Charguéraud

INRIA

Mike Rainey

Max Planck Institute for Software Systems

PPoPP 25.2.2013

1

Friday, July 5, 13

slide-2
SLIDE 2

Scheduling parallel tasks

2

Friday, July 5, 13

slide-3
SLIDE 3

Scheduling parallel tasks

set of cores

2

Friday, July 5, 13

slide-4
SLIDE 4

Scheduling parallel tasks

pool of tasks

2

Friday, July 5, 13

slide-5
SLIDE 5

Scheduling parallel tasks

2

  • Goal: dynamic load balancing
  • A centralized approach: does not scale up
  • Popular approach: work stealing
  • Our work: study implementations of work stealing

Friday, July 5, 13

slide-6
SLIDE 6

Work stealing

3

Friday, July 5, 13

slide-7
SLIDE 7

Work stealing

deque

3

Friday, July 5, 13

slide-8
SLIDE 8

Work stealing

3

Friday, July 5, 13

slide-9
SLIDE 9

Work stealing

3

push pop push pop push pop

Friday, July 5, 13

slide-10
SLIDE 10

Work stealing

3

Friday, July 5, 13

slide-11
SLIDE 11

Work stealing

3

steal

Friday, July 5, 13

slide-12
SLIDE 12

Work stealing

3

Friday, July 5, 13

slide-13
SLIDE 13

Concurrent deques

  • Deques are shared.
  • Two sources of race:
  • between thieves
  • between owner and thief
  • Chase-Lev data structure resolves

these races using atomic compare&swap and memory fences.

4

top bot push pop steals

Friday, July 5, 13

slide-14
SLIDE 14

Concurrent deques

  • Well studied: shown to perform well

both in theory and in practice ...

  • Runtime overhead: In a relaxed

memory model, pop must use a memory fence.

  • Lack of flexibility: Simple extensions

(e.g., steal half) involve major challenges.

5

however, researchers identified two main limitations

Friday, July 5, 13

slide-15
SLIDE 15

Previous studies of private deques

6

Feeley 1992 Multilisp Hendler & Shavit 2002 C Umatani 2003 Java Hirashi et al. 2009 C Sanchez et al. 2010 C Fluet et al. 2011 Parallel ML

Friday, July 5, 13

slide-16
SLIDE 16

Private deques

  • Each core has exclusive access

to its own deque.

  • An idle core obtains a task by

making a steal request.

  • A busy core regularly checks for

incoming requests.

7

steal request pop & send push pop

Friday, July 5, 13

slide-17
SLIDE 17

Private deques

8

  • no need for memory fence
  • flexible deques (any data structure can be used)
  • new cost associated with regular polling
  • additional delay associated with steals

but Addresses the main limitations of concurrent deques:

Friday, July 5, 13

slide-18
SLIDE 18

Unknowns of private deques

  • What is the best way to implement work

stealing with private deques?

  • How does it compare on state of art

benchmarks with concurrent deques?

  • Can establish tight bounds on the runtime?

9

Friday, July 5, 13

slide-19
SLIDE 19

Unknowns of private deques

  • What is the best way to implement work

stealing with private deques?

  • How does it compare on state of art

benchmarks with concurrent deques?

  • Can establish tight bounds on the runtime?

9

We give a receiver- and a sender-initiated algorithm. We evaluate on a collection of benchmarks. We prove a theorem w.r.t. delay and polling overhead.

Friday, July 5, 13

slide-20
SLIDE 20

Receiver initiated

1 3 4

10

  • 1
  • 1
  • 1

2 2

  • 1

Friday, July 5, 13

slide-21
SLIDE 21

Receiver initiated

1 3 4

10

  • 1
  • 1
  • 1

2 2

  • 1

Friday, July 5, 13

slide-22
SLIDE 22

Receiver initiated

1 3 4

10

  • 1
  • 1
  • 1

CAS 2 2

  • 1

Friday, July 5, 13

slide-23
SLIDE 23

Receiver initiated

1 3 4

10

  • 1
  • 1

CAS 2 2

  • 1

Friday, July 5, 13

slide-24
SLIDE 24

Receiver initiated

1 3 4

10

  • 1
  • 1

2 2

  • 1

Friday, July 5, 13

slide-25
SLIDE 25

Receiver initiated

1 3 4

10

  • 1
  • 1

2 2

  • 1

Friday, July 5, 13

slide-26
SLIDE 26

Receiver initiated

1 3 4

10

  • 1
  • 1

2

  • 1
  • 1

Friday, July 5, 13

slide-27
SLIDE 27

Receiver initiated

1 3 4

10

  • 1
  • 1

2

  • 1
  • 1

Friday, July 5, 13

slide-28
SLIDE 28

From receiver to sender initiated

  • Receiver initiated: each idle core targets
  • ne busy core at random
  • Sender initiated: each busy core targets one

core at random

  • Sender initiated idea is adapted from

distributed computing.

  • Sender initiated is simpler to implement.

11

Friday, July 5, 13

slide-29
SLIDE 29

Sender initiated

1 3 4

12

... ... ...

2

...

Friday, July 5, 13

slide-30
SLIDE 30

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

slide-31
SLIDE 31

Sender initiated

1 3 4

12

... ... ...

CAS 2

Friday, July 5, 13

slide-32
SLIDE 32

Sender initiated

1 3 4

12

... ... ...

CAS 2

Friday, July 5, 13

slide-33
SLIDE 33

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

slide-34
SLIDE 34

Sender initiated

1 3 4

12

... ... ...

2

Friday, July 5, 13

slide-35
SLIDE 35

Sender initiated

1 3 4

12

... ... ...

2

...

Friday, July 5, 13

slide-36
SLIDE 36

Performance study

  • We implemented in our own C++ library:
  • our receiver-initiated algorithm
  • our sender-initiated algorithm
  • our Chase-Lev implementation
  • We compare all of those implementations

against Cilk Plus.

13

Friday, July 5, 13

slide-37
SLIDE 37

Benchmarks

  • Classic Cilk benchmarks and Problem Based

Benchmark Suite (Blelloch et al 2012)

  • Problem areas: merge sort, sample sort,

maximal independent set, maximal matching, convex hull, fibonacci, and dense matrix multiply.

14

Friday, July 5, 13

slide-38
SLIDE 38

Performance results

15

Intel Xeon, 30 cores polling period = 30µsec

matmul cilksort(exptintseq) cilksort(randintseq) fib matching(eggrid2d) matching(egrlg) matching(egrmat) MIS(grid2d) MIS(rlg) MIS(rmat) hull(plummer2d) hull(uniform2d) Shared deques Recv.−init. Sender−init. Cilk Plus normalized execution time 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Normalized run time

concurrent deques sender init receiver init Cilk Plus

Friday, July 5, 13

slide-39
SLIDE 39

Analytical model

16

δ

polling interval F maximal number of forks in a path P number of cores T1 serial run time T∞ minimal run time with infinite cores TP parallel run time with P cores

Friday, July 5, 13

slide-40
SLIDE 40

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

  • ·

⇣ 1 + O(1)

δ

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

Friday, July 5, 13

slide-41
SLIDE 41

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

  • ·

⇣ 1 + O(1)

δ

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: cost of steals Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

Friday, July 5, 13

slide-42
SLIDE 42

Our main analytical result

17

E [TP ] ≤ T1

P + P −1 P

T∞ + O(δF)

  • ·

⇣ 1 + O(1)

δ

TP ≤

T1 P + P −1 P

T∞

Bound for greedy schedulers: cost of steals polling

  • verhead

Bound for concurrent deques (ignoring cost of fences): Bound for our two algorithms:

E [TP ] ≤

T1 P + P −1 P

T∞ + O(F)

cost of steals

Friday, July 5, 13

slide-43
SLIDE 43

Conclusion

  • We presented two new private-deques

algorithms, evaluated them, and proved analytical results.

  • In the paper, we demonstrated the

flexibility of private deques by implementing the steal half policy.

18

Friday, July 5, 13