S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - - PowerPoint PPT Presentation

s peculative l oad b alancing
SMART_READER_LITE
LIVE PREVIEW

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - - PowerPoint PPT Presentation

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign 2 Continuous Dynamic Load Balancing Irregular parallel applications Irregular and unpredictable


slide-1
SLIDE 1

SPECULATIVE LOAD BALANCING

Hassan Eslami William D. Gropp

Department of Computer Science University of Illinois at Urbana Champaign

slide-2
SLIDE 2

Continuous Dynamic Load Balancing

  • Irregular parallel applications
  • Irregular and unpredictable structure
  • Nested or recursive parallelism
  • Dynamic generation of units of computation
  • Available parallelism heavily depends on input data
  • Require continuous dynamic load balancing

2

Optimization and search problems N-Body problems

slide-3
SLIDE 3

Dynamic Load Balancing Model

  • TaskPool.initialize(initial tasks)

While (t TaskPool.get()) t.execute()

  • 3

In execute(), one may call TaskPool.put() Idle time in TaskPool.get()

slide-4
SLIDE 4

How to Eliminate Idle Time? – Prefetching

4

Thread 1 Thread 2

slide-5
SLIDE 5

How to Eliminate Idle Time? – Prefetching

5

Thread 1 Thread 2

slide-6
SLIDE 6

How to Eliminate Idle Time? – Prefetching

6

Thread 1 Thread 2

slide-7
SLIDE 7

How to Eliminate Idle Time? – Prefetching

7

Thread 1 Thread 2

slide-8
SLIDE 8

How to Eliminate Idle Time? – Prefetching

8

Thread 1 Thread 2

slide-9
SLIDE 9

How to Eliminate Idle Time? – Prefetching

9

Thread 1 Thread 2

slide-10
SLIDE 10

How to Eliminate Idle Time? – Prefetching

10

Thread 1 Thread 2

slide-11
SLIDE 11

How to Eliminate Idle Time? – Prefetching

11

Thread 1 Thread 2

slide-12
SLIDE 12

How to Eliminate Idle Time? – Prefetching

12

Thread 1 Thread 2

  • Unpredictable workload
  • Data dependence and limited

parallelism

slide-13
SLIDE 13

How to Eliminate Idle Time? – Speculation

13

Thread 1 Thread 2

slide-14
SLIDE 14

How to Eliminate Idle Time? – Speculation

14

Thread 1 Thread 2

slide-15
SLIDE 15

How to Eliminate Idle Time? – Speculation

15

Thread 1 Thread 2 Arbitration Request

slide-16
SLIDE 16

How to Eliminate Idle Time? – Speculation

16

Thread 1 Thread 2 Speculation Fail

slide-17
SLIDE 17

Work Sharing Algorithm

17

Manager Thread 0 Thread 1 Thread 2 Thread 3

Work Request Work Request Work Request

slide-18
SLIDE 18

Work Sharing Algorithm

18

Manager Thread 0 Thread 1 Thread 2 Thread 3

Work Request Work Request Work Request

slide-19
SLIDE 19

Work Sharing Algorithm

19

Manager Thread 0 Thread 1 Thread 2 Thread 3

slide-20
SLIDE 20

Speculative Work Sharing Algorithm

20

Manager Thread 0 Thread 1 Thread 2 Thread 3

Work Request Work Request Work Request

slide-21
SLIDE 21

Speculative Work Sharing Algorithm

21

Manager Thread Some Worker Thread

slide-22
SLIDE 22

Speculative Work Sharing Algorithm

22

Manager Thread Some Worker Thread

slide-23
SLIDE 23

Speculative Work Sharing Algorithm

23

Manager Thread Some Worker Thread Arbitration Request for A A

slide-24
SLIDE 24

Speculative Work Sharing Algorithm

24

Manager Thread Some Worker Thread Arbitration Request for B A B

slide-25
SLIDE 25

Speculative Work Sharing Algorithm

25

Manager Thread Some Worker Thread Arbitration Request for E A B C D E

slide-26
SLIDE 26

Speculative Work Sharing Algorithm

26

Manager Thread Some Worker Thread A B C D E Response for A: Success Commit

slide-27
SLIDE 27

Speculative Work Sharing Algorithm

27

Manager Thread Some Worker Thread B C D E Response for B: Success Commit

slide-28
SLIDE 28

Speculative Work Sharing Algorithm

28

Manager Thread Some Worker Thread C D E Response for C: Fail Roll Back Roll Back Roll Back Delete

slide-29
SLIDE 29

Speculative Work Sharing Algorithm

29

Manager Thread Some Worker Thread Work Request

slide-30
SLIDE 30
  • Counting nodes in randomly generated

tree

  • Tree generation is based on separable

cryptographic random number generator

childCount = f(nodeId) childId = SHA1(nodeId, childIndex)

  • Different types of trees
  • Binomial (probability q, # of child m)
  • Geometric (depth limit d, branching factor is

geometric distribution with mean b)

Unbalanced Tree Search (UTS)

30

slide-31
SLIDE 31

Work Sharing in UTS

  • A node in tree is a unit of work
  • A chunk is a set of nodes, and minimum transferrable unit
  • Release interval is the frequency with which a worker

releases a chunk to the manager If (HasSurplusWork() and NodesProcessed % release_inerval == 0) { ReleaseWork() }

31

slide-32
SLIDE 32

Experimental Setup and Inputs

  • Illinois Campus Cluster
  • Cluster of HP ProLiant Servers
  • 2 Intel X5650 2.66Ghz 6Core Processors per Node
  • High Speed Infiniband Cluster Interconnect

32

Binomial (109 Nodes) Geometry (109 Nodes) Small 0.111 0.102 Medium 2.79 1.64 Large 10.6 4.23

slide-33
SLIDE 33

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4

Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each)

33

slide-34
SLIDE 34

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8

Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each)

34

slide-35
SLIDE 35

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8 16

Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each)

35

slide-36
SLIDE 36

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8 16 32

Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each)

36

slide-37
SLIDE 37

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each)

37

slide-38
SLIDE 38

5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 5 10 15 20 25 30 35 40 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Original vs. Speculative Algorithm – Small Input

(on 4 nodes, 12 cores each)

38

Original Speculative

slide-39
SLIDE 39

10 15 20 25 30 35 40 45 50 1 10 100

  • Exec. Time (s)

Chunk Size Impact of release interval on execution time (Geometric Tree) 16 32 64 128 256 512 1024 2048 4096

Tuning of Original Algorithm – Medium Input

(on 4 nodes, 12 cores each)

39

  • Optimal values: (128, 12)
  • Some results for large input on 8

nodes

Time (s) (128, 8) Time (s) (128, 12) Original 50.385 26.681 Speculative 18.902 18.886

slide-40
SLIDE 40

Scalability Study – Geometric Tree

40

20 40 60 80 100 120 140 160 180 10 100 1000

  • Exec. Time (s)

# of MPI Ranks Original Speculative

slide-41
SLIDE 41

10 20 30 40 50 60 70 10 100 1000

  • Exec. Time (s)

# of MPI Ranks Original Speculative

Scalability Study – Binomial Tree

41

slide-42
SLIDE 42

Conclusion

  • Speculation
  • Is a light-weight technique in load-balancing algorithms
  • Is a potential solution to eliminate idle time
  • Reduces sensitivity of a load-balancing algorithm to parameters
  • Helps to reduce tuning efforts
  • Exhibits a higher scalability

42

slide-43
SLIDE 43

BACK UP SLIDES

slide-44
SLIDE 44

44

slide-45
SLIDE 45

Design Guidelines

  • The time it takes to process a speculative task is far less

than the time it takes to get response of an arbitration

  • A worker may need multiple speculative tasks at a time
  • Low overhead algorithm to get speculative task
  • Minimal speculative task transfer (i.e. minimizing speculative task

destroy)

  • Quality of an speculative task decreases over time
  • Move actual task a worker has, less speculative task it should carry
  • Quality of an speculative task increases as it goes deeper

in its owner’s actual queue

45

slide-46
SLIDE 46

Does Speculation Help Work Stealing?

  • Base-line algorithm + speculative algorithm guidelines =

speculative work stealing (Algorithm A)

  • Speculative work stealing + replacing speculative

messages with prefetching = optimized prefetch-based work stealing (Algorithm B)

  • “A” has a slight performance benefit over “B” (less than 5

percent overall)

  • Reason: Even the base-line does not have too much idle time in

UTS

  • … But, speculative work stealing is helpful in problems

where there is a limited parallelism due to data dependence

  • Example: Depth-first traversal of a graph

46