Improving Implicit Parallelism Jos Manuel Caldern Trilla & - - PowerPoint PPT Presentation

improving implicit parallelism
SMART_READER_LITE
LIVE PREVIEW

Improving Implicit Parallelism Jos Manuel Caldern Trilla & - - PowerPoint PPT Presentation

Improving Implicit Parallelism Jos Manuel Caldern Trilla & Colin Runciman University of York Improving Implicit Parallelism Jos Manuel Caldern Trilla & Colin Runciman University of York Why are you doing this to


slide-1
SLIDE 1

Improving Implicit Parallelism

José Manuel Calderón Trilla & Colin Runciman University of York

slide-2
SLIDE 2

Improving Implicit Parallelism

José Manuel Calderón Trilla & Colin Runciman University of York

slide-3
SLIDE 3

–Many of you

“Why are you doing this to yourself?”

slide-4
SLIDE 4

FP at York

slide-5
SLIDE 5

FP at York

slide-6
SLIDE 6

FP at York

slide-7
SLIDE 7

Motivation

–Edward Z. Yang

“End of Moore’s Law: blah blah blah”

slide-8
SLIDE 8

The takeaway

Static analysis alone is not enough to achieve implicit parallelism.

slide-9
SLIDE 9

The takeaway

Static analysis alone is not enough to achieve implicit parallelism. We use profile directed feedback in addition to well-known static analysis techniques to achieve better results.

slide-10
SLIDE 10

‘par’ annotations

slide-11
SLIDE 11

‘par’ annotations

  • Simple way to introduce parallelism
slide-12
SLIDE 12

‘par’ annotations

  • Simple way to introduce parallelism
  • Cheap when using a sparking model (Clack and

Peyton Jones 1986)

slide-13
SLIDE 13

‘par’ annotations

  • Simple way to introduce parallelism
  • Cheap when using a sparking model (Clack and

Peyton Jones 1986)

  • Lends itself to use in Strategies (Trinder et al. 1998,

Marlow et al. 2010)

slide-14
SLIDE 14

‘par’ annotations

slide-15
SLIDE 15

fib :: Int -> Int fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2)

‘par’ annotations

slide-16
SLIDE 16

fib :: Int -> Int fib 0 = 0 fib 1 = 1 fib n = let x = fib (n-1) y = fib (n-2) in x `par` y `seq` x + y

‘par’ annotations

slide-17
SLIDE 17

‘par’ annotations

slide-18
SLIDE 18
  • par also lends itself to ‘switching’

‘par’ annotations

slide-19
SLIDE 19
  • par also lends itself to ‘switching’

par :: a -> b -> b

‘par’ annotations

slide-20
SLIDE 20

Takeaway: revisited

slide-21
SLIDE 21

Takeaway: revisited

  • Use static analysis to place pars throughout the

program, generously

  • Use profiling data to determine which pars

should be switched off

slide-22
SLIDE 22

Higher-order specialisation

slide-23
SLIDE 23

Higher-order specialisation

  • Two purposes:
slide-24
SLIDE 24

Higher-order specialisation

  • Two purposes:
  • Necessary for projection analysis 


(Hinze 1995)

slide-25
SLIDE 25

Higher-order specialisation

  • Two purposes:
  • Necessary for projection analysis 


(Hinze 1995)

  • Specialises par-sites
slide-26
SLIDE 26

Higher-order specialisation

pMap :: (a -> b) -> [a] -> [b] pMap f [] = [] pMap f (x:xs) = y `par` y : pMap f xs where y = f x

slide-27
SLIDE 27

Higher-order specialisation

pMap_g :: [a] -> [b] pMap_g [] = [] pMap_g (x:xs) = y `par` y : pMap_g xs where y = g x

slide-28
SLIDE 28

par placement

slide-29
SLIDE 29

par placement

  • We want safety
slide-30
SLIDE 30

par placement

  • We want safety
  • Only spark sub-expressions that are needed
slide-31
SLIDE 31

par placement

  • We want safety
  • Only spark sub-expressions that are needed
  • Projections for strictness analysis can help us

determine which arguments are needed and how much is needed 
 (Hinze 1995)

slide-32
SLIDE 32

Without projections

Instead of asking: “If argument x is non-terminating, is the function non-terminating?” (original S.A. (see Mycroft))

slide-33
SLIDE 33

Projections

We ask: “If N amount of the function’s result is needed, how much is needed of the function’s arguments?”

slide-34
SLIDE 34

Projections

slide-35
SLIDE 35

Projections

data Context = CVar String | CRec String | CBot | CProd [Context] | CSum [(String, Context)] | CMu String Context | CStr Context | CLaz Context

slide-36
SLIDE 36

Projections ≈ Strategies

slide-37
SLIDE 37

Projections ≈ Strategies

  • Projections: describe how much of a structure is

needed

slide-38
SLIDE 38

Projections ≈ Strategies

  • Projections: describe how much of a structure is

needed

  • Strategies: describe how much of a structure to

evaluate (possibly in parallel)

slide-39
SLIDE 39

Projections ≈ Strategies

  • Projections: describe how much of a structure is

needed

  • Strategies: describe how much of a structure to

evaluate (possibly in parallel)

  • Similar to Burn’s “Evaluation Transformers” 


(Burn 1991)

slide-40
SLIDE 40
  • Example: Analysis determines a list

can be fully evaluated

Projections ≈ Strategies

slide-41
SLIDE 41
  • Example: Analysis determines a list

can be fully evaluated

pList :: Strategy a -> [a] -> () pList s [] = () pList s (x:xs) = s x `par` pList s xs

Projections ≈ Strategies

slide-42
SLIDE 42

fib 0 = 0 fib 1 = 1 fib n = let x = fib (n-1) y = fib (n-2) in x `par` y `seq` x + y

Using Strategies

slide-43
SLIDE 43

1990’s Version

slide-44
SLIDE 44

1990’s Version

  • We’re done.
slide-45
SLIDE 45

The remake

slide-46
SLIDE 46

The remake

  • Have the compiler do what programmers do:

look at profiling data

slide-47
SLIDE 47

The remake

  • Have the compiler do what programmers do:

look at profiling data

slide-48
SLIDE 48

Par-site Health

slide-49
SLIDE 49

Par-site Health

  • Not all threads are equally productive
slide-50
SLIDE 50

Par-site Health

  • Not all threads are equally productive
  • Each thread has an origin (par-site)
slide-51
SLIDE 51

Par-site Health

  • Not all threads are equally productive
  • Each thread has an origin (par-site)
  • Calculate the health of a par-site by looking at

the productivity of the threads it sparked

slide-52
SLIDE 52

Thread Health

1 2 3 4 5 6 7 8 9 101 102 103 104 105 Par-Site Reduction Count Par-Site Health for SumEuler

slide-53
SLIDE 53

Incorporate Feedback

slide-54
SLIDE 54

Incorporate Feedback

  • After calculating par-site health switch off the

weakest par

slide-55
SLIDE 55

Incorporate Feedback

  • After calculating par-site health switch off the

weakest par

  • Repeat until no more improvement to overall

performance

slide-56
SLIDE 56

main = let v_130 = let v_129 = fromto_D1 1 1000 in (par (fix mainLL_0 v_129) (mapDefeuler v_129)) in (par (fix mainLL_3 v_130) (sum v_130)); mainLL_2 v_0 = seq v_0 Pack{0,0}; mainLL_1 v_1 v_2 = case v_2 of { <0> v_131 v_132 -> par (mainLL_2 v_131) (seq (v_1 v_132) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_0 v_3 = mainLL_1 v_3; mainLL_5 v_4 = seq v_4 Pack{0,0}; mainLL_4 v_5 v_6 = case v_6 of { <0> v_133 v_134 -> par (mainLL_5 v_133) (seq (v_5 v_134) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_3 v_7 = mainLL_4 v_7; sum v_8 = case v_8 of { <1> -> 0; <0> v_135 v_136 -> let v_139 = sum v_136 in (par (sumLL_0 v_139) ((v_135 + v_139))) }; sumLL_0 v_9 = seq v_9 Pack{0,0}; mapDefeuler v_10 = case v_10 of { <1> -> Pack{1,0}; <0> v_140 v_141 -> Pack{0,2} (euler v_140) (mapDefeuler v_141) }; fromto_D1 v_11 v_12 = ifte ((v_11 > v_12)) Pack{1,0} (Pack{0,2} v_11 (fromto_D1 ((v_11 + 1)) v_12)); fromto_D2 v_13 v_14 = ifte ((v_13 > v_14)) Pack{1,0} (Pack{0,2} v_13 (fromto_D2 ((v_13 + 1)) v_14)); gcd v_30 v_31 = ifte ((v_31 == 0)) v_30 (ifte ((v_30 > v_31)) (gcd ((v_30 - v_31)) v_31) (gcd v_30 ((v_31 - v_30)))) euler v_15 = let v_164 = filterDefrelPrime v_15 (fromto_D2 1 v_15) in (par (fix eulerLL_0 v_164) (length v_164)); eulerLL_1 v_16 v_17 = case v_17

  • f {

<0> v_168 v_169 -> seq (v_16 v_169) Pack{0,0}; <1> -> Pack{0,0} }; eulerLL_0 v_18 = eulerLL_1 v_18; ifte v_19 v_20 v_21 = case v_19

  • f {

<1> -> v_20; <0> -> v_21 }; length v_22 = case v_22 of { <1> -> 0; <0> v_170 v_171 -> let v_174 = length v_171 in (par (lengthLL_0 v_174) ((1 + v_174))) }; lengthLL_0 v_23 = seq v_23 Pack{0,0}; filterDefrelPrime v_24 v_25 = case v_25 of { <1> -> Pack{1,0}; <0> v_175 v_176 -> let v_183 = relPrime v_24 v_175 in (par (filterDefrelPrimeLL_0 v_183) (ifte v_183 (Pack{0,2} v_175 (filterDefrelPrime v_24 v_176)) (filterDefrelPrime v_24 v_176))) }; filterDefrelPrimeLL_0 v_26 = case v_26 of { <1> -> Pack{0,0}; <0> -> Pack{0,0} }; relPrime v_27 v_28 = let v_188 = gcd v_27 v_28 in (par (relPrimeLL_0 v_188) ((v_188 == 1))); relPrimeLL_0 v_29 = seq v_29 Pack{0,0};

slide-57
SLIDE 57

main = let v_130 = let v_129 = fromto_D1 1 1000 in (par (fix mainLL_0 v_129) (mapDefeuler v_129)) in (par (fix mainLL_3 v_130) (sum v_130)); mainLL_2 v_0 = seq v_0 Pack{0,0}; mainLL_1 v_1 v_2 = case v_2 of { <0> v_131 v_132 -> par (mainLL_2 v_131) (seq (v_1 v_132) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_0 v_3 = mainLL_1 v_3; mainLL_5 v_4 = seq v_4 Pack{0,0}; mainLL_4 v_5 v_6 = case v_6 of { <0> v_133 v_134 -> par (mainLL_5 v_133) (seq (v_5 v_134) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_3 v_7 = mainLL_4 v_7; sum v_8 = case v_8 of { <1> -> 0; <0> v_135 v_136 -> let v_139 = sum v_136 in (par (sumLL_0 v_139) ((v_135 + v_139))) }; sumLL_0 v_9 = seq v_9 Pack{0,0}; mapDefeuler v_10 = case v_10 of { <1> -> Pack{1,0}; <0> v_140 v_141 -> Pack{0,2} (euler v_140) (mapDefeuler v_141) }; fromto_D1 v_11 v_12 = ifte ((v_11 > v_12)) Pack{1,0} (Pack{0,2} v_11 (fromto_D1 ((v_11 + 1)) v_12)); fromto_D2 v_13 v_14 = ifte ((v_13 > v_14)) Pack{1,0} (Pack{0,2} v_13 (fromto_D2 ((v_13 + 1)) v_14)); gcd v_30 v_31 = ifte ((v_31 == 0)) v_30 (ifte ((v_30 > v_31)) (gcd ((v_30 - v_31)) v_31) (gcd v_30 ((v_31 - v_30)))) euler v_15 = let v_164 = filterDefrelPrime v_15 (fromto_D2 1 v_15) in (par (fix eulerLL_0 v_164) (length v_164)); eulerLL_1 v_16 v_17 = case v_17

  • f {

<0> v_168 v_169 -> seq (v_16 v_169) Pack{0,0}; <1> -> Pack{0,0} }; eulerLL_0 v_18 = eulerLL_1 v_18; ifte v_19 v_20 v_21 = case v_19

  • f {

<1> -> v_20; <0> -> v_21 }; length v_22 = case v_22 of { <1> -> 0; <0> v_170 v_171 -> let v_174 = length v_171 in (par (lengthLL_0 v_174) ((1 + v_174))) }; lengthLL_0 v_23 = seq v_23 Pack{0,0}; filterDefrelPrime v_24 v_25 = case v_25 of { <1> -> Pack{1,0}; <0> v_175 v_176 -> let v_183 = relPrime v_24 v_175 in (par (filterDefrelPrimeLL_0 v_183) (ifte v_183 (Pack{0,2} v_175 (filterDefrelPrime v_24 v_176)) (filterDefrelPrime v_24 v_176))) }; filterDefrelPrimeLL_0 v_26 = case v_26 of { <1> -> Pack{0,0}; <0> -> Pack{0,0} }; relPrime v_27 v_28 = let v_188 = gcd v_27 v_28 in (par (relPrimeLL_0 v_188) ((v_188 == 1))); relPrimeLL_0 v_29 = seq v_29 Pack{0,0};

slide-58
SLIDE 58

main = let v_130 = let v_129 = fromto_D1 1 1000 in (par (fix mainLL_0 v_129) (mapDefeuler v_129)) in (par (fix mainLL_3 v_130) (sum v_130)); mainLL_2 v_0 = seq v_0 Pack{0,0}; mainLL_1 v_1 v_2 = case v_2 of { <0> v_131 v_132 -> par (mainLL_2 v_131) (seq (v_1 v_132) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_0 v_3 = mainLL_1 v_3; mainLL_5 v_4 = seq v_4 Pack{0,0}; mainLL_4 v_5 v_6 = case v_6 of { <0> v_133 v_134 -> par (mainLL_5 v_133) (seq (v_5 v_134) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_3 v_7 = mainLL_4 v_7; sum v_8 = case v_8 of { <1> -> 0; <0> v_135 v_136 -> let v_139 = sum v_136 in (par (sumLL_0 v_139) ((v_135 + v_139))) }; sumLL_0 v_9 = seq v_9 Pack{0,0}; mapDefeuler v_10 = case v_10 of { <1> -> Pack{1,0}; <0> v_140 v_141 -> Pack{0,2} (euler v_140) (mapDefeuler v_141) }; fromto_D1 v_11 v_12 = ifte ((v_11 > v_12)) Pack{1,0} (Pack{0,2} v_11 (fromto_D1 ((v_11 + 1)) v_12)); fromto_D2 v_13 v_14 = ifte ((v_13 > v_14)) Pack{1,0} (Pack{0,2} v_13 (fromto_D2 ((v_13 + 1)) v_14)); gcd v_30 v_31 = ifte ((v_31 == 0)) v_30 (ifte ((v_30 > v_31)) (gcd ((v_30 - v_31)) v_31) (gcd v_30 ((v_31 - v_30)))) euler v_15 = let v_164 = filterDefrelPrime v_15 (fromto_D2 1 v_15) in (par (fix eulerLL_0 v_164) (length v_164)); eulerLL_1 v_16 v_17 = case v_17

  • f {

<0> v_168 v_169 -> seq (v_16 v_169) Pack{0,0}; <1> -> Pack{0,0} }; eulerLL_0 v_18 = eulerLL_1 v_18; ifte v_19 v_20 v_21 = case v_19

  • f {

<1> -> v_20; <0> -> v_21 }; length v_22 = case v_22 of { <1> -> 0; <0> v_170 v_171 -> let v_174 = length v_171 in (par (lengthLL_0 v_174) ((1 + v_174))) }; lengthLL_0 v_23 = seq v_23 Pack{0,0}; filterDefrelPrime v_24 v_25 = case v_25 of { <1> -> Pack{1,0}; <0> v_175 v_176 -> let v_183 = relPrime v_24 v_175 in (par (filterDefrelPrimeLL_0 v_183) (ifte v_183 (Pack{0,2} v_175 (filterDefrelPrime v_24 v_176)) (filterDefrelPrime v_24 v_176))) }; filterDefrelPrimeLL_0 v_26 = case v_26 of { <1> -> Pack{0,0}; <0> -> Pack{0,0} }; relPrime v_27 v_28 = let v_188 = gcd v_27 v_28 in (par (relPrimeLL_0 v_188) ((v_188 == 1))); relPrimeLL_0 v_29 = seq v_29 Pack{0,0};

slide-59
SLIDE 59

main = let v_130 = let v_129 = fromto_D1 1 1000 in (par (fix mainLL_0 v_129) (mapDefeuler v_129)) in (par (fix mainLL_3 v_130) (sum v_130)); mainLL_2 v_0 = seq v_0 Pack{0,0}; mainLL_1 v_1 v_2 = case v_2 of { <0> v_131 v_132 -> par (mainLL_2 v_131) (seq (v_1 v_132) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_0 v_3 = mainLL_1 v_3; mainLL_5 v_4 = seq v_4 Pack{0,0}; mainLL_4 v_5 v_6 = case v_6 of { <0> v_133 v_134 -> par (mainLL_5 v_133) (seq (v_5 v_134) Pack{0,0}); <1> -> Pack{0,0} }; mainLL_3 v_7 = mainLL_4 v_7; sum v_8 = case v_8 of { <1> -> 0; <0> v_135 v_136 -> let v_139 = sum v_136 in (par (sumLL_0 v_139) ((v_135 + v_139))) }; sumLL_0 v_9 = seq v_9 Pack{0,0}; mapDefeuler v_10 = case v_10 of { <1> -> Pack{1,0}; <0> v_140 v_141 -> Pack{0,2} (euler v_140) (mapDefeuler v_141) }; fromto_D1 v_11 v_12 = ifte ((v_11 > v_12)) Pack{1,0} (Pack{0,2} v_11 (fromto_D1 ((v_11 + 1)) v_12)); fromto_D2 v_13 v_14 = ifte ((v_13 > v_14)) Pack{1,0} (Pack{0,2} v_13 (fromto_D2 ((v_13 + 1)) v_14)); gcd v_30 v_31 = ifte ((v_31 == 0)) v_30 (ifte ((v_30 > v_31)) (gcd ((v_30 - v_31)) v_31) (gcd v_30 ((v_31 - v_30)))) euler v_15 = let v_164 = filterDefrelPrime v_15 (fromto_D2 1 v_15) in (par (fix eulerLL_0 v_164) (length v_164)); eulerLL_1 v_16 v_17 = case v_17

  • f {

<0> v_168 v_169 -> seq (v_16 v_169) Pack{0,0}; <1> -> Pack{0,0} }; eulerLL_0 v_18 = eulerLL_1 v_18; ifte v_19 v_20 v_21 = case v_19

  • f {

<1> -> v_20; <0> -> v_21 }; length v_22 = case v_22 of { <1> -> 0; <0> v_170 v_171 -> let v_174 = length v_171 in (par (lengthLL_0 v_174) ((1 + v_174))) }; lengthLL_0 v_23 = seq v_23 Pack{0,0}; filterDefrelPrime v_24 v_25 = case v_25 of { <1> -> Pack{1,0}; <0> v_175 v_176 -> let v_183 = relPrime v_24 v_175 in (par (filterDefrelPrimeLL_0 v_183) (ifte v_183 (Pack{0,2} v_175 (filterDefrelPrime v_24 v_176)) (filterDefrelPrime v_24 v_176))) }; filterDefrelPrimeLL_0 v_26 = case v_26 of { <1> -> Pack{0,0}; <0> -> Pack{0,0} }; relPrime v_27 v_28 = let v_188 = gcd v_27 v_28 in (par (relPrimeLL_0 v_188) ((v_188 == 1))); relPrimeLL_0 v_29 = seq v_29 Pack{0,0};

slide-60
SLIDE 60

SumEuler speedup

1 2 3 4 5 6 Feedback Iteration 5 10 Speedup compared to sequential 4 cores 8 cores 16 cores

slide-61
SLIDE 61

0 1 2 3 4 5 6 7 8 9 10111213141516171819202122 Feedback Iteration 5 10 15 Speedup compared to sequential 4 cores 8 cores 16 cores

Queens2 speedup

slide-62
SLIDE 62

Taut speedup

1 2 3 4 5 6 7 8 9 Feedback Iteration 0.99 1 1.01 Speedup compared to sequential 4 cores 16 cores

slide-63
SLIDE 63

Looks nice, but…

slide-64
SLIDE 64

Looks nice, but…

  • Transferring resulting programs to GHC gives

us “less than ideal” results (discussed in paper)

slide-65
SLIDE 65

Looks nice, but…

  • Transferring resulting programs to GHC gives

us “less than ideal” results (discussed in paper)

  • Not entirely clear that technique will scale
slide-66
SLIDE 66

Conclusion

slide-67
SLIDE 67

Conclusion

  • Early results promising, issues when transferring

program to GHC

slide-68
SLIDE 68

Conclusion

  • Early results promising, issues when transferring

program to GHC

  • Will likely need other forms of specialisation
slide-69
SLIDE 69

Conclusion

  • Early results promising, issues when transferring

program to GHC

  • Will likely need other forms of specialisation
  • Speculation may be necessary for more

complex programs

slide-70
SLIDE 70

Fin

slide-71
SLIDE 71

Ask me:

slide-72
SLIDE 72

Ask me:

  • About problems with benchmarking
  • What is necessary to get something like this into

GHC?

  • This is bad and you should feel bad, how does

that make you feel?

slide-73
SLIDE 73

Two Search Algorithms

slide-74
SLIDE 74

Two Search Algorithms

  • 1. Greedy: Try switching a random bit, keep

faster setting. Visit all bits once.

slide-75
SLIDE 75

Two Search Algorithms

  • 1. Greedy: Try switching a random bit, keep

faster setting. Visit all bits once.

  • 2. Hill-climbing: Iterate through neighbors

randomly, if a neighbor is faster, move to that setting. Repeat until no neighbor is faster.

slide-76
SLIDE 76

5 10 15 20 25 50 75 100

evaluations speedup compared to sequential

alg,cores HC,24 G,24 HC,16 G,16 HC,8 G,8 HC,4 G,4

slide-77
SLIDE 77

5 10 20 40 60

evaluations speedup compared to sequential

alg,cores HC,24 G,24 HC,16 G,16 HC,8 G,8 HC,4 G,4

slide-78
SLIDE 78

Example Projections

Pairs: CSum [("Pair", CProd [CProd []?, CProd []?])] CSum [("Pair", CProd [CProd []!, CBot?])] Lists: CMu "L" (CSum ,[("Cons", CProd [(CVar "a")? ,(CRec "L")!]) ("Nil", CProd [])]