On the Factor Refinement Principle and its Implementation on - - PowerPoint PPT Presentation

on the factor refinement principle and its implementation
SMART_READER_LITE
LIVE PREVIEW

On the Factor Refinement Principle and its Implementation on - - PowerPoint PPT Presentation

On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario,


slide-1
SLIDE 1

On the Factor Refinement Principle and its Implementation on Multicore Architectures

Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza

  • Dept. of Computer Science (ORCCA Lab)

The University of Western Ontario, London, ON, Canada

December 15, 2011

1 / 65

slide-2
SLIDE 2

Factor Refinement Serial Algorithms Approach based on the naive refinement Approach based on augment refinement Approach based on subproduct trees Motivation Implementation Challenges on Multicore Architectures Contribution Proposed Parallel Algorithms A d-n-c illustration Parallel algorithms based on the naive refinement Parallel approach based on the augment refinement Parallel Algorithm Based on Subproduct Tree Conclusion

2 / 65

slide-3
SLIDE 3

Factor Refinement I

Factor Refinement

3 / 65

slide-4
SLIDE 4

Factor Refinement I

Definition

◮ Let D be a UFD and m1, m2, . . . , mr be elements of D. ◮ Let m be the product of m1, m2, . . . , mr. ◮ We say that elements n1, n2, . . . , ns of D form a GCD-free basis

whenever gcd(ni, nj) = 1 for all 1 ≤ i < j ≤ s.

◮ Let e1, e2, . . . , es be positive integers. ◮ We say that the pairs (n1, e1), (n2, e2), . . . , (ns, es) form a

refinement of m1, m2, . . . , mr if the following two conditions hold: (i) n1, n2, . . . , ns is a GCD-free basis, (ii) for every 1 ≤ i ≤ r there exists non-negative integers f1, . . . , fs such that we have

1≤j≤s nfj j = mi,

(ii)

1≤i≤s nei i = m.

When this holds, we also say that (n1, e1), (n2, e2), . . . , (ns, es) is a coprime factorization of m.

4 / 65

slide-5
SLIDE 5

Factor Refinement II

Example Let m1 = 30, m2 = 42 and their product m = 1260. Then (i) 51, 62, 71 is a refinement of 30 and 42, (ii) 5, 6, 7 is a GCD-free basis of 30 and 42, (iii) 51, 62, 71 is a coprime factorization of 1260.

5 / 65

slide-6
SLIDE 6

Factor Refinement III

Applications

◮ Simplifying systems of polynomial equations and inequations,

(i) ab = bc = ca = = ⇒ a = b = c = 0, (ii) Below, {A, B, C, D, E, F, G} can be seen as a GCD-free basis of {S1, S2, S3}:

S1 S2 S3 S1 S2 S3 A B C D E F G = ⇒ ,

◮ consolidation of independent factorizations, ◮ etc.

6 / 65

slide-7
SLIDE 7

Serial Algorithms: Approach based on the naive refinement and quadratic arithmetic

7 / 65

slide-8
SLIDE 8

Approach based on the naive refinement I

Idea from Bach, Driscoll, and Shallit in 1990 [BDS90].

◮ Given a partial factorization of an integer m, say m = m1m2,

we compute d = gcd(m1, m2) and write m = (m1/d)(d2)(m2/d).

◮ This process is continued until all the factors are pairwise

coprime.

◮ This is also used for the general case of more than two inputs,

say m = m1m2 . . . mℓ. Algebraic complexity If m = m1m2 . . . mℓ, then this algorithm takes O(size(m)3) bit op- erations, where size(m) = 1 if m = 0 1 + ⌊log2 |m|⌋ if m > 0.

8 / 65

slide-9
SLIDE 9

Serial Algorithms: Approach based on the augment refinement and quadratic arithmetic

9 / 65

slide-10
SLIDE 10

Approach based on augment refinement and quadratic arithmetic I

Again from Bach, Driscoll, and Shallit in 1990 [BDS90].

◮ Basic idea same as before but organizing the computations

more precisely leading to an improved complexity [BDS90]

◮ The trick is to keep tracks of the pairs (nj, nk) in an ordered

pair list such that only elements adjacent in the list can have a nontrivial GCD. Algebraic complexity If m = m1m2 . . . mℓ, then this algorithm takes O(size(m)2) bit op- erations, where size(m) = 1 if m = 0 1 + ⌊log2 |m|⌋ if m > 0.

10 / 65

slide-11
SLIDE 11

Serial Algorithms: Approach based on subproduct trees

11 / 65

slide-12
SLIDE 12

Approach based on subproduct trees I

Idea of Asymptotically Fast Algorithm for GCD-free Basis from Dahan, Moreno Maza, Schost, and Xie in 2005 [DMS+05].

◮ Divide the input into sub-problems until a base case is

reached,

◮ Conquer the sub-problems from leaves to the root applying

fast arithmetic based on subproduct trees (described later). Algebraic complexity The total number

  • f

field

  • perations
  • f

this algorithm is O(M(d) log3

2 d), where ◮ d is the sum of the degrees of the input polynomials, ◮ M(d) is a multiplication time of two univariate polynomials of

degree less than d,

12 / 65

slide-13
SLIDE 13

Motivation I

Parallel Computation of the Minimal Elements of a Poset

◮ by Leiserson, Li, Moreno Maza, and Xie in 2010 [ELLMX10]. ◮ This is a multithreaded (fork-join parallelism) approach which

is divide-and-conquer, free of data races, inspired by parallel-merge-sort.

◮ Its Cilk++ shows nearly linear speed-up on 32-core processors

for sufficiently large input data set. This work led us to the design and implementation of parallel factor refinement algorithms.

13 / 65

slide-14
SLIDE 14

Implementation Challenges on Multicore Architectures

14 / 65

slide-15
SLIDE 15

Multithreaded Parallelism on Multicore Architectures I

Multicore architectures

◮ A multi-core processor is a single computing component with

two or more independent and tighly coupled processors, called cores, sharing memory.

◮ They also share the same bus and memory controller; thus

memory bandwidth may limit performances.

◮ In order to maintain memory consistency, synchorization is

needed between cores, which may also limit performances.

15 / 65

slide-16
SLIDE 16

Multithreaded Parallelism on Multicore Architectures II

Fork-join parallelism

◮ This model represents the execution of a multithreaded

program as a set of nonblocking threads denoted by the vertices of a dag where the dag edges indicate dependencies between instructions.

◮ Assuming unit cost of execution for all threads, the number of

vertices of the dag is the work (= running time on a single core).

◮ The maximum length of a path from the root to a leaf is the

span (= running time on ∞ processors).

◮ The paralleisim is the ratio work to span (= average amount

  • f work along the span).

16 / 65

slide-17
SLIDE 17

The Ideal-cache Model I

fathena,cel,prokop,sridhar g@sup erte ch.l cs. mit. edu = ( )
  • (
+ = ) ( + ( = )( + )) ( )
  • (
+ ( + + )= + = p ) ( ; )

Q

cache misses

  • rganized by
  • ptimal replacement

strategy Main Memory Cache Z

=L Cache lines

Lines

  • f length L

CPU

W

work

> = ( ) ; ( ) ( ; )

Figure 1: The ideal-cache model.

17 / 65

slide-18
SLIDE 18

The Ideal-cache Model II

◮ The processor can only refer to words that reside in the cache

memory, which is a small and fast access memory, containing Z words organized in cache lines of L words each.

◮ If the referenced line of a word is not in cache, the

corresponding line needs to be brought from the main

  • memory. This is a cache miss. If the cache is full, a cache line

must be evicted.

◮ Cache complexity analyzes algorithms in terms of cache

misses.

18 / 65

slide-19
SLIDE 19

From Cilk to Cilk++ I

The language

◮ Cilk (resp. Cilk++) is an extension of C (resp. C++)

implementating the fork-join parallelism with two keywords spawn and sync.

◮ A Cilk (resp. Cilk++) program has the same semantics as

its C (resp. C++) ellision. Performance of the work-stealing scheduler In theory, Cilk (resp. Cilk++)s scheduler executes any Cilk++computation in a nearly optimal time on p processors, pro- vided that

◮ for almost all parallel steps, there are at least p units of work

which can be run concurrently,

◮ each processor is either working or stealing work, ◮ each thread executes in unit time.

19 / 65

slide-20
SLIDE 20

Parallelization overheads I

Overheads and burden

◮ In practice, the observed speedup factor may be less

(sometimes much less) than the theoretical parallelism.

◮ Many factors explain that fact: simplification assumptions of

the fork-join parallelism model, architecture limitation, costs

  • f executing the parallel constructs, overheads of scheduling.

Parallelism vs. burdened parallelism

◮ Cilkview is a perforance analyzer which caclulates the work,

the span, the parallelism of a given Cilk++ program run.

◮ Cilkview also estimates the running time Tp on p processors

as Tp = T1/p + 1.7 burden span, where burden span is 15000 instructions times the number of spawn along the span!

20 / 65

slide-21
SLIDE 21

Contribution I

◮ Parallel algorithm based on the naive refinement principle

[NOT GOOD for data locality and thus for parallelism on multicore architectures].

◮ Parallel algorithm based on the augment refinement principle

[GOOD for data locality and parallelism].

◮ Parallel algorithm based on subproduct tree [MORE

CHALLENGING for implementation on multicore architectures]. Principle All are based on algorithms which are divide-and-conquer (d-n-c), multithreaded, free of data races.

21 / 65

slide-22
SLIDE 22

Proposed Parallel Algorithms A d-n-c illustration

22 / 65

slide-23
SLIDE 23

A d-n-c illustration I

2, 6, 7, 10, 15, 21, 22, 26 2, 6, 7, 10 15, 21, 22, 26 2, 6 7, 10 15, 21 22, 26 2 6 7 10 15 21 22 26 21 61 71 101 151 211 221 261 31, 22 71, 101 51, 71, 32 111, 131, 22 31, 71, 51, 23 51, 71, 32, 111, 131, 22 111, 131, 33, 72, 52, 25 Input Output Expand Merge Done in parallel Done in parallel

Figure 2: Example of algorithm execution.

23 / 65

slide-24
SLIDE 24

Proposed Parallel Algorithms Parallel algorithms based on the naive refinement

24 / 65

slide-25
SLIDE 25

Parallel algorithms based on the naive refinement I

Expanding input and calling Merge Algorithm 1: ParallelFactorRefinement(A)

Input: Array of square-free positive integers A = m1, m2, . . . , mk. Output: Two arrays of positive integers N = n1, n2, . . . , ns, E = e1, e2, . . . , es such that (n1, e1), (n2, e2), . . . , (ns, es) is a refinement of m1, m2, · · · , mk. Moreover, n1, n2, . . . , ns are square-free. if k = 1 then

1:

return [ A ], [ 1 ] ;

2:

else

3:

Divide array A into two parts called A1 and A2;

4:

f1 ← spawn ParallelFactorRefinement(A1) ;

5:

f2 ← spawn ParallelFactorRefinement(A2) ;

6:

sync;

7:

return MergeRefinement(f1, f2);

8:

25 / 65

slide-26
SLIDE 26

Parallel algorithms based on the naive refinement II

Calling all pairs of GCD and separating factors with those GCDs.

Algorithm 2: MergeRefinement(A, E, B, F)

Input: Arrays of square-free positive integers A = l1, . . . , lk, E = e1, . . . , ek, B = m1, . . . , mr and F = f1, . . . , fr, where A, E (resp. B, F) is regarded as a factor refinement (l1, e1), . . . , (lk, ek) (resp. (m1, f1), . . . , (mr, fr)). Output: A factor refinement of the concatenation of A, E and B, F. Allocate G and H of (k × r) row-major arrays, A′ an k-array and B′ an r-array; 1: parallel for (i, j) ∈ {1, . . . , k} × {1, . . . , r} do 2: Hi,j ← 0; G ← GcdOfAllPairs(A, B) ; 3: parallel for i = 1 to k do 4: pi ←

1≤j≤r Gi,j ;

5: A′

i ← li/pi ;

6: for j = 1 to r do 7: if Gi,j = 1 then 8: Hi,j ← Hi,j + ei; 9: For each column, do similar work, from Line 4 to 9 ; 10: Write the non-one entries of A′ (for each row), B′ (for each column), and G to C; 11: Write the corresponding exponents from E, F, and H to D; 12: return C, D ; 13:

26 / 65

slide-27
SLIDE 27

Parallel algorithms based on the naive refinement III

Algorithm 3: GcdOfAllPairsInner(A, B, G)

Input: 1-D arrays of positive integers A = l1, l2, . . . , lk, B = m1, m2, . . . , mr and a 2-D array G = [Gi,j | 1 ≤ i ≤ k, 1 ≤ j ≤ r]. Output: All possible pairs of gcd(li, mj) for 1 ≤ i ≤ k, 1 ≤ j ≤ r with gcd(li, mj) stored in Gi,j. comment C is a global variable equal to a base-case threshold, say 16. Assume C ≥ 2. if k ≤ C and r ≤ C then

1:

for (i, j) ∈ {1, . . . , k} × {1, . . . , r} do

2:

Gi,j ← gcd(Ai, Bj); [Place of INEFFICIENCY for cache complexity]

3:

else if k > C and r ≤ C then

4:

Divide A into two halves A1 = l1, . . . , lk/2, A2 = lk/2+1, . . . , lk ;

5:

GcdOfAllPairsInner(A1, B, G);

6:

GcdOfAllPairsInner(A2, B, G);

7:

else if k ≤ C and r > C then

8:

Divide B into two halves B1 = m1, . . . , mr/2, B2 = mr/2+1, . . . , mr ;

9:

GcdOfAllPairsInner(A, B1, G);

10:

GcdOfAllPairsInner(A, B2, G);

11:

else

12:

Divide A and B arrays into two halves

13:

A1 = l1, . . . , lk/2, A2 = lk/2+1, . . . , lk; B1 = m1, . . . , mr/2, B2 = mr/2+1, . . . , mr ; spawn GcdOfAllPairsInner(A1, B1, G);

14:

spawn GcdOfAllPairsInner(A2, B2, G);

15:

spawn GcdOfAllPairsInner(A1, B2, G);

16:

spawn GcdOfAllPairsInner(A2, B1, G);

17:

27 / 65

slide-28
SLIDE 28

Parallel algorithms based on the naive refinement IV

Proposition: work, span and parallelism On input data of size n for Algorithm 1, under the condition that each arithmetic operation (integer division and integer GCD com- putation) has a unit cost,

◮ Work = O(n2), ◮ Span = O(n), and ◮ Parallelism = O(n). GOOD! :-)

28 / 65

slide-29
SLIDE 29

Parallel algorithms based on the naive refinement V

Proposition: cache complexity Under the ideal cache of Z words, with L words per cache-line (model

  • f Frigo, Leiserson, Prokop, and Ramachandra, 1999), with C small

enough, for any input of size n, the number of cache misses of Algorithm 3 is Q(n) ∈ O(n2/L + n2/Z + n2/Z 2), BAD! :-( under the assumption that

◮ there exists a positive integer w such that each integer

coefficient of each input array A or B in Algorithms 1, 2 is stored in w consecutive machine words, and

◮ each input array A or B is packed, that is, its successive

elements occupy consecutive memory slots.

29 / 65

slide-30
SLIDE 30

Experimental results I

◮ The speedup estimates are obtained by Cilkview feature of

Cilk++ (Frigo, Leiserson and Randall, 1998) executing in an Intel(R) Xeon(R) (64 bit) Machine, with CPU (E7340) Speed 2.40GHz, 128.0 GB of RAM, in a 16 Cores sharcnet cluster.

◮ Timings are obtained by running in an Intel(R) Core(TM) i7

(64 bit) Machine, with CPU (870) Speed 2.93GHz, 8.0 GB of RAM with 8 Cores configured in ORCCA Lab.

◮ Timing results are obtained with the average value of 5

executions.

30 / 65

slide-31
SLIDE 31

Experimental results II

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup Cores Parallelism = 2395.73432, Ideal Speedup Lower Performance Bound Measured Speedup

Figure 3: Scalability analysis of the naive refinement based parallel factor refinement algorithm for 4, 000 dense square-free univariate polynomials by Cilkview. All of Degree 60 and coefficient operations are taken modulo a small prime.

31 / 65

slide-32
SLIDE 32

Experimental results III

10 20 30 40 50 60 100 200 300 400 500 600 Time (Sec) Input size Algorithm: Cores = 1 Algorithm: Cores = 8 Maple

Figure 4: Running time comparisons of the naive refinement based parallel factor refinement algorithm for dense square-free univariate

  • polynomials. Degree 60 and coefficient operations are taken modulo a

small prime.

32 / 65

slide-33
SLIDE 33

Proposed Parallel Algorithms Parallel approach based on the augment refinement

33 / 65

slide-34
SLIDE 34

Parallel approach based on the augment refinement I

Expanding input and calling Merge Algorithm 4: ParallelFactorRefinementDNC(A)

Input: A sequence of square-free polynomials A = (m1, m2, . . . , mk) ∈ K[x]. Output: A sequence of square-free pairwise coprime polynomials N = (n1, n2, . . . , ns) ∈ K[x], and a sequence of positive integers E = (e1, e2, . . . , es) such that (n1, e1), (n2, e2), . . . , (ns, es) is a factor refinement of m1, m2, · · · , mk. if k < 2 then

1:

return (m1), (1) ;

2:

else

3:

Divide A into two sub-sequences called A1 and A2;

4:

(X1, Y1) ← spawn ParallelFactorRefinementDNC(A1) ;

5:

(X2, Y2) ← spawn ParallelFactorRefinementDNC(A2) ;

6:

sync;

7:

return MergeRefinementDNC(X1, Y1, X2, Y2);

8:

34 / 65

slide-35
SLIDE 35

Parallel approach based on the augment refinement II

Proceed recursively and calling Merge Algorithm 5: MergeRefinementDNC(A, E, B, F)

Input: Two sequences of square-free pairwise coprime polynomials A = (a1, . . . , an) and B = (b1, b2, . . . , br) ∈ K[x] together with two sequences of positive integers E = (e1, e2, . . . , en) and F = (f1, f2, . . . , fr). Output: A factor refinement of (a1, e1), . . . , (an, en), (b1, f1), . . . , (br, fr). if n ≤ BASESIZE or r ≤ BASESIZE then 1: return MergeRefineTwoSeq(A, E, B, F); 2: else 3: Divide A, E, B, and F into two halves called A1, A2, E1, E2, B1, B2, and F1, F2, 4: respectively ; (L1, M1, Q1, R1, S1, T1) ← spawn MergeRefinementDNC(A1, E1, B1, F1) ; 5: (L2, M2, Q2, R2, S2, T2) ← spawn MergeRefinementDNC(A2, E2, B2, F2) ; 6: sync ; 7: (L3, M3, Q3, R3, S3, T3) ← spawn MergeRefinementDNC(L1, M1, S2, T2) ; 8: (L4, M4, Q4, R4, S4, T4) ← spawn MergeRefinementDNC(L2, M2, S1, T1) ; 9: sync ; 10: return 11: {L3 + L4, M3 + M4, Q1 + Q2 + Q3 + Q4, R1 + R2 + R3 + R4, S3 + S4, T3 + T4} ;

35 / 65

slide-36
SLIDE 36

Parallel approach based on the augment refinement III

Proceed iteratively and calling Merge Algorithm 6: MergeRefineTwoSeq(A, E, B, F)

Input: Two sequences of square-free pairwise coprime polynomials A = (a1, . . . , an) and B = (b1, b2, . . . , br) of K[x] together with two sequences of positive integers E = (e1, e2, . . . , en) and F = (f1, f2, . . . , fr). Output: A factor refinement of (a1, e1), . . . , (an, en), (b1, f1), . . . , (br, fr). L ← ∅; 1: M ← ∅; 2: Q ← ∅; 3: R ← ∅; 4: S0 ← B; 5: T0 ← F; 6: for i from 1 to n do 7: (ℓi, mi, Qi, Ri, Si, Ti) ← MergeRefinePolySeq(ai, ei, Si−1, Ti−1) ; 8: Q ← Q + Qi ; 9: R ← R + Ri ; 10: if ℓi = 1 then 11: L ← L + (ℓi) ; 12: M ← M + (mi) ; 13: return (L, M, Q, R, Sn, Tn); 14:

36 / 65

slide-37
SLIDE 37

Parallel approach based on the augment refinement IV

Proceed iteratively and augment refinement Algorithm 7: MergeRefinePolySeq(a, e, B, F)

Input: A square-free polynomial a ∈ K[x], a positive integer e, a sequence of square-free pairwise coprime polynomials B = (b1, b2, . . . , bn) of K[x] and a sequence of positive integers F = (f1, f2, . . . , fn). Output: A factor refinement of ae, b1f1, . . . , bnfn. ℓ0 ← a; 1: m0 ← e; 2: Q ← ∅; 3: R ← ∅; 4: S ← ∅; 5: T ← ∅; 6: for i from 1 to n do 7: (ℓi, mi, Gi, Vi, di, wi) ← PolyRefine(ℓi−1, mi−1, bi, fi) ; 8: Q ← Q + Gi ; 9: R ← R + Vi ; 10: if di = 1 then 11: S ← S + (di) ; 12: T ← T + (wi) ; 13: return (ℓn, mn, Q, R, S, T); 14:

37 / 65

slide-38
SLIDE 38

Parallel approach based on the augment refinement V

Refinement of two inputs (Pair-refine) Algorithm 8: PolyRefine(a, e, b, f )

Input: Two square-free univariate polynomials a, b ∈ K[x] for a field K and two positive integers e, f . Output: A factor refinement of (a, e), (b, f ) . g ← gcd(a, b); 1: a′ ← a quotient g; 2: b′ ← b quotient g; 3: if g = 1 then 4: return (a, e, ∅, ∅, b, f ) ; // Here ∅ designates the empty sequence 5: else if a = b then 6: return (1, 1, (a), (e + f ), 1, 1) 7: else 8: (ℓ1, e1, G1, V1, r1, f1) ← PolyRefine(a′, e, g, e + f ); 9: (ℓ2, e2, G2, V2, r2, f2) ← PolyRefine(r1, f1, b′, f ); 10: if ℓ2 = 1 then 11: G2 ← G2 + (ℓ2) ; // Here + designates sequence concatenation 12: V2 ← V2 + (e2); 13: return (ℓ1, e1, G1 + G2, V1 + V2, r2, f2) ; 14:

38 / 65

slide-39
SLIDE 39

Parallel approach based on the augment refinement VI

Proposition: work, span and parallelism On input data of size n for Algorithm 4, under the condition that each arithmetic operation (integer division and integer GCD com- putation) has a unit cost, where C is the threshold BASESIZE,

◮ Work = O(n2), ◮ Span = O(Cn), and ◮ Parallelism = O(n/C). NOT BAD! :-)

39 / 65

slide-40
SLIDE 40

Parallel approach based on the augment refinement VII

Proposition: cache complexity Under the ideal cache of Z words, with L words per cache-line (model

  • f Frigo, Leiserson, Prokop, and Ramachandra, 1999), with C small

enough, for any input of size n, the number of cache misses of Algorithm 5 is Q(n) = O(n2/ZL + n2/Z 2), GOOD! :-) under the assumption that

◮ each input or output sequence in Algorithms 8, 7, 6 or 5 is

packed, and

◮ each element of the field K can be stored in one machine

word.

40 / 65

slide-41
SLIDE 41

Experimental results I

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup Cores Parallelism = 170.733083, Ideal Speedup Lower Performance Bound Measured Speedup

Figure 5: Scalability analysis of the augment refinement based parallel factor refinement algorithm for 200, 000 int type inputs by Cilkview.

41 / 65

slide-42
SLIDE 42

Experimental results II

5 10 15 20 25 30 35 40 45 50 5000 5200 5400 5600 5800 6000 Time (Sec) Input size Algorithm: Cores = 1 Algorithm: Cores = 8 Maple

Figure 6: Running time comparisons of the augment refinement based parallel factor refinement algorithm for int type inputs.

42 / 65

slide-43
SLIDE 43

Experimental results III

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup Cores Parallelism = 265.204981, Ideal Speedup Lower Performance Bound Measured Speedup

Figure 7: Scalability analysis of the augment refinement based parallel factor refinement algorithm for 4, 000 dense square-free univariate polynomials by Cilkview. All of degree 150 and coefficient operations are taken modulo a small prime.

43 / 65

slide-44
SLIDE 44

Experimental results IV

20 40 60 80 100 120 140 100 200 300 400 500 600 Time (Sec) Input size Algorithm: Cores = 1 Algorithm: Cores = 8 Maple

Figure 8: Running time comparisons of the augment refinement based parallel factor refinement algorithm for dense square-free univariate

  • polynomials. All of degree 150 and coefficient operations are taken

modulo a small prime.

44 / 65

slide-45
SLIDE 45

Experimental results V

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup Cores Parallelism = 136.931277, Ideal Speedup Lower Performance Bound Measured Speedup

Figure 9: Scalability analysis of the augment based parallel factor refinement algorithm for 4, 120 sparse square-free univariate polynomials by Cilkview when the input is already a GCD-free basis. Degree 150 and coefficient operations are taken modulo a small prime.

45 / 65

slide-46
SLIDE 46

Experimental results VI

5 10 15 20 25 30 35 40 45 50 200 300 400 500 600 Time (Sec) Input size Algorithm: Cores = 1 Algorithm: Cores = 8 Maple

Figure 10: Running time comparisons of the augment based parallel factor refinement algorithm for sparse square-free univariate polynomials when input is already a GCD-free basis. Degree 150 and coefficient

  • perations are taken modulo a small prime.

46 / 65

slide-47
SLIDE 47

Parallel Algorithm Based on Subproduct Trees

47 / 65

slide-48
SLIDE 48

Subproduct Tree I

◮ Useful construction to devise fast algorithms with univariate polynomials, ◮ If m0, m1, . . . , mn−1 be monic, non-constant, polynomials in K[x], then its subproduct tree is as follows [cost: O(M(d) log2(d)), d = n−1

i=0 degree(mi)] m0.m1 . . . mn−1 m0.m1 . . . mn/2−1 mn/2.mn/2+1 . . . mn−1 m0.m1 m2.m3 mn−2.mn−1 m0 m1 m2 m3 mn−2 mn−1 . . . . . . . . . . . . . . . . . . i = k i = k − 1 i = 1 i = 0

Figure 11: Construction of a subproduct tree.

48 / 65

slide-49
SLIDE 49

Fast multiple remainders I

◮ Let m0, m2, . . . , mn−1 be as before and assume their subproduct tree M has been computed. ◮ To compute f rem m0, f rem m1, . . . , f rem mn−1, one reduces f by the root M, then by its children (leading to two remainders) and so on. ◮ deg(f ) < n, this amounts to O(M(d) log2(d)) field operations.

m0.m1 . . . mn−1 m0.m1 . . . mn/2−1 mn/2.mn/2+1 . . . mn−1 m0.m1 m2.m3 mn−2.mn−1 m0 m1 m2 m3 mn−2 mn−1 . . . . . . . . . . . . . . . . . . i = k i = k − 1 i = 1 i = 0

Figure 12: Construction of a subproduct tree.

49 / 65

slide-50
SLIDE 50

Fast Multiple GCDs I

◮ Suppose we have a polynomial f ∈ K[x] and a sequence of square-free polynomials A = m0, m1, . . . , mn−1 in K[x] with their subproduct tree. ◮ We compute multiGcd(f , A), that is, all gcd(f , ai) for 0 ≤ i ≤ (n − 1) as follows:

a0.a1 . . . an−1 a0.a1 . . . an/2−1 an/2.an/2+1 . . . an−1 a0.a1 a2.a3 an−2.an−1 a0 a1 a2 a3 an−2 an−1 . . . . . . . . . . . . . . . mk,0 = f mod (a0.a1 . . . an−1) mk−1,0 = mk,0 mod (a0.a1 . . . an/2−1) mk−1,1 = mk,0 mod (an/2.an/2+1 . . . an−1) m1,0 = m2,0 mod (a0.a1) m0,0 = m1,0 mod a0 return gcd(m0,0, a0) m0,1 = m1,0 mod a1 return gcd(m0,1, a1)

Figure 13: Multiple GCDs computation.

50 / 65

slide-51
SLIDE 51

Fast All Possible Pairs of GCDs I

◮ Suppose we have two sequences of square-free polynomials

A = a0, a1, . . . , an−1 and B = b0, b1, . . . , bs−1 in K[x].

◮ We compute parallelPairsOfGcd(A, B), that is,

gcd(a0, b0), . . . , gcd(a1, bs−1), . . ., gcd(an−1, b0), . . . , gcd(an−1, bs−1)

◮ This is done using the subproduct tree of b0, b1, . . . , bs−1 and

calling multiGcd(f , A) at ech node.

51 / 65

slide-52
SLIDE 52

Merging GCD-free Bases I

◮ Once we have a routine parallelPairsOfGcd(A, B), we easily

derive a routine for merging two GCD-free bases of square free polynomials.

◮ Let us call it parallelMergeGcdFreeBases(f1, f2).

52 / 65

slide-53
SLIDE 53

Parallel GCD-free Bases I

Expanding input and calling Merge Algorithm 9: parallelGcdFreeBasis(A) Input: Sequence of square-free polynomials A = a1, a2, . . . , ae. Output: A GCD-free basis of A. Build a subproduct tree where the root is labeled by the sequence

1:

  • f polynomials A;

for every node N ∈ Sub′, bottom-up, processing all nodes of the

2:

same level concurrently do if N is not a leaf then

3:

f1 ← spawn leftChild(N);

4:

f2 ← spawn rightChild(N);

5:

sync;

6:

Label N by parallelMergeGcdFreeBases(f1, f2);

7:

return the label of RootOf(Sub′);

8:

53 / 65

slide-54
SLIDE 54

Parallel GCD-free Bases II

Proposition: work, span and parallelism On input data of size n for Algorithm 9,

◮ Work = O(M(d) log3

2 d), GOOD thanks to fast arithmetic

◮ Span = O( d

n log2 2(n)), and

◮ Parallelism = O(n

log4

2 d

log2

2(n)), NOT GOOD, because almost similar to

QUADRATIC and with higher parallelization overheads where (i) d = i=n

i=1deg(inputi),

(iii) FFT-based parallel univariate multiplication with M(d) ∈ O(d log2 d), (iv) computing univariate polynomial GCD in degree d with a work of O(M(d) log2 d) and a span of O(d), and (v) subproduct tree of n arbitrary polynomials needs work of O(M(d) log2 n) and span of O(log2

2 d log2 n).

54 / 65

slide-55
SLIDE 55

Parallel GCD-free Bases III

Key Observations:

◮ Better work than that of quadratic. ◮ Almost similar parallelism that of quadratic. ◮ But, implementation on multicore is more challenging,

because

(i) parallelization of polynomial multiplication in multicore is hard problem reported by Chowdhury, Moreno Maza, Pan, and Schost in 2011 [CMPS11] and Moreno Maza and Xie in 2011 [MX11]. (ii) They are efficient from degree 100,000 to 1,000,000 which has a negative impact on the span of subproduct tree. (iii) Subproduct tree construction has also

◮ a negative impact on memory consumption and ◮ no cache-friendly algorithms in multicore is known. 55 / 65

slide-56
SLIDE 56

Factor Refinement I

Conclusion

56 / 65

slide-57
SLIDE 57

Conclusion I

◮ We have proposed and analyzed parallel algorithms for factor

refinement (and GCD-free basis) computation.

◮ We have observed that cache complexity has major effects on

performances.

◮ Algorithm 5 with cache complexity O(n2/ZL) is more efficient

than Algorithm 3 with O(n2/L), as we verified experimentally.

◮ Asymptotically fast polynomial arithmetic does not always pay

  • ff on multicore architectures.

◮ Work in progress:

◮ Handling multi-precision (= big) integers efficiently. ◮ Supporting the multivariate case in the context of polynomial

system solving.

57 / 65

slide-58
SLIDE 58

Acknowledgement I

I would like to acknowledge Dr. Yuzhen Xie, Dr. Rong Xiao, Dr. Changbo Chen and other ORCCA Lab members for their great help throughout this research work.

58 / 65

slide-59
SLIDE 59

Thanks to All!

59 / 65

slide-60
SLIDE 60

Work, span and parallelism: naive based principle I

Let W1(n), W2(n), W3(n) (resp. S1(n), S2(n), S3(n)) the work (resp. span) of Algorithms 1, 2 and 3 on input data of order n. Thus we have: W1(n) = 2W1(n/2) + W2(n) and S1(n) = S1(n/2) + S2(n). (1) W2(n) = W3(n) + O(n2) and S2(n) = S3(n) + O(n). (2) W3(n) = 4W3(n/2) + Θ(1) and S3(n) = S3(n/2) + Θ(1). (3) From Equation 3, we have: W3(n) = Θ(n2) and S3(n) = Θ(log2(n)). From Equation 2, we obtain W2(n) = Θ(n2) and S2(n) = Θ(n). Finally, from Equation 1, we deduce W1(n) = Θ(n2) and S1(n) = Θ(n).

60 / 65

slide-61
SLIDE 61

Cache complexity: naive based principle I

Let Q(n) is the number of cache misses in Algorithm 3. Under the ideal cache model, there exists a constant α such that we have, Q(n) ≤

  • n2/L + n

for n < αZ 4Q(n/2) + Θ(1)

  • therwise.

The solution of the above recurrence is as follows: Q(n) ≤ 4Q(n/2) + Θ(1) ≤ 4[4Q(n/4) + Θ(1)] + Θ(1) . . . ≤ 4k Q(n/2k ) +

k−1

  • j=0

4j Θ(1) where 2k = n/αZ in base case. Under this base case, Q(n) ≤ (n/αZ)2[(αZ)2/L + αZ] + Θ((n/αZ)2) ≤ Θ(n2/L + n2/Z + n2/Z2) 61 / 65

slide-62
SLIDE 62

Work, span and parallelism: augment based principle I

Let us denote by W4(n), W5(n), W6(n), W7(n), W8(n) (resp. S4(n), S5(n), S6(n), S7(n), S8(n)) the work (resp. span) of Algorithms 4, 5, 6, 7 and 8 on input data of order n. Thus we have: W4(n) ≤ 2W4(n/2) + W5(n) and S4(n) ≤ S4(n/2) + S5(n). (4) Algorithm 5 also proceeds in a divide-and-conquer manner, dividing the input data into two parts and performing two recursive calls. W5(n) ≤

  • W6(n)

for n < C 4W5(n/2) + Θ(1)

  • therwise,

(5) and S5(n) ≤

  • S6(n)

for n < C 2S5(n/2) + Θ(1)

  • therwise.

(6) It is observed that W6(n), W7(n), W8(n) fit within Θ(n2), Θ(n), Θ(1), respectively. Moreover, since Algorithms 6, 7 and 8 are serial, we also have S6(n), S7(n), S8(n) within Θ(n2), Θ(n), Θ(1), respectively. Let k = ⌈log2(n/C)⌉. Then we have W5(n) ≤ O(4kC 2) = O(n2) and S5(n) ≤ O(2kC 2) = O(Cn). Therefore, from Relation (4), we deduce W4(n) ∈ O(n2) and S4(n) ∈ O(Cn).

62 / 65

slide-63
SLIDE 63

Cache complexity: augment based principle I

The recursive structure of Algorithm 5 presents that there exists a positive constant α such that Q(n) satisfies the following relation: Q(n) ≤

  • O(n/L + 1)

for n < αZ 4Q(n/2) + Θ(1)

  • therwise,

provided C < αZ. The above recurrence leads to the following inequality for all n ≥ 2: Q(n) ≤ 4Q(n/2) + Θ(1) ≤ 4[4Q(n/4) + Θ(1)] + Θ(1) . . . ≤ 4k Q(n/2k ) +

k−1

  • j=0

4j Θ(1) where k = ⌈log2(n/αZ)⌉. Since n/2k ≤ αZ, we deduce: Q(n) ≤ (n/αZ)2(αZ/L + 1) + Θ((n/αZ)2) = O(n2/ZL + n2/Z2). . 63 / 65

slide-64
SLIDE 64

References I

Eric Bach, James Driscoll, and Jeffrey Shallit. Factor Refinement. In Symposium on Discrete Algorithms, pages 201–211, 1990. Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, and ´ Eric Schost. Complexity and Performance Results for non FFT-based Univariate Polynomial Multiplication, 2011.

  • X. Dahan, M. Moreno Maza, ´
  • E. Schost, W. Wu, and Y. Xie.

On the Complexity of the D5 Principle. SIGSAM Bull., 39:97–98, September 2005.

64 / 65

slide-65
SLIDE 65

References II

Charles E. Leiserson, Liyun Li, M. Moreno Maza, and Yuzhen Xie. Parallel Computation of the Minimal Elements of a Poset. In 4th International Workshop on Parallel and Symbolic Computation, pages 53–62, 2010. Marc Moreno Maza and Yuzhen Xie. Balanced Dense Polynomial Multiplication on Multi-cores.

  • Int. J. Found. Comput. Sci., 22(5):1035–1055, 2011.

65 / 65