On the Factor Refinement Principle and its Implementation on - PowerPoint PPT Presentation

On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario, London, ON, Canada December 15, 2011 1 / 65

Factor Refinement Serial Algorithms Approach based on the naive refinement Approach based on augment refinement Approach based on subproduct trees Motivation Implementation Challenges on Multicore Architectures Contribution Proposed Parallel Algorithms A d-n-c illustration Parallel algorithms based on the naive refinement Parallel approach based on the augment refinement Parallel Algorithm Based on Subproduct Tree Conclusion 2 / 65

Factor Refinement I Factor Refinement 3 / 65

Factor Refinement I Definition ◮ Let D be a UFD and m 1 , m 2 , . . . , m r be elements of D . ◮ Let m be the product of m 1 , m 2 , . . . , m r . ◮ We say that elements n 1 , n 2 , . . . , n s of D form a GCD-free basis whenever gcd( n i , n j ) = 1 for all 1 ≤ i < j ≤ s . ◮ Let e 1 , e 2 , . . . , e s be positive integers. ◮ We say that the pairs ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) form a refinement of m 1 , m 2 , . . . , m r if the following two conditions hold: ( i ) n 1 , n 2 , . . . , n s is a GCD-free basis, ( ii ) for every 1 ≤ i ≤ r there exists non-negative integers f 1 , . . . , f s such 1 ≤ j ≤ s n f j that we have � j = m i , 1 ≤ i ≤ s n e i ( ii ) � i = m . When this holds, we also say that ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) is a coprime factorization of m . 4 / 65

Factor Refinement II Example Let m 1 = 30 , m 2 = 42 and their product m = 1260. Then ( i ) 5 1 , 6 2 , 7 1 is a refinement of 30 and 42, ( ii ) 5 , 6 , 7 is a GCD-free basis of 30 and 42, ( iii ) 5 1 , 6 2 , 7 1 is a coprime factorization of 1260. 5 / 65

Factor Refinement III Applications ◮ Simplifying systems of polynomial equations and inequations, (i) � ab � a � = 0 � = 0 bc � = 0 = ⇒ b � = 0 ca � = 0 c � = 0 , (ii) Below, { A , B , C , D , E , F , G } can be seen as a GCD-free basis of { S 1 , S 2 , S 3 } : S 1 S 1 S 2 S 2 A B E , C = ⇒ D F G S 3 S 3 ◮ consolidation of independent factorizations, ◮ etc. 6 / 65

Serial Algorithms: Approach based on the naive refinement and quadratic arithmetic 7 / 65

Approach based on the naive refinement I Idea from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Given a partial factorization of an integer m , say m = m 1 m 2 , we compute d = gcd( m 1 , m 2 ) and write m = ( m 1 / d )( d 2 )( m 2 / d ) . ◮ This process is continued until all the factors are pairwise coprime. ◮ This is also used for the general case of more than two inputs, say m = m 1 m 2 . . . m ℓ . Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 3 ) bit operations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 8 / 65

Serial Algorithms: Approach based on the augment refinement and quadratic arithmetic 9 / 65

Approach based on augment refinement and quadratic arithmetic I Again from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Basic idea same as before but organizing the computations more precisely leading to an improved complexity [BDS90] ◮ The trick is to keep tracks of the pairs ( n j , n k ) in an ordered pair list such that only elements adjacent in the list can have a nontrivial GCD. Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 2 ) bit operations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 10 / 65

Serial Algorithms: Approach based on subproduct trees 11 / 65

Approach based on subproduct trees I Idea of Asymptotically Fast Algorithm for GCD-free Basis from Dahan, Moreno Maza, Schost, and Xie in 2005 [DMS + 05]. ◮ Divide the input into sub-problems until a base case is reached, ◮ Conquer the sub-problems from leaves to the root applying fast arithmetic based on subproduct trees (described later). Algebraic complexity The total number of field operations of this algorithm is O ( M ( d ) log 3 2 d ), where ◮ d is the sum of the degrees of the input polynomials, ◮ M ( d ) is a multiplication time of two univariate polynomials of degree less than d , 12 / 65

Motivation I Parallel Computation of the Minimal Elements of a Poset ◮ by Leiserson, Li, Moreno Maza, and Xie in 2010 [ELLMX10]. ◮ This is a multithreaded (fork-join parallelism) approach which is divide-and-conquer, free of data races, inspired by parallel-merge-sort. ◮ Its Cilk++ shows nearly linear speed-up on 32-core processors for sufficiently large input data set. This work led us to the design and implementation of parallel factor refinement algorithms. 13 / 65

Implementation Challenges on Multicore Architectures 14 / 65

Multithreaded Parallelism on Multicore Architectures I Multicore architectures ◮ A multi-core processor is a single computing component with two or more independent and tighly coupled processors, called cores, sharing memory. ◮ They also share the same bus and memory controller; thus memory bandwidth may limit performances. ◮ In order to maintain memory consistency, synchorization is needed between cores, which may also limit performances. 15 / 65

Multithreaded Parallelism on Multicore Architectures II Fork-join parallelism ◮ This model represents the execution of a multithreaded program as a set of nonblocking threads denoted by the vertices of a dag where the dag edges indicate dependencies between instructions. ◮ Assuming unit cost of execution for all threads, the number of vertices of the dag is the work (= running time on a single core). ◮ The maximum length of a path from the root to a leaf is the span (= running time on ∞ processors). ◮ The paralleisim is the ratio work to span (= average amount of work along the span). 16 / 65

f athena,cel,prokop,sridhar g @sup erte ch.l cs. mit. edu The Ideal-cache Model I Main organized by Memory optimal replacement strategy Cache CPU W work Q = L Cache lines Z cache = ( ) � misses Lines ( + = ) of length L ( + ( = )( + )) ( ) Figure 1: The ideal-cache model. � � p ( + ( + + ) = + = ) 17 / 65 > = ( ) ; ( ; ) ( ) ( ; )

The Ideal-cache Model II ◮ The processor can only refer to words that reside in the cache memory, which is a small and fast access memory, containing Z words organized in cache lines of L words each. ◮ If the referenced line of a word is not in cache, the corresponding line needs to be brought from the main memory. This is a cache miss. If the cache is full, a cache line must be evicted. ◮ Cache complexity analyzes algorithms in terms of cache misses. 18 / 65

From Cilk to Cilk++ I The language ◮ Cilk (resp. Cilk++ ) is an extension of C (resp. C++ ) implementating the fork-join parallelism with two keywords spawn and sync . ◮ A Cilk (resp. Cilk++ ) program has the same semantics as its C (resp. C++ ) ellision. Performance of the work-stealing scheduler In theory, (resp. Cilk++ )s scheduler executes any Cilk Cilk++ computation in a nearly optimal time on p processors, pro- vided that ◮ for almost all parallel steps, there are at least p units of work which can be run concurrently, ◮ each processor is either working or stealing work, ◮ each thread executes in unit time. 19 / 65

Parallelization overheads I Overheads and burden ◮ In practice, the observed speedup factor may be less (sometimes much less) than the theoretical parallelism. ◮ Many factors explain that fact: simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling. Parallelism vs. burdened parallelism ◮ Cilkview is a perforance analyzer which caclulates the work, the span, the parallelism of a given Cilk++ program run. ◮ Cilkview also estimates the running time T p on p processors as T p = T 1 / p + 1 . 7 burden span , where burden span is 15000 instructions times the number of spawn along the span! 20 / 65

Contribution I ◮ Parallel algorithm based on the naive refinement principle [NOT GOOD for data locality and thus for parallelism on multicore architectures]. ◮ Parallel algorithm based on the augment refinement principle [GOOD for data locality and parallelism]. ◮ Parallel algorithm based on subproduct tree [MORE CHALLENGING for implementation on multicore architectures]. Principle All are based on algorithms which are divide-and-conquer (d-n-c), multithreaded, free of data races. 21 / 65

Proposed Parallel Algorithms A d-n-c illustration 22 / 65

A d-n-c illustration I Input Expand 2, 6, 7, 10, 15, 21, 22, 26 2, 6, 7, 10 15, 21, 22, 26 Done in parallel 7, 10 2, 6 15, 21 22, 26 2 6 7 10 15 21 22 26 Merge 2 1 6 1 7 1 10 1 15 1 21 1 22 1 26 1 3 1 , 2 2 7 1 , 10 1 5 1 , 7 1 , 3 2 11 1 , 13 1 , 2 2 Done in parallel 3 1 , 7 1 , 5 1 , 2 3 5 1 , 7 1 , 3 2 , 11 1 , 13 1 , 2 2 11 1 , 13 1 , 3 3 , 7 2 , 5 2 , 2 5 Output Figure 2: Example of algorithm execution. 23 / 65

Proposed Parallel Algorithms Parallel algorithms based on the naive refinement 24 / 65

On the Factor Refinement Principle and its Implementation on - PowerPoint PPT Presentation

On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario,

End-to-End principle End-to-end Principle Broad networking principle First implementation

Adaptive Mesh Refinement CS 101 - Meshing Winter 2007 1 Mesh Refinement Applications

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Class 14 Slides SLIDE what is the designing principle how does designing principle

Confirmatory Factor Analysis and Exploratory-Confirmatory Factor Analysis Maximum

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

Implementation and run-time mesh refinement for the k SST DES turbulence model when applied

Reducing Extraneous Processing Modality Principle Jan L. Plass, ECT Coherence Principle

2/2/2015 FUNDAMENTAL LEGAL PRINCIPLES Principle of Indemnity Principle of Insurable

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

Quadratic Interval Refinement Nikolaos Arvanitopoulos Seminar on Computational Geometry and

Data Refinement: model-oriented proof methods and their comparison Willem-Paul de Roever

Energy Aware Consolidation in Cloud Computing April 8th, 2015 Adrian J. Mirabel and Rashid

Consolidation in the Lab (Lambe & Whitman, Soil Mechanics , 1969) Casagrande Method 0% 10%

The global well-posedness for the compressible viscous fluid flow in 3D exterior domains . . .

What we know and what we do not know about practical compressive sampling Deanna Needell Jan.

PBFVMC: A New Pseudo-Boolean Formulation to Virtual-Machine Consolidation Bruno Cesar Ribas 1 , 3

The Anglo-Lafarge JV inquiry Coordinated effects in cement ACE Conference Paris 25

Numerical methods to overcome metastability in molecular dynamics T. Lelivre CERMICS - Ecole

Smoothed Particle Hydrodynamics Techniques for the Physics Based Simulation of Fluids and Solids

On the Factor Refinement Principle and its Implementation on - PowerPoint PPT Presentation

On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario,

End-to-End principle End-to-end Principle Broad networking principle First implementation

Adaptive Mesh Refinement CS 101 - Meshing Winter 2007 1 Mesh Refinement Applications

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Class 14 Slides SLIDE what is the designing principle how does designing principle

Confirmatory Factor Analysis and Exploratory-Confirmatory Factor Analysis Maximum

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

Implementation and run-time mesh refinement for the k SST DES turbulence model when applied

Reducing Extraneous Processing Modality Principle Jan L. Plass, ECT Coherence Principle

2/2/2015 FUNDAMENTAL LEGAL PRINCIPLES Principle of Indemnity Principle of Insurable

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

Quadratic Interval Refinement Nikolaos Arvanitopoulos Seminar on Computational Geometry and

Data Refinement: model-oriented proof methods and their comparison Willem-Paul de Roever

Energy Aware Consolidation in Cloud Computing April 8th, 2015 Adrian J. Mirabel and Rashid

Consolidation in the Lab (Lambe &amp; Whitman, Soil Mechanics , 1969) Casagrande Method 0% 10%

The global well-posedness for the compressible viscous fluid flow in 3D exterior domains . . .

What we know and what we do not know about practical compressive sampling Deanna Needell Jan.

PBFVMC: A New Pseudo-Boolean Formulation to Virtual-Machine Consolidation Bruno Cesar Ribas 1 , 3

The Anglo-Lafarge JV inquiry Coordinated effects in cement ACE Conference Paris 25

Numerical methods to overcome metastability in molecular dynamics T. Lelivre CERMICS - Ecole

Smoothed Particle Hydrodynamics Techniques for the Physics Based Simulation of Fluids and Solids

Consolidation in the Lab (Lambe & Whitman, Soil Mechanics , 1969) Casagrande Method 0% 10%