The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza - PowerPoint PPT Presentation

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4402 - CS 9535

Plan Parallelism Complexity Measures cilk for Loops Scheduling Theory and Implementation Measuring Parallelism in Practice Anticipating parallelization overheads Announcements

The fork-join parallelism model Example: Example: int fib (int n) { int fib (int n) { if (n<2) return (n); if (n<2) return (n); ( ( ) ) ( ); ( ); fib(4) fib(4) else { else { int x,y; int x,y; x = cilk_spawn fib(n-1); x = cilk_spawn fib(n-1); 4 y = fib(n-2); y fib(n 2); y fib(n 2); y = fib(n-2); cilk_sync; cilk_sync; return (x+y); return (x+y); 3 2 } } } } } } 2 2 1 1 1 1 0 0 “Processor “Processor oblivious” oblivious” 1 1 0 0 The computation dag computation dag unfolds dynamically. We shall also call this model multithreaded parallelism .

Terminology initial strand initial strand final strand inal strand strand strand strand strand continu continue edg edge return edge return edge spawn edge spawn spawn spawn edge edge dge call edge call edge ◮ a strand is is a maximal sequence of instructions that ends with a spawn , sync , or return (either explicit or implicit) statement. ◮ At runtime, the spawn relation causes procedure instances to be structured as a rooted tree, called spawn tree or parallel instruction stream , where dependencies among strands form a dag.

Work and span We define several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: T p is the minimum running time on p processors T 1 is called the work , that is, the sum of the number of instructions at each node. T ∞ is the minimum running time with infinitely many processors, called the span

The critical path length Assuming all strands run in unit time, the longest path in the DAG is equal to T ∞ . For this reason, T ∞ is also referred to as the critical path length .

Work law ◮ We have: T p ≥ T 1 / p . ◮ Indeed, in the best case, p processors can do p works per unit of time.

Span law ◮ We have: T p ≥ T ∞ . ◮ Indeed, T p < T ∞ contradicts the definitions of T p and T ∞ .

Speedup on p processors ◮ T 1 / T p is called the speedup on p processors ◮ A parallel program execution can have: ◮ linear speedup : T 1 / T P = Θ( p ) ◮ superlinear speedup : T 1 / T P = ω ( p ) (not possible in this model, though it is possible in others) ◮ sublinear speedup : T 1 / T P = o ( p )

Parallelism Because the Span Law dictates that T ≥ T that T P ≥ T ∞ , the maximum the maximum possible speedup given T 1 and T ∞ is T /T T 1 /T ∞ = para parall lleli ll ll li li lism = the average amount of work amount of work per step along the span.

The Fibonacci example (1/2) 1 8 2 2 7 7 3 4 6 5 ◮ For Fib (4), we have T 1 = 17 and T ∞ = 8 and thus T 1 / T ∞ = 2 . 125. ◮ What about T 1 ( Fib ( n )) and T ∞ ( Fib ( n ))?

The Fibonacci example (2/2) ◮ We have T 1 ( n ) = T 1 ( n − 1) + T 1 ( n − 2) + Θ(1). Let’s solve it. ◮ One verify by induction that T ( n ) ≤ aF n − b for b > 0 large enough to dominate Θ(1) and a > 1. ◮ We can then choose a large enough to satisfy the initial condition, whatever that is. ◮ On the other hand we also have F n ≤ T ( n ). √ ◮ Therefore T 1 ( n ) = Θ( F n ) = Θ( ψ n ) with ψ = (1 + 5) / 2. ◮ We have T ∞ ( n ) = max( T ∞ ( n − 1) , T ∞ ( n − 2)) + Θ(1). ◮ We easily check T ∞ ( n − 1) ≥ T ∞ ( n − 2). ◮ This implies T ∞ ( n ) = T ∞ ( n − 1) + Θ(1). ◮ Therefore T ∞ ( n ) = Θ( n ). ◮ Consequently the parallelism is Θ( ψ n / n ).

Series composition A B ◮ Work? ◮ Span?

Series composition A B ◮ Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) ◮ Span: T ∞ ( A ∪ B ) = T ∞ ( A ) + T ∞ ( B )

Parallel composition A B ◮ Work? ◮ Span?

Parallel composition A B ◮ Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) ◮ Span: T ∞ ( A ∪ B ) = max( T ∞ ( A ) , T ∞ ( B ))

Some results in the fork-join parallelism model Algorithm Al g orithm Work Work Span p an Merge sort Θ (n lg n) Θ (lg 3 n) Matrix multiplication Θ (n 3 ) Θ (lg n) Strassen Θ (n lg7 ) Θ (lg 2 n) LU-decomposition Θ (n 3 ) Θ (n lg n) Tableau construction Θ (n 2 ) Ω (n lg3 ) FFT Θ (n lg n) Θ (lg 2 n) Breadth-first search B d h fi h Θ (E) Θ (E) Θ (d l Θ (d lg V) V) We shall prove those results in the next lectures.

For loop parallelism in Cilk++ a 11 a 12 ⋯ a 1n a 11 a 21 ⋯ a n1 a 21 a 22 ⋯ a 2n a 12 a 22 ⋯ a n2 21 22 2n 12 22 n2 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn a 1n a 2n ⋯ a nn n1 n2 nn 1n 2n nn A A T cilk_for (int i=1; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; } } The iterations of a cilk for loop execute in parallel.

Implementation of for loops in Cilk++ Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation : void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; cilk_spawn recur(lo, mid); recur(mid+1, hi); cilk_sync; } else for (int j=lo; j<hi+1; ++j) { double temp = A[hi][j]; A[hi][j] = A[j][hi]; A[j][hi] = temp; } } }

Analysis of parallel for loops 1 1 2 2 3 3 4 4 5 6 7 8 Here we do not assume that each strand runs in unit time. ◮ Span of loop control : Θ(log( n )) ◮ Max span of an iteration : Θ( n ) ◮ Span : Θ( n ) ◮ Work : Θ( n 2 ) ◮ Parallelism : Θ( n )

Parallelizing the inner loop This would yield the following code cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; } } ◮ Span of outer loop control : Θ(log( n )) ◮ Max span of an inner loop control : Θ(log( n )) ◮ Span of an iteration : Θ(1) ◮ Span : Θ(log( n )) ◮ Work : Θ( n 2 ) ◮ Parallelism : Θ( n 2 / log( n )) In practice, parallelizing the inner loop would increase the memory footprint (allocation of the temporaries) and increase parallelism overheads. So, this is not a good idea.

Scheduling Memory I/O Network … $ $ $ $ $ $ P P P P P A scheduler ’s job is to map a computation to particular processors. Such a mapping is called a schedule . ◮ If decisions are made at runtime, the scheduler is online , otherwise, it is offline ◮ Cilk++ ’s scheduler maps strands onto processors dynamically at runtime.

Greedy scheduling (1/2) ◮ A strand is ready if all its predecessors have executed ◮ A scheduler is greedy if it attempts to do as much work as possible at every step.

Greedy scheduling (2/2) P = 3 ◮ In any greedy schedule , there are two types of steps: ◮ complete step : There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. ◮ incomplete step : There are strictly less than p strands that are ready to run. The greedy scheduler runs them all.

Theorem of Graham and Brent P = 3 For any greedy schedule, we have T p T 1 / p + T ∞ ≤ ◮ # complete steps ≤ T 1 / p , by definition of T 1 . ◮ # incomplete steps ≤ T ∞ . Indeed, let G ′ be the subgraph of G that remains to be executed immediately prior to an incomplete step. ( i ) During this incomplete step, all strands that can be run are actually run ( ii ) Hence removing this incomplete step from G ′ reduces T ∞ by one.

Corollary 1 A greedy scheduler is always within a factor of 2 of optimal. From the work and span laws, we have: T P ≥ max( T 1 / p , T ∞ ) (1) In addition, we can trivially express: T 1 / p ≤ max( T 1 / p , T ∞ ) (2) T ∞ ≤ max( T 1 / p , T ∞ ) (3) From Graham - Brent Theorem, we deduce: T P T 1 / p + T ∞ (4) ≤ max( T 1 / p , T ∞ ) + max( T 1 / p , T ∞ ) (5) ≤ 2 max( T 1 / p , T ∞ ) (6) ≤ which concludes the proof.

Corollary 2 The greedy scheduler achieves linear speedup whenever T ∞ = O ( T 1 / p ) . From Graham - Brent Theorem, we deduce: T p T 1 / p + T ∞ (7) ≤ = T 1 / p + O ( T 1 / p ) (8) = Θ( T 1 / p ) (9) The idea is to operate in the range where T 1 / p dominates T ∞ . As long as T 1 / p dominates T ∞ , all processors can be used efficiently. The quantity T 1 / pT ∞ is called the parallel slackness .

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza - PowerPoint PPT Presentation

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4402 - CS 9535 Plan Parallelism Complexity Measures cilk for Loops Scheduling Theory and Implementation Measuring

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Feasibility Review SOUTH SAN JOAQUIN ELECTRIC JUNE 28, 2016 A Fork in the Road BOARD DECISION

shell fork/exec Session ID ? Process Group ? ftree fork/exec fork/exec sleeper sleeper

RFID from Farm to Fork Piero Filippin p.filippin@wlv.ac.uk RFID from Farm to Fork Funded by

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Stick a fork in it An attempt to summarise the Fork-Join framework through the same titled series

Forks and Governance November 6, 2019 guha.jayachandran@sjsu.edu What is a Fork? What is a

Advanced FORK-256 Presented by Seokhie Seokhie Hong Hong Presented by hsh@cist.korea.ac.kr

Function Objects and the Comparator Interface Merge Sort Fork/Join Framework Checkout

Fork/Join Framework Checkout ForkJoinIntro project from SVN Merge sort recap Introduction

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

East Fork Lewis River Temperature and Fecal Coliform Source Assessment Study Presentation of

Memory Questions? ! What is main memory? CSCI [4|6]730 ! How does multiple processes share memory

Leveraging MPST in Linux with Application Guidance to Achieve Power and Performance Goals Michael

Memory Prefetching Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

iBench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos

Chapter 16 Distributed-File Systems Background Naming and Transparency Remote File

Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory Bao Nguyen Hua

Over the Edge: Silently Owning Windows 10's Secure Browser Erik Bosman , Kaveh Razavi, Herbert

CS 423 Operating System Design: Memory Wrap-Up Professor Adam Bates CS 423: Operating

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza - PowerPoint PPT Presentation

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4402 - CS 9535 Plan Parallelism Complexity Measures cilk for Loops Scheduling Theory and Implementation Measuring

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Feasibility Review SOUTH SAN JOAQUIN ELECTRIC JUNE 28, 2016 A Fork in the Road BOARD DECISION

shell fork/exec Session ID ? Process Group ? ftree fork/exec fork/exec sleeper sleeper

RFID from Farm to Fork Piero Filippin p.filippin@wlv.ac.uk RFID from Farm to Fork Funded by

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Stick a fork in it An attempt to summarise the Fork-Join framework through the same titled series

Forks and Governance November 6, 2019 guha.jayachandran@sjsu.edu What is a Fork? What is a

Advanced FORK-256 Presented by Seokhie Seokhie Hong Hong Presented by hsh@cist.korea.ac.kr

Function Objects and the Comparator Interface Merge Sort Fork/Join Framework Checkout

Fork/Join Framework Checkout ForkJoinIntro project from SVN Merge sort recap Introduction

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

East Fork Lewis River Temperature and Fecal Coliform Source Assessment Study Presentation of

Memory Questions? ! What is main memory? CSCI [4|6]730 ! How does multiple processes share memory

Leveraging MPST in Linux with Application Guidance to Achieve Power and Performance Goals Michael

Memory Prefetching Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

iBench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos

Chapter 16 Distributed-File Systems Background Naming and Transparency Remote File

Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory Bao Nguyen Hua

Over the Edge: Silently Owning Windows 10's Secure Browser Erik Bosman , Kaveh Razavi, Herbert

CS 423 Operating System Design: Memory Wrap-Up Professor Adam Bates CS 423: Operating

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA