Plan Parallelism Complexity Measures 1 Multithreaded Parallelism - PowerPoint PPT Presentation

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism in Practice 4 CS 4435 - CS 9624 Announcements 5 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 1 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 2 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Plan The fork-join parallelism model Example: Example: int fib (int n) { int fib (int n) { if (n<2) return (n); if (n<2) return (n); ( ( ) ) ( ); ( ); fib(4) fib(4) Parallelism Complexity Measures else { else { 1 int x,y; int x,y; x = cilk_spawn fib(n-1); x = cilk_spawn fib(n-1); 4 y = fib(n-2); y fib(n 2); y = fib(n-2); y fib(n 2); cilk for Loops 2 cilk_sync; cilk_sync; return (x+y); return (x+y); 3 2 } } } } } } Scheduling Theory and Implementation 3 2 2 1 1 1 1 0 0 “Processor “Processor Measuring Parallelism in Practice 4 oblivious” oblivious” 1 1 0 0 The computation dag computation dag Announcements 5 unfolds dynamically. We shall also call this model multithreaded parallelism . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 3 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 4 / 62

Parallelism Complexity Measures Parallelism Complexity Measures Terminology Work and span initial strand initial strand final strand inal strand strand strand strand strand continu continue edg edge return edge return edge spawn edge spawn spawn spawn edge edge dge call edge call edge We define several performance measures. We assume an ideal situation: a strand is is a maximal sequence of instructions that ends with a no cache issues, no interprocessor costs: spawn , sync , or return (either explicit or implicit) statement. T p is the minimum running time on p processors At runtime, the spawn relation causes procedure instances to be T 1 is called the work , that is, the sum of the number of instructions at structured as a rooted tree, called spawn tree or parallel instruction each node. stream , where dependencies among strands form a dag. T ∞ is the minimum running time with infinitely many processors, called the span (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 5 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 6 / 62 Parallelism Complexity Measures Parallelism Complexity Measures The critical path length Work law Assuming all strands run in unit time, the longest path in the DAG is equal We have: T p ≥ T 1 / p . to T ∞ . For this reason, T ∞ is also referred to as the critical path length . Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 7 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 8 / 62

Parallelism Complexity Measures Parallelism Complexity Measures Span law Speedup on p processors T 1 / T p is called the speedup on p processors A parallel program execution can have: linear speedup : T 1 / T P = Θ( p ) superlinear speedup : T 1 / T P = ω ( p ) (not possible in this model, though it is possible in others) sublinear speedup : T 1 / T P = o ( p ) We have: T p ≥ T ∞ . Indeed, T p < T ∞ contradicts the definitions of T p and T ∞ . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 9 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 10 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Parallelism The Fibonacci example (1/2) Because the Span Law dictates 1 8 that T ≥ T that T P ≥ T ∞ , the maximum the maximum possible speedup given T 1 and T ∞ is 2 2 7 7 T /T T 1 /T ∞ = para parall ll ll li lleli li lism = the average 3 4 6 amount of work amount of work per step along the span. 5 For Fib (4), we have T 1 = 17 and T ∞ = 8 and thus T 1 / T ∞ = 2 . 125. What about T 1 ( Fib ( n )) and T ∞ ( Fib ( n ))? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 11 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 12 / 62

Parallelism Complexity Measures Parallelism Complexity Measures The Fibonacci example (2/2) Series composition We have T 1 ( n ) = T 1 ( n − 1) + T 1 ( n − 2) + Θ(1). Let’s solve it. One verify by induction that T ( n ) ≤ aF n − b for b > 0 large enough to dominate Θ(1) and a > 1. We can then choose a large enough to satisfy the initial condition, A B whatever that is. On the other hand we also have F n ≤ T ( n ). √ Therefore T 1 ( n ) = Θ( F n ) = Θ( ψ n ) with ψ = (1 + 5) / 2. We have T ∞ ( n ) = max( T ∞ ( n − 1) , T ∞ ( n − 2)) + Θ(1). Work? We easily check T ∞ ( n − 1) ≥ T ∞ ( n − 2). Span? This implies T ∞ ( n ) = T ∞ ( n − 1) + Θ(1). Therefore T ∞ ( n ) = Θ( n ). Consequently the parallelism is Θ( ψ n / n ). (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 13 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 14 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Series composition Parallel composition A A B B Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) Span: T ∞ ( A ∪ B ) = T ∞ ( A ) + T ∞ ( B ) Work? Span? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 15 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 16 / 62

Parallelism Complexity Measures Parallelism Complexity Measures Parallel composition Some results in the fork-join parallelism model Al Algorithm g orithm Work Work Span p an A Merge sort Θ (n lg n) Θ (lg 3 n) Matrix multiplication Θ (n 3 ) Θ (lg n) Strassen Θ (n lg7 ) Θ (lg 2 n) B LU-decomposition Θ (n 3 ) Θ (n lg n) Tableau construction Θ (n 2 ) Ω (n lg3 ) FFT Θ (n lg n) Θ (lg 2 n) Breadth-first search B d h fi h Θ (E) Θ (E) Θ (d lg V) Θ (d l V) Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) Span: T ∞ ( A ∪ B ) = max( T ∞ ( A ) , T ∞ ( B )) We shall prove those results in the next lectures. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 17 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 18 / 62 cilk for Loops cilk for Loops Plan For loop parallelism in Cilk++ a 11 a 12 ⋯ a 1n a 11 a 21 ⋯ a n1 a 21 a 22 ⋯ a 2n a 12 a 22 ⋯ a n2 21 22 2n 12 22 n2 Parallelism Complexity Measures 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn a 1n a 2n ⋯ a nn n1 n2 nn 1n 2n nn A A T cilk for Loops 2 cilk_for (int i=1; i<n; ++i) { Scheduling Theory and Implementation 3 for (int j=0; j<i; ++j) { double temp = A[i][j]; Measuring Parallelism in Practice 4 A[i][j] = A[j][i]; A[j][i] = temp; } Announcements 5 } The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 19 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 20 / 62

cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation : void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; 1 1 2 2 3 3 cilk_spawn recur(lo, mid); 4 4 5 6 7 8 recur(mid, hi); cilk_sync; Here we do not assume that each strand runs in unit time. } else for (int j=0; j<i; ++j) { Span of loop control : Θ(log( n )) double temp = A[i][j]; Max span of an iteration : Θ( n ) A[i][j] = A[j][i]; Span : Θ( n ) A[j][i] = temp; Work : Θ( n 2 ) } } Parallelism : Θ( n ) } (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 21 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 22 / 62 cilk for Loops Scheduling Theory and Implementation Parallelizing the inner loop Plan cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { Parallelism Complexity Measures 1 double temp = A[i][j]; A[i][j] = A[j][i]; cilk for Loops A[j][i] = temp; 2 } } Scheduling Theory and Implementation 3 Span of outer loop control : Θ(log( n )) Measuring Parallelism in Practice 4 Max span of an inner loop control : Θ(log( n )) Span of an iteration : Θ(1) Announcements 5 Span : Θ(log( n )) Work : Θ( n 2 ) Parallelism : Θ( n 2 / log( n )) But! More on this next week . . . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 23 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 24 / 62

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism - PowerPoint PPT Presentation

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

Exploring graphs using parallel rotor walks Dominik Pajak LaBRI, Inria Bordeaux Sud-Ouest, France

Theorem Provers Michael Rawson, Giles Reger University of Manchester, UK The problem with all

The dynamic multithreading model (part 1) CSE 6230, Fall 2014 August 26 1 Recall: DAG model of

Computing from projections of random points: A dense hierarchy of subideals of the K -trivial

Parallel and asynchronous interactive theorem proving in Isabelle from READ-EVAL-PRINT to

Superposition and Grover algorithm in the presence of a closed timelike curve Ki Hyuk Yee (U. of

Checking using Neural Networks Amir Shokri Amirsh.nll@gmail.com Most text editors let users

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism - PowerPoint PPT Presentation

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&amp;M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE &amp; FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

Exploring graphs using parallel rotor walks Dominik Pajak LaBRI, Inria Bordeaux Sud-Ouest, France

Theorem Provers Michael Rawson, Giles Reger University of Manchester, UK The problem with all

The dynamic multithreading model (part 1) CSE 6230, Fall 2014 August 26 1 Recall: DAG model of

Computing from projections of random points: A dense hierarchy of subideals of the K -trivial

Parallel and asynchronous interactive theorem proving in Isabelle from READ-EVAL-PRINT to

Superposition and Grover algorithm in the presence of a closed timelike curve Ki Hyuk Yee (U. of

Checking using Neural Networks Amir Shokri Amirsh.nll@gmail.com Most text editors let users

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions