ece 1747h ece 1747h parallel
play

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - PDF document

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA 4164 Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material


  1. ECE 1747H ECE 1747H : Parallel • Meeting time: Mon 4-6 PM Programming • Meeting place: BA 4164 • Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material Prerequisites • Course notes • Programming in C or C++ • Web material (e.g., published papers) • Data structures • No required textbook, some recommended • Basics of machine architecture • Basics of network programming • Please send e-mail to eugenia@eecg to get an eecg account !! (name, stuid, class, instructor) 1

  2. Other than that Programming Project • No written homeworks, no exams • Parallelizing a sequential program, or improving the performance or the • 10% for each small programming functionality of a parallel program assignments (expect 1-2) • Project proposal and final report • 10% class participation • In-class project proposal and final report • Rest comes from major course project presentation • “Sample” project presentation posted Parallelism (1 of 2) Parallelism (2 of 2) • Ability to execute different parts of a single • Coarse-grain parallelism mainly applicable program concurrently on different machines to long-running, scientific programs • Goal: shorter running time • Examples: weather prediction, prime number factorization, simulations, … • Grain of parallelism: how big are the parts? • Can be instruction, statement, procedure, … • Will mainly focus on relative coarse grain 2

  3. Lecture material (1 of 4) Lecture material (2 of 4) • Standard models of parallelism • Parallelism – shared memory (Pthreads) – What is parallelism? – message passing (MPI) – What can be parallelized? – shared memory + data parallelism (OpenMP) – Inhibitors of parallelism: dependences • Classes of applications – scientific – servers Lecture material (3 of 4) Lecture material (4 of 4) • Transaction processing • Perf. of parallel & distributed programs – classic programming model for databases – architecture-independent optimization – now being proposed for scientific programs – architecture-dependent optimization 3

  4. Course Organization Parallel vs. Distributed Programming • First month of semester: Parallel programming has matured: – lectures on parallelism, patterns, models • Few standard programming models – small programming assignments, done • Few common machine architectures individually • Portability between models and • Rest of the semester architectures – major programming project, done individually or in small group – Research paper discussions Bottom Line • Programmer can now focus on program and ECE 1747H: Parallel use suitable programming model Programming • Reasonable hope of portability • Problem: much performance optimization is Lecture 1-2: Parallelism, still platform-dependent Dependences – Performance portability is a problem 4

  5. Parallelism Measures of Performance • Ability to execute different parts of a • To computer scientists: speedup, execution program concurrently on different machines time. • Goal: shorten execution time • To applications people: size of problem, accuracy of solution, etc. Speedup of Algorithm Speedup on Problem • Speedup of algorithm = sequential execution time • Speedup on problem = sequential execution / execution time on p processors (with the same time of best known sequential algorithm / data set). execution time on p processors. speedup • A more honest measure of performance. • Avoids picking an easily parallelizable algorithm with poor sequential execution time. p 5

  6. What Speedups Can You Get? Speedup • Linear speedup speedup linear – Confusing term: implicitly means a 1-to-1 speedup per processor. – (almost always) as good as you can do. actual • Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc. p Scalability Super-linear Speedup? • No really precise decision. • Due to cache/memory effects: • Roughly speaking, a program is said to – Subparts fit into cache/memory of each node. scale to a certain number of processors p, if – Whole problem does not fit in cache/memory of a single node. going from p-1 to p processors results in • Nondeterminism in search problems. some acceptable improvement in speedup (for instance, an increase of 0.5). – One thread finds near-optimal solution very quickly => leads to drastic pruning of search space. 6

  7. Cardinal Performance Rule Amdahl’s Law • If 1/s of the program is sequential, then you • Don’t leave (too) much of your code can never get a speedup better than s. sequential! – (Normalized) sequential execution time = 1/s + (1- 1/s) = 1 – Best parallel execution time on p processors = 1/s + (1 - 1/s) /p – When p goes to infinity, parallel execution = 1/s – Speedup = s. Why keep something sequential? When can two statements execute in parallel? • Some parts of the program are not • On one processor: parallelizable (because of dependences) statement 1; statement 2; • Some parts may be parallelizable, but the • On two processors: overhead dwarfs the increased speedup. processor1: processor2: statement1; statement2; 7

  8. Fundamental Assumption When can 2 statements execute in parallel? • Processors execute independently: no • Possibility 1 control over order of execution between Processor1: Processor2: processors statement1; statement2; • Possibility 2 Processor1: Processor2: statement2: statement1; Example 1 When can 2 statements execute in parallel? a = 1; • Their order of execution must not matter! b = 2; • In other words, • Statements can be executed in parallel. statement1; statement2; must be equivalent to statement2; statement1; 8

  9. Example 2 Example 3 a = 1; a = f(x); b = a; b = a; • Statements cannot be executed in parallel • May not be wise to change the program (sequential execution would take longer). • Program modifications may make it possible. Example 5 True dependence a = 1; Statements S1, S2 a = 2; S2 has a true dependence on S1 • Statements cannot be executed in parallel. iff S2 reads a value written by S1 9

  10. Anti-dependence Output Dependence Statements S1, S2. Statements S1, S2. S2 has an anti-dependence on S1 S2 has an output dependence on S1 iff iff S2 writes a value read by S1. S2 writes a variable written by S1. Example 6 When can 2 statements execute in parallel? • Most parallelism occurs in loops . S1 and S2 can execute in parallel iff there are no dependences between S1 and S2 for(i=0; i<100; i++) – true dependences a[i] = i; – anti-dependences – output dependences • No dependences. Some dependences can be removed. • Iterations can be executed in parallel. 10

  11. Example 7 Example 8 for(i=0; i<100; i++) { for(i=0;i<100;i++) a[i] = i; a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; b[i] = 2*i; } Iterations and loops can be executed in parallel. Iterations and statements can be executed in parallel. Example 9 Example 10 for(i=0; i<100; i++) for( i=0; i<100; i++ ) a[i] = a[i] + 100; a[i] = f(a[i-1]); • There is a dependence … on itself! • Dependence between a[i] and a[i-1]. • Loop is still parallelizable. • Loop iterations are not parallelizable. 11

  12. Loop-carried dependence Example 11 • A loop carried dependence is a dependence for(i=0; i<100; i++ ) that is present only if the statements are part for(j=0; j<100; j++ ) of the execution of a loop. a[i][j] = f(a[i][j-1]); • Otherwise, we call it a loop-independent dependence. • Loop-independent dependence on i. • Loop-carried dependence on j. • Loop-carried dependences prevent loop • Outer loop can be parallelized, inner loop cannot. iteration parallelization. Example 12 Level of loop-carried dependence for( j=0; j<100; j++ ) • Is the nesting depth of the loop that carries the dependence. for( i=0; i<100; i++ ) • Indicates which loops can be parallelized. a[i][j] = f(a[i][j-1]); • Inner loop can be parallelized, outer loop cannot. • Less desirable situation. • Loop interchange is sometimes possible. 12

  13. Be careful … Example 13 Be careful … Example 14 printf(“a”); a = f(x); printf(“b”); b = g(x); Statements have a hidden output dependence Statements could have a hidden dependence if due to the output stream. f and g update the same variable. Also depends on what f and g can do to x. Be careful … Example 15 Be careful … Example 16 for(i=0; i<100; i++) for( i=1; i<100;i++ ) { a[i] = …; a[i+10] = f(a[i]); ... = a[i-1]; } • Dependence between a[10], a[20], … • Dependence between a[i] and a[i-1] • Dependence between a[11], a[21], … • Complete parallel execution impossible • … • Pipelined parallel execution possible • Some parallel execution is possible. 13

  14. Be careful … Example 14 An aside for( i=0; i<100; i++ ) • Parallelizing compilers analyze program dependences to decide parallelization. a[i] = f(a[indexa[i]]); • In parallelization by hand, user does the same analysis. • Cannot tell for sure. • Compiler more convenient and more correct • Parallelization depends on user knowledge • User more powerful, can analyze more of values in indexa[]. patterns. • User can tell, compiler cannot. To remember • Statement order must not matter. • Statements must not have dependences. • Some dependences can be removed. • Some dependences may not be obvious. 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend