ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - - PDF document

ece 1747h ece 1747h parallel
SMART_READER_LITE
LIVE PREVIEW

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - - PDF document

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA 4164 Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material


slide-1
SLIDE 1

1

ECE 1747H : Parallel Programming

Lecture 1: Overview

ECE 1747H

  • Meeting time: Mon 4-6 PM
  • Meeting place: BA 4164
  • Instructor: Cristiana Amza,

http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E

Material

  • Course notes
  • Web material (e.g., published papers)
  • No required textbook, some recommended

Prerequisites

  • Programming in C or C++
  • Data structures
  • Basics of machine architecture
  • Basics of network programming
  • Please send e-mail to eugenia@eecg

to get an eecg account !! (name, stuid, class, instructor)

slide-2
SLIDE 2

2

Other than that

  • No written homeworks, no exams
  • 10% for each small programming

assignments (expect 1-2)

  • 10% class participation
  • Rest comes from major course project

Programming Project

  • Parallelizing a sequential program, or

improving the performance or the functionality of a parallel program

  • Project proposal and final report
  • In-class project proposal and final report

presentation

  • “Sample” project presentation posted

Parallelism (1 of 2)

  • Ability to execute different parts of a single

program concurrently on different machines

  • Goal: shorter running time
  • Grain of parallelism: how big are the parts?
  • Can be instruction, statement, procedure, …
  • Will mainly focus on relative coarse grain

Parallelism (2 of 2)

  • Coarse-grain parallelism mainly applicable

to long-running, scientific programs

  • Examples: weather prediction, prime

number factorization, simulations, …

slide-3
SLIDE 3

3

Lecture material (1 of 4)

  • Parallelism

– What is parallelism? – What can be parallelized? – Inhibitors of parallelism: dependences

Lecture material (2 of 4)

  • Standard models of parallelism

– shared memory (Pthreads) – message passing (MPI) – shared memory + data parallelism (OpenMP)

  • Classes of applications

– scientific – servers

Lecture material (3 of 4)

  • Transaction processing

– classic programming model for databases – now being proposed for scientific programs

Lecture material (4 of 4)

  • Perf. of parallel & distributed programs

– architecture-independent optimization – architecture-dependent optimization

slide-4
SLIDE 4

4

Course Organization

  • First month of semester:

– lectures on parallelism, patterns, models – small programming assignments, done individually

  • Rest of the semester

– major programming project, done individually

  • r in small group

– Research paper discussions

Parallel vs. Distributed Programming

Parallel programming has matured:

  • Few standard programming models
  • Few common machine architectures
  • Portability between models and

architectures

Bottom Line

  • Programmer can now focus on program and

use suitable programming model

  • Reasonable hope of portability
  • Problem: much performance optimization is

still platform-dependent

– Performance portability is a problem

ECE 1747H: Parallel Programming

Lecture 1-2: Parallelism, Dependences

slide-5
SLIDE 5

5

Parallelism

  • Ability to execute different parts of a

program concurrently on different machines

  • Goal: shorten execution time

Measures of Performance

  • To computer scientists: speedup, execution

time.

  • To applications people: size of problem,

accuracy of solution, etc.

Speedup of Algorithm

  • Speedup of algorithm = sequential execution time

/ execution time on p processors (with the same data set). p speedup

Speedup on Problem

  • Speedup on problem = sequential execution

time of best known sequential algorithm / execution time on p processors.

  • A more honest measure of performance.
  • Avoids picking an easily parallelizable

algorithm with poor sequential execution time.

slide-6
SLIDE 6

6

What Speedups Can You Get?

  • Linear speedup

– Confusing term: implicitly means a 1-to-1 speedup per processor. – (almost always) as good as you can do.

  • Sub-linear speedup: more normal due to
  • verhead of startup, synchronization,

communication, etc.

Speedup

p speedup linear actual

Scalability

  • No really precise decision.
  • Roughly speaking, a program is said to

scale to a certain number of processors p, if going from p-1 to p processors results in some acceptable improvement in speedup (for instance, an increase of 0.5).

Super-linear Speedup?

  • Due to cache/memory effects:

– Subparts fit into cache/memory of each node. – Whole problem does not fit in cache/memory of a single node.

  • Nondeterminism in search problems.

– One thread finds near-optimal solution very quickly => leads to drastic pruning of search space.

slide-7
SLIDE 7

7

Cardinal Performance Rule

  • Don’t leave (too) much of your code

sequential!

Amdahl’s Law

  • If 1/s of the program is sequential, then you

can never get a speedup better than s.

– (Normalized) sequential execution time = 1/s + (1- 1/s) = 1 – Best parallel execution time on p processors = 1/s + (1 - 1/s) /p – When p goes to infinity, parallel execution = 1/s – Speedup = s.

Why keep something sequential?

  • Some parts of the program are not

parallelizable (because of dependences)

  • Some parts may be parallelizable, but the
  • verhead dwarfs the increased speedup.

When can two statements execute in parallel?

  • On one processor:

statement 1; statement 2;

  • On two processors:

processor1: processor2:

statement1; statement2;

slide-8
SLIDE 8

8

Fundamental Assumption

  • Processors execute independently: no

control over order of execution between processors When can 2 statements execute in parallel?

  • Possibility 1

Processor1: Processor2:

statement1; statement2;

  • Possibility 2

Processor1: Processor2:

statement2: statement1;

When can 2 statements execute in parallel?

  • Their order of execution must not matter!
  • In other words,

statement1; statement2;

must be equivalent to

statement2; statement1;

Example 1

a = 1; b = 2;

  • Statements can be executed in parallel.
slide-9
SLIDE 9

9

Example 2

a = 1; b = a;

  • Statements cannot be executed in parallel
  • Program modifications may make it

possible.

Example 3

a = f(x); b = a;

  • May not be wise to change the program

(sequential execution would take longer).

Example 5

a = 1; a = 2;

  • Statements cannot be executed in parallel.

True dependence

Statements S1, S2 S2 has a true dependence on S1 iff S2 reads a value written by S1

slide-10
SLIDE 10

10

Anti-dependence

Statements S1, S2. S2 has an anti-dependence on S1 iff S2 writes a value read by S1.

Output Dependence

Statements S1, S2. S2 has an output dependence on S1 iff S2 writes a variable written by S1. When can 2 statements execute in parallel? S1 and S2 can execute in parallel iff there are no dependences between S1 and S2

– true dependences – anti-dependences – output dependences

Some dependences can be removed.

Example 6

  • Most parallelism occurs in loops.

for(i=0; i<100; i++) a[i] = i;

  • No dependences.
  • Iterations can be executed in parallel.
slide-11
SLIDE 11

11

Example 7

for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i; } Iterations and statements can be executed in parallel.

Example 8

for(i=0;i<100;i++) a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; Iterations and loops can be executed in parallel.

Example 9

for(i=0; i<100; i++) a[i] = a[i] + 100;

  • There is a dependence … on itself!
  • Loop is still parallelizable.

Example 10

for( i=0; i<100; i++ ) a[i] = f(a[i-1]);

  • Dependence between a[i] and a[i-1].
  • Loop iterations are not parallelizable.
slide-12
SLIDE 12

12

Loop-carried dependence

  • A loop carried dependence is a dependence

that is present only if the statements are part

  • f the execution of a loop.
  • Otherwise, we call it a loop-independent

dependence.

  • Loop-carried dependences prevent loop

iteration parallelization.

Example 11

for(i=0; i<100; i++ ) for(j=0; j<100; j++ ) a[i][j] = f(a[i][j-1]);

  • Loop-independent dependence on i.
  • Loop-carried dependence on j.
  • Outer loop can be parallelized, inner loop cannot.

Example 12

for( j=0; j<100; j++ ) for( i=0; i<100; i++ ) a[i][j] = f(a[i][j-1]);

  • Inner loop can be parallelized, outer loop

cannot.

  • Less desirable situation.
  • Loop interchange is sometimes possible.

Level of loop-carried dependence

  • Is the nesting depth of the loop that carries

the dependence.

  • Indicates which loops can be parallelized.
slide-13
SLIDE 13

13

Be careful … Example 13

printf(“a”); printf(“b”); Statements have a hidden output dependence due to the output stream.

Be careful … Example 14

a = f(x); b = g(x); Statements could have a hidden dependence if f and g update the same variable. Also depends on what f and g can do to x.

Be careful … Example 15

for(i=0; i<100; i++) a[i+10] = f(a[i]);

  • Dependence between a[10], a[20], …
  • Dependence between a[11], a[21], …
  • Some parallel execution is possible.

Be careful … Example 16

for( i=1; i<100;i++ ) { a[i] = …; ... = a[i-1]; }

  • Dependence between a[i] and a[i-1]
  • Complete parallel execution impossible
  • Pipelined parallel execution possible
slide-14
SLIDE 14

14

Be careful … Example 14

for( i=0; i<100; i++ ) a[i] = f(a[indexa[i]]);

  • Cannot tell for sure.
  • Parallelization depends on user knowledge
  • f values in indexa[].
  • User can tell, compiler cannot.

An aside

  • Parallelizing compilers analyze program

dependences to decide parallelization.

  • In parallelization by hand, user does the

same analysis.

  • Compiler more convenient and more correct
  • User more powerful, can analyze more

patterns.

To remember

  • Statement order must not matter.
  • Statements must not have dependences.
  • Some dependences can be removed.
  • Some dependences may not be obvious.