Factors that Determine Speedup Characteristics of parallel code - - PDF document

factors that determine speedup
SMART_READER_LITE
LIVE PREVIEW

Factors that Determine Speedup Characteristics of parallel code - - PDF document

Factors that Determine Speedup Characteristics of parallel code ECE 1747 Parallel Programming granularity load balance locality Machine-independent communication and synchronization Performance Optimization Techniques


slide-1
SLIDE 1

1

ECE 1747 Parallel Programming

Machine-independent Performance Optimization Techniques

Factors that Determine Speedup

  • Characteristics of parallel code

– granularity – load balance – locality – communication and synchronization

Granularity

  • Granularity = size of the program unit that

is executed by a single processor.

  • May be a single loop iteration, a set of loop

iterations, etc.

  • Fine granularity leads to:

– (positive) ability to use lots of processors – (positive) finer-grain load balancing – (negative) increased overhead

Granularity and Critical Sections

  • Small granularity => more processors =>

more critical section accesses => more contention.

slide-2
SLIDE 2

2

Issues in Performance of Parallel Parts

  • Granularity.
  • Load balance.
  • Locality.
  • Synchronization and communication.

Load Balance

  • Load imbalance = different in execution

time between processors between barriers.

  • Execution time may not be predictable.

– Regular data parallel: yes. – Irregular data parallel or pipeline: perhaps. – Task queue: no.

Static vs. Dynamic

  • Static: done once, by the programmer

– block, cyclic, etc. – fine for regular data parallel

  • Dynamic: done at runtime

– task queue – fine for unpredictable execution times – usually high overhead

  • Semi-static: done once, at run-time

Choice is not inherent

  • MM or SOR could be done using task

queues: put all iterations in a queue.

– In heterogeneous environment. – In multitasked environment.

  • TSP could be done using static partitioning:

– If we did exhaustive search.

slide-3
SLIDE 3

3

Static Load Balancing

  • Block

– best locality – possibly poor load balance

  • Cyclic

– better load balance – worse locality

  • Block-cyclic

– load balancing advantages of cyclic (mostly) – better locality (see later)

Dynamic Load Balancing (1 of 2)

  • Centralized: single task queue.

– Easy to program – Excellent load balance

  • Distributed: task queue per processor.

– Less communication/synchronization

Dynamic Load Balancing (2 of 2)

  • Task stealing:

– Processes normally remove and insert tasks from their own queue. – When queue is empty, remove task(s) from

  • ther queues.
  • Extra overhead and programming difficulty.
  • Better load balancing.

Semi-static Load Balancing

  • Measure the cost of program parts.
  • Use measurement to partition computation.
  • Done once, done every iteration, done every

n iterations.

slide-4
SLIDE 4

4

Molecular Dynamics (continued)

for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued)

for each molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])

Molecular Dynamics (continued)

for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple)

for some number of timesteps { Fork() for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); Join() Fork() for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); } Join()

slide-5
SLIDE 5

5

Molecular Dynamics (simple)

for some number of timesteps { Parallel for for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); Parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple)

  • Simple to program.
  • Possibly poor load balance

– block distribution of i iterations (molecules) – could lead to uneven neighbor distribution – cyclic does not help

Better Load Balance

  • Assign iterations such that each processor

has ~ the same number of neighbors.

  • Array of “assign records”

– size: number of processors – two elements:

  • beginning i value (molecule)
  • ending i value (molecule)
  • Recompute partition periodically

Molecular Dynamics (continued)

for some number of timesteps { Parallel for pr = get_thread_num(); for( i=assign[pr]->b; i<assign[pr]->e; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); Parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

slide-6
SLIDE 6

6

Frequency of Balancing

  • Every time neighbor list is recomputed.

– once during initialization. – every iteration. – every n iterations.

  • Extra overhead vs. better approximation and

better load balance.

Summary

  • Parallel code optimization

– Critical section accesses. – Granularity. – Load balance.

Factors that Determine Speedup

– granularity – load balancing – locality

  • uniprocessor
  • multiprocessor

– synchronization and communication

Uniprocessor Memory Hierarchy

memory L2 cache L1 cache CPU access time size

slide-7
SLIDE 7

7

Typical Cache Organization

  • Caches are organized in “cache lines”.
  • Typical line sizes

– L1: 32 bytes – L2: 128 bytes

Cache Replacement

  • If you hit in the cache, done.
  • If you miss in the cache,

– Fetch line from next level in hierarchy. – Replace a line from the cache.

Bottom Line

  • To get good performance,

– You have to have a high hit rate. – You have to continue to access the data “close” to the data that you accessed recently.

Locality

  • Locality (or re-use) = the extent to which a

processor continues to use the same data or “close” data.

  • Temporal locality: re-accessing a particular

word before it gets replaced

  • Spatial locality: accessing other words in a

cache line before the line gets replaced

slide-8
SLIDE 8

8

Example 1

for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j];

  • Good spatial locality in grid and temp

(arrays in C laid out in row-major order).

  • No temporal locality.

Example 2

for( j=0; j<n; j++ ) for( i=0; i<n; i++ ) grid[i][j] = temp[i][j];

  • No spatial locality in grid and temp.
  • No temporal locality.

Example 3

for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) temp[i][j] = 0.25 * (grid[i+1][j]+grid[i+1][j]+ grid[i][j-1]+grid[i][j+1]);

  • Spatial locality in temp.
  • Spatial locality in grid.
  • Temporal locality in grid?

Access to grid[i][j]

  • First time grid[i][j] is used: temp[i-1,j].
  • Second time grid[i][j] is used: temp[i,j-1].
  • Between those times, 3 rows go through the

cache.

  • If 3 rows > cache size, cache miss on

second access for grid[i][j].

slide-9
SLIDE 9

9

Fix

  • Traverse the array in blocks, rather than

row-wise sweep.

  • Make sure grid[i][j] still in cache on second

access.

Example 3 (before) Example 3 (afterwards) Achieving Better Locality

  • Technique is known as blocking / tiling.
  • Compiler algorithms known.
  • Few commercial compilers do it.
  • Learn to do it yourself.
slide-10
SLIDE 10

10

Locality in Parallel Programming

  • Is even more important than in sequential

programming, because the memory latencies are longer.

Returning to Sequential vs. Parallel

  • A piece of code may be better executed

sequentially if considered by itself.

  • But locality may make it profitable to

execute it in parallel.

  • Typically happens with initializations.

Example: Parallelization Ignoring Locality

for( i=0; i<n; i++ ) a[i] = i; Parallel for for( i=0; i<n; i++ ) /* assume f is a very expensive function */ b[i] = f( a[i-1], a[i] );

Example: Taking Locality into Account

Parallel for for( i=0; i<n; i++ ) a[i] = i; Parallel for for( i=0; i<n; i++ ) /* assume f is a very expensive function */ b[i] = f( a[i-1], a[i] )

slide-11
SLIDE 11

11

How to Get Started?

  • First thing: figure what takes the time in your

sequential program => profile it (gprof) !

  • Typically, few parts (few loops) take the bulk of

the time.

  • Parallelize those parts first, worrying about

granularity and load balance.

  • Advantage of shared memory: you can do that

incrementally.

  • Then worry about locality.

Performance and Architecture

  • Understanding the performance of a parallel

program often requires an understanding of the underlying architecture.

  • There are two principal architectures:

– distributed memory machines – shared memory machines

  • Microarchitecture plays a role too!