Numerical Algorithms Matrices A Review Column a 0,0 a 0,1 a 0, m - - PDF document

numerical algorithms
SMART_READER_LITE
LIVE PREVIEW

Numerical Algorithms Matrices A Review Column a 0,0 a 0,1 a 0, m - - PDF document

Numerical Algorithms Matrices A Review Column a 0,0 a 0,1 a 0, m 2 a 0, m 1 a 1,0 a 1,1 a 1, m 2 a 1, m 1 Row a n 2,0 a n 2,1 a n 2, m -2 a n 2, m 1 a n 1,0 a n 1,1 a n 1, m 2 a n 1, m


slide-1
SLIDE 1

335

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 a0,0 a0,1 a1,0 a0,m−2 an−1,0 a0,m−1 an−2,0 an−1,m−1 an−1,m−2 an−2,m−1 a1,1 a1,m−2 a1,m−1 an−2,1 an−2,m-2 an−1,1 Row Column Figure 10.1 An n × m matrix.

Numerical Algorithms

Matrices — A Review

slide-2
SLIDE 2

336

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Matrix Addition

Matrix addition simply involves adding corresponding elements of each matrix to form the result matrix. Given the elements of A as ai,j and the elements of B as bi,j, each element of C is computed as ci,j = ai,j + bi,j (0 ≤ i < n, 0 ≤ j < m)

slide-3
SLIDE 3

337

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

× =

A B C Figure 10.2 Matrix multiplication, C = A × B. i j ci,j Row Column Multiply Sum results

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix C whose elements, ci,j (0 ≤ i < n, 0 ≤ j < m), are computed as follows: where A is an n × l matrix and B is an l × m matrix. ci j

,

ai,kbk,j

k = l 1 –

=

slide-4
SLIDE 4

338

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

× =

A b c Figure 10.3 Matrix-vector multiplication c = A × b. i ci Row sum

Matrix-Vector Multiplication

A vector is a matrix with one column; i.e., an n × 1 matrix. Matrix-vector multiplication follows directly from the definition of matrix-matrix multipli- cation by making B an n × 1 matrix. The result will be an n × 1 matrix (vector).

slide-5
SLIDE 5

339

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Relationship of Matrices to Linear Equations

A system of linear equations can be written in matrix form Ax = b The matrix A holds the a constants, x is a vector of the unknowns, and b is a vector of the b constants.

slide-6
SLIDE 6

340

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Implementing Matrix Multiplication

Sequential Code

Assume throughout that the matrices are square (n × n matrices). The sequential code to compute A × B could simply be

for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; }

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of Ο(n3).

slide-7
SLIDE 7

341

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Parallel Code

Usually based upon the direct sequential matrix multiplication algorithm. Sequential code has independent loops. With n processors (and n × n matrices), we can expect a parallel time complexity of Ο(n2), and this is easily obtainable. Also quite easy to obtain a time complexity of Ο(n) with n2 processors, where one element

  • f A and B are assigned to each processor.

These implementations are cost optimal [since Ο(n3) = n × Ο(n2) = n2 × Ο(n)]. Possible to obtain Ο(log n) with n3 processors by parallelizing the inner loop - not cost

  • ptimal [since Ο(n3) ≠ n3 × Ο(log n)].

Ο(log n) is the lower bound for parallel matrix multiplication.

slide-8
SLIDE 8

342

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

× =

Sum A B C Figure 10.4 Block matrix multiplication. p q Multiply results

Partitioning into Submatrices

Usually, we want to use far fewer than n processors with n × n matrices because of the size

  • f n.

Submatrices

Each matrix can be divided into blocks of elements called submatrices. These submatrices can be manipulated as if there were single matrix elements. Suppose the matrix is divided into s2 submatrices. Each submatrix has n/s × n/s elements. Using the notation Ap,q as the submatrix in submatrix row p and submatrix column q:

for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; /* clear elements of submatrix */ for (r = 0; r < m; r++) /* submatrix multiplication and */ Cp,q = Cp,q + Ap,r * Br,q; /* add to accumulating submatrix */ }

The line

Cp,q = Cp,q + Ap,r * Br,q;

means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix

Cp,q using matrix addition.

The arrangement is known as block matrix multiplication.

slide-9
SLIDE 9

343

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 a0,0 a0,1 a0,2 a0,3 a1,0 a2,0 a3,0 a1,2 a1,1 a2,1 a3,1 a2,2 a3,2 a3,3 a1,3 a2,3 b0,0 b0,1 b0,2 b0,3 b1,0 b2,0 b3,0 b1,2 b1,1 b2,1 b3,1 b2,2 b3,2 b3,3 b1,3 b2,3 a0,0 a0,1 a1,0 a1,1 b0,0 b0,1 b1,0 b1,1 a0,2 a0,3 a1,2 a1,3 b2,0 b2,1 b3,0 b3,1 (a) Matrices (b) Multiplying A0,0 × B0,0 to obtain C0,0 a0,0b0,0+a0,1b1,0 a0,0b0,1+a0,1b1,1 a1,0b0,0+a1,1b1,0 a1,0b0,1+a1,1b1,1 A0,0 B0,0 A0,1 B1,0 a0,2b2,0+a0,3b3,0 a0,2b2,1+a0,3b3,1 a1,2b2,0+a1,3b3,0 a1,2b2,1+a1,3b3,1

+ × + × = =

a0,0b0,0+a0,1b1,0+a0,2b2,0+a0,3b3,0 a0,0b0,1+a0,1b1,1+a0,2b2,1+a0,3b3,1 a1,0b0,0+a1,1b1,0+a1,2b2,0+a1,3b3,0 a1,0b0,1+a1,1b1,1+a1,2b2,1+a1,3b3,1 = C0,0 Figure 10.5 Submatrix multiplication.

×

slide-10
SLIDE 10

344

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 b[][j] a[i][] Row i Column j c[i][j] Processor Pi,j Figure 10.6 Direct implementation of matrix multiplication.

Direct Implementation

Allocate one processor to compute each element of C. Then n2 processors would be needed. One row of elements of A and one column of elements of B are needed. Some of the same elements must be sent to more than one processor. Using submatrices, one processor would compute one m × m submatrix of C.

slide-11
SLIDE 11

345

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Analysis

Assuming n × n matrices and not submatrices.

Communication

With separate messages to each of the n2 slave processors: tcomm = n2(tstartup + 2ntdata) + n2(tstartup + tdata) = n2(2tstartup + (2n + 1)tdata) A broadcast along a single bus would yield tcomm = (tstartup + n2tdata) + n2(tstartup + tdata) Dominant time now is in returning the results as tstartup is usually significantly larger than

  • tdata. A gather routine should reduce the time.

Computation

Each slave performs in parallel n multiplications and n additions; i.e., tcomp = 2n

slide-12
SLIDE 12

346

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Figure 10.7 Accumulation using a tree construction. P0 P1 P2 P3 P0 P0 P2

c0,0 a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0

× + × × × + +

Performance Improvement

By using a tree construction so that n numbers can be added in log n steps using n proces- sors: Computational time complexity of Ο(log n) using n3 processors Instead of Ο(n) using n2 processors.

slide-13
SLIDE 13

347

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Submatrices

In every method, we can substitute submatrices for matrix elements to reduce the number

  • f processors.

Let us select m × m submatrices and s = n/m. Then there are s2 submatrices in each matrix and s2 processors.

Communication

Each of the s2 slave processors must separately receive one row and one column of subma-

  • trices. Each slave processor must separately return a C submatrix to the master processor

giving a communication time of tcomm = s2{2(tstartup + nmtdata) + (tstartup + m2tdata)} = (n/m)2{3tstartup + (m2 + 2nm)tdata} Complete matrices could be broadcast to every processor. As the size of the matrices, m, is increased, the data transmission time of each message increases but the actual number of messages decreases.

Computation

One sequential submatrix multiplication requires m3 multiplications and m3 additions. A submatrix addition requires m2 additions. Hence tcomp = s(2m3 + m2) = Ο(sm3) = Ο(nm2)

slide-14
SLIDE 14

348

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 App Aqp Aqq Apq i j i j Bpp Bqp Bqq Bpq Cpp Cqp Cqq Cpq P1 P3 P2 P0 P0 + P1 P4 + P5 P6 + P7 P2 + P3 P5 P7 P6 P4 Figure 10.8 Submatrix multiplication and summation.

Recursive Implementation

Consider two n × n matrices, A and B, where n is a power of 2. Each matrix is divided to four square submatrices. Suppose the submatrices of A are labeled App, Apq, Aqp, and Aqq, and the submatrices of B are labeled Bpp, Bpq, Bqp, and Bqq (p and q identifying the row and column positions). The final answer requires eight pairs of submatrices to be multiplied: App × Bpp, Apq × Bqp, App × Bpq, Apq × Bqp, Aqp × Bpp, Aqq × Bqp, Aqp × Bpq, Aqq × Bqq, and pairs of results to be added: The same algorithm could do each submatrix multiplication, by decomposing each subma- trix into four sub-submatrices, and so on to creat a recursive algorithm.

slide-15
SLIDE 15

349

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Recursive Algorithm

mat_mult(App, Bpp, s) { if (s == 1) /* if submatrix has one element */ C = A * B; /* multiply elements */ else { /* else continue to make recursive calls */ s = s/2; /* the number of elements in each row/column*/ P0 = mat_mult(App, Bpp, s); P1 = mat_mult(Apq, Bqp, s); P2 = mat_mult(App, Bpq, s); P3 = mat_mult(Apq, Bqq, s); P4 = mat_mult(Aqp, Bpp, s); P5 = mat_mult(Aqq, Bqp, s); P6 = mat_mult(Aqp, Bpq, s); P7 = mat_mult(Aqq, Bqq, s); Cpp = P0 + P1; /* add submatrix products */ Cpq = P2 + P3; /* to form submatrices of final matrix */ Cqp = P4 + P5; Cqq = P6 + P7; } return (C); /* return final matrix */ }

Each of the eight recursive calls can be performed simultaneously and by separate proces- sors. More processors can be assigned after further recursive calls. Generally, the number of processors needs to be a power of 8 if each processor is to be given

  • ne task of the tasks created by the recursive calls. The level of recursion can be limited.

At each recursive call, the size of data being passed is reduced and localized. This is ideal for the best performance of a multiprocessor system with cache memory. The method is especially suitable for shared memory systems.

slide-16
SLIDE 16

350

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 B A Figure 10.9 Movement of A and B elements. j i Pi,j

Mesh Implementations

Cannon’s Algorithm

Uses a mesh of processors with wraparound connections (a torus) to shift the A elements (or submatrices) left and the B elements (or submatrices) up. Algorithm:

  • 1. Initially processor Pi,j has elements ai,j and bi,j (0 ≤ i < n, 0 ≤ k < n).
  • 2. Elements are moved from their initial position to an “aligned” position. The complete

ith row of A is shifted i places left and the complete jth column of B is shifted j places

  • upward. This has the effect of placing the element ai,j+i and the element bi+j,j in

processor Pi,j,. These elements are a pair of those required in the accumulation of ci,j.

  • 3. Each processor, Pi,j, multiplies its elements.
  • 4. The ith row of A is shifted one place right, and the jth column of B is shifted one place
  • upward. This has the effect of bringing together the adjacent elements of A and B,

which will also be required in the accumulation.

  • 5. Each processor, Pi,j, multiplies the elements brought to it and adds the result to the

accumulating sum.

  • 6. Step 4 and 5 are repeated until the final result is obtained (n − 1 shifts with n rows and

n columns of elements).

slide-17
SLIDE 17

351

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 B A Figure 10.10 Step 2 — Alignment of elements of A and B. j i bi+j,j ai,j+i i places j places

slide-18
SLIDE 18

352

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 B A Figure 10.11 Step 4 — One-place shift of elements of A and B. j i Pi,j

slide-19
SLIDE 19

353

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Analysis

Communication

Given s2 submatrices, each of size m × m, the initial alignment requires a maximum of s − 1 shift (communication) operations. After that, there will be s − 1 shift operations. Each shift operation involves m × m elements. Hence tcomm = 2(s − 1)(tstartup + m2tdata)

  • r a communication time complexity of Ο(sm2) or Ο(mn).

Computation

Each submatrix multiplication requires m3 multiplications and m3 additions. Hence, with s − 1 shifts, tcomp = 2sm3 = 2m2n

  • r a computational time complexity of Ο(m2n).
slide-20
SLIDE 20

354

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 c0,0 c0,1 c0,2 c0,3 c1,0 c1,1 c1,2 c1,3 c2,0 c2,1 c2,2 c2,3 c3,0 c3,1 c3,2 c3,3 b3,0 b2,0 b1,0 b0,0 b3,3 b2,3 b1,3 b0,3 b3,2 b2,2 b1,2 b0,2 b3,1 b2,1 b1,1 b0,1 a0,3 a0,2 a0,1 a0,0 a3,3 a3,2 a3,1 a3,0 a2,3 a2,2 a2,1 a2,0 a1,3 a1,2 a1,1 a1,0 Figure 10.12 Matrix multiplication using a systolic array. Pumping action One cycle delay

Two-dimensional Pipeline Systolic Array

The word systolic has been borrowed from the medical field — just as the heart pumps blood, information is pumped through a systolic array at regular intervals. In the two-dimensional systolic array, information is pumped from left to right and from top to bottom. The information meets at internal nodes where the processing occurs. The same informa- tion passes onward (left to right, or downward).

Systolic array used to multiply two 4 × 4 matrices, A and B.

The elements of A enter from the left and the elements of B enter from the top. The final product terms of C will be held in the processors as shown:

slide-21
SLIDE 21

355

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Code

Suitable numbering of processors is using x and y coordinates starting at (0, 0) in the top left corner. Each processor, Pi,j, repeatedly performs the same algorithm (after c is initialized to zero):

recv(&a, Pi,j-1); /* receive from left */ recv(&b, Pi-1,j); /* receive from right */ c = c + a * b; /* accumulate value for ci,j */ send(&a, Pi,j+1); /* send to right */ send(&b, Pi+1,j); /* send downwards */

which accumulates the required summations.

slide-22
SLIDE 22

356

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 c0 c1 c2 c3 b3 b2 b1 b0 a0,3 a0,2 a0,1 a0,0 a3,3 a3,2 a3,1 a3,0 a2,3 a2,2 a2,1 a2,0 a1,3 a1,2 a1,1 a1,0 Figure 10.13 Matrix-vector multiplication using a systolic array. Pumping action

Matrix-Vector Multiplication

Methods for matrix multiplication can be applied to matrix-vector multiplication by simply us- ing one column of the matrix B. A systolic array for matrix-vector multiplication:

slide-23
SLIDE 23

357

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Solving a System of Linear Equations

Linear Equations

Suppose we have a system of linear equations: an−1,0x0 + an−1,1x1 + an−1,2x2 … + an−1,n−1xn−1= bn−1 . . . a2,0x0 + a2,1x1 + a2,2x2 … + a2,n−1xn−1 = b2 a1,0x0 + a1,1x1 + a1,2x2 … + a1,n−1xn−1 = b1 a0,0x0 + a0,1x1 + a0,2x2 … + a0,n−1xn−1 = b0 which, in matrix form, is Ax = b The objective of solving this system of equations is to find values for the unknowns, x0, x1, …, xn−1, given values for a0,0, a0,1, …, an−1,n−1, and b0, …, bn . Matrices are regarded as dense matrices if most of the elements in the matrix are nonzero. Matrices are sparse matrices if a significant number of the elements is zero. The reason that the two are differentiated is that less computationally intensive and perhaps more space-efficient methods are available for sparse matrices. Here, we will assume that the matrix A is dense. The equations will then be solved directly by mathematical manipulation.

slide-24
SLIDE 24

358

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Cleared to zero Already cleared to zero Row i Column i Column Row Figure 10.14 Gaussian elimination. Row j Step through aji

Gaussian Elimination

The objective of Gaussian elimination is to convert the general system of linear equations into a triangular system of equations which can then be solved by Back Substitution. The method uses the characteristic of linear equations that any row can be replaced by that row added to another row multiplied by a constant. The procedure starts at the first row and works toward the bottom row. At the ith row, each row j below the ith row is replaced by row j + (row i) (−aj,i/ai,i). The constant used for row j is −aj,i/ai,i. This has the effect of making all the elements in the ith column below the ith row zero because a j i

,

a j i

,

ai i

,

a j i

,

– ai i

,

   + = =

slide-25
SLIDE 25

359

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Partial Pivoting

Unfortunately, the previous procedure does not exhibit good stability on digital computers. In particular, if ai,i is zero or close to zero, we will not be able to compute the quantity −aj,i/ai,i. The procedure must be modified into so-called partial pivoting by swapping the ith row with the row below it that has the largest absolute element in the ith column of any of the rows below the ith row if there is one. (Reordering equations will not affect the system.) In the following, we will not consider partial pivoting, which will incur additional compu- tations.

slide-26
SLIDE 26

360

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Sequential Code

Without partial pivoting:

for (i = 0; i < n-1; i++) /* for each row, except last */ for (j = i+1; j < n; j++) { /* step through subsequent rows */ m = a[j][i]/a[i][i]; /* Compute multiplier */ for (k = i; k < n; k++) /* modify last n-i-1 elements of row j */ a[j][k] = a[j][k] - a[i][k] * m; b[j] = b[j] - b[i] * m; /* modify right side */ }

The time complexity is O(n3).

slide-27
SLIDE 27

361

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Already Row i Column Row Figure 10.15 Broadcast in parallel implementation of Gaussian elimination. Broadcast ith row n − i +1 elements (including b[i]) cleared to zero

Parallel Implementation

One way to partition the problem is to arrange that one processor holds one row of elements and operates on this row. This requires n processors for n equations. Before a processor can operate on its row, it must receive the elements from row i, which can be achieved through a broadcast operation. First, processor P0 (holding row 0) broadcasts elements of its row to each of the other n − 1 processors. Then these processors compute their multiplier and modify their row. The procedure is re- peated with each of P1, P2 … Pn−2 broadcasting elements of their rows. Processors Pi+1 to Pn−1 inclusive processors receive the broadcast message (at total n − i − 1 processors) and they operate upon n − i + 1 elements of their rows.

slide-28
SLIDE 28

362

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Analysis

Communication

There are n − 1 broadcast messages that must be performed sequentially because the rows are altered between each broadcast. The ith broadcast message contains n − i + 1 elements. Hence, the total message communication is given by

  • r a time complexity of Ο(n2).

For large n, the data time tdata could dominate the overall communication time in the early stages.

Computation

After a row has been broadcast, each processor Pj beyond the broadcast processor Pi will receive the message, compute its multiplier, and operate upon n − j + 2 elements of its row. Ignoring the computation of the multiplier, there are n − j + 2 multiplications and n − j + 2 subtractions. Hence, the computation consists of Efficiency will be relatively low because all the processors before the processor holding row i do not participate in the computation again. tcomm tstartup n i – 1 + ( )tdata + ( )

i = n 2 –

n 1 – ( )tstartup n 2 + ( ) n 1 + ( ) 2

  • 3

–     tdata +     = = tcomp 2 n j – 2 + ( )

j 1 = n 1 –

n 3 + ( ) n 2 + ( ) 2

  • 3

– O n2 ( ) = = =

slide-29
SLIDE 29

363

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Broadcast Figure 10.16 Pipeline implementation of Gaussian elimination. P0 P1 P2 Pn−1 rows Row

Pipeline Configuration

Processors could be formed into a pipeline configuration:

slide-30
SLIDE 30

364

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 P0 P1 P3 P2

n/p 2n/p 3n/p

Figure 10.17 Strip partitioning.

Row

Partitioning

Strip Partitioning

To reduce the number of processors to be less than n, we can partition the matrix into groups

  • f rows for each processor:

Unfortunately, processors do not participate in the computation after their last row is processed.

slide-31
SLIDE 31

365

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 P0 P1 Figure 10.18 Cyclic partitioning to equalize workload.

n/p 2n/p 3n/p Row

Cyclic-Striped Partitioning

An alternative, which actually equalizes the processor workload:

slide-32
SLIDE 32

366

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Iterative Methods

Jacobi Iteration

Given a general system of N linear equations: Ax = b Jacobi iterations are described by the following iteration formula: where the superscript indicates the iteration; i.e., is kth iteration of xi and is the (k−1)th iteration of xj. The iteration formula is simply the ith equation rearranged to have the ith unknown on the left side. The time complexity of the direct method is significant at Ο(N2) with N processors. The time complexity of the iteration method will depend upon the number of iterations and the required accuracy. A particular application of the iterative method is in solving a system of sparse linear equations. xk

i

1 ai i

,

  • bi

ai j

, x j k 1 – j i ≠

– = xk

i

x j

k 1 –

slide-33
SLIDE 33

367

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 ∆ ∆ f(x, y) Solution space y x Figure 10.19 Finite difference method.

Laplace’s Equation

A fundamental partial differential equation: The objective is to solve for f over the two-dimensional space having coordinates x and y. For a computer solution, finite difference methods are appropriate. Here, the two-dimen- sional solution space is “discretized” into a large number of solution points: f

2

∂ x2 ∂

  • f

2

∂ y2 ∂

  • +

=

slide-34
SLIDE 34

368

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

If the distance between the points in the x and y directions, ∆, is made small enough, the central difference approximation of the second derivative can be used: Substituting into Laplace’s equation, we get Rearranging, we get The formula can be rewritten as an iterative formula: where f k(x, y) is the value obtained from kth iteration, and f k−1(x, y) is the value obtained from the (k − 1)th iteration. f

2

∂ x2 ∂

  • 1

∆2

  • f x

∆ y , + ( ) 2 f x y , ( ) – f x ∆ – y , ( ) + [ ] ≈ f

2

∂ y2 ∂

  • 1

∆2

  • f x y

∆ + , ( ) 2 f x y , ( ) – f x y ∆ – , ( ) + [ ] ≈ 1 ∆2

  • f x

∆ y , + ( ) f x ∆ – y , ( ) f x y ∆ + , ( ) f x y ∆ – , ( ) 4 f x y , ( ) – + + + [ ] = f x y , ( ) f x ∆ y , – ( ) f x y ∆ – , ( ) f x ∆ + y , ( ) f x y ∆ + , ( ) + + + [ ] 4

  • =

f k x y , ( ) f k

1 –

x ∆ – y , ( ) f k

1 –

x y ∆ – , ( ) f k

1 –

+ x ∆ y , + ( ) f k

1 –

+ x y ∆ + , ( ) + [ ] 4

  • =
slide-35
SLIDE 35

369

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Figure 10.20 Mesh of points numbered in natural order. x1 x4 x3 x2 x8 x7 x6 x5 x9 x31 x34 x33 x32 x38 x37 x36 x35 x39 x41 x44 x43 x42 x48 x47 x46 x45 x49 x51 x54 x53 x52 x58 x57 x56 x55 x59 x61 x64 x63 x62 x68 x67 x66 x65 x69 x71 x74 x73 x72 x78 x77 x76 x75 x79 x11 x14 x13 x12 x18 x17 x16 x15 x19 x21 x24 x23 x22 x28 x27 x26 x25 x29 x81 x84 x83 x82 x88 x87 x86 x85 x89 x60 x70 x80 x90 x40 x50 x30 x20 x10 x91 x94 x93 x92 x98 x97 x96 x95 x99 x100 Boundary points (see text)

Natural Order

Row by row in the mesh - the first row has the points x1, x2, x3, …, xn. The next row has the points xn+1, xn+2, xn+3, …, x2n and so on:

slide-36
SLIDE 36

370

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Relationship with a General System of Linear Equations

Using natural ordering, the ith point is computed from the ith equation:

  • r

xi−n + xi−1 − 4xi + xi+1+ xi+n = 0 which is a linear equation with five unknowns (except those with boundary points). xi xi

n –

xi

1 –

xi

1 +

xi

n +

+ + + 4

  • =
slide-37
SLIDE 37

371

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 −4 −4 −4 −4 −4 1 1 1 1 1 1 1 1 1 1 ai,i ai,i+n ai,i−1 ai,i+1 ai,i−n 1 1 ith equation 1 1 1 1 1 1 Figure 10.21 Sparse matrix for Laplace’s equation. × x1 = To include boundary values 1 1 A x and some zero entries (see text) x2 xN xN-1 Those equations with a boundary point on diagonal unnecessary for solution

In general form, the ith equation becomes: ai,i−nxi−n + ai,i−1xi−1 + ai,ixi + ai,i+1xi+1+ ai,i+nxi+n = 0 where ai,i = −4, and ai,i−n = ai,i−1 = ai,i+1 = ai,i+n = 1.

slide-38
SLIDE 38

372

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Point Point to be computed computed Sequential order of computation Figure 10.22 Gauss-Seidel relaxation with natural order, computed sequentially.

Faster Convergence Methods

Gauss-Seidel Relaxation

One way to attempt a faster convergence is to use some of the newly computed values to compute other values in that iteration. Suppose we compute the values sequentially in natural order, x1, x2, x3, etc. When xi is being computed, x1, x2, …, xi−1 will have already been computed and these values could be used in the iteration formula for xi, together with the previous values of xi+1, xi+2, …, xN, which have not yet been recomputed.

slide-39
SLIDE 39

373

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Gauss-Seidel Iteration Formula

where the superscript indicates the iteration; i.e., the kth iteration uses values from the kth iteration and values from the (k−1)th iteration. For Laplace’s equation with natural ordering of unknowns, the formula reduces to xk

i = (−1/ai,i)[ai,i−nx k i−n

+ ai,i−1x k

i−1

+ a i,i+1 x k−1

i+1+ a i,i+n

x k−1

i+n

] At the kth iteration, two of the four values (of the points before the ith element) are taken from the kth iteration and two values (of the points after the ith element) are taken from the (k−1)th iteration. Using the original finite difference notation, we have xk

i

1 ai i

,

  • bi

ai j

, xk j

ai j

, xk 1 – j j i 1 + = N

j 1 = i 1 –

– = f k x y , ( ) f k x ∆ – y , ( ) f k x y ∆ – , ( ) f k

1 –

+ x ∆ y , + ( ) f k

1 –

+ x y ∆ + , ( ) + [ ] 4

  • =
slide-40
SLIDE 40

374

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Red Black Figure 10.23 Red-black ordering.

In the Gauss-Seidel method, the order of update is in natural order. Hence, the computation will sweep across the mesh of points - not a particularly convenient characteristic for parallelizing the method. However, the ordering can be changed to make the method more amenable to paralleliza- tion:

Red-Black Ordering

The mesh of points is divided into “red” points and “black” points, which are interleaved. The black points are computed using four neighboring red points. The red points are computed using four neighboring black points. The computation is organized in two phases that are repeated until the values converge. First, the black points are computed. Next, the red points are computed. All black points can be computed simultaneously, and all red points can be computed simultaneously.

slide-41
SLIDE 41

375

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Red-Black Parallel Code

The code could be of the form

forall (i = 1; i < n; i++) forall (j = 1; j < n; j++) if ((i + j) % 2 == 0) /* compute red points */ f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]); forall (i = 1; i < n; i++) forall (j = 1; j < n; j++) if ((i + j) % 2 != 0) /* compute black points */ f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);

slide-42
SLIDE 42

376

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Figure 10.24 Nine-point stencil.

Higher-Order Difference Methods

More distant points could be used in the computation. The following update formula: uses eight nearby points to update a point: f k x y , ( ) = 1 60

  • 16 f k

1 –

x ∆ – y , ( ) 16 f k

1 –

x y ∆ – , ( ) 16 f k

1 –

+ x ∆ y , + ( ) 16 f k

1 –

+ x y ∆ + , ( ) + f –

k 1 –

x 2∆ – y , ( ) f k

1 –

x y 2∆ – , ( ) – f k

1 –

x 2∆ + y , ( ) – f k

1 –

x y 2∆ + , ( ) –

slide-43
SLIDE 43

377

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999

Overrelaxation

Improved convergence can be obtained by adding the factor (1 − ω)xi to either the Jacobi

  • r Gauss-Seidel formulae.

The factor ω is the overrelaxation parameter. The Jacobi overrelaxation formula (for the general system of linear equations) is where 0 < ω < 1. Gauss-Seidel successive overrelaxation is where 0 < ω ≤ 2. If ω = 1, we obtain the Gauss-Seidel method. xi

k

ω aii

  • bi

aijxi

k 1 – j i ≠

– 1 ω – ( )xi

k 1 –

+ = xi

k

ω aii

  • bi

aijxi

k

aijxi

k 1 – j i 1 + = N

j 1 = i 1 –

– 1 ω – ( )xi

k 1 –

+ =

slide-44
SLIDE 44

378

Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen  Prentice Hall, 1999 Coarsest grid points Finer grid points Processor Figure 10.25 Multigrid processor allocation.

Multigrid Method

In the multigrid method, the iterative process operates on a grid of points as previously, but the number of points is altered at stages during the computation. First, a coarse grid of points is used (that is, the solution space is divided into relatively few points). With these points, the iteration process will start to converge quickly. At some stage, the number of points is increased to include the points of the coarse grid and extra points between the points of the coarse grid. The initial values of extra points can be found by interpolation. The computation continues with this finer grid. The grid can be made finer and finer as the computation proceeds, or the computation can alternate between fine and coarse grids. The coarser grids take into account distant effects more quickly and provide a good starting point for the next finer grid.