Matrix multiplication Standard serial algorithm: procedure MAT_VECT - - PowerPoint PPT Presentation

matrix multiplication
SMART_READER_LITE
LIVE PREVIEW

Matrix multiplication Standard serial algorithm: procedure MAT_VECT - - PowerPoint PPT Presentation

Matrix multiplication Standard serial algorithm: procedure MAT_VECT ( A, x, y ) begin for i := 0 to n - 1 do begin y [ i ] := 0 for j := 0 to n - 1 do y [ i ] := y [ i ] + A [ i , j ] * x [ j ] end end MAT_VECT Complexity: W = n 2


slide-1
SLIDE 1

Matrix multiplication

  • Standard serial algorithm:
  • Complexity: W = n2

procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT

slide-2
SLIDE 2

Matrix-Vector Multiplication (rowwise striping)

  • Sequential run time: W = n2
  • Simple case: p = n

– A2A broadcast of vector elements: Θ(n) – single row multiplication: Θ(n) – Total time: Θ(n) – processor-time product: Θ(n2) (cost-opt.)

  • General case: p < n

– A2A broadcast of vector elements:

  • ts log p + twn (hypercube)
  • 2 ts √p + twn

(mesh)

– row multiplication: Θ(n2/p) – total time:

  • n2/p + ts log p + twn (hypercube)
  • n2/p +2 ts √p + twn

(mesh)

slide-3
SLIDE 3

Matrix-Vector Multiplication (checkerboard partitioning)

  • Simple case: p = n2

– one-to-one comm. + one-2-all broadcast + single node accumul. per row + multipl. = = Θ(n) + Θ(n) + Θ(n) + Θ(1) = Θ(n) (mesh) = Θ(log n) + Θ(log n) + … = Θ(log n) (hyper) – processor-time product:

  • Θ(n3)

(mesh)

  • Θ(n2logn) (hypercube)
  • General case: p < n2

– each processor stores a (n√p × n√p) block – Run time on mesh with CT: – Θ(n/√p) + Θ((n/√p) log p) + Θ((n/√p) log p) + Θ(n2/p) – Cost optimal as long as p = O (n2/log2n )

slide-4
SLIDE 4

Matrix multiplication

  • Standard serial algorithm:
  • Complexity: W = n3

procedure MAT_MULT (A, B, C) begin for i := 0 to n - 1 do for j := 0 to n - 1 do begin C[i, j] := 0 for k := 0 to n - 1 do C[i, j] := C[i, j] + A[i, k] * B [k, j] end end MAT_MULT

slide-5
SLIDE 5

Matrix multiplication: a block matrix implementation

  • A parallelizable algorithm is derived from the serial one by

replacing scalar operations with block matrix operations:

  • Number of scalar addition/moltiplications: n3

– q3 matrix multiplications each involving n/q × n/q elements and requiring (n/q)3 multiplications and additions procedure BLOCK_MAT_MULT(A, B, C) begin for i := 0 to q - 1 do for j := 0 to q - 1 do begin Initialize all elements of Ci,j to zero for k := 0 to q - 1 do Ci,j := Ci, j + Ai, k * Bk, j end end BLOCK_MAT_MULT

slide-6
SLIDE 6

Matrix multiplication: simple parallel implementation

  • The block matrix algorithm lends itself to implementation on a

√p × √p mesh, with one block per processor:

– processor Pi, j initially holds Ai, j and Bi, j, and computes block Ci,j

, Pi, j needs all blocks Ai, k and Bk, j for 0 ≤ k ≤ √p, thus an A2A

broadcast is performed on each row, and one on each column – computation requires √p multiplications of n/ √p × n/ √p matrices – Assuming a hypercube topology:

  • Tp = n3/p + ts log p + 2 tw n2/ √p

– cost

  • n3 + ts p log p + 2 tw n2 √p
  • cost-optimal for p = O(n2)
  • A drawback of this algorithm is the excessive memory

requirements - every processor has 2√p blocks of size n2/p

slide-7
SLIDE 7

Matrix Multiplication: efficient algorithms

  • The simple parallel algorithm we saw last time is

not memory efficient

– every processor holds 2√p blocks of size n2/p -> total memory required is n2√p which is √p times the sequential alg.

  • More memory efficient algorithms exist

– Cannon’s algorithm

  • rowwise (columnwise) rotation of submatrices instead of a2a

broadcast

  • interleaving of communication and submatrix multiplication

– Fox’s algorithm

  • rowwise broadcasts of each Ai,j in turn, upward shifts of Bi,
slide-8
SLIDE 8

Systems of linear equations

a0,0x0 + a0,1x1 + … + a0,n-1xn-1 = b0 a1,0x0 + a1,1x1 + … + a1,n-1xn-1 = b1 . . . . . . . . . . . . an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1 x0 + u0,1x1 + … + u0,n-1xn-1 = y0 x1 + … + u1,n-1xn-1 = y1 . . . . . . xn-1 = yn-1

Gaussian elimination

  • Gaussian elimination is the classical serial algorithm for

solving systems of linear equations:

– First step: Ax = b transformed into Ux = y, where U is upper triangular and U[i, i] = 1 – Second step: system is solved for the x variables from x[n-1] to x[0] by back-substitution

slide-9
SLIDE 9

Steps of the Gaussian elimination

slide-10
SLIDE 10

Parallel Gaussian elimination: striped partitioning

  • Simple case: one row/processor

– kth computation step: 3(n-k-1) – Total computation: 3n(n-1)/2 – kth comm. step: (ts + tw(n-k-1))logn (on hypercube) – Tot. comm. : (tsn + twn(n-1)/2)logn – Cost: Θ(n3 logn)

  • not cost-optimal (W= 2n3/3)
  • A possible optimization; pipelining of

the n iterations

– row k+1 can start computation as soon as it gets data from row k – this optimized version of the alg. is cost-

  • ptimal
slide-11
SLIDE 11

Parallel Gaussian elimination: striped partitioning

  • General case: p < n

– block-striped partitioning

  • kth computation step: 3(n-k-1) n/p (on
  • proc. with max load)
  • kth comm. step: (ts + tw(n-k-1))logn (on

hypercube)

  • => computation time dominates
  • total time: Θ(n3 p)
  • cost: Θ(n3)
  • not very efficient because some

processors are idle

– cyclic-striped partitioning

  • better load balancing => less idle time
slide-12
SLIDE 12

Parallel Gaussian elimination: blocked-checkerboard partitioning

  • Simple case: one

element/processor

– two communication steps required

  • rowwise broadcast
  • columnwise broadcast

– Not cost-optimal

  • Usual optimization; pipelining of

communication and computation

– two communication steps => two

  • pportunities for pipelining

– this optimized version of the alg. is cost-optimal

slide-13
SLIDE 13

Parallel Gaussian elimination: cyclic-checkerboard partitioning

  • Like for striped partitioning, an additional optimization is

achieved by using cyclic- instead of block- partitioning due to improved load balancing among processors

slide-14
SLIDE 14

Example of discretized physical domain

  • Physical system: temperature distribution on a sheet of metal held at

temperature U0 and U1 on two sides

– system is discretized by superimposing grid – temperature at each point is computed based on neighbor values