SLIDE 10 The Algorithm of Fox
We again determine the product matrix according to Ci,j = p
k=1 Ai,k · Bk,j, but now
◮ processes are arranged in a √p × √p mesh of processes. ◮ Process i knows the n/√p × n/√p submatrices Ai,j and Bi,j.
We have √p phases. In phase k we want process (i, j) to compute Ai,i+k−1 · Bi+k−1,j:
◮ process (i, i + k − 1) broadcasts Ai,i+k−1 to all processes in row i, ◮ process (i, j) computes Ai,i+k−1 · Bi+k−1,j, ◮ receives Bi+k,j from (i + 1, j) and sends Bi+k−1,j to (i − 1, j).
Performance Analysis:
◮ Per phase: computing time O(( n
√p)3) and communication time
O( n2
p · log p).
◮ We have √p phases: computation time O( n3
p ), communication time
O( n2
√p · log p). The compute/communicate ratio n √p log2 p increases.
Parallel Linear Algebra The Matrix-Matrix Product 10 / 35