Marwa A. Al-Shandawely PDC/KTH Algorithm overview. Trivial - - PowerPoint PPT Presentation

marwa a al shandawely
SMART_READER_LITE
LIVE PREVIEW

Marwa A. Al-Shandawely PDC/KTH Algorithm overview. Trivial - - PowerPoint PPT Presentation

Marwa A. Al-Shandawely PDC/KTH Algorithm overview. Trivial parallelization. Problems. Sequential optimization Proposed solutions. Experimental results. Conclusions and future work. for i=1 to n-1 find pivotPos in


slide-1
SLIDE 1

Marwa A. Al-Shandawely PDC/KTH

slide-2
SLIDE 2

 Algorithm overview.  Trivial parallelization.  Problems.  Sequential optimization  Proposed solutions.  Experimental results.  Conclusions and future work.

slide-3
SLIDE 3

for i=1 to n-1 find pivotPos in column i if pivotPos ≠ i exchange rows(pivotPos,i) end if for j=i+1 to n A(i,j) = A(i,j)/A(i,i) end for j for j=i+1 to n+1 for k=i+1 to n A(k,j)=A(k,j)-A(k,i)×A(i,j) end for k end for j end for i !$omp parallel lel do private te ( i ,j ) )

slide-4
SLIDE 4

nThreads

1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 N=1000 N=2000 N=3000 N=4000 N=5000

slide-5
SLIDE 5

 Poor data locality  Pivoting is done by master thread  Overheads of creating and destroying threads

at each iteration

 Sequential optimization

slide-6
SLIDE 6

 Replace division by the constant pivot  Avoid loop invariant access in the inner most

loop

 Eliminate the check for pivot changing

position

 Make use of fortran array notation Do k=j+1,n A(k,j)=A(k,j)/A(j,j) End do C=1/A(j,j) A(j+1:n)=A(j+1:n)*c

slide-7
SLIDE 7

Pivots array

 Pivots array  Locks array  Pivot holder

  • Eliminate (i) on column(i+1)
  • Search (i+1)
  • Store pivot (i+1) position
  • Prepare colmn (i+1)
  • Free lock (i+1)
  • Eliminate (i) on rest of scope

P1 P2 P3 P4 Locks

slide-8
SLIDE 8

nThreads

2 4 6 8 10 12 1 2 3 4 5 6 7 N=1000 N=2000 N=3000 N=4000 N=5000

slide-9
SLIDE 9

 The original algorithm requires pivot columns

to be prepared in order while the whole matrix is accessed for each pivot column.

 For large input sizes; the cache is evicted

many times for each iteration and there is no reuse of data in the cache.

 False sharing on pivots and locks array.

slide-10
SLIDE 10

 Double elimination on pivot holders.

  • Knowledge of two pivots allow data reuse.

 Each column is an accumulation of eliminations using

previous columns!

  • Make more pivots available each step and eliminate each column

using several pivots while it is in the cache.

slide-11
SLIDE 11

Pivots array

 Block of pivots  Increase work/iter.  Increase locality  Less locks  Load balancing?!

P1 P2 P3 P4 Locks

slide-12
SLIDE 12

Pivots array

 Block of pivots  Increase work/iter.  Increase locality  Less locks  Load balancing?!

P1 P2 P3 P4 Locks

slide-13
SLIDE 13

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 Original C=1 C=2 C=3 C=4 C=5 2 4 6 8 10 12 14 16 2 3 4 5 6 7 8

N=2000 N=5000

slide-14
SLIDE 14

5 10 15 20 25 30 2 3 4 5 6 7 8 5 10 15 20 25 2 3 4 5 6 7 8 Original double elimination C=25 with double elimination

N=2000 N=5000

slide-15
SLIDE 15

 Scalable performance on multicores is highly dependent on

application implementation, data layout and access patterns.

 Cache and memory access optimization techniques is vital

for performance despite the loss of readability.

 Future work:

  • Adaptive blocking scheme that changes the block

size as a function of the matrix size, cache settings, and number of cores.

slide-16
SLIDE 16