Identifying opportunities for parallelization In the hotspots of - - PowerPoint PPT Presentation
Identifying opportunities for parallelization In the hotspots of - - PowerPoint PPT Presentation
Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to parallelize your code correctly
2
PARALLWARE SW DEVELOPMENT CYCLE
Understanding the sequential code and profiling
a. Analyze your code b. Focus on profiling and where to parallelize your code correctly
Identifying opportunities for parallelization
c. Figuring out where the code is suitable for parallelization d. Often the hardest step!
Introduce parallelism
e. Decide how to implement the parallelism discovered in your code
Test the correctness of your parallel implementation
f. Compile & run the parallel versions of your code to check that the numerical result is correct
Test the performance of your parallel implementation
g. Run the parallel versions of your code to measure performance increase for real-world workloads
Performance tuning
h. Repeat steps 1-5 until you meet your performance requirements....
3
PARALLWARE SW DEVELOPMENT CYCLE
Understanding the sequential code and profiling
a. Analyze your code b. Focus on profiling and where to parallelize your code correctly
Identifying opportunities for parallelization
c. Figuring out where the code is suitable for parallelization d. Often the hardest step!
Introduce parallelism
e. Decide how to implement the parallelism discovered in your code
Test the correctness of your parallel implementation
f. Compile & run the parallel versions of your code to check that the numerical result is correct
Test the performance of your parallel implementation
g. Run the parallel versions of your code to measure performance increase for real-world workloads
Performance tuning
h. Repeat steps 1-5 until you meet your performance requirements....
4
PARALLWARE SW DEVELOPMENT CYCLE
Understanding the sequential code and profiling Identifying opportunities for parallelization
Why are dependences difficult to use in practice?
- K. Asanovic et al. 2009. A view of the parallel computing landscape. Commun. ACM 52, 10 (October 2009), 56-67.
DOI: https://doi.org/10.1145/1562764.1562783
https://www.exascaleproject.org/event/bssw/
5
PARALLWARE SW DEVELOPMENT CYCLE
01 void atmux(double* restrict y, … , int n) 08 { 09 for(int t = 0; t < n; t++) 10 y[t] = 0; 11 12 for(int i = 0; i < n; i++) { 13 for (int k = row_ptr[i]; k < row_ptr[i+1]; k++) { 14 y[col_ind[k]] += x[i] * val[k]; 15 } 16 } 17 }
FLOW dependences OUTPUT dependences ANTI dependences
$ icc atmux.c -std=c99 -c -O3 -xAVX -Wall -vec-report3 -opt-report3 -restrict -parallel -openmp -guide icc (ICC) 13.1.1 20130313 ... HPO THREADIZER REPORT (atmux) LOG OPENED ON Fri Sep 25 18:04:15 2015 HPO Threadizer Report (atmux) atmux.c(9:2-9:2):PAR:atmux: loop was not parallelized: existence of parallel dependence atmux.c(10:3-10:3):PAR:atmux: potential ANTI dependence on y. potential FLOW dependence on y. atmux.c(9:2-9:2):PAR:atmux: LOOP WAS AUTO-PARALLELIZED atmux.c(12:2-12:2):PAR:atmux: loop was not parallelized: existence of parallel dependence atmux.c(13:3-13:3):PAR:atmux: loop was not parallelized: existence of parallel dependence ...
Understanding the sequential code and profiling Identifying opportunities for parallelization
6
PARALLWARE SW DEVELOPMENT CYCLE
MISSING TOWER “CODE” FOR “IMPLEMENTATION GAP”
Understanding the sequential code and profiling Identifying opportunities for parallelization
7
Source code:
- Outputs: xi, yi, zi
- Temporaries: dxc, dyc, dzc, m, f
- Read-only: xx1, yy1, zz1, mass1,
fsrrmax2, ...
PARALLWARE SW DEVELOPMENT CYCLE
Understanding the sequential code and profiling Identifying opportunities for parallelization
Making the most of your
- pportunities to parallelize
- The Parallware Analysis will help you to
identify regions that are opportunities for parallelization:
○
#pragma omp parallel for … \ shared(y) for (i = 0; i<n; i++) { y[i] = 0; for (k=ia[i]; k<ia[i+1-1]; i++) { y[i] = y[i] + a[j]*x[ja[k]]; } }
Parallel scalar reduction, in the inner loop Parallel forall, in the outer loop
#pragma omp parallel for … \ private(t) for (i = 0; i<n; i++) { t = 0; for (k=ia[i]; k<ia[i+1-1]; i++) { t = t + a[k]*x[ja[k]]; } y[i] = t; }
Parallel forall, in the outer loop Parallel scalar reduction, in the inner loop
for (h = 0; h<Adim; h++) { hist[h] = 0; } for (h = 0; h<fDim; h++) { hist[f[[h]] = hist[f[h]] + 1; } for (h = 1; h<Adim; h++) { hist[h] = hist[h] + hist[h-1]; }
Parallel forall Parallel sparse reduction Parallel recurrence In general not parallelizable, but in many situations can be parallelized with significant synchronization
- verhead.