March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18
Introduce parallelism
OpenMP & OpenACC
Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - - PowerPoint PPT Presentation
Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18 PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to
March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18
OpenMP & OpenACC
2
PARALLWARE SW DEVELOPMENT CYCLE
Understanding the sequential code and profiling
a. Analyze your code b. Focus on profiling and where to parallelize your code correctly
Identifying opportunities for parallelization
c. Figuring out where the code is suitable for parallelization d. Often the hardest step!
Introduce parallelism
e. Decide how to implement the parallelism discovered in your code
Test the correctness of your parallel implementation
f. Compile & run the parallel versions of your code to check that the numerical result is correct
Test the performance of your parallel implementation
g. Run the parallel versions of your code to measure performance increase for real-world workloads
Performance tuning
h. Repeat steps 1-5 until you meet your performance requirements....
3
What is the value of variable “x” at the end of the parallel region? x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }
What is the value of variable “x” at the end of the parallel region?
x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }
Thread0 (“x=x+1”) finishes before Thread1 begins its computation (“x=x+1”) and the value is “x=2” Thread0: r1=0+1 Thread0: x=r1 Thread1: r2=1+1 Thread1: x=r2
Correct result!
Scenario 1
What is the value of variable “x” at the end of the parallel region?
x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }
Thread0 (“x=x+1”) does not finish before Thread1 begins (“x=x+1”) and the value is “x=1” Thread0: r1=0+1 Thread1: r2=0+1 Thread0: x=r1 Thread1: x=r2
Wrong result!
Scenario 2
w = 1/N; /* Compute the sum */ sum = 0.0; for( i=0; i<N; i++ ) { x = (i - 0.5) / N; sum = sum + ( 4.0 / ( 1+ x2 ) ); } /* Compute value PI */ pi = sum * w;
6 PRIVATIZATION Variables that store thread-local temporary results. Typically loop temporaries whose value can be discarded at the end of the parallel region. WORK SHARING Map computational workload to threads. Typically loop iterations mapped in block or cyclic manner. DEFINITION OF THE PARALLEL REGION Identify the code fragment that can be executed concurrently. Typically for loops. SHARING Read-only variables that save input data. Typically code parameters. REDUCTION Identify computations with associative, commutative
additional synchronization. Typically sum, product, max, min.
○ Master process is split into parallel thread. ○ Information shared via shared memory.
Host-driven execution:
Stream Programming Model
DATA TRANSFERS CODE TRANSFERS
#Blocks #Threads/Block #WarpThreads
gangs teams distribute
workers parallel
vector simd
OpenACC OpenMP
integración de 4/(1+x2) en el intervalo [0,1].
aproxima el área bajo la curva.
N→∞.d
longitud 1/N: ■ Para cada subintervalo se calcula el área del rectángulo cuya altura es el valor de 4/(1+x2) en su punto medio
sum = 0.0 for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N
Stage 1: ■ Broadcast N, P Stage 2: ■ Distribute loop iterations ■ Compute partial sums Sp at each processor Stage 3: ■ Gather all partial sums ■ Compute global sum S = S0+…+Sp-1 N,P N,P N,P N,P Np=N/P Np=N/P Np=N/P S1 S2 S0 Np=N/P S3 S
Parallel Programming Framework
sum = 0.0 #pragma omp parallel shared (N) private(i)\ reduction(+:sum) { #pragma for schedule(static) for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } } pi = sum / N sum = 0.0 #pragma omp parallel for reduction(+:sum) for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N sum = 0.0 #pragma omp parallel shared (N) private(i)\ private(sum_aux) { sum_aux = 0; #pragma for schedule(static) for( i=0; i<N; i++ ) { sum_aux = sum_aux + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } #pragma atomic sum = sum + sum_aux; } pi = sum / N sum = 0.0 #pragma omp parallel for for( i=0; i<N; i++ ) { #pragma omp atomic sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N
Definition of parallel region Privatization Work sharing Reduction
Definition
region Data scoping REDUCTION, with implicit message passing to compute the approximation
Data scoping PRIVATE, variable declarations are process-local
Definition of parallel region Data scoping (private, reduction) with implicit synchronizatio n Worksharing
17
○ OpenMP: ○ OpenACC: private, firstprivate, lastprivate, thread-local declaration inside parallel region c/c++ not fortran private, firstprivate, lastprivate, create, copyin
○ OpenMP: ○ OpenACC: shared copyin
○ OpenMP: ○ OpenACC: shared foralls, reduction/atomic/critical reductions copy, copyout, reduction/atomic reductions
○ OpenMP: ○ OpenACC: schedule(static), schedule(static,1), schedule(dynamic) Hardware controlled