Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - - PowerPoint PPT Presentation

introduce parallelism
SMART_READER_LITE
LIVE PREVIEW

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - - PowerPoint PPT Presentation

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18 PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to


slide-1
SLIDE 1

March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18

Introduce parallelism

OpenMP & OpenACC

slide-2
SLIDE 2

2

PARALLWARE SW DEVELOPMENT CYCLE

Understanding the sequential code and profiling

a. Analyze your code b. Focus on profiling and where to parallelize your code correctly

Identifying opportunities for parallelization

c. Figuring out where the code is suitable for parallelization d. Often the hardest step!

Introduce parallelism

e. Decide how to implement the parallelism discovered in your code

Test the correctness of your parallel implementation

f. Compile & run the parallel versions of your code to check that the numerical result is correct

Test the performance of your parallel implementation

g. Run the parallel versions of your code to measure performance increase for real-world workloads

Performance tuning

h. Repeat steps 1-5 until you meet your performance requirements....

slide-3
SLIDE 3

Race Conditions

3

  • Race conditions are programmer’s nightmare
  • Make the result of your parallel code unpredictable.
  • What are “race conditions”? How can we handle “race conditions”?

What is the value of variable “x” at the end of the parallel region? x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }

slide-4
SLIDE 4

Race Conditions

What is the value of variable “x” at the end of the parallel region?

x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }

Thread0 (“x=x+1”) finishes before Thread1 begins its computation (“x=x+1”) and the value is “x=2” Thread0: r1=0+1 Thread0: x=r1 Thread1: r2=1+1 Thread1: x=r2

Correct result!

Scenario 1

slide-5
SLIDE 5

Race Conditions

What is the value of variable “x” at the end of the parallel region?

x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } }

Thread0 (“x=x+1”) does not finish before Thread1 begins (“x=x+1”) and the value is “x=1” Thread0: r1=0+1 Thread1: r2=0+1 Thread0: x=r1 Thread1: x=r2

Wrong result!

Scenario 2

slide-6
SLIDE 6

A Code-Oriented Approach

w = 1/N; /* Compute the sum */ sum = 0.0; for( i=0; i<N; i++ ) { x = (i - 0.5) / N; sum = sum + ( 4.0 / ( 1+ x2 ) ); } /* Compute value PI */ pi = sum * w;

6 PRIVATIZATION Variables that store thread-local temporary results. Typically loop temporaries whose value can be discarded at the end of the parallel region. WORK SHARING Map computational workload to threads. Typically loop iterations mapped in block or cyclic manner. DEFINITION OF THE PARALLEL REGION Identify the code fragment that can be executed concurrently. Typically for loops. SHARING Read-only variables that save input data. Typically code parameters. REDUCTION Identify computations with associative, commutative

  • perators that require

additional synchronization. Typically sum, product, max, min.

slide-7
SLIDE 7

OpenMP Execution Model

  • Exploit parallelism across several Hw processor
  • Exploit parallelism within one Sw process

○ Master process is split into parallel thread. ○ Information shared via shared memory.

slide-8
SLIDE 8

OpenMP Execution Model

Host-driven execution:

  • HPC App is split into parallelizable phases
  • Master thread starts and finishes execution.
  • Parallelism is exploited within each phase.
slide-9
SLIDE 9

GPU Execution Model

Stream Programming Model

DATA TRANSFERS CODE TRANSFERS

slide-10
SLIDE 10

GPU Execution Model

#Blocks #Threads/Block #WarpThreads

  • Coarse-grain:

gangs teams distribute

  • Fine-grain:

workers parallel

  • Finest-grain:

vector simd

OpenACC OpenMP

slide-11
SLIDE 11

Case study: Algorithm ∏

  • Aproximación del valor de ∏ mediante la

integración de 4/(1+x2) en el intervalo [0,1].

  • La suma de las áreas de los N rectángulos

aproxima el área bajo la curva.

  • La aproximación de ∏ es más precisa cuando

N→∞.d

  • Dividir el intervalo en N subintervalos de

longitud 1/N: ■ Para cada subintervalo se calcula el área del rectángulo cuya altura es el valor de 4/(1+x2) en su punto medio

sum = 0.0 for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N

slide-12
SLIDE 12

Parallelization of Algorithm ∏

Stage 1: ■ Broadcast N, P Stage 2: ■ Distribute loop iterations ■ Compute partial sums Sp at each processor Stage 3: ■ Gather all partial sums ■ Compute global sum S = S0+…+Sp-1 N,P N,P N,P N,P Np=N/P Np=N/P Np=N/P S1 S2 S0 Np=N/P S3 S

Parallel Programming Framework

slide-13
SLIDE 13

OpenMP

sum = 0.0 #pragma omp parallel shared (N) private(i)\ reduction(+:sum) { #pragma for schedule(static) for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } } pi = sum / N sum = 0.0 #pragma omp parallel for reduction(+:sum) for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N sum = 0.0 #pragma omp parallel shared (N) private(i)\ private(sum_aux) { sum_aux = 0; #pragma for schedule(static) for( i=0; i<N; i++ ) { sum_aux = sum_aux + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } #pragma atomic sum = sum + sum_aux; } pi = sum / N sum = 0.0 #pragma omp parallel for for( i=0; i<N; i++ ) { #pragma omp atomic sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N )2 ) ) } pi = sum / N

slide-14
SLIDE 14

OpenACC

Definition of parallel region Privatization Work sharing Reduction

slide-15
SLIDE 15

MPI

Definition

  • f parallel

region Data scoping REDUCTION, with implicit message passing to compute the approximation

  • f number PI

Data scoping PRIVATE, variable declarations are process-local

slide-16
SLIDE 16

CUDA

Definition of parallel region Data scoping (private, reduction) with implicit synchronizatio n Worksharing

slide-17
SLIDE 17

Data Scoping

17

  • Temporary variables

○ OpenMP: ○ OpenACC: private, firstprivate, lastprivate, thread-local declaration inside parallel region c/c++ not fortran private, firstprivate, lastprivate, create, copyin

  • Read-only variables

○ OpenMP: ○ OpenACC: shared copyin

  • Output variables

○ OpenMP: ○ OpenACC: shared foralls, reduction/atomic/critical reductions copy, copyout, reduction/atomic reductions

  • Work-sharing

○ OpenMP: ○ OpenACC: schedule(static), schedule(static,1), schedule(dynamic) Hardware controlled