Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - PowerPoint PPT Presentation

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18

PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to parallelize your code correctly Identifying opportunities for parallelization c. Figuring out where the code is suitable for parallelization d. Often the hardest step! Introduce parallelism e. Decide how to implement the parallelism discovered in your code Test the correctness of your parallel implementation f. Compile & run the parallel versions of your code to check that the numerical result is correct Test the performance of your parallel implementation g. Run the parallel versions of your code to measure performance increase for real-world workloads Performance tuning h. Repeat steps 1-5 until you meet your performance requirements.... 2

Race Conditions Race conditions are programmer’s nightmare ● Make the result of your parallel code unpredictable. ● What are “race conditions”? How can we handle “race conditions”? ● What is the value of variable “x” at the end of the parallel region? x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } } 3

Race Conditions Scenario 1 What is the value of variable “x” at the end of the parallel region? Thread0 (“x=x+1”) finishes before Thread1 begins its computation (“x=x+1”) and the value is “x=2” x=0; #pragma omp parallel Thread0: r1=0+1 { Thread0: x=r1 #pragma omp for Thread1: r2=1+1 for (int i=0; i<N; i++ ) { Thread1: x=r2 x = x + 1; } } Correct result!

Race Conditions Scenario 2 What is the value of variable “x” at the end of the parallel region? Thread0 (“x=x+1”) does not finish before Thread1 begins (“x=x+1”) and the value is “x=1” x=0; #pragma omp parallel { Thread0: r1=0+1 #pragma omp for Thread1: r2=0+1 for (int i=0; i<N; i++ ) { Thread0: x=r1 x = x + 1; } Thread1: x=r2 } Wrong result!

A Code-Oriented Approach SHARING Read-only variables that save input data. DEFINITION OF THE PRIVATIZATION Typically code parameters. PARALLEL REGION Variables that store thread-local Identify the code fragment that temporary results. Typically loop can be executed concurrently. temporaries whose value can be Typically for loops. discarded at the end of the w = 1/N; parallel region. /* Compute the sum */ sum = 0.0; for( i=0; i<N; i++ ) { x = (i - 0.5) / N; sum = sum + REDUCTION ( 4.0 / ( 1+ x 2 ) ); Identify computations with WORK SHARING } associative, commutative Map computational workload operators that require /* Compute value PI */ to threads. pi = sum * w; additional synchronization. Typically loop iterations Typically sum, product, max, mapped in block or cyclic min. manner. 6

OpenMP Execution Model Exploit parallelism across several Hw processor ● Exploit parallelism within one Sw process ● Master process is split into parallel thread. ○ Information shared via shared memory. ○

OpenMP Execution Model Host-driven execution: HPC App is split into parallelizable phases ● Master thread starts and finishes execution. ● Parallelism is exploited within each phase. ●

GPU Execution Model CODE TRANSFERS DATA TRANSFERS Stream Programming Model

GPU Execution Model OpenACC OpenMP Coarse-grain: gangs teams distribute ● Fine-grain: workers parallel ● Finest-grain: vector simd ● #Blocks #Threads/Block #WarpThreads

Case study: Algorithm ∏ Aproximación del valor de ∏ mediante la ● integración de 4/(1+x 2 ) en el intervalo [0,1]. Dividir el intervalo en N subintervalos de ● longitud 1/N : Para cada subintervalo se calcula el ■ área del rectángulo cuya altura es el valor de 4/(1+x 2 ) en su punto medio La suma de las áreas de los N rectángulos ● sum = 0.0 aproxima el área bajo la curva. for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) La aproximación de ∏ es más precisa cuando ● } N → ∞.d pi = sum / N

Parallelization of Algorithm ∏ Stage 1: N , P Broadcast N , P ■ N , P N , P N , P Stage 2: N p =N/P N p =N/P N p =N/P N p =N/P Distribute loop iterations ■ Compute partial sums S p at each ■ processor Stage 3: S 0 S 1 S 3 S 2 Gather all partial sums ■ S Compute global sum ■ S = S 0 +…+S p-1 Parallel Programming Framework

OpenMP sum = 0.0 #pragma omp parallel shared (N) private(i)\ sum = 0.0 private(sum_aux) #pragma omp parallel for reduction(+:sum) for( i=0; i<N; i++ ) { { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) sum_aux = 0; ) #pragma for schedule(static) } for( i=0; i<N; i++ ) { pi = sum / N sum_aux = sum_aux + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) } #pragma atomic sum = sum + sum_aux; } pi = sum / N sum = 0.0 #pragma omp parallel shared (N) private(i)\ reduction(+:sum) { sum = 0.0 #pragma for schedule(static) #pragma omp parallel for for( i=0; i<N; i++ ) { for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) #pragma omp atomic sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) } } } pi = sum / N pi = sum / N

OpenACC Definition of Work sharing parallel region Privatization Reduction

MPI Data scoping PRIVATE, variable Definition declarations of parallel are region process-local Data scoping REDUCTION, with implicit message passing to compute the approximation of number PI

CUDA Worksharing Definition of parallel Data scoping region (private, reduction) with implicit synchronizatio n

Data Scoping Temporary variables ● private, firstprivate, lastprivate, OpenMP: ○ thread-local declaration inside parallel region c/c++ not fortran private, firstprivate, lastprivate, create, copyin OpenACC: ○ Read-only variables ● shared OpenMP: ○ OpenACC: ○ copyin Output variables ● shared foralls, reduction/atomic/critical reductions OpenMP: ○ OpenACC: ○ copy, copyout, reduction/atomic reductions Work-sharing ● schedule(static), schedule(static,1), schedule(dynamic) OpenMP: ○ OpenACC: ○ Hardware controlled 17

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - PowerPoint PPT Presentation

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18 PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Parallelism in FreeFem++. Guy Atenekeng 1 Frederic Hecht 2 Laura Grigori 1 Jacques Morice 2

Beyond Data and Model Parallelism for Deep Neural Networks ZHIHAO JIA, MATEI ZAHARIA, ALEX AIKEN

Race Why is parallelism hard? Non-determinism!! Practice Theory 2 Why is parallelism

Introduction to GGSSA Kathlene Oliver ASEG Perth Branch Meeting 9 th April 2014 What is GGSSA

Mining for Cost Estimating Relations from Limited Complex Data Modeling Approaches for NASA

THE NATURAL NUMBER Degree in Primary Education Teaching What is the number? It is a property

Making operations on standard-library containers strongly exception safe Jyrki Katajainen

Embedded High Performance Computing (EHPC) and Neuromorphic Computing November 18, 2014 Mr.

Computer Science and Computer Engineering Presentation to the Arkansas Academy of Computing

eShepherd TM automated grazing control for cattle 1 AGERSENS CONFIDENTIAL A GERSENS eShepherd