how to get good performance by using openmp
play

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! - PowerPoint PPT Presentation

Agenda ! How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP performance ! ! Best practices ! ! Task parallelism !! !" #" Memory access patterns ! Correctness versus performance ! It


  1. Agenda ! How to Get Good Performance by Using OpenMP ! • ! Loop optimizations ! • ! Measuring OpenMP performance ! • ! Best practices ! • ! Task parallelism !! !" #" Memory access patterns ! Correctness versus performance ! It may be easy to write a correctly functioning A major goal is to organize data accesses so that OpenMP program, but not so easy to create a program values are used as often as possible while they are still that provides the desired level of performance. ! in the cache. ! $" %"

  2. Two-dimensional array access ! Two-dimensional array access ! In C, a two-dimensional array is stored in rows. ! Empirical test on alvin: ! ! n = 50000 ! ! row-wise access: ! ! 34.8 seconds ! ! column-wise access: ! 213.3 seconds ! &" '" Loop fusion ! Loop unrolling ! (" )"

  3. Loop fission ! Loop tiling ! n � cont'd on next page ! *" !+" Measuring OpenMP performance ! (1) Using the time command available on Unix systems: ! ! ! $ time program � � � real � 5.4 � � � user � 3.2 � � � sys � � 2.0 � (2) Using the omp_get_wtime() function. ! ! Returns the wall clock time (in seconds) relative to ! ! an arbitrary reference time. ! !!" !#"

  4. Parallel overhead ! A simple performance model ! T CPU ( P ) = (1 + O P ! P ) ! T serial The amount of time required to coordinate parallel threads, as opposed to doing useful work. ! T Elapsed ( P ) = ( f Parallel overhead can include factors such as: ! P " f + 1 + O P ! P ) ! T serial • ! Thread start-up time ! • Synchronization ! ! • Software overhead imposed by parallel compilers, ! ! libraries, tools, operating system, etc. ! • ! Thread termination time ! !$" !%" Performance factors ! • Manner in which memory is accessed by the individual threads. ! • Sequential overheads : Sequential work that is replicated. ! • (OpenMP) Parallelization overheads : The amount of time spent ! handling OpenMP constructs. ! • Load imbalance overheads : The load imbalance between ! synchronization points. ! Speedup ( P ) = T Serial ( P ) 1 1 • Synchronization overheads : Time wasted for waiting to enter T Elapsed ( P ) = = f 0.95 P ! f + 1 + O P " P + 0.05 + 0.02 " P ! critical regions. ! P Efficiency ( P ) = Speedup ( P ) P !&" !'"

  5. Overheads of OpenMP directives on alvin (gcc) ! Overheads of OpenMP directives ! ,-./01234" 56,677-7"83," 56,677-7" 924:7-" ;6,,2-," 83," !(" !)" Overhead of OpenMP scheduling on alvin (gcc) ! Overhead of OpenMP scheduling ! !*" #+"

  6. Optimize barrier use ! Best practices ! #!" ##" Avoid the ordered construct ! Avoid the critical region construct ! The ordered construct is expensive. ! The construct can often be avoided. It might be better to perform I/O outside the parallel loop. ! If at all possible, an atomic update is to be preferred. ! #$" #%"

  7. Avoid large critical regions ! Maximize parallel regions ! Lost time waiting for locks ! # pragma omp parallel � Busy ! { � Idle ! #pragma omp critical � In Critical ! { � time ! ... � } � ... � } � !"# #&" #'" Maximize parallel regions ! Avoid parallel regions in inner loops ! Large parallel regions offer more opportunities for using data in cache and provide a bigger context for compiler optimizations. ! #(" #)"

  8. Load imbalance ! Load balancing ! • "" Load balancing is an important aspect of performance ! Unequal work loads lead to idle threads and wasted time. ! • ! For regular expressions (e.g. vector addition), load ! balancing is not an issue ! #pragma omp parallel � Busy ! • { � ! For less regular workloads, care needs to be taken in Idle ! #pragma omp for � ! distributing the work over the threads ! time ! for ( ; ; ) { � • ! Examples of irregular workloads: ! ! ! ! ! ! } � ! ! - multiplication of triangular matrices ! ! ! ! } � ! ! - parallel searches in a linked list ! • ! For these irregular situations, the schedule clause supports ! various iteration scheduling algorithms ! #*" #*" $+" Address poor load balancing ! cont'd on next page ! $!" $#"

  9. False sharing ! False sharing ! False sharing is likely to significantly impact performance under the following conditions: ! 1. Shared data are modified by multiple threads. ! The state bit of a cache line does not keep track of the cache 2. The access pattern is such that multiple threads modify line state on a byte basis, but at the line level instead. ! the same cache line(s). ! Thus, if independent data items happen to reside on the same 3. These modification occur in rapid succession. ! cache line (cache block), each update will cause the cache line to “ping-pong” between the threads. ! ! ! ! ! ! ! ! ! ! ! ! ! ! This is called false sharing . ! $$" $%" False sharing example ! False sharing ! Array elements are contiguous in memory and hence share cache lines. ! ! !! ! ! Result: False sharing may lead to poor scaling ! Solutions: ! • ! When updates to an array are frequent, work ! with local copies of the array in stead of an array ! indexed by the thread ID. ! • ! Pad arrays so elements you use are on distinct ! ! cache lines. ! $&" $'"

  10. Array padding ! Case study: Matrix times vector product ! int a[Nthreads]; � #pragma omp parallel for shared(Nthreads,a) schedule(static,1) � for (int i=0; i<Nthreads; i++) � a[i] += i; � int a[Nthreads][cache_line_size]; � #pragma omp parallel for shared(Nthreads,a) schedule(static,1) � for (int i=0; i<Nthreads; i++) � a[i][0] += i; � $(" $(" $)" $*" %+"

  11. %!" %#" Task Parallelism in OpenMP 3.0 ! %$" %%"

  12. What are tasks? ! Tasks in OpenMP ! OpenMP has always had tasks, but they were not called that. ! Tasks are independent units of work ! • ! A thread encountering a parallel construct packages up a set of Threads are assigned to perform the work of each task ! ! implicit tasks, one per thread. ! • Tasks may be deferred ! • ! A team of threads is created. ! • Tasks may be executed immediately ! The runtime system decides which of the above ! • ! Each thread is assigned to one of the tasks (and tied to it). ! • ! Barrier holds master thread until all implicit tasks are finished. ! • Tasks are composed of: ! • code to execute ! OpenMP 3.0 adds a way to create a task explicitly for the team to • data environment (it own its data) ! Serial ! Parallel ! execute. ! • internal control variables ! %'" $"# Simple example of using tasks ! The task construct ! for pointer chasing ! #pragma omp task [clause [[,] clause] ... ] structured block void process_list(elem_t *elem) { � #pragma omp parallel � { � #pragma omp single � Each encountering thread creates a new task. ! { � • Code and data are being packaged up ! while (elem != NULL) { � • Tasks can be nested ! #pragma omp task � { � process(elem); � An OpenMP barrier (implicit or explicit): ! ! ! } � ! All tasks created by any thread of the current elem = elem->next; � ! team are guaranteed to be completed at barrier exit. ! } � } � } � Task barrier ( taskwait ): ! ! ! ! } � ! Encountering thread suspends until all child tasks it ! has generated are ! complete. ! elem is firstprivate by default ! %(" %)"

  13. Simple example of using tasks ! Using tasks for tree traversal ! in a recursive algorithm ! struct node { � struct node *left, *right; � int fib(int n) { � int main() { � }; � int i, j; � int n = 10; � if (n < 2) � #pragma omp parallel � void traverse(struct node *p, int postorder) { � return n; � #pragma omp single � if (p->left != NULL) � #pragma omp task shared(i) � printf("fib(%d) = %d\n", � #pragma omp task � i = fib(n - 1); � n, fib(n)); � traverse(p->left, postorder); � #pragma omp task shared(j) � } � if (p->right != NULL) � � j = fib(n - 2); � #pragma omp task � #pragma omp taskwait � traverse(p->right, postorder); � return i + j; � if (postorder) { � } � #pragma omp taskwait � } � process(p); � } � Computation of Fibonacci numbers ! 1,1,2,3,5,8,13,21,34,55,89,144,... ! %*" &+" Collapsing of loops ! Task switching ! The collapse clause (in OpenMP 3.0) handles perfectly Certain constructs have suspend/resume points at defined nested multi-dimensional loops. ! positions within them. ! When a thread encounters a suspend point, it is allowed to suspend the current task and resume another. It can then return #pragma omp for collapse(2) � for (i = 0; i < N; i++) � to the original task and resume it. ! for (j = 0; j < M; j++) � for (k = 0; k < K; k++) � A tied task must be resumed by the same task that suspended it. ! foo(i, j, k); � Tasks are tied by default. A task can be specified to be untied Iteration space from i -loop and j -loop is collapsed into a single using ! one, if the two loops are perfectly nested and form a rectangular ! ! ! #pragma omp task untied � iteration space. ! &!" &#"

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend