Using OpenMP for HEP Framework Algorithm Scheduling In partnership - - PowerPoint PPT Presentation

using openmp for hep framework algorithm scheduling
SMART_READER_LITE
LIVE PREVIEW

Using OpenMP for HEP Framework Algorithm Scheduling In partnership - - PowerPoint PPT Presentation

FERMILAB-SLIDES-19-068-CMS-SCD Using OpenMP for HEP Framework Algorithm Scheduling In partnership with: Dr Christopher D Jones, Dr Patrick Gartung CHEP 2019 4 November 2019 This manuscript has been authored by Fermi Research Alliance, LLC


slide-1
SLIDE 1

In partnership with:

Dr Christopher D Jones, Dr Patrick Gartung CHEP 2019 4 November 2019

Using OpenMP for HEP Framework Algorithm Scheduling

FERMILAB-SLIDES-19-068-CMS-SCD

This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics.

slide-2
SLIDE 2

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Outline

Motivation OpenMP Review Demonstrator Frameworks Experiment Setup Results

2

slide-3
SLIDE 3

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Motivation

Why bother with OpenMP when already using Intel’s Threading Building Blocks? HPC Centers

Super Computing Centers traditionally use OpenMP for threading When communicating with HPC specialist, we are often asked about OpenMP Utilization of HPC centers for HEP will only increase over time

Need to either use OpenMP or have reason to not use

3

slide-4
SLIDE 4

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Review

OpenMP is an extension to a compiler not a library

Uses compiler pragma statements implementations of features vary considerably across compilers

OpenMP 4.5 Constructs

  • mp parallel
  • mp for
  • mp task
  • mp taskloop

4

slide-5
SLIDE 5

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Construct: omp parallel

Starts threads used in the following block

Once assigned those threads can only be used by that parallel construct

At end of block the job waits till all assigned threads finish the block number of threads for each parallel block is controlled by

env variable OMP_NUM_THREADS or calling omp_set_num_threads Max number of threads for job is controlled by env variable OMP_THREAD_LIMIT

5

#pragma omp parallel { … }

slide-6
SLIDE 6

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Constructs: omp for

Distributes iterations to threads associated with innermost parallel block By default, calling thread waits till all iterations have completed

6

#pragma omp for for(int i=0; i< N; ++i){ … }

slide-7
SLIDE 7

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Construct: nested parallel blocks

Support of concurrent nested parallel blocks is implementation defined

Also controlled by env variable OMP_NESTED or calling omp_set_nested

nested parallel blocks have as many threads as the outer blocks

Until max number of threads are reached

7

slide-8
SLIDE 8

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling 8

OpenMP Construct: nested parallel blocks — example 1

9 max threads per job main thread waits till nested parallel finished

main thread

  • mp_set_num_threads(3);

#pragma omp parallel for for(int i=0; i< 3; ++i){ #pragma omp parallel for for(int j=0; j<3; ++j) { doWork(i,j); } } Time i j j j

slide-9
SLIDE 9

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling 9

OpenMP Construct: nested parallel blocks — example 2

same as before except

6 max threads per job

finished threads cannot be used by other parallel blocks

main thread Time i j j j

slide-10
SLIDE 10

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Construct: omp task

All code in the block is put into a task object An untied task can be run by any thread of the innermost parallel section When a task completes another task can be scheduled on the thread

The new task must be from the same parallel section

10

#pragma omp task { … }

slide-11
SLIDE 11

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

OpenMP Constructs: omp taskloop

Creates OpenMP tasks for the iterations Calling thread may run other tasks while waiting for all taskloop tasks to end

I.e. implementations may do task stealing

11

#pragma omp taskloop for(int i=0; i< N; ++i){ … }

slide-12
SLIDE 12

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Demonstrator Frameworks

Created simplified OpenMP, TBB and single threaded based frameworks Frameworks can process multiple events concurrently Work is done via Modules

Modules generate data and put into events One Module can depend on data from other Modules Modules are wrapped in OpenMP or TBB tasks Module tasks only start once needed data are available

Modules may use parallel for constructs internally

Allows testing of nested parallelism

Code available at https://github.com/Dr15Jones/toy-mt-framework

12

slide-13
SLIDE 13

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Experimental Setup

Compiled TBB and OpenMP frameworks with gcc 8 and clang 7

Very different OpenMP 4.5 implementations

Created Module call graph that emulated CMS reconstruction

Use same module dependencies Use module run times from 100 different events

Experiment varied

Number of threads

Number of concurrent events == number of threads Number of events processed in a job = Number of threads * 100

Amount of module internal parallelism

Measurements done on an Intel KNL machine

13

slide-14
SLIDE 14

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Module Perfect Parallelism

All modules are concurrent capable TBB results using gcc and clang are identical Ran as many single-threaded jobs as number of threads OpenMP and TBB have same results

14

Event Throughput (ev/sec)

3 6 9 12

Number of Threads & Concurrent Events

32 64 96 128 160 192 224 256 TBB OpenMP clang OpenMP gcc N Single Threaded

slide-15
SLIDE 15

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

One Serial Module with No Internal Parallelism

Simulate behavior of output

Serialize event access to the output module All other modules are as before

Jobs quickly hit Ahmdal’s law limit

15

Event Throughput (ev/s)

0.25 0.5 0.75 1

Number of Threads & Concurrent Events

4 8 12 16 20 24 28 32 TBB OpenMP clang OpenMP gcc N Single Threaded

slide-16
SLIDE 16

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Serial Module with Internal Parallelism: Task Stealing

Allow output module to use parallelism

Use a for loop with 100 iterations

TBB uses tbb::parallel_for

does task stealing by default

OpenMP uses taskloop

clang does task stealing gcc does not do task stealing

Task stealing hurts throughput

16

Event Throughput (ev/s)

3 6 9 12

Number of Threads & Concurrent Events

32 64 96 128 160 192 224 256 TBB OpenMP clang OpenMP gcc N Single Threaded

slide-17
SLIDE 17

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Serial Module with Internal Parallelism: No Task Stealing

Make all versions avoid task stealing TBB use arenas OpenMP uses omp for

Only way in API to guarantee no stealing For each (max) number of threads

ran many jobs varying omp_set_num_threads chose value with highest throughput

Even picking best working point for OpenMP, TBB automatic behavior gives best results

17

Event Throughput (ev/s)

3 6 9 12

Number of Threads & Concurrent Events

32 64 96 128 160 192 224 256 TBB OpenMP clang & gcc N Single Threaded

slide-18
SLIDE 18

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Conclusion

It is possible to create a HEP framework using OpenMP

Our investigation finds it would be less optimal than using TBB

Compiler implementation variations make portable performance hard

gcc taskloop does not do task stealing clang taskloop does task stealing with no way to disable

OpenMP has composibility difficulties

parallel blocks do not share threads nested parallelism uses fixed allocation of threads very hard to tune how many threads to use at each nested parallel level

18

slide-19
SLIDE 19

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling 19

Backup Slides

slide-20
SLIDE 20

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling 20

Task Stealing Problem

E.g. waiting thread steals a long running task Can’t start makeTasks till stolen task finishes

main thread #pragma omp taskloop for(int i=0; i< 2; ++i){ doWork(i); } makeTasks(); Time doWork(i) makeTasks() stolen task

slide-21
SLIDE 21

Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Scanning Job Results for omp for usage

A selection of throughput vs omp_set_num_threads plots

Kept maximum number of threads == number of concurrent events for each measurement

21

Max Threads: 32

Throughput

1 2 3

  • mp_set_num_threads

24 26 28 30 32 gcc8 clang7

Max Threads: 48

Throughput

1 2 3 4

  • mp_set_num_threads

40 42 44 46 48 gcc8 clang7

Max Threads: 64

Throughput

1 2 3 4 5

  • mp_set_num_threads

56 58 60 62 64 gcc8 clang7

Max Threads: 96

Throughput

1 2 3 4 5 6 7

  • mp_set_num_threads

74 78 82 86 90 gcc8 clang7

Max Threads: 128

Throughput

3 6 9

  • mp_set_num_threads

112 114 116 118 120 122 124 gcc8 clang7