To thread or not to thread? Why PETSc favors MPI-only Plenary - PowerPoint PPT Presentation

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016 Based on: MS35 - To Thread or Not To Thread April 13, 2016 SIAM PP , Paris

The Big Picture The Big Picture �� The next large NERSC produc)on system “Cori” will be Intel Xeon ¡Phi ¡KNL ¡(Knights ¡Landing) ¡architecture: ¡ � >60 cores per node, 4 hardware threads per core � Total of >240 threads per node � Your applica)on is very likely to run on KNL with simple port, ¡but ¡high ¡performance is ¡harder to ¡achieve. � Many applica)ons will not fit into the memory of a KNL node using pure MPI across all HW cores and threads because of the memory overhead for each MPI task. � Hybrid MPI/OpenMP is the recommended programming model, ¡to ¡achieve ¡scaling ¡capability ¡and ¡code ¡portability. ¡ ¡ � Current NERSC systems (Babbage, Edison, and Hopper) can help ¡prepare your codes. -‑ ¡85 ¡-‑ ¡ 2 “OpenMP Basics and MPI/OpenMP Scaling”, Yun He, NERSC, 2015

The Big Picture The Big Picture �� 100" Running Times (s) Pure MPI" OMP=1 OMP=2 OMP=3 OMP=4 �� 10" 1" " " " p m u l a i 2 q m n 1 f a l s 1 p 4 t h s b s e s t t a c w e w o n s t s m s s b a s r 2 T i s r x p r e o t a t a e d w Total number of MPI ranks=60; OMP=N means N threads per MPI rank. � � Original code uses a shared global task ¡counter to deal with dynamic load balancing with MPI ranks � Loop parallelize top 10 rou)nes in TEXAS package (75% ¡of total CPU )me) with OpenMP. Has load-‑imbalance. � OMP=1 has overhead over pure MPI. � OMP=2 has overall best performance in many rou)nes. -‑ ¡119 ¡-‑ ¡ 3 “OpenMP Basics and MPI/OpenMP Scaling”, Yun He, NERSC, 2015

The Big Picture The Big Picture �� OpenMP is a fun and powerful language for shared memory ¡programming. ¡ ¡ � Hybrid MPI/OpenMP is recommended for many next ¡genera)on ¡architectures ¡(Intel ¡Xeon ¡Phi ¡for example), including NERSC-‑8 system, Cori. � You should explore to add OpenMP now if your applica)on is flat MPI only. -‑ ¡123 ¡-‑ ¡ 4 “OpenMP Basics and MPI/OpenMP Scaling”, Yun He, NERSC, 2015

The Big Picture The Big Picture “OpenMP is fun” is not a sufficient justification for changing our programming model! 5

Threads and Library Interfaces Threads and Library Interfaces Attempt 1 Library spawns threads void library_func( double *x, int N) { #pragma omp parallel for for ( int i=0; i<N; ++i) x[i] = something_complicated(); } Problems Call from multi-threaded environment? void user_func( double **y, int N) { #pragma omp parallel for for ( int j=0; j<M; ++j) library_func(y[j], N); } Incompatible OpenMP runtimes (e.g. GCC vs. ICC) 6

Threads and Library Interfaces Threads and Library Interfaces Attempt 2 Use pthreads/TBB/etc. instead of OpenMP to spawn threads Fixes incompatible OpenMP implementations (probably) Problems Still a problem with multi-threaded user environments void user_func( double **y, int N) { #pragma omp parallel for for ( int j=0; j<M; ++j) library_func(y[j], N); } 7

Threads and Library Interfaces Threads and Library Interfaces Attempt 3 Hand back thread management to user void library_func(ThreadInfo ti, double *x, int N) { int start = compute_start_index(ti, N); int stop = compute_stop_index(ti, N); for ( int i=start; i<stop; ++i) x[i] = something_complicated(); } Implications Users can use their favorite threading model API requires one extra parameter Extra boilerplate code required in user code 8

Threads and Library Interfaces Threads and Library Interfaces Reflection Extra thread communication parameter void library_func(ThreadInfo ti, double *x, int N) {...} Rename thread management parameter void library_func(Thread_Comm c, double *x, int N) {...} Compare: void library_func(MPI_Comm comm, double *x, int N) {...} Conclusion Prefer flat MPI over MPI+OpenMP for a composable software stack MPI automatically brings better data locality 9

To thread or not to thread? Why PETSc favors MPI-only Plenary - PowerPoint PPT Presentation

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016 Based on: MS35 - To Thread or Not To Thread April 13, 2016 SIAM PP , Paris The Big Picture The Big Picture

Algebraic multigrid in PETSc Mark Adams Lawrence Berkeley National Laboratory PETSc user

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Nonlinear Preconditioning in PETSc Matthew Knepley PETSc Team Computation Institute

Fluid Interface Detection with PETSc and DONLP2 PETSc User Meeting Vienna 2016 Poster Session

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science,

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

CS 423 Operating System Design: Concurrency (more Threads) Professor Adam Bates Fall 2018

CS 423 Operating System Design: Con Concu curren rrency cy Tianyin Tianyin Xu Xu * Thanks

Threads (light weight processes) Chester Rebeiro IIT Madras 1 Processes Separate streams

Concurrency: Threads Questions answered in this lecture: Why is concurrency useful? What is a

DataCollider: Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

Slides on thr Slides on threads eads borr borrowed by Chase owed by Chase Landon Cox Landon

To thread or not to thread? Why PETSc favors MPI-only Plenary - PowerPoint PPT Presentation

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016 Based on: MS35 - To Thread or Not To Thread April 13, 2016 SIAM PP , Paris The Big Picture The Big Picture

Algebraic multigrid in PETSc Mark Adams Lawrence Berkeley National Laboratory PETSc user

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Nonlinear Preconditioning in PETSc Matthew Knepley PETSc Team Computation Institute

Fluid Interface Detection with PETSc and DONLP2 PETSc User Meeting Vienna 2016 Poster Session

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science,

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

CS 423 Operating System Design: Concurrency (more Threads) Professor Adam Bates Fall 2018

CS 423 Operating System Design: Con Concu curren rrency cy Tianyin Tianyin Xu Xu * Thanks

Threads (light weight processes) Chester Rebeiro IIT Madras 1 Processes Separate streams

Concurrency: Threads Questions answered in this lecture: Why is concurrency useful? What is a

DataCollider: Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

Slides on thr Slides on threads eads borr borrowed by Chase owed by Chase Landon Cox Landon

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards