MPIBlib: Benchmarking MPI Communications for Parallel Computing on - - PowerPoint PPT Presentation

▶

Apr 03, 2023 240 likes •405 views

Introduction MPIBlib benchmarking suite Conclusion MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters Alexey Lastovetsky Vladimir Rychkov Maureen OFlynn { Alexey.Lastovetsky,

SLIDE 1

Introduction MPIBlib benchmarking suite Conclusion

MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters

Alexey Lastovetsky Vladimir Rychkov Maureen O’Flynn

{Alexey.Lastovetsky, Vladimir.Rychkov, Maureen.OFlynn}@ucd.ie Heterogeneous Computing Laboratory School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie

The 15th European PVM/MPI Users Group conference September 9, 2008, Dublin, Ireland

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 2

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Accurate estimation of the execution time of MPI communication operations

plays an important role in optimization of parallel applications:

◮ Design of parallel applications ◮ Tuning collective communication operations ◮ Heterogeneous platforms Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 3

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Accurate estimation of the execution time of MPI communication operations

plays an important role in optimization of parallel applications:

◮ Design of parallel applications ◮ Tuning collective communication operations ◮ Heterogeneous platforms

◮ MPI benchmarking suites

mpptest, NetPIPE, IMB(PMB), SKaMPI, MPIBench

◮ Measurement of the execution time of MPI functions - fixed set of

communication operations to be measured (except SKaMPI)

◮ A benchmark methodology - a single timing method ◮ Not much interpretation of results - executables and plotting Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 4

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Communication performance modeling - interpretation of results The procedure of the estimation of parameters determines what amount of experimental results and what communication experiments are required

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 5

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Communication performance modeling - interpretation of results The procedure of the estimation of parameters determines what amount of experimental results and what communication experiments are required

◮ Results of experiments should be available dynamically -

MPI benchmarking library

◮ The communication operations measured by benchmarking suite should be

customized - user-defined communication experiments

◮ The efficiency of measurements is crucial for the modeling at runtime (less

accurate can be acceptable) - selection of timing methods

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 6

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Benchmark methodology Gropp, W., Lusk E.: Reproducible Measurements of MPI Performance Characteristics. In: Dongarra, J., Luque, E., Margalef, T. (eds.) EuroPVM/MPI 1999. LNCS, vol. 1697, pp. 1118, Springer (1999)

◮ Repeating the communication operation multiple times to obtain the reliable

estimation of its execution time

◮ Selecting message sizes adaptively to eliminate artifacts in a graph of the output ◮ Testing the communication operation in different conditions: cache effects,

communication and computation overlap, communication patterns, non-blocking communication etc.

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 7

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

◮ Benchmark methodology Gropp, W., Lusk E.: Reproducible Measurements of MPI Performance Characteristics. In: Dongarra, J., Luque, E., Margalef, T. (eds.) EuroPVM/MPI 1999. LNCS, vol. 1697, pp. 1118, Springer (1999)

◮ Repeating the communication operation multiple times to obtain the reliable

estimation of its execution time

◮ Selecting message sizes adaptively to eliminate artifacts in a graph of the output ◮ Testing the communication operation in different conditions: cache effects,

communication and computation overlap, communication patterns, non-blocking communication etc.

◮ Common features on MPI benchmarking suites

◮ computing an average, minimum, maximum execution time of a series of the

same communication experiments to get accurate results;

◮ measuring the communication time for different message sizes - the number of

measurements can be fixed or adaptively increased for messages when time is fluctuating rapidly;

◮ performing simple statistical analysis by finding averages, variations, and errors. Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 8

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

Scheduling the communication experiment

◮ Series of communications - overlapping

Intel MPI Benchmarks

0.004 0.008 0.012 0.016 20 40 60 80 100 Execution time (sec) Message size (KB) Scatter single (min) single (max) multi (avg) 0.075 0.15 0.225 0.3 20 40 60 80 100 Execution time (sec) Message size (KB) Gather single (min) single (max) multi (avg)

◮ Isolation of communication operations from each other -

barrier, reduce, short acknowledgments

verlapping with these communications

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 9

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

Timing methods - based on MPI Wtime

◮ General - the time between two events:

◮ on a single designated processor (root) ◮ on all participating processors (max) ◮ on different processors (global)

Global timing is the most accurate but the costliest if MPI global timer is not supported by a platform (regular clock synchronization required)

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 10

Introduction MPIBlib benchmarking suite Conclusion Motivation Related work

Timing methods - based on MPI Wtime

◮ General - the time between two events:

◮ on a single designated processor (root) ◮ on all participating processors (max) ◮ on different processors (global)

Global timing is the most accurate but the costliest if MPI global timer is not supported by a platform (regular clock synchronization required)

◮ Operation-specific Supinski, B. de, Karonis, N.: Accurately measuring MPI broadcasts in a computational

grid. In: The 8th International Symposium on High Performance Distributed Computing, pp.

2937 (1999)

1 2 3

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 11

Introduction MPIBlib benchmarking suite Conclusion Features Customization of communication operations

MPIBlib benchmarking suite

◮ Implemented as a library - can be integrated into applications ◮ Provides general and operation-specific timing methods ◮ Supports extension of the communication operations to be measured

Input accuracy parameters

◮ minimum/maximum numbers of repetitions

if min reps == max reps, the fixed number of measurement

◮ confidence level and error of estimation

if min reps < max reps, the number of measurement depends on statistics

Output accuracy parameters

◮ number of repetitions ◮ confidence interval

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 12

Introduction MPIBlib benchmarking suite Conclusion Features Customization of communication operations

Different timing methods on 16 node heterogeneous cluster

0.004 0.008 0.012 0.016 20 40 60 80 100 Execution time (sec) Message size (KB) Scatter root max global 0.075 0.15 0.225 0.3 20 40 60 80 100 Execution time (sec) Message size (KB) Gather root max global

Timing method Scatter Gather 0..100KB, 1KB stride, 1 rep (sec) 0..100KB, 1KB stride, 1 rep (sec) Global 28.7 44.7 Maximum 0.8 15.6 Root 0.8 15.7

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 13

Introduction MPIBlib benchmarking suite Conclusion Features Customization of communication operations

Encapsulation - Special data structure

struct MPIB coll container {| void (initialize)(void this, MPI Comm comm, int root, int M);| void (execute)(void this, MPI Comm comm, int root, int M);| void (finalize)(void this, MPI Comm comm, int root);| void (free)(void this);| }|

◮ Allocation and deallocation of buffers required for the communication operation ◮ Communication operation ◮ Release of data structure

struct MPIB Scatter container {| struct MPIB coll container base;| char* buffer;| int (scatter)(void sendbuf, int sendcount, MPI Datatype sendtype,...);| }|

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 14

Introduction MPIBlib benchmarking suite Conclusion Features Customization of communication operations

Customization of communication operations

MPI_Scatter MPIB_Scatter_container MPIB_measure_max MPIB_measure_root MPIB_measure_global MPIB_Scatter_linear MPIB_Scatter_binomial MPI_Scatterv MPIB_Scatterv_container Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 15

Introduction MPIBlib benchmarking suite Conclusion Acknowledgments

MPI Benchmarking library was used for communication performance modeling on heterogeneous clusters

◮ Measurement of roundtrips with empty and non-empty messages -

sequential, parallel (clusters with a single switch)

◮ Measurement of linear scatter/gather - root timing ◮ User-defined communication operations - one-to-two -

sequential, parallel (clusters with a single switch)

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

SLIDE 16

Introduction MPIBlib benchmarking suite Conclusion Acknowledgments

University College Dublin Science Foundation Ireland IBM Dublin CAS

Alexey Lastovetsky, Vladimir Rychkov, Maureen O’Flynn {Alexey.Lastovetsky, Vladimir.Rychkov, Maureen. MPIBlib: Benchmarking MPI Communications for Parallel Com

MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters

Alexey Lastovetsky Vladimir Rychkov Maureen O’Flynn

{Alexey.Lastovetsky, Vladimir.Rychkov, Maureen.OFlynn}@ucd.ie Heterogeneous Computing Laboratory School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie

The 15th European PVM/MPI Users Group conference September 9, 2008, Dublin, Ireland

◮ Accurate estimation of the execution time of MPI communication operations

plays an important role in optimization of parallel applications:

◮ Accurate estimation of the execution time of MPI communication operations

plays an important role in optimization of parallel applications:

◮ MPI benchmarking suites

mpptest, NetPIPE, IMB(PMB), SKaMPI, MPIBench

communication operations to be measured (except SKaMPI)

◮ Communication performance modeling - interpretation of results The procedure of the estimation of parameters determines what amount of experimental results and what communication experiments are required

◮ Communication performance modeling - interpretation of results The procedure of the estimation of parameters determines what amount of experimental results and what communication experiments are required

MPI benchmarking library

customized - user-defined communication experiments

accurate can be acceptable) - selection of timing methods

◮ Benchmark methodology Gropp, W., Lusk E.: Reproducible Measurements of MPI Performance Characteristics. In: Dongarra, J., Luque, E., Margalef, T. (eds.) EuroPVM/MPI 1999. LNCS, vol. 1697, pp. 1118, Springer (1999)

estimation of its execution time

communication and computation overlap, communication patterns, non-blocking communication etc.

◮ Benchmark methodology Gropp, W., Lusk E.: Reproducible Measurements of MPI Performance Characteristics. In: Dongarra, J., Luque, E., Margalef, T. (eds.) EuroPVM/MPI 1999. LNCS, vol. 1697, pp. 1118, Springer (1999)

estimation of its execution time

communication and computation overlap, communication patterns, non-blocking communication etc.

◮ Common features on MPI benchmarking suites

same communication experiments to get accurate results;

measurements can be fixed or adaptively increased for messages when time is fluctuating rapidly;

Scheduling the communication experiment

◮ Series of communications - overlapping

Intel MPI Benchmarks

◮ Isolation of communication operations from each other -

barrier, reduce, short acknowledgments

Timing methods - based on MPI Wtime

◮ General - the time between two events:

Global timing is the most accurate but the costliest if MPI global timer is not supported by a platform (regular clock synchronization required)

Timing methods - based on MPI Wtime

◮ General - the time between two events:

Global timing is the most accurate but the costliest if MPI global timer is not supported by a platform (regular clock synchronization required)

◮ Operation-specific Supinski, B. de, Karonis, N.: Accurately measuring MPI broadcasts in a computational

2937 (1999)

MPIBlib benchmarking suite

◮ Implemented as a library - can be integrated into applications ◮ Provides general and operation-specific timing methods ◮ Supports extension of the communication operations to be measured

Input accuracy parameters

◮ minimum/maximum numbers of repetitions

if min reps == max reps, the fixed number of measurement

◮ confidence level and error of estimation

if min reps < max reps, the number of measurement depends on statistics

Output accuracy parameters

◮ number of repetitions ◮ confidence interval

Different timing methods on 16 node heterogeneous cluster

Timing method Scatter Gather 0..100KB, 1KB stride, 1 rep (sec) 0..100KB, 1KB stride, 1 rep (sec) Global 28.7 44.7 Maximum 0.8 15.6 Root 0.8 15.7

Encapsulation - Special data structure

struct MPIB coll container {| void (*initialize)(void* this, MPI Comm comm, int root, int M);| void (*execute)(void* this, MPI Comm comm, int root, int M);| void (*finalize)(void* this, MPI Comm comm, int root);| void (*free)(void* this);| }|

◮ Allocation and deallocation of buffers required for the communication operation ◮ Communication operation ◮ Release of data structure

struct MPIB Scatter container {| struct MPIB coll container base;| char* buffer;| int (*scatter)(void* sendbuf, int sendcount, MPI Datatype sendtype,...);| }|

Customization of communication operations

MPI Benchmarking library was used for communication performance modeling on heterogeneous clusters

◮ Measurement of roundtrips with empty and non-empty messages -

sequential, parallel (clusters with a single switch)

◮ Measurement of linear scatter/gather - root timing ◮ User-defined communication operations - one-to-two -

sequential, parallel (clusters with a single switch)

University College Dublin Science Foundation Ireland IBM Dublin CAS

struct MPIB coll container {| void (initialize)(void this, MPI Comm comm, int root, int M);| void (execute)(void this, MPI Comm comm, int root, int M);| void (finalize)(void this, MPI Comm comm, int root);| void (free)(void this);| }|

struct MPIB Scatter container {| struct MPIB coll container base;| char* buffer;| int (scatter)(void sendbuf, int sendcount, MPI Datatype sendtype,...);| }|