Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation
Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta
Mapping CSP Networks to MPI Clusters Using Channel Graphs and - - PowerPoint PPT Presentation
Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta Outline Introduction 1. CSP-Library 2. Configuration Language 3. CSP-based
Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta
computational performance
search for the most efficient way to run it in parallel - time consuming
underlying architecture
concurrent applications and evaluates their performance
CSP-based applications in order to automate the mapping process, then to study different automatic techniques and see how they compare
High level application mapping onto a cluster
resources
A CSP-based message passing channel must first be implemented to allows for communication between an application’s parallel processes. This was done by using MPI and POSIX threads to implement the following:
executed concurrently. The processes start together and the Parallel terminates when all combined processes have terminated.
chosen for execution. This is a separate process which is provided with a list of channels and randomly selects one which is ready in order to receive the data to be sent.
executed on different nodes, which were assigned through a predefined mapping.
A configuration language must then be established to provide a means for mapping such applications easily on to a cluster. This was developed using JSON to allow users to manually map processes
have various mappings. Three sections are defined:
pair of processes each channel connects. The processes in this section are identified using a unique ID, which is then referenced in the mapping section.
application section and is assigned a rank upon which it will execute. This section groups mappings with a unique ID, in order to allow multiple mappings to be used in the same application.
to be used in the mapping and application sections that follow. Such variables can also be edited from the application at runtime, allowing for the configuration to dynamically correspond to the application.
A number of concurrent applications with various programming patterns were then developed using the CSP library and mapped using the configuration language. Applications are representing using a graph model in order to facilitate application partitioning for future mappings.
and receiving processes
filtering using a single Gaussian filter, by dividing it horizontally
filtering using a single Gaussian filter, by dividing it in both planes and master process uses Alt function
which simulates image filtering using the same Gaussian filter twice
to sort a list of numbers
The final step is to automate the mapping of such applications using various partitioning
times.
minimum cost and maximum gain
assigned a partition according to their distance from the root
Lin iteratively swaps processes between partitions if gain is better
mapping solution than the previous, by using neighbouring partitioning solutions
processes together after selecting initial centroids
The algorithms which use application information require an initial run of the application and the following data is recorded for each individual channel and each placed parallel process:
process, until the second process arrives AND transfers the data
data across the channel
mapping algorithms
generated per application
4[, 6] cores on 2 different clusters
time was recorded in each case
MVAPICH
Example:
largest speedup, whereas the Weighted Scatter, the least
concurrent applications
time and effort when generating mappings for their CSP applications
applications can be beneficial
which divide an application without adding extra external channels performed better (eg: Linear)
equality of partitions proved to be more effective
Create/Destroy Send/Receive Timer Channel
Internal Channel External Channel
setting the TIME flag, whereas the second version calculates the total execution time without timing overheads
algorithms:
where Chan_comms_time is the total communication time of all channels used by the current process