Mapping CSP Networks to MPI Clusters Using Channel Graphs and - PowerPoint PPT Presentation

Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta

Outline • Introduction • 1. CSP-Library • 2. Configuration Language • 3. CSP-based Concurrent Applications • 4. Automatic Mapping Algorithms • Results • Conclusion

Introduction • Distributing an application's processes across a cluster improves computational performance • Therefore after implementing an application the developer must then search for the most e ffi cient way to run it in parallel - time consuming • A number of algorithms have been studied to map applications onto the underlying architecture • This work seeks to create this same mapping e ff ect for CSP-based concurrent applications and evaluates their performance • The initial aim is to provide the necessary tools to implement and map CSP-based applications in order to automate the mapping process, then to study di ff erent automatic techniques and see how they compare

Introduction High level application mapping onto a cluster • The idea is to automatically map an application with any number of processes onto a cluster with any number of nodes in order to best utilize the available resources • This was achieved in 4 parts which will be explained in the following slides

1. CSP-Library A CSP-based message passing channel must first be implemented to allows for communication between an application’s parallel processes. This was done by using MPI and POSIX threads to implement the following: • Channels - Provide communication between pairs of processes. Three types of channels are defined: Internal, External and Timer. • Parallel - This brings together a number of processes such that they are executed concurrently. The processes start together and the Parallel terminates when all combined processes have terminated.

1. CSP-Library • Alternation - This combines a number of processes whereby only one of them is chosen for execution. This is a separate process which is provided with a list of channels and randomly selects one which is ready in order to receive the data to be sent. • Placed Parallel - Similar to Parallel however the concurrent processes are executed on di ff erent nodes, which were assigned through a predefined mapping.

2. Configuration Language A configuration language must then be established to provide a means for mapping such applications easily on to a cluster. This was developed using JSON to allow users to manually map processes onto nodes. This is included in a separate file so that a single application can have various mappings. Three sections are defined: • Application - This section lists all the channels used in the application and the pair of processes each channel connects. The processes in this section are identified using a unique ID, which is then referenced in the mapping section.

2. Configuration Language • Mapping - Each process is referenced by a unique ID, shared with the application section and is assigned a rank upon which it will execute. This section groups mappings with a unique ID, in order to allow multiple mappings to be used in the same application. • Global - This must be the first section and is used to declare any variables to be used in the mapping and application sections that follow. Such variables can also be edited from the application at runtime, allowing for the configuration to dynamically correspond to the application.

3. CSP-based Concurrent Applications A number of concurrent applications with various programming patterns were then developed using the CSP library and mapped using the configuration language. Applications are representing using a graph model in order to facilitate application partitioning for future mappings. • Sort Pump - Linear application which sorts a list of numbers • N-place Bu ff er - Linear application creating a bu ff er of N-processes between the sending and receiving processes • Single Filter - Geometric application simulating image filtering using a single Gaussian filter, by dividing it horizontally • Single 2D Filter - Farming application simulating image filtering using a single Gaussian filter, by dividing it in both planes and master process uses Alt function

3. CSP-based Concurrent Applications • Double Filter - Geometric application which simulates image filtering using the same Gaussian filter twice • Mergesort - Binary tree application to sort a list of numbers

4. Automatic Mapping Algorithms The final step is to automate the mapping of such applications using various partitioning algorithms. Following are the mapping techniques used: • Simple - Random, Linear from graph and Weighted Scatter according to execution times. • Min-Max Greedy - Greedily assigns processes to partitions with the aim of achieving minimum cost and maximum gain • Breadth-First Search - Traverses the graph starting from a root and vertices are assigned a partition according to their distance from the root • K-way Bisection - Recursive Graph Bisection groups nearby vertices and Kernighan Lin iteratively swaps processes between partitions if gain is better • Simulated Annealing - Optimization algorithm which constantly searches for a better mapping solution than the previous, by using neighbouring partitioning solutions • K-Means Clustering - Unsupervised learning algorithm which groups connected processes together after selecting initial centroids

Application Statistics The algorithms which use application information require an initial run of the application and the following data is recorded for each individual channel and each placed parallel process: • Channel total time - Total time spent waiting on the channel by the first process, until the second process arrives AND transfers the data • Channel communication time - Total time taken to actually transfer all the data across the channel • Channel usage - Total number of times the channel was used • Channel data size - Total amount of bytes transferred across the channel • Process total time - Total time taken by a process to execute

Results • The 6 applications were mapped and executed with the 9 mapping algorithms • 3 separate mappings using each algorithm were generated per application • Applications were run using 1, 2, 4, 6, 8 nodes and 1, 2, 4[, 6] cores on 2 di ff erent clusters • Each application instance was run 3 times, and execution time was recorded in each case • 2 versions of MPI were used: MPI Hydra (MPICH) and MVAPICH

Results Example: • MVAPICH results for the Sort Pump on one core • Results indicate the Linear algorithm generated the largest speedup, whereas the Weighted Scatter, the least

Conclusion • This work provides the necessary tools for developing CSP-based concurrent applications • The framework developed will save developers a significant amount of time and e ff ort when generating mappings for their CSP applications • Results indicate that using a mapping algorithm to map such applications can be beneficial • Large tree-depth graphs (eg: Sort Pump) - Partitioning algorithms which divide an application without adding extra external channels performed better (eg: Linear) • Short tree-depth graphs - Partitioning algorithms which prioritize equality of partitions proved to be more e ff ective

Questions?

Channel Usage Create/Destroy Send/Receive Timer Channel

Channel Implementation Internal Channel External Channel

Application Statistics • Application information is collected using two versions of the CSP library. The first version calculates all channel and process times by setting the TIME flag, whereas the second version calculates the total execution time without timing overheads • The following data is extracted and used by the graph partitioning algorithms: where Chan_comms_time is the total communication time of all channels used by the current process

Sort Pump Mappings • Sort Pump application (small scale): • Linear mapping: • Weighted Scatter mapping:

Mapping CSP Networks to MPI Clusters Using Channel Graphs and - PowerPoint PPT Presentation

Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta Outline Introduction 1. CSP-Library 2. Configuration Language 3. CSP-based

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

CSP Emerging Markets Solar Development in North Africa Daniele Tabacco CSP Expo - Rome,

Agenda Purpose of the CSP CSP Foundational Concepts CSP Construct Mission and

Virtual infrastructure partitioning and provisioning under nearly real-time constraints Student:

BMP Design Aids w w w. t r a n s p o r t a t i o n . o h i o . g o v 1 Equations / Programs

Which Method for Solution of the System of Interval Equations Should we Choose? A. Pownuk 1 , J.

THE HIGH MARGIN PRECIOUS METALS COMPANY OCTOBER 2015 CAUTIONARY STATEMENTS CAUTIONARY NOTE

Calculating bounds on expected return and first passage times in finite-state imprecise

Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and

SPAIN Smart Path Assignment In Networks Radosaw Pudekiewicz Uniwersytet Warszawski

Efficient HPC Development and Production with Allinea Tools Florent Lebeau