Mapping CSP Networks to MPI Clusters Using Channel Graphs and - - PowerPoint PPT Presentation

mapping csp networks to mpi clusters using channel graphs
SMART_READER_LITE
LIVE PREVIEW

Mapping CSP Networks to MPI Clusters Using Channel Graphs and - - PowerPoint PPT Presentation

Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta Outline Introduction 1. CSP-Library 2. Configuration Language 3. CSP-based


slide-1
SLIDE 1

Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation

Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta

slide-2
SLIDE 2

Outline

  • Introduction
  • 1. CSP-Library
  • 2. Configuration Language
  • 3. CSP-based Concurrent Applications
  • 4. Automatic Mapping Algorithms
  • Results
  • Conclusion
slide-3
SLIDE 3

Introduction

  • Distributing an application's processes across a cluster improves

computational performance

  • Therefore after implementing an application the developer must then

search for the most efficient way to run it in parallel - time consuming

  • A number of algorithms have been studied to map applications onto the

underlying architecture

  • This work seeks to create this same mapping effect for CSP-based

concurrent applications and evaluates their performance

  • The initial aim is to provide the necessary tools to implement and map

CSP-based applications in order to automate the mapping process, then to study different automatic techniques and see how they compare

slide-4
SLIDE 4

Introduction

High level application mapping onto a cluster

  • The idea is to automatically map an application with any number of processes
  • nto a cluster with any number of nodes in order to best utilize the available

resources

  • This was achieved in 4 parts which will be explained in the following slides
slide-5
SLIDE 5
  • 1. CSP-Library

A CSP-based message passing channel must first be implemented to allows for communication between an application’s parallel processes. This was done by using MPI and POSIX threads to implement the following:

  • Parallel - This brings together a number of processes such that they are

executed concurrently. The processes start together and the Parallel terminates when all combined processes have terminated.

  • Channels - Provide communication between pairs of processes. Three types
  • f channels are defined: Internal, External and Timer.
slide-6
SLIDE 6
  • 1. CSP-Library
  • Alternation - This combines a number of processes whereby only one of them is

chosen for execution. This is a separate process which is provided with a list of channels and randomly selects one which is ready in order to receive the data to be sent.

  • Placed Parallel - Similar to Parallel however the concurrent processes are

executed on different nodes, which were assigned through a predefined mapping.

slide-7
SLIDE 7
  • 2. Configuration Language

A configuration language must then be established to provide a means for mapping such applications easily on to a cluster. This was developed using JSON to allow users to manually map processes

  • nto nodes. This is included in a separate file so that a single application can

have various mappings. Three sections are defined:

  • Application - This section lists all the channels used in the application and the

pair of processes each channel connects. The processes in this section are identified using a unique ID, which is then referenced in the mapping section.

slide-8
SLIDE 8
  • 2. Configuration Language
  • Mapping - Each process is referenced by a unique ID, shared with the

application section and is assigned a rank upon which it will execute. This section groups mappings with a unique ID, in order to allow multiple mappings to be used in the same application.

  • Global - This must be the first section and is used to declare any variables

to be used in the mapping and application sections that follow. Such variables can also be edited from the application at runtime, allowing for the configuration to dynamically correspond to the application.

slide-9
SLIDE 9
  • 3. CSP-based Concurrent

Applications

A number of concurrent applications with various programming patterns were then developed using the CSP library and mapped using the configuration language. Applications are representing using a graph model in order to facilitate application partitioning for future mappings.

  • Sort Pump - Linear application which sorts a list of numbers
  • N-place Buffer - Linear application creating a buffer of N-processes between the sending

and receiving processes

  • Single Filter - Geometric application simulating image

filtering using a single Gaussian filter, by dividing it horizontally

  • Single 2D Filter - Farming application simulating image

filtering using a single Gaussian filter, by dividing it in both planes and master process uses Alt function

slide-10
SLIDE 10
  • 3. CSP-based Concurrent

Applications

  • Double Filter - Geometric application

which simulates image filtering using the same Gaussian filter twice

  • Mergesort - Binary tree application

to sort a list of numbers

slide-11
SLIDE 11
  • 4. Automatic Mapping

Algorithms

The final step is to automate the mapping of such applications using various partitioning

  • algorithms. Following are the mapping techniques used:
  • Simple - Random, Linear from graph and Weighted Scatter according to execution

times.

  • Min-Max Greedy - Greedily assigns processes to partitions with the aim of achieving

minimum cost and maximum gain

  • Breadth-First Search - Traverses the graph starting from a root and vertices are

assigned a partition according to their distance from the root

  • K-way Bisection - Recursive Graph Bisection groups nearby vertices and Kernighan

Lin iteratively swaps processes between partitions if gain is better

  • Simulated Annealing - Optimization algorithm which constantly searches for a better

mapping solution than the previous, by using neighbouring partitioning solutions

  • K-Means Clustering - Unsupervised learning algorithm which groups connected

processes together after selecting initial centroids

slide-12
SLIDE 12

Application Statistics

The algorithms which use application information require an initial run of the application and the following data is recorded for each individual channel and each placed parallel process:

  • Channel total time - Total time spent waiting on the channel by the first

process, until the second process arrives AND transfers the data

  • Channel communication time - Total time taken to actually transfer all the

data across the channel

  • Channel usage - Total number of times the channel was used
  • Channel data size - Total amount of bytes transferred across the channel
  • Process total time - Total time taken by a process to execute
slide-13
SLIDE 13

Results

  • The 6 applications were mapped and executed with the 9

mapping algorithms

  • 3 separate mappings using each algorithm were

generated per application

  • Applications were run using 1, 2, 4, 6, 8 nodes and 1, 2,

4[, 6] cores on 2 different clusters

  • Each application instance was run 3 times, and execution

time was recorded in each case

  • 2 versions of MPI were used: MPI Hydra (MPICH) and

MVAPICH

slide-14
SLIDE 14

Results

Example:

  • MVAPICH results for the Sort Pump on one core
  • Results indicate the Linear algorithm generated the

largest speedup, whereas the Weighted Scatter, the least

slide-15
SLIDE 15

Conclusion

  • This work provides the necessary tools for developing CSP-based

concurrent applications

  • The framework developed will save developers a significant amount of

time and effort when generating mappings for their CSP applications

  • Results indicate that using a mapping algorithm to map such

applications can be beneficial

  • Large tree-depth graphs (eg: Sort Pump) - Partitioning algorithms

which divide an application without adding extra external channels performed better (eg: Linear)

  • Short tree-depth graphs - Partitioning algorithms which prioritize

equality of partitions proved to be more effective

slide-16
SLIDE 16

Questions?

slide-17
SLIDE 17

Channel Usage

Create/Destroy Send/Receive Timer Channel

slide-18
SLIDE 18

Channel Implementation

Internal Channel External Channel

slide-19
SLIDE 19

Application Statistics

  • Application information is collected using two versions of the CSP
  • library. The first version calculates all channel and process times by

setting the TIME flag, whereas the second version calculates the total execution time without timing overheads

  • The following data is extracted and used by the graph partitioning

algorithms:

where Chan_comms_time is the total communication time of all channels used by the current process

slide-20
SLIDE 20

Sort Pump Mappings

  • Sort Pump application (small scale):
  • Linear mapping:
  • Weighted Scatter mapping: