Network and Load-Aware Resource Manager for MPI Programs Ashish - - PowerPoint PPT Presentation

network and load aware resource manager for mpi programs
SMART_READER_LITE
LIVE PREVIEW

Network and Load-Aware Resource Manager for MPI Programs Ashish - - PowerPoint PPT Presentation

Network and Load-Aware Resource Manager for MPI Programs Ashish Kumar Naman Jain Preeti Malakar Indian Institite of Technology, Kanpur SRMPDS, International Conference on Parallel Processing 2020 Ashish Kumar, Naman Jain, Preeti Malakar


slide-1
SLIDE 1

Network and Load-Aware Resource Manager for MPI Programs

Ashish Kumar Naman Jain Preeti Malakar

Indian Institite of Technology, Kanpur

SRMPDS, International Conference on Parallel Processing 2020

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 1 / 26

slide-2
SLIDE 2

Introduction

Distributed-memory parallel programs and MPI More than one processing element using their own local memory. Nodes work cooperatively to solve a single big problem. Data exchange through communications by sending and receiving messages.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 2 / 26

slide-3
SLIDE 3

Introduction

Distributed-memory parallel programs and MPI More than one processing element using their own local memory. Nodes work cooperatively to solve a single big problem. Data exchange through communications by sending and receiving messages. Uses Message Passing Interface (MPI) as ”de facto” standard for message passing.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 2 / 26

slide-4
SLIDE 4

Introduction

Distributed-memory parallel programs and MPI More than one processing element using their own local memory. Nodes work cooperatively to solve a single big problem. Data exchange through communications by sending and receiving messages. Uses Message Passing Interface (MPI) as ”de facto” standard for message passing. Runs on cluster (shared or dedicated) or a supercomputer.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 2 / 26

slide-5
SLIDE 5

Introduction

Need for node allocation to run MPI jobs.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 3 / 26

slide-6
SLIDE 6

Introduction

Need for node allocation to run MPI jobs. In this work, we address the problem of allocating a good set of nodes to run MPI jobs in a shared non-dedicated cluster.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 3 / 26

slide-7
SLIDE 7

Non Dedicated/Shared Cluster and challenges

Non exclusive access of nodes

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 4 / 26

slide-8
SLIDE 8

Non Dedicated/Shared Cluster and challenges

Non exclusive access of nodes Shared among many users, same node can be used by different users/processes at same time for different purposes.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 4 / 26

slide-9
SLIDE 9

Non Dedicated/Shared Cluster and challenges

Non exclusive access of nodes Shared among many users, same node can be used by different users/processes at same time for different purposes. Variation in resource uses across time/nodes

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 4 / 26

slide-10
SLIDE 10

Non Dedicated/Shared Cluster and challenges

Non exclusive access of nodes Shared among many users, same node can be used by different users/processes at same time for different purposes. Variation in resource uses across time/nodes Which nodes to run our job on? What parameters should be considered?

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 4 / 26

slide-11
SLIDE 11

Node Resource Usage Variation

Variations in node resource usage across time and node in shared cluster

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 5 / 26

slide-12
SLIDE 12

Network Usage Variation

Variation in network usage between nodes in shared cluster

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 6 / 26

slide-13
SLIDE 13

Going towards our approach

Use knowledge of these variations across nodes, time and network to allocate resources in a better way.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 7 / 26

slide-14
SLIDE 14

Going towards our approach

Use knowledge of these variations across nodes, time and network to allocate resources in a better way. Take into account both static and dynamic attributes of resources, including network availability.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 7 / 26

slide-15
SLIDE 15

Overview

1

Node Allocation Algorithm

2

Resource Monitoring

3

Experiments

4

Conclusions and Future Work

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 8 / 26

slide-16
SLIDE 16

Allocation as Sub Graph Selection

v1 v2 v3 v4 80 90 85 60 75 90

G = (V, E)

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 9 / 26

slide-17
SLIDE 17

Allocation as Sub Graph Selection

v1 v2 v3 v4 80 90 85 60 75 90

Node Compute load #Cores v1 50.2 6 v2 43.5 8 v3 54.7 10 v4 38.3 4 G = (V, E) Vertex v∈ V : compute node having compute load CLv and available processor count pcv

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 9 / 26

slide-18
SLIDE 18

Allocation as Sub Graph Selection

v1 v2 v3 v4 80 90 85 60 75 90

Node Compute load #Cores v1 50.2 6 v2 43.5 8 v3 54.7 10 v4 38.3 4 G = (V, E) Vertex v∈ V : compute node having compute load CLv and available processor count pcv Edge e∈ E : network load NL(u,v) between compute nodes.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 9 / 26

slide-19
SLIDE 19

Allocation as Sub Graph Selection

v1 v2 v3 v4 80 90 85 60 75 90

Node Compute load #Cores v1 50.2 6 v2 43.5 8 v3 54.7 10 v4 38.3 4 G = (V, E) Vertex v∈ V : compute node having compute load CLv and available processor count pcv Edge e∈ E : network load NL(u,v) between compute nodes. n : number of processes to be allocated Find a sub-graph such that the

  • verall cost/load of the

sub-graph is minimized and process demand is fulfilled.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 9 / 26

slide-20
SLIDE 20

Some Definitions

Compute load: measure of overall load on the node

Static attributes: clock speed, core count, total memory. Dynamic attributes: CPU load, CPU utilization, available memory Compute load, CLv =

a∈attributes wa ∗ valva

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 10 / 26

slide-21
SLIDE 21

Some Definitions

Compute load: measure of overall load on the node

Static attributes: clock speed, core count, total memory. Dynamic attributes: CPU load, CPU utilization, available memory Compute load, CLv =

a∈attributes wa ∗ valva

Network load: measure of load on the p2p network link

Latency Bandwidth Network load, NL(u,v) = wltLT(u,v) + wbwBW(u,v)

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 10 / 26

slide-22
SLIDE 22

Some Definitions

Compute load: measure of overall load on the node

Static attributes: clock speed, core count, total memory. Dynamic attributes: CPU load, CPU utilization, available memory Compute load, CLv =

a∈attributes wa ∗ valva

Network load: measure of load on the p2p network link

Latency Bandwidth Network load, NL(u,v) = wltLT(u,v) + wbwBW(u,v)

Avaliable processors: measure of effective number of processors

pcv = coreCountv − Loadv % coreCountv

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 10 / 26

slide-23
SLIDE 23

Some Definitions

Compute load: measure of overall load on the node

Static attributes: clock speed, core count, total memory. Dynamic attributes: CPU load, CPU utilization, available memory Compute load, CLv =

a∈attributes wa ∗ valva

Network load: measure of load on the p2p network link

Latency Bandwidth Network load, NL(u,v) = wltLT(u,v) + wbwBW(u,v)

Avaliable processors: measure of effective number of processors

pcv = coreCountv − Loadv % coreCountv

Weights can be tuned according to program need/type.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 10 / 26

slide-24
SLIDE 24

Allocation Algorithm

Find candidate sub-graph corresponding to each node.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 11 / 26

slide-25
SLIDE 25

Allocation Algorithm

Find candidate sub-graph corresponding to each node. For each sub-graph Gv = (Vv, Ev) define:

Compute Load, CGv =

u∈Vv CLu

Network Load, NGv =

(x,y)∈Ev NL(x,y)

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 11 / 26

slide-26
SLIDE 26

Allocation Algorithm

Find candidate sub-graph corresponding to each node. For each sub-graph Gv = (Vv, Ev) define:

Compute Load, CGv =

u∈Vv CLu

Network Load, NGv =

(x,y)∈Ev NL(x,y)

Total Load, TGv = α × CGv Normalized + β × NGv Normalized

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 11 / 26

slide-27
SLIDE 27

Allocation Algorithm

Find candidate sub-graph corresponding to each node. For each sub-graph Gv = (Vv, Ev) define:

Compute Load, CGv =

u∈Vv CLu

Network Load, NGv =

(x,y)∈Ev NL(x,y)

Total Load, TGv = α × CGv Normalized + β × NGv Normalized

Allocate the best one on the basis of total load

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 11 / 26

slide-28
SLIDE 28

Allocation Algorithm

v1 v2 v3 v4

45 90 85 70 75 65 Available core count of v4 = 4 Required process count = 16 Compute weight (α) = 0.4 Network weight (β) = 0.6 Picked v4 Allocated process count = 4

Node Compute load Network load #Available cores v1 52 45 6 v2 47 70 8 v3 74 65 5

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 12 / 26

slide-29
SLIDE 29

Allocation Algorithm

v1 v2 v3 v4

45 90 85 70 75 65 Available core count of v4 = 4 Required process count = 16 Compute weight (α) = 0.4 Network weight (β) = 0.6 Picked v4 Allocated process count = 4

Node Compute load Network load Addition load #Available cores v1 52 45 47.8 6 v2 47 70 60.8 8 v3 74 65 68.6 5

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 13 / 26

slide-30
SLIDE 30

Allocation Algorithm

v1 v2 v3 v4

45 90 85 70 75 65 Available core count of v4 = 4 Required process count = 16 Compute weight (α) = 0.4 Network weight (β) = 0.6 Picked v1 Allocated process count = 10

Node Compute load Network load Addition load #Available cores v1 52 45 47.8 6 v2 47 70 60.8 8 v3 74 65 68.6 5

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 13 / 26

slide-31
SLIDE 31

Allocation Algorithm

v1 v2 v3 v4

45 90 85 70 75 65 Available core count of v4 = 4 Required process count = 16 Compute weight (α) = 0.4 Network weight (β) = 0.6 Picked v2 Allocated process count = 18

Node Compute load Network load Addition load #Available cores v1 52 45 47.8 6 v2 47 70 60.8 8 v3 74 65 68.6 5

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 13 / 26

slide-32
SLIDE 32

Candidate Selection Algorithm

Candidate Selection Algorithm Compute addition cost for all the nodes w.r.t start node Addition Cost : Av(u) = α × CL(u) + β × NL(v, u)

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 14 / 26

slide-33
SLIDE 33

Candidate Selection Algorithm

Candidate Selection Algorithm Compute addition cost for all the nodes w.r.t start node Addition Cost : Av(u) = α × CL(u) + β × NL(v, u) Keep adding nodes to sub-graph until allocated processes is more than required number of processes

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 14 / 26

slide-34
SLIDE 34

Candidate Selection Algorithm

Input : Node v, Graph G, PC List of effective processor count, n Requested number of processes Output: Gv sub-graph with v included

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 15 / 26

slide-35
SLIDE 35

Candidate Selection Algorithm

Input : Node v, Graph G, PC List of effective processor count, n Requested number of processes Output: Gv sub-graph with v included Vv ← − φ; allocated process ← − 0; k ← − Total number of nodes in G;

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 15 / 26

slide-36
SLIDE 36

Candidate Selection Algorithm

Input : Node v, Graph G, PC List of effective processor count, n Requested number of processes Output: Gv sub-graph with v included Vv ← − φ; allocated process ← − 0; k ← − Total number of nodes in G; Av(v) ← − 0; Calculate Av(u) for each node other than v; Let u1, u2, u3, ... , uk be the vertices sorted in increasing order of addition load Av(u);

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 15 / 26

slide-37
SLIDE 37

Candidate Selection Algorithm

Input : Node v, Graph G, PC List of effective processor count, n Requested number of processes Output: Gv sub-graph with v included Vv ← − φ; allocated process ← − 0; k ← − Total number of nodes in G; Av(v) ← − 0; Calculate Av(u) for each node other than v; Let u1, u2, u3, ... , uk be the vertices sorted in increasing order of addition load Av(u); i ← − 1; while allocated processes < n do Vv ← − Vv ∪ {ui}; allocated process ← − allocated process + pcui; i ← − i+1; end

Algorithm 4: Candidate Selection Algorithm

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 15 / 26

slide-38
SLIDE 38

Resource Monitoring

Developed a distributed monitoring system for flexibility

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 16 / 26

slide-39
SLIDE 39

Resource Monitoring

Developed a distributed monitoring system for flexibility Light-weight deamons for periodically updating:

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 16 / 26

slide-40
SLIDE 40

Resource Monitoring

Developed a distributed monitoring system for flexibility Light-weight deamons for periodically updating:

Livehosts: nodes which are up and running

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 16 / 26

slide-41
SLIDE 41

Resource Monitoring

Developed a distributed monitoring system for flexibility Light-weight deamons for periodically updating:

Livehosts: nodes which are up and running Node statistics: available memory, CPU load and CPU utilization, etc.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 16 / 26

slide-42
SLIDE 42

Resource Monitoring

Developed a distributed monitoring system for flexibility Light-weight deamons for periodically updating:

Livehosts: nodes which are up and running Node statistics: available memory, CPU load and CPU utilization, etc. Network statistics: available bandwidth and latency

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 16 / 26

slide-43
SLIDE 43

Experimental Setup and Benchmark

Experimental setup:

40 12-core Intel Core nodes (4.6 GHz) and 20 8-core Intel Core nodes (2.8 GHz) Cluster has a tree-like hierarchical topology with 4 switches

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 17 / 26

slide-44
SLIDE 44

Experimental Setup and Benchmark

Experimental setup:

40 12-core Intel Core nodes (4.6 GHz) and 20 8-core Intel Core nodes (2.8 GHz) Cluster has a tree-like hierarchical topology with 4 switches

Mantevo benchmark (miniMD) : a simple, parallel molecular dynamics mini-application

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 17 / 26

slide-45
SLIDE 45

Experimental Setup and Benchmark

Experimental setup:

40 12-core Intel Core nodes (4.6 GHz) and 20 8-core Intel Core nodes (2.8 GHz) Cluster has a tree-like hierarchical topology with 4 switches

Mantevo benchmark (miniMD) : a simple, parallel molecular dynamics mini-application Mantevo benchmark (miniFE) : proxy application for unstructured implicit finite element codes which sets up a brick-shaped problem domain of hexahedral elements

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 17 / 26

slide-46
SLIDE 46

Experimental Setup and Benchmark

Experimental setup:

40 12-core Intel Core nodes (4.6 GHz) and 20 8-core Intel Core nodes (2.8 GHz) Cluster has a tree-like hierarchical topology with 4 switches

Mantevo benchmark (miniMD) : a simple, parallel molecular dynamics mini-application Mantevo benchmark (miniFE) : proxy application for unstructured implicit finite element codes which sets up a brick-shaped problem domain of hexahedral elements Comparison with:

Random allocation Sequential allocation Load-aware allocation

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 17 / 26

slide-47
SLIDE 47

Weights for miniMD and miniFE

Attribute Weight CPU Load 0.3 CPU Utilization 0.2 Node Bandwidth 0.2 Used memory 0.1 Logical core count 0.1 Clock Speed 0.05 Total Memory 0.05 Table: Relative weights for compute load

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 18 / 26

slide-48
SLIDE 48

Weights for miniMD and miniFE

Attribute Weight CPU Load 0.3 CPU Utilization 0.2 Node Bandwidth 0.2 Used memory 0.1 Logical core count 0.1 Clock Speed 0.05 Total Memory 0.05 Table: Relative weights for compute load

Determined emperically Relative weights for latency and bandwidth were set to 0.25 and 0.75 respectively. Relative weights for compute and network load were set to 0.3 and 0.7 respectively for miniMD, and 0.4 and 0.6 repectively for miniFE.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 18 / 26

slide-49
SLIDE 49

Experiments : MiniMD

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 19 / 26

slide-50
SLIDE 50

Performance Gain: miniMD

Allocation Policy Average Gain Median Gain Maximum Gain Random 49.9% 50.7% 87.8% Sequential 43.1% 42.1% 84.5% Load-Aware 32.4% 29.8% 87.7%

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 20 / 26

slide-51
SLIDE 51

CPU Load : MiniMD

Average CPU load per logical core: Load aware = 0.31 Network and Load aware = 0.43 Sequential = 0.68 Random = 0.78

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 21 / 26

slide-52
SLIDE 52

Experiments : MiniFE

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 22 / 26

slide-53
SLIDE 53

Performance Gain: miniFE

Allocation Policy Average Gain Median Gain Maximum Gain Random 47.9% 50.4% 92.1% Sequential 31.1% 28.0% 80.4% Load-Aware 34.8% 38.7% 91.0%

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 23 / 26

slide-54
SLIDE 54

Experiment: Resource Allocation Analysis

Configuration: 32 processes, 4 processes per node

Algorithm

  • Avg. CPU load
  • Avg. bandwidth
  • Avg. latency

Random 1.242 17.07 546.46 Sequential 1.262 10.72 304.25 Load Aware 0.453 18.64 354.51 Network and load-aware 0.633 5.36 82.90 Table: Usage of allocated resource group during allocation

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 24 / 26

slide-55
SLIDE 55

Experiment: Resource Allocation Analysis

Configuration: 32 processes, 4 processes per node

Algorithm

  • Avg. CPU load
  • Avg. bandwidth
  • Avg. latency

Random 1.242 17.07 546.46 Sequential 1.262 10.72 304.25 Load Aware 0.453 18.64 354.51 Network and load-aware 0.633 5.36 82.90 Table: Usage of allocated resource group during allocation Total execution time: Random: 27.61s Sequential: 24.91s Load aware: 12.31s Network & load aware: 4.43s

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 24 / 26

slide-56
SLIDE 56

Experiments: Resource Allocation Analysis

Figure: Peer-to-peer bandwidth and CPU Load

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 25 / 26

slide-57
SLIDE 57

Shortcomings and Future Work

Network and load aware algorithm reduces runtime by more than 38% over random, sequential, and load-aware allocations due to less interference from external factors. Challenging to determine the relative weights for resource attributes and computation-communication characteristics for large applications. We plan to enhance profiling tools for this purpose. Time series estimation methods may be used for bandwidth forecast. Extension to large scale systems, spanning over multiple clusters. Exploring integrating our tool as a plugin for SLURM job scheduler.

Ashish Kumar, Naman Jain, Preeti Malakar Network and Load-Aware Resource Manager for MPI Programs 26 / 26