Processes and Threads Placement of Parallel Applications. Why, How - - PowerPoint PPT Presentation

processes and threads placement of parallel applications
SMART_READER_LITE
LIVE PREVIEW

Processes and Threads Placement of Parallel Applications. Why, How - - PowerPoint PPT Presentation

Processes and Threads Placement of Parallel Applications. Why, How and for What gain? Joint work with: Guillaume Mercier, Franois Tessier, Brice Goglin, Emmanuel Agulo, George Bosilca. COST spring School, Uppsala Emmanuel Jeannot Runtime


slide-1
SLIDE 1

Processes and Threads Placement

  • f Parallel Applications. Why, How

and for What gain?

Joint work with: Guillaume Mercier, François Tessier, Brice Goglin, Emmanuel Agulo, George Bosilca. COST spring School, Uppsala

Emmanuel Jeannot Runtime Team

Inria Bordeaux Sud-Ouest June 4, 2013

slide-2
SLIDE 2

Runtime Systems and the Inria Runtime Team

Process and Thread Placement July 4, 2012

1

slide-3
SLIDE 3

Software Stack

July 4, 2012 Process and Thread Placement

Applications Hardware

slide-4
SLIDE 4

Software Stack

July 4, 2012 Process and Thread Placement

Applications Programming models Hardware

Enable and express parallelism Give abstraction of the parallel machine

slide-5
SLIDE 5

Software Stack

July 4, 2012 Process and Thread Placement

Applications Programming models Hardware

Compilers

Static optimization Parallelism extraction Enable and express parallelism Give abstraction of the parallel machine

slide-6
SLIDE 6

Software Stack

July 4, 2012 Process and Thread Placement

Applications Programming models Hardware Libraries

Optimize Computational Kernels

Compilers

Static optimization Parallelism extraction Enable and express parallelism Give abstraction of the parallel machine

slide-7
SLIDE 7

Software Stack

July 4, 2012 Process and Thread Placement

Applications Programming models Hardware Libraries

Optimize Computational Kernels

Operating systems

Hardware abstraction Basic services

Compilers

Static optimization Parallelism extraction Enable and express parallelism Give abstraction of the parallel machine

slide-8
SLIDE 8

Software Stack

July 4, 2012 Process and Thread Placement

Applications Programming models Hardware Libraries

Optimize Computational Kernels

Operating systems

Hardware abstraction Basic services

Runtime systems

Dynamic optimization

Compilers

Static optimization Parallelism extraction Enable and express parallelism Give abstraction of the parallel machine

slide-9
SLIDE 9

Runtime System

July 4, 2012 Process and Thread Placement

  • Scheduling
  • Parallelism orchestration (Comm. Synchronization)
  • I/O
  • Reliability and resilience
  • Collective communication routing
  • Migration
  • Data and task/process/thread placement
  • etc.
slide-10
SLIDE 10

Runtime Team

July 4, 2012 Process and Thread Placement

Inria Team Enable performance portability by improving interface expressivity Success stories:

  • MPICH 2 (Nemesis Kernel)
  • KNEM (enabling high-performance intra-node MPI

communication for large messages)

  • StarPU (unified runtime system for CPU and GPU program

execution)

  • HWLOC (portable hardware locality)
slide-11
SLIDE 11

June 4, 2013

Process Placement

2

slide-12
SLIDE 12

MPI (Process-based runtime systems)

June 4, 2013

Performance of MPI programs depends on many factors that can be handled when you change the machine:

  • Implementation of the standard (e.g. collective com.)
  • Parallel algorithm(s)
  • Implementation of the algorithm
  • Underlying libraries (e.g. BLAS)
  • Hardware (processors, cache, network)
  • etc.

But…

slide-13
SLIDE 13

Process Placement

June 4, 2013

The MPI model makes little (no?) assumption on the way processes are mapped to resources It is often assume that the network topology is flat and hence the process mapping has little impact on the performance

CPU CPU CPU CPU

Mem Mem Mem Mem

Interconection network

slide-14
SLIDE 14

The Topology is not Flat

June 4, 2013

Due to multicore processors current and future parallel machines are hierarchical Communication speed depends on:

  • Receptor and emitter
  • Cache hierarchy
  • Memory bus
  • Interconnection network

etc. Almost nothing in the MPI standard help to handle these factors

slide-15
SLIDE 15

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

slide-16
SLIDE 16

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

slide-17
SLIDE 17

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

slide-18
SLIDE 18

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

slide-19
SLIDE 19

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

slide-20
SLIDE 20

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

NIC NIC Network

slide-21
SLIDE 21

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

NIC NIC Network

slide-22
SLIDE 22

Core 1 Core 2 Core 4 Core 3

L1/L2 L1/L2 L1/L2 L1/L2 L3

  • Mem. Controler

Local RAM

Bus

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Example on a Parallel Machine

June 4, 2013

The higher we have to go into the hierarchy the costly the data exchange

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

Node

Local RAM

  • Mem. Controler

NIC NIC Network

The network can also be hierarchical!

slide-23
SLIDE 23

Rationale

June 4, 2013

Not all the processes exchange the same amount of data The speed of the communications, and hence performance of the application depends on the way processes are mapped to resources.

slide-24
SLIDE 24

Do we Really Care: to Bind or not to Bind?

June 4, 2013

After all, the system scheduler is able to move processes when needed. Yes, but only for shared memory system. Migration is possible but it is not in the MPI standard (see charm++) Moreover binding provides better execution runtime stability.

slide-25
SLIDE 25

Do we Really Care: to Bind or not to Bind?

June 4, 2013

After all, the system scheduler is able to move processes when needed. Yes, but only for shared memory system. Migration is possible but it is not in the MPI standard (see charm++) Moreover binding provides better execution runtime stability.

Zeus MHD Blast. 64 Processes/Cores. Mvapich2 1.8. + ICC

slide-26
SLIDE 26

Process Placement Problem

June 4, 2013

Given :

  • Parallel machine topology
  • Process affinity (communication pattern)

Map processes to resources (cores) to reduce communication cost: a nice algorithmic problem:

  • Graph partitionning (Scotch, Metis)
  • Application tuning [Aktulga et al. Euro-Par 12]
  • Topology-to-patten matching (TreeMatch)
slide-27
SLIDE 27

Reduce Communication Cost?

June 4, 2013

But wait, my application is compute-bound! Well, but this might not be still true in the future: strong scaling might not always be a solution.

slide-28
SLIDE 28

Reduce Communication Cost?

June 4, 2013

But wait, my application is compute-bound! Well, but this might not be still true in the future: strong scaling might not always be a solution.

Taken from one of J. Dongarra’s Talk.

slide-29
SLIDE 29

How to bind Processes to Core/Node?

June 4, 2013

MPI standard does not specify process binding Each distribution has its own solution:

  • MPICH2 (hydra manager): mpiexec -np 2 -binding

cpu:sockets

  • OpenMPI: mpiexec -np 64 -bind-to-board
  • etc.

You can also specify process binding using numactl or taskset unix command in the mpirun command line: mpiexec -np 1 –host machine numactl --physcpubind=0 ./prg

slide-30
SLIDE 30

Obtaining the Topology (Shared Memory)

June 4, 2013

HWLOC (portable hardware locality):

  • Runtime and OpenMPI team
  • Portable abstraction (across OS, versions,

architectures, ...)

  • Hierarchical topology
  • Modern architecture (NUMA, cores, caches, etc.)
  • ID of the cores
  • C library to play with
  • Etc
slide-31
SLIDE 31

HWLOC

June 4, 2013

http://www.open-mpi.org/projects/hwloc/

slide-32
SLIDE 32

Obtaining the Topology (Distributed Memory)

June 4, 2013

Not always easy (research issue) MPI core has some routine to get that Sometime requires to build a file that specifies node adjacency

slide-33
SLIDE 33

Getting the Communication Pattern

June 4, 2013

No automatic way so far… Can be done through application monitoring:

  • During execution
  • With a « blank execution »
slide-34
SLIDE 34

Results

June 4, 2013

64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR). Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors.

slide-35
SLIDE 35

Results

June 4, 2013

64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR). Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors.

36% gain against standard MPI policy

slide-36
SLIDE 36

Results

June 4, 2013

64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR). Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors.

slide-37
SLIDE 37

Results

June 4, 2013

64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR). Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors.

400% gain against some graph partitionners

slide-38
SLIDE 38

Conclusion of this Part

June 4, 2013

To ensure performance protability one must take into account the topology of target machine Process placement according to application behavior and topology helps in increasing performance Several open issues:

  • Communication pattern
  • Metrics
  • Dynamic adaptation
  • Faster algorithm
  • Intergration into MPI (dist_graph_create + new

communicator)

slide-39
SLIDE 39

June 4, 2013

Thread placement on shared-memory

3

slide-40
SLIDE 40

Multithreading

June 4, 2013

Multithreading is a good model for shared memory machine

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-41
SLIDE 41

Multithreading

June 4, 2013

Multithreading is a good model for shared memory machine

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-42
SLIDE 42

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A

But threads and/or memory pages can move

slide-43
SLIDE 43

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A

But threads and/or memory pages can move

slide-44
SLIDE 44

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A

But threads and/or memory pages can move

slide-45
SLIDE 45

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A

But threads and/or memory pages can move

slide-46
SLIDE 46

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A A Local RAM

But threads and/or memory pages can move

slide-47
SLIDE 47

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A A Local RAM

But threads and/or memory pages can move

slide-48
SLIDE 48

Multithreading

June 4, 2013

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

A A Local RAM Scheduler decision

But threads and/or memory pages can move

slide-49
SLIDE 49

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-50
SLIDE 50

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-51
SLIDE 51

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-52
SLIDE 52

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-53
SLIDE 53

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-54
SLIDE 54

Thread Binding

June 4, 2013

You cannot prevent memory pages from moving but you can:

  • Bind threads to nodes/cores (HWLOC)
  • Allocate memory pages on specific memory node

Severam solutions are possible

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Process and Thread Placement

slide-55
SLIDE 55

June 4, 2013

Example on the Tiled Version of the Dense Cholesky Factorization

0,0 0,1 0,2 0,3 0,4 1,0 1,1 1,2 1,3 1,4 2,0 2,1 2,2 2,3 2,4 3,0 3,1 3,2 3,3 3,4 4,0 4,1 4,2 4,3 4,4 DPOTRF DTRSM DSYRK DGEMM

Process and Thread Placement

slide-56
SLIDE 56

Example on the Tiled Version of the Dense Cholesky Factorization

Process and Thread Placement June 4, 2013

slide-57
SLIDE 57

Example on the Tiled Version of the Dense Cholesky Factorization

Process and Thread Placement

On a 20 nodes, 8 cores per node, shared memory machine 20 pool of threads vs. 1 pool of thread

1 pool of threads

  • vs. 20 pools of threads

June 4, 2013

slide-58
SLIDE 58

Example on the Tiled Version of the Dense Cholesky Factorization

Process and Thread Placement

On a 20 nodes, 8 cores per node, shared memory machine 20 pool of threads vs. 1 pool of thread Perf degradation!

1 pool of threads

  • vs. 20 pools of threads

June 4, 2013

slide-59
SLIDE 59

What’s Happeing at N=64000?

Process and Thread Placement

Problem: System has 600Gb of Ram. But we start swapping at N=64000 (30Gb)

June 4, 2013

slide-60
SLIDE 60

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Each node has 30Gb of Ram malloc: pages are put on the first node that writes it

June 4, 2013

slide-61
SLIDE 61

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Each node has 30Gb of Ram malloc: pages are put on the first node that writes it

June 4, 2013

slide-62
SLIDE 62

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

double *A = malloc( N * LDA * sizeof(double)); fill(A);

Each node has 30Gb of Ram malloc: pages are put on the first node that writes it

June 4, 2013

slide-63
SLIDE 63

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

double *A = malloc( N * LDA * sizeof(double)); fill(A); A

Each node has 30Gb of Ram malloc: pages are put on the first node that writes it

June 4, 2013

slide-64
SLIDE 64

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Solution:

  • have a multithreaded I/O for creating the matrix in tiled format
  • allocate pages of matrix in round robin fashion onto the nodes

(numa_alloc_interleaved)

June 4, 2013

slide-65
SLIDE 65

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

Solution:

  • have a multithreaded I/O for creating the matrix in tiled format
  • allocate pages of matrix in round robin fashion onto the nodes

(numa_alloc_interleaved)

June 4, 2013

slide-66
SLIDE 66

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

double *A = numa_alloc_interleaved( N * LDA * sizeof(double)); fill(A); Solution:

  • have a multithreaded I/O for creating the matrix in tiled format
  • allocate pages of matrix in round robin fashion onto the nodes

(numa_alloc_interleaved)

June 4, 2013

slide-67
SLIDE 67

What’s Happeing at N=64000?

Process and Thread Placement

...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM ...

  • Mem. Controler

Core Core

L1/L2 L1/L2 L3

Local RAM

Interconect

double *A = numa_alloc_interleaved( N * LDA * sizeof(double)); fill(A); A Solution:

  • have a multithreaded I/O for creating the matrix in tiled format
  • allocate pages of matrix in round robin fashion onto the nodes

(numa_alloc_interleaved) A

June 4, 2013

slide-68
SLIDE 68

Example on the Tiled Version of the Dense Cholesky Factorization

Process and Thread Placement

1 group of threads

  • vs. 20 group of threads

June 4, 2013

slide-69
SLIDE 69

Example on the Tiled Version of the Dense Cholesky Factorization

Process and Thread Placement

1 group of threads

  • vs. 20 group of threads

Numa-aware data binding and allocation

June 4, 2013

slide-70
SLIDE 70

Comparison with Block-Cyclic distribution

Process and Thread Placement June 4, 2013

P=2, Q=3

slide-71
SLIDE 71

Comparison with Block-Cyclic distribution

Process and Thread Placement June 4, 2013

slide-72
SLIDE 72

Conclusion of this Part

Process and Thread Placement

Shared memory machine gives the illusion of a flat address space but:

  • data allocation
  • data movement

have a huge impact on the performance We are lacking model and tools to hide/expose this complexity.

June 4, 2013

slide-73
SLIDE 73

June 4, 2013

Conclusion

4

slide-74
SLIDE 74

Take Away Message

June 4, 2013

Performance portability is difficult! Parallel machines are more complex than you may think! Important to take care of process/thread/data placement: big gain can be achieved. Good news: does not require to change application or algorithm Lot of works are required either:

  • to hide complexity from the user while keeping expressivity
  • r to expose what is necessary to ensure performance
slide-75
SLIDE 75

Thanks!

Uppsala Spring School, 2013

www.inria.fr