Processes and Threads Placement of Parallel Applications. Why, How - PowerPoint PPT Presentation

Processes and Threads Placement of Parallel Applications. Why, How and for What gain? Joint work with: Guillaume Mercier, François Tessier, Brice Goglin, Emmanuel Agulo, George Bosilca. COST spring School, Uppsala Emmanuel Jeannot Runtime Team June 4, 2013 Inria Bordeaux Sud-Ouest

1 Runtime Systems and the Inria Runtime Team Process and Thread Placement July 4, 2012

Software Stack Applications Hardware Process and Thread Placement July 4, 2012

Software Stack Applications Enable and express parallelism Programming models Give abstraction of the parallel machine Hardware Process and Thread Placement July 4, 2012

Software Stack Applications Enable and express parallelism Programming models Give abstraction of the parallel machine Static optimization Compilers Parallelism extraction Hardware Process and Thread Placement July 4, 2012

Software Stack Applications Enable and express parallelism Programming models Give abstraction of the parallel machine Static optimization Compilers Parallelism extraction Optimize Computational Libraries Kernels Hardware Process and Thread Placement July 4, 2012

Software Stack Applications Enable and express parallelism Programming models Give abstraction of the parallel machine Static optimization Compilers Parallelism extraction Optimize Computational Libraries Kernels Operating systems Hardware abstraction Basic services Hardware Process and Thread Placement July 4, 2012

Software Stack Applications Enable and express parallelism Programming models Give abstraction of the parallel machine Static optimization Compilers Parallelism extraction Optimize Computational Libraries Kernels Dynamic optimization Runtime systems Operating systems Hardware abstraction Basic services Hardware Process and Thread Placement July 4, 2012

Runtime System • Scheduling • Parallelism orchestration (Comm. Synchronization) • I/O • Reliability and resilience • Collective communication routing • Migration • Data and task/process/thread placement • etc. Process and Thread Placement July 4, 2012

Runtime Team Inria Team Enable performance portability by improving interface expressivity Success stories: • MPICH 2 (Nemesis Kernel) • KNEM (enabling high-performance intra-node MPI communication for large messages) • StarPU (unified runtime system for CPU and GPU program execution) • HWLOC (portable hardware locality) Process and Thread Placement July 4, 2012

2 Process Placement June 4, 2013

MPI (Process-based runtime systems) Performance of MPI programs depends on many factors that can be handled when you change the machine: • Implementation of the standard (e.g. collective com.) • Parallel algorithm(s) • Implementation of the algorithm • Underlying libraries (e.g. BLAS) • Hardware (processors, cache, network) • etc. But … June 4, 2013

Process Placement The MPI model makes little (no?) assumption on the way processes are mapped to resources It is often assume that the network topology is flat and hence the process mapping has little impact on the performance Interconection network CPU CPU CPU CPU Mem Mem Mem Mem June 4, 2013

The Topology is not Flat Due to multicore processors current and future parallel machines are hierarchical Communication speed depends on: • Receptor and emitter • Cache hierarchy • Memory bus • Interconnection network etc. Almost nothing in the MPI standard help to handle these factors June 4, 2013

Example on a Parallel Machine The higher we have to go into the hierarchy the costly the data exchange June 4, 2013

Example on a Parallel Machine The higher we have to go into the hierarchy the costly the data exchange Local RAM Mem. Controler L3 Bus L1/L2 L1/L2 L1/L2 L1/L2 Core 1 Core 2 Core 3 Core 4 June 4, 2013

Example on a Parallel Machine The higher we have to go into the hierarchy the costly the data exchange Local RAM Local RAM Local RAM Mem. Controler Mem. Controler Mem. Controler Interconect L3 L3 L3 Bus L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 Core 1 Core 2 Core 3 Core 4 ... ... Core Core Core Core June 4, 2013

Example on a Parallel Machine The higher we have to go into the hierarchy the costly the data exchange Local RAM Local RAM Local RAM Network Mem. Controler Mem. Controler Mem. Controler Interconect NIC NIC L3 L3 L3 Local RAM Local RAM Local RAM Local RAM Bus L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 Mem. Controler Mem. Controler Mem. Controler Mem. Controler L1/L2 L1/L2 Node Node Node Node Core 1 Core 2 Core 3 Core 4 ... ... Core Core Core Core June 4, 2013

Example on a Parallel Machine The higher we have to go into the hierarchy the costly the data exchange Local RAM Local RAM Local RAM Network The network can also be Mem. Controler Mem. Controler Mem. Controler Interconect NIC NIC hierarchical! L3 L3 L3 Local RAM Local RAM Local RAM Local RAM Bus L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 Mem. Controler Mem. Controler Mem. Controler Mem. Controler L1/L2 L1/L2 Node Node Node Node Core 1 Core 2 Core 3 Core 4 ... ... Core Core Core Core June 4, 2013

Rationale Not all the processes exchange the same amount of data The speed of the communications, and hence performance of the application depends on the way processes are mapped to resources. June 4, 2013

Do we Really Care: to Bind or not to Bind? After all, the system scheduler is able to move processes when needed. Yes, but only for shared memory system. Migration is possible but it is not in the MPI standard (see charm++) Moreover binding provides better execution runtime stability. June 4, 2013

Do we Really Care: to Bind or not to Bind? After all, the system scheduler is able to move processes when needed. Yes, but only for shared memory system. Migration is possible but it is not in the MPI standard (see charm++) Moreover binding provides better execution runtime stability. Zeus MHD Blast. 64 Processes/Cores. Mvapich2 1.8. + ICC June 4, 2013

Process Placement Problem Given : • Parallel machine topology • Process affinity (communication pattern) Map processes to resources (cores) to reduce communication cost: a nice algorithmic problem: • Graph partitionning (Scotch, Metis) • Application tuning [Aktulga et al. Euro-Par 12] • Topology-to-patten matching (TreeMatch) June 4, 2013

Reduce Communication Cost? But wait, my application is compute-bound! Well, but this might not be still true in the future: strong scaling might not always be a solution. June 4, 2013

Reduce Communication Cost? But wait, my application is compute-bound! Well, but this might not be still true in the future: strong scaling might not always be a solution. Taken from one of J. Dongarra’s Talk. June 4, 2013

How to bind Processes to Core/Node? MPI standard does not specify process binding Each distribution has its own solution: • MPICH2 (hydra manager): mpiexec -np 2 -binding cpu:sockets • OpenMPI: mpiexec -np 64 -bind-to-board • etc. You can also specify process binding using numactl or taskset unix command in the mpirun command line: mpiexec -np 1 –host machine numactl --physcpubind=0 ./prg June 4, 2013

Obtaining the Topology (Shared Memory) HWLOC (portable hardware locality): • Runtime and OpenMPI team • Portable abstraction (across OS, versions, architectures, ...) • Hierarchical topology • Modern architecture (NUMA, cores, caches, etc.) • ID of the cores • C library to play with • Etc June 4, 2013

HWLOC http://www.open-mpi.org/projects/hwloc/ June 4, 2013

Obtaining the Topology (Distributed Memory) Not always easy (research issue) MPI core has some routine to get that Sometime requires to build a file that specifies node adjacency June 4, 2013

Getting the Communication Pattern No automatic way so far … Can be done through application monitoring: • During execution • With a « blank execution » June 4, 2013

Results 64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR). Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors. June 4, 2013

Processes and Threads Placement of Parallel Applications. Why, How - PowerPoint PPT Presentation

Processes and Threads Placement of Parallel Applications. Why, How and for What gain? Joint work with: Guillaume Mercier, Franois Tessier, Brice Goglin, Emmanuel Agulo, George Bosilca. COST spring School, Uppsala Emmanuel Jeannot Runtime

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Implementing Processes Implementing Processes Review: Threads vs vs. Processes . Processes

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads 2.3 Interprocess communication 2.4

Processes Variables What makes up a process? program code Stack machine registers

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

Chapter 5: Threads I Overview I Multithreading Models I Threading Issues I Pthreads I Solaris 2

Threads: Questions CSCI 1730 Systems Programming How is a thread different from a process?

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Chapter: Threads: Ques/ons How is a thread different from a process? Why are threads

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

On The Security of Unique- Witness Blind Signature Schemes December 2013 ASIACRYPT, Bangalore,

On Deciding MUS Membership with QBF s Janota 1 Joao Marques-Silva 1 , 2 Mikol a 1

Hybrid Resources Workshop November 24, 2020 9:30am - 4:30pm California Public Utilities

and lest his readers should over- estimate the unity and sanctity of the Christian body in those

Asynchronous Parallel DLA in Concurrent Collections Aparna Chandramowlishwaran, Richard Vuduc

Neologism: Easy Vocabulary Publishing Cosmin Basca, Stphane Corlosquet, Richard Cyganiak,

Acts Series Lesson #118 August 6, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert L.

A Topos Approach to the Formulation of Physical Theories Category Theory 2008 Calais 26. June

Processes and Threads Placement of Parallel Applications. Why, How - PowerPoint PPT Presentation

Processes and Threads Placement of Parallel Applications. Why, How and for What gain? Joint work with: Guillaume Mercier, Franois Tessier, Brice Goglin, Emmanuel Agulo, George Bosilca. COST spring School, Uppsala Emmanuel Jeannot Runtime

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads Processes Threads

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads n Processes n Threads n

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Implementing Processes Implementing Processes Review: Threads vs vs. Processes . Processes

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads 2.3 Interprocess communication 2.4

Processes Variables What makes up a process? program code Stack machine registers

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

Chapter 5: Threads I Overview I Multithreading Models I Threading Issues I Pthreads I Solaris 2

Threads: Questions CSCI 1730 Systems Programming How is a thread different from a process?

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Chapter: Threads: Ques/ons How is a thread different from a process? Why are threads

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

On The Security of Unique- Witness Blind Signature Schemes December 2013 ASIACRYPT, Bangalore,

On Deciding MUS Membership with QBF s Janota 1 Joao Marques-Silva 1 , 2 Mikol a 1

Hybrid Resources Workshop November 24, 2020 9:30am - 4:30pm California Public Utilities

and lest his readers should over- estimate the unity and sanctity of the Christian body in those

Asynchronous Parallel DLA in Concurrent Collections Aparna Chandramowlishwaran, Richard Vuduc

Neologism: Easy Vocabulary Publishing Cosmin Basca, Stphane Corlosquet, Richard Cyganiak,

Acts Series Lesson #118 August 6, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert L.

A Topos Approach to the Formulation of Physical Theories Category Theory 2008 Calais 26. June

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n