Employing MPI Collectives for Timing Analysis on Embedded - PowerPoint PPT Presentation

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J¨ org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution Time Analysis July 5, 2016 July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 1

Motivation Bj¨ orn Lisper, WCET 2012: ”Towards Parallel Programming Models for Predictability” – Shared memory does not scale ⇒ Replace it with distributed memory – Replace bus with Network-on-Chip (NoC) – Learn from Parallel Programming Models – e.g. Bulk Synchronous Programming (BSP): Execute program in supersteps : 1. Local computation 2. Global communication 3. Barrier July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 2

MPI programs Similar programming model comes with MPI programs – At a collective operation , all (or a group of) cores work together – local computation, followed by communication ⇒ implicit barrier – One core for coordination and distribution (master), others for computation (slave) – Examples: – Barrier – Broadcast – Global sum July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 3

Outline Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 4

Underlying Architecture 1 3 Small and Simple Core Statically Scheduled Network Network Interface Local Core Memory I /O Connection 2 4 Distributed Memory Task + Network Analysis = WCET [Metzlaff et al.: A Real-Time Capable Many-Core Model, RTSS-WiP 2012] July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 6

Structure of a MPI program Same sequential code on all cores A (A) Barrier after initialization Time (B) Data exchange B (C) Data exchange (D) Global operation C D July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 7

Structure of MPI Allreduce – Global reduction operation – Broadcasts result afterwards July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B C D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation (G) Broadcast result G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation (G) Broadcast result G WCET = Σ A to G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

Analysis of MPI Allreduce – WCET of sequential parts estimated with OTAWA – Worst-case traversal time (WCTT) of communication parts has to be added – Result: Equation with parameters – #values to be transmitted – #communication partners – Dimensions of NoC – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 10

Analysis of MPI Sendrecv July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

The CG Benchmark – Conjugate Gradient method from mathematics – Optimization method to find the minimum/maximum of a multidimensional function – Operations on a large matrix – Distributed on several cores – Cores exchange data a number of times – Taken from NAS Parallel Benchmark Suite for highly parallel systems – Adapted for C + MPI July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 13

Employing MPI Collectives for Timing Analysis on Embedded - PowerPoint PPT Presentation

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Employing Dynamic Employing Dynamic Transparency for 3D Occlusion Transparency for 3D Occlusion

Secure Interoperation in Multidomain Environments Employing UCON Policies Environments Employing

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Non-Intrusively Avoiding Scaling Problems in and out of MPI Collectives Hongbo Li , Zizhong Chen,

I2RS Use Cases Summary draft-ietf-i2rs-usecase-reqs-summary Sue Hares Huawei Goal of Use Cases

[N ETWORKING ] Lugging a torrent of bits From here to there And through thin air With fidelity

TRANSMISSION MODE (ATM) ECE 422 DATA COMMUNICATION & COMPUTER NETWORKS 28 October 2020

Designing for Scalability Patrick Linskey pcl@apache.org Patrick Linskey Apache OpenJPA

X

Reachability and error diagnosis in LR(1) automata Franois Pottier JFLA, Saint-Malo January

Automatically Repairing Input Data for Novice Python Programs Madeline Endres, University of

Lecture 02 Algorithmic Thinking Prof. Katherine Gibson Prof. Jeremy Dixon Based on slides by

Employing MPI Collectives for Timing Analysis on Embedded - PowerPoint PPT Presentation

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Employing Dynamic Employing Dynamic Transparency for 3D Occlusion Transparency for 3D Occlusion

Secure Interoperation in Multidomain Environments Employing UCON Policies Environments Employing

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Non-Intrusively Avoiding Scaling Problems in and out of MPI Collectives Hongbo Li , Zizhong Chen,

I2RS Use Cases Summary draft-ietf-i2rs-usecase-reqs-summary Sue Hares Huawei Goal of Use Cases

[N ETWORKING ] Lugging a torrent of bits From here to there And through thin air With fidelity

TRANSMISSION MODE (ATM) ECE 422 DATA COMMUNICATION &amp; COMPUTER NETWORKS 28 October 2020

Designing for Scalability Patrick Linskey pcl@apache.org Patrick Linskey Apache OpenJPA

X

Reachability and error diagnosis in LR(1) automata Franois Pottier JFLA, Saint-Malo January

Automatically Repairing Input Data for Novice Python Programs Madeline Endres, University of

Lecture 02 Algorithmic Thinking Prof. Katherine Gibson Prof. Jeremy Dixon Based on slides by

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

TRANSMISSION MODE (ATM) ECE 422 DATA COMMUNICATION & COMPUTER NETWORKS 28 October 2020