employing mpi collectives for timing analysis on embedded
play

Employing MPI Collectives for Timing Analysis on Embedded - PowerPoint PPT Presentation

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution


  1. Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J¨ org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution Time Analysis July 5, 2016 July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 1

  2. Motivation Bj¨ orn Lisper, WCET 2012: ”Towards Parallel Programming Models for Predictability” – Shared memory does not scale ⇒ Replace it with distributed memory – Replace bus with Network-on-Chip (NoC) – Learn from Parallel Programming Models – e.g. Bulk Synchronous Programming (BSP): Execute program in supersteps : 1. Local computation 2. Global communication 3. Barrier July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 2

  3. MPI programs Similar programming model comes with MPI programs – At a collective operation , all (or a group of) cores work together – local computation, followed by communication ⇒ implicit barrier – One core for coordination and distribution (master), others for computation (slave) – Examples: – Barrier – Broadcast – Global sum July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 3

  4. Outline Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 4

  5. Outline Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 5

  6. Underlying Architecture 1 3 Small and Simple Core Statically Scheduled Network Network Interface Local Core Memory I /O Connection 2 4 Distributed Memory Task + Network Analysis = WCET [Metzlaff et al.: A Real-Time Capable Many-Core Model, RTSS-WiP 2012] July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 6

  7. Structure of a MPI program Same sequential code on all cores A (A) Barrier after initialization Time (B) Data exchange B (C) Data exchange (D) Global operation C D July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 7

  8. Outline Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 8

  9. Structure of MPI Allreduce – Global reduction operation – Broadcasts result afterwards July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  10. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B C D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  11. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  12. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  13. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  14. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  15. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  16. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation (G) Broadcast result G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  17. Structure of MPI Allreduce Master Slave 1 Slave 2 – Global reduction operation A – Broadcasts result afterwards B (A) Initialization C (B) Acknowledgement D (C) Data structure initialization (D) Send values E (E) Collect and store values F (F) Apply global operation (G) Broadcast result G WCET = Σ A to G July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

  18. Analysis of MPI Allreduce – WCET of sequential parts estimated with OTAWA – Worst-case traversal time (WCTT) of communication parts has to be added – Result: Equation with parameters – #values to be transmitted – #communication partners – Dimensions of NoC – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 10

  19. Analysis of MPI Sendrecv July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  20. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  21. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  22. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  23. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  24. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  25. Analysis of MPI Sendrecv – Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

  26. Outline Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 12

  27. The CG Benchmark – Conjugate Gradient method from mathematics – Optimization method to find the minimum/maximum of a multidimensional function – Operations on a large matrix – Distributed on several cores – Cores exchange data a number of times – Taken from NAS Parallel Benchmark Suite for highly parallel systems – Adapted for C + MPI July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend