dpg fr hjahrstagung darmstadt 17 03 2016 hk 54
play

DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker - PowerPoint PPT Presentation

The CBM First-level Event Selector, Timeslice Building and Availability Studies Helvi Hartmann hhartmann@fias.uni-frankfurt.de DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker Lindenstruth FIAS Frankfurt Institute for


  1. The CBM First-level Event Selector, Timeslice Building and Availability Studies Helvi Hartmann hhartmann@fias.uni-frankfurt.de DPG Frühjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker Lindenstruth FIAS Frankfurt Institute for Advanced Studies Goethe-Universität Frankfurt am Main, Germany CBM http://compeng.uni-frankfurt.de 1

  2. Introduction • CBM detectors are untriggered Challenge • free streaming data, expected data rate of ~1TB/s • Online event reconstruction using timeslices Timeslice FLES Timeslice Component Micro- Input overlap Compute MS slice + … Node 0/0 100/0 Compute Node … Infiniband 0/1 Micro- . . slice . Timeslice Component Input … + Compute Compute 100/ … Node Node 1000 0/1000 FIAS Frankfurt Institute 2 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  3. Input Timeslice Reconstruction Interface building & analysis Timeslice 1MB Timeslice Component Micro- overlap MS slice … 0/0 100/0 . . . Micro- Timeslice Component slice Micro- overlap MS slice 100/ … Input 0/1000 1000 Compute + Node Compute Node Infiniband Input + Compute Compute Node Node 3

  4. Availability • MTBF - Meant time between failures • MTTR - Meant time to repair FIAS Frankfurt Institute 4 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  5. Availability MTTR 50% 99.9% 3h 1 0.99 0.98 16min 0.97 0.96 2min 0.95 0.94 0.93 10s 0.92 0.91 1s 0.9 10s 2min 16min 3h 1d 1w 4m MTBF FIAS Frankfurt Institute 5 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  6. Availability estimated MTBF for a crucial node failure is two weeks, extrapolated from real-world data the ALICE-HLT node failure 3h 1 0.99 0.98 0.97 30min 0.96 16min 0.95 0.94 0.93 Restart 5min Framework 0.92 0.91 2min 0.9 1d 1w 4m FIAS Frankfurt Institute 6 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  7. Availability 16min 1 0.99 5min 0.98 Restart 0.97 2min Framework 0.96 0.95 0.94 10s 0.93 0.92 0.91 1s 0.9 2d 16min 3h 1d 1w FIAS Frankfurt Institute 7 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  8. use native Infiniband Verbs implementation case: corrupted input data report error Input Timeslice Reconstruction interface building corrupted Timeslice ? Compute Node corrupted Input Timeslice + Compute Node Compute Infiniband Node 8

  9. use native Infiniband Verbs implementation case: process failure report error Input Timeslice Reconstruction interface building Timeslice Micro- slice Compute Node Input + Compute Node Compute Infiniband Node 9

  10. Can we use MPI as high-level API instead of low- level native Infiniband Verbs implementation? Input Timeslice Reconstruction interface building Timeslice Micro- slice 1MB Input Compute + Node Compute Node Infiniband 10

  11. MPI Fault Tolerance In MPI: when one processes crashes all other processes within the same Communicator crash! Child Processes Intracommunicator MPI_COMM_World i k Process wit rank i of generation k 0 1 1 1 2 1 Parent to Child Intercommunicator 0 0 3 0 2 1 3 1 2 2 3 2 2 —> not possible to create independent Communicators FIAS Frankfurt Institute 11 hhartmann@fias.uni-frankfurt.de for Advanced Studies

  12. Control System start/stop process on each node detect errors Input Timeslice Reconstruction interface building Timeslice Micro- slice 1MB Input Compute + Node Compute Node Infiniband 12

  13. Conclusion and Outlook Availability • desired availability of 99.9% • higher failure rates during commissioning • no more failures than every 2 days • MPI is not fault tolerant • use native Infiniband Verbs implementation for timeslice building • add a control software to orchestrate processes and allow recovery from errors FIAS Frankfurt Institute 13 hhartmann@fias.uni-frankfurt.de for Advanced Studies

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend