core service failures co e se ce a u es results from tic
play

Core Service Failures Co e Se ce a u es Results from TIC WG - PowerPoint PPT Presentation

Enabling Grids for E-sciencE g Core Service Failures Co e Se ce a u es Results from TIC WG Marcin Radecki Marcin Radecki 1st OAT Meeting, CERN 6-7 May 08 www.eu-egee.org EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks


  1. Enabling Grids for E-sciencE g Core Service Failures Co e Se ce a u es Results from TIC WG Marcin Radecki Marcin Radecki 1st OAT Meeting, CERN 6-7 May 08 www.eu-egee.org EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks

  2. Where is the problem? Enabling Grids for E-sciencE • Service availability monitoring is not only a function of site services – Example: � lcg-cr command, uses LFC and top level BDII which are not in the administrative t i th d i i t ti VO LFC domain of the site Reality Reality Picture seen Picture seen by monitoring site boundary 1st OAT meeting, 6-7 May 2008, CERN EGEE-II INFSO-RI-031688 2

  3. By what means we tell the problem occurred? Enabling Grids for E-sciencE Three ways of determining a Core Service problem • 1. Error message coming from gLite 1. Improved SAM sensors deployed at regional SAM instance in CE 1. Improved SAM sensors deployed at regional SAM instance in CE 2. Ambiguous error messages GGUS #33813 3. Network stack problem – not enough information passed at the application layer 2. Last Core Service status in SAM DB – Can be used by monitoring tools to avoid raising alarms on OK sites – Solution may pose unacceptable load on SAM DB 3. Heuristics by which a Core Service failure is seen as many sites failing – Problem with services running on a performance edge f – Problem with bad firewall config etc. Loosing the error message as reliable information source we can only reach a kind of certainty level 1st OAT meeting, 6-7 May 2008, CERN EGEE-II INFSO-RI-031688 3

  4. Conclusion: How to improve? Enabling Grids for E-sciencE • Long-term goals – Common error message format from all gLite components – Avoid design with dependencies on remote services or locate Avoid design with dependencies on remote services or locate them within the site boundaries if possible – if not possible improve reliability of Core Services • Short-term goals – limit impact of network on monitoring � locate monitoring closer to sites � locate monitoring closer to sites � integrate newtork monitoring results into service monitoring – put more „intelligence” into monitoring – in case of a service failure: failure: � compare last core service result in SAM DB � run the test twice � run additional check on the dependency service 1st OAT meeting, 6-7 May 2008, CERN EGEE-II INFSO-RI-031688 4

  5. References Enabling Grids for E-sciencE • Full report on Core Services failures – http://galaxy.agh.edu.pl/~radecki/cod-cs-failures.pdf • TIC web page on Core Service TIC b C S i – http://goc.grid.sinica.edu.tw/gocwiki/Tools_Improvements_for_C OD/FailuresDueToCoreServices 1st OAT meeting, 6-7 May 2008, CERN EGEE-II INFSO-RI-031688 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend