Core Service Failures Co e Se ce a u es Results from TIC WG - - PowerPoint PPT Presentation

core service failures co e se ce a u es results from tic
SMART_READER_LITE
LIVE PREVIEW

Core Service Failures Co e Se ce a u es Results from TIC WG - - PowerPoint PPT Presentation

Enabling Grids for E-sciencE g Core Service Failures Co e Se ce a u es Results from TIC WG Marcin Radecki Marcin Radecki 1st OAT Meeting, CERN 6-7 May 08 www.eu-egee.org EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks


slide-1
SLIDE 1

Enabling Grids for E-sciencE g

Core Service Failures Co e Se ce a u es Results from TIC WG

Marcin Radecki Marcin Radecki 1st OAT Meeting, CERN 6-7 May 08

EGEE-II INFSO-RI-031688

www.eu-egee.org

EGEE and gLite are registered trademarks

slide-2
SLIDE 2

Enabling Grids for E-sciencE

Where is the problem?

  • Service availability monitoring

is not only a function of site services

– Example:

lcg-cr command, uses LFC and top level BDII which are t i th d i i t ti not in the administrative domain of the site

VO LFC

Reality Picture seen

site boundary

Reality Picture seen by monitoring

EGEE-II INFSO-RI-031688 1st OAT meeting, 6-7 May 2008, CERN 2

slide-3
SLIDE 3

Enabling Grids for E-sciencE

By what means we tell the problem

  • ccurred?
  • Three ways of determining a Core Service problem
  • 1. Error message coming from gLite
  • 1. Improved SAM sensors deployed at regional SAM instance in CE
  • 1. Improved SAM sensors deployed at regional SAM instance in CE
  • 2. Ambiguous error messages GGUS #33813
  • 3. Network stack problem – not enough information passed at the

application layer

  • 2. Last Core Service status in SAM DB

– Can be used by monitoring tools to avoid raising alarms on OK sites – Solution may pose unacceptable load on SAM DB

  • 3. Heuristics by which a Core Service failure is seen as many sites

failing

f – Problem with services running on a performance edge – Problem with bad firewall config etc.

Loosing the error message as reliable information

EGEE-II INFSO-RI-031688 1st OAT meeting, 6-7 May 2008, CERN 3

source we can only reach a kind of certainty level

slide-4
SLIDE 4

Enabling Grids for E-sciencE

Conclusion: How to improve?

  • Long-term goals

– Common error message format from all gLite components Avoid design with dependencies on remote services or locate – Avoid design with dependencies on remote services or locate them within the site boundaries if possible – if not possible improve reliability of Core Services

  • Short-term goals

– limit impact of network on monitoring

locate monitoring closer to sites locate monitoring closer to sites integrate newtork monitoring results into service monitoring

– put more „intelligence” into monitoring – in case of a service failure: failure:

compare last core service result in SAM DB run the test twice

EGEE-II INFSO-RI-031688 1st OAT meeting, 6-7 May 2008, CERN 4

run additional check on the dependency service

slide-5
SLIDE 5

Enabling Grids for E-sciencE

References

  • Full report on Core Services failures

– http://galaxy.agh.edu.pl/~radecki/cod-cs-failures.pdf

TIC b C S i

  • TIC web page on Core Service

– http://goc.grid.sinica.edu.tw/gocwiki/Tools_Improvements_for_C OD/FailuresDueToCoreServices

EGEE-II INFSO-RI-031688 1st OAT meeting, 6-7 May 2008, CERN 5