The dashboard Grid monitoring framework Benjamin Gaidioz on behalf - - PowerPoint PPT Presentation

the dashboard grid monitoring framework
SMART_READER_LITE
LIVE PREVIEW

The dashboard Grid monitoring framework Benjamin Gaidioz on behalf - - PowerPoint PPT Presentation

The dashboard Grid monitoring framework Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference The dashboard Grid monitoring framework p. 1 introduction/outline goals of the project, the team, the


slide-1
SLIDE 1

The dashboard Grid monitoring framework

Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference

The dashboard Grid monitoring framework – p. 1

slide-2
SLIDE 2

introduction/outline

goals of the project, the team, the framework, some monitoring applications: job monitoring, site monitoring, data management monitoring.

The dashboard Grid monitoring framework – p. 2

slide-3
SLIDE 3

the project (EGEE/ARDA)

another monitoring tool, a VO specific monitoring service, showing Grid usage from a VO point of view (cross Grid, cross application, submission tool, etc.), merging Grid information and VO information. implemented in close contact with the VOs.

The dashboard Grid monitoring framework – p. 3

slide-4
SLIDE 4

the team

Julia Andreeva (lead, CMS) and Juha Herrala (former member, CMS), Benjamin Gaidioz and Ricardo Rocha (ATLAS), Pablo Saiz (ALICE), Gerhild Maier, collaborators and visitors: Taipei: Fu-Ming Tsai (daily summaries), Tao-Sheng Chen (Postgresql and Oracle), Shih-Chun Chiu (user web interface, PHP), etc., Moscow State University,

  • ur contacts in all the VOs and Grids.

contact: dashboard-support@cern.ch

The dashboard Grid monitoring framework – p. 4

slide-5
SLIDE 5

the framework

a python framework for collecting and publishing monitoring information

dashboard

  • racle database

monalisa collector

data access object (DAO) GridPP

GridPP collector

RGMA

RGMA collector

Monalisa

client dashboard web server question request text/html, text/xml, image/png, etc.

developer guide, savannah project.

The dashboard Grid monitoring framework – p. 5

slide-6
SLIDE 6

a set of applications

The dashboard Grid monitoring framework – p. 6

slide-7
SLIDE 7

applications

  • 1. job monitoring,
  • 2. site monitoring,
  • 3. data management monitoring.

see the links in the last slide for accessing them all.

The dashboard Grid monitoring framework – p. 7

slide-8
SLIDE 8

job monitoring

real-time view of Grid jobs for a VO, summary views, various grid information systems used (EGEE RGMA, GridPP XML files, LCG BDII), VO info: job instrumentation (Monalisa’s ApMon), ATLAS prodsys database, panda monitoring, GangaAtlas monitoring, Dirac database, etc. consistent merging (Grid info + VO info). powerful filtering for serving different use cases (managers, site admins, users), examples: ATLAS activities today, ATLAS jobs in Taiwan, CMS daily views.

The dashboard Grid monitoring framework – p. 8

slide-9
SLIDE 9

job monitoring summary

installed for ALICE, ATLAS, CMS, LHCb. latest/next developments:

  • pen HTTP API for a VO to publish job

information to the dashboard (in progress), user task monitoring (in progress), alerts (with failure pattern recognition), link with the SAM tests (site functionality tests). RSS feeds.

The dashboard Grid monitoring framework – p. 9

slide-10
SLIDE 10

site monitoring

linked to job monitoring, identify reason of failure of jobs in sites, using RGMA (which reports Grid error messages), examples: ALICE site info.

Waiting Ready (unavailable ) Scheduled (Job successfully submitted to Globus ) Ready (7 an authentication operation failed ) Done (Job got an error while in the CondorG queue. ) Running (Job successfully submitted to Globus ) Submitted Done (/net/hisrv0001/opt.x86_64/grid/globus/etc/globus-user-env.sh not found or unreadable ) Done (Job terminated successfully ) Done (Cannot read JobWrapper output both from Condor and from Maradona. ) Waiting (unavailable ) Cleared (user retrieved output sandbox )

The dashboard Grid monitoring framework – p. 10

slide-11
SLIDE 11

site monitoring

linked to job monitoring, identify reason of failure of jobs in sites, using RGMA (which reports Grid error messages), examples: ALICE site info.

cluster.pnpi.nw.ru ce01.cmsaf.mit.edu lepton.rcac.purdue.edu ce02.grid.acad.bg submit Waiting Waiting Waiting Waiting Ready Ready Ready Ready Scheduled Running Done Success Scheduled Running Error, wrong installation Error, authentication Scheduled Running Error, maradona

The dashboard Grid monitoring framework – p. 10

slide-12
SLIDE 12

site monitoring summary

installed for ALICE, ATLAS, CMS, LHCb. latest/next developments: merging of all information of a site (not per VO), in order to see if failures are similar for all VOs (in progress).

The dashboard Grid monitoring framework – p. 11

slide-13
SLIDE 13

data management

an ATLAS specific application, monitoring the ATLAS DDM tool, events directly reported by ATLAS software to the dashboard, current performance, details, developed in close contact with ATLAS DDM admins and developers, daily summary sent by mail.

The dashboard Grid monitoring framework – p. 12

slide-14
SLIDE 14

data management: summary

installed for ATLAS, critical component of ATLAS DDM (now official monitoring system), latest/next developments: text summary sent by e-mail to site admins, correlation with the SAM tests (site functionality tests).

The dashboard Grid monitoring framework – p. 13

slide-15
SLIDE 15

conclusion

The dashboard Grid monitoring framework – p. 14

slide-16
SLIDE 16

conclusion

goal: grid monitoring from a VO point of view: merging VO infos and Grid information, feeting the various use cases (managers, users, site admins), several applications already implemented using a flexible python framework, future work: new applications, new information sources (GridICE, APEL, SAM), new functionalities: alerts, assistance in error tracking.

The dashboard Grid monitoring framework – p. 15

slide-17
SLIDE 17

links

Savannah project dashboard main page CMS dashboard main page ATLAS dashboard main page LHCb dashboard main page ALICE dashboard main page site reliability dashboard-support@cern.ch

The dashboard Grid monitoring framework – p. 16