WMSMonitor: a tool to monitor gLite WMS/LB cluster status and job - - PowerPoint PPT Presentation

wmsmonitor a tool to monitor glite wms lb cluster status
SMART_READER_LITE
LIVE PREVIEW

WMSMonitor: a tool to monitor gLite WMS/LB cluster status and job - - PowerPoint PPT Presentation

Enabling Grids for E sciencE Enabling Grids for E-sciencE WMSMonitor: a tool to monitor gLite WMS/LB cluster status and job workflow Daniele Cesini, Danilo Dongiovanni, Enrico Fattibene INFN-CNAF EGEE08 22-26 Sept. 2008 - Istanbul www eu


slide-1
SLIDE 1

Enabling Grids for E sciencE Enabling Grids for E-sciencE

WMSMonitor: a tool to monitor gLite WMS/LB cluster status and job workflow

Daniele Cesini, Danilo Dongiovanni, Enrico Fattibene INFN-CNAF EGEE08 – 22-26 Sept. 2008 - Istanbul

www eu egee org

EGEE-III INFSO-RI-222667

www.eu-egee.org

EGEE and gLite are registered trademarks

slide-2
SLIDE 2

Enabling Grids for E-sciencE

Motivation of the work

  • Workload

Management System (WMS) and Logging and Bookkeeping (LB) Service have a complex internal structure and knowing their status who and how is using them is challenging knowing their status, who and how is using them is challenging

  • A site can run many WMS/LB instances
  • At the same time WMS/LB services are an interesting source of

information about Job Lifecycle and resource usage by the VOs y g y

  • The middleware is not currently providing any monitoring facilities

Importance of having an efficient monitoring system aggregating information from internal components and

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 2

aggregating information from internal components and from various instances

slide-3
SLIDE 3

Enabling Grids for E-sciencE

Target Users

  • WMS/LB administrators to check the cluster status,

who is using it and how

  • WMS developers and advanced users to benchmark

the service performance and test its scalability

  • Resource Center managers that need per-VO

t d t ti ti d i il bilit aggregated statistics on usage and service availability VO t bt i t d j b t ti ti

  • VO managers to obtain aggregated job statistics, e.g.

to cross check their monitoring systems

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 3

slide-4
SLIDE 4

Enabling Grids for E-sciencE

Web presentation: cluster overview

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 4

slide-5
SLIDE 5

Enabling Grids for E-sciencE

WMS/LB instance details view

  • Textual boxes

report latest series

  • f acquired data
  • acqu ed da a
  • Top charts

represent status p history of Condor Jobs (left) and WMS internal components components queues (right)

  • Bottom charts
  • Bottom charts

represent history

  • f job flow rates

between components

  • A CMS use case

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 5

using collections and BulkMM

slide-6
SLIDE 6

Enabling Grids for E-sciencE

WMS instance details/ Daily Report

  • Daily summary of

Job flow through the WMS components, including: g

  • Resubmission of

failed jobs N b f j b i

  • Number of jobs in

successful final state

  • Number of jobs in

aborted final status.

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 6

status.

slide-7
SLIDE 7

Enabling Grids for E-sciencE

WMS instance details/ Custom Plot

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 7

slide-8
SLIDE 8

Enabling Grids for E-sciencE

WMS cluster VO stats

  • Statistics
  • n per WMS

usage by a usage by a single VO (chart or tabular tabular format). Time interval is configurable

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 8

slide-9
SLIDE 9

Enabling Grids for E-sciencE

Working on…

  • User level statistics Dynamical VO discover
  • Resource Usage Statistics:

– Destination CE – Number of matched CE per job

  • DB redesign
  • Distributed instances monitoring

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 9

slide-10
SLIDE 10

Enabling Grids for E-sciencE Architecture/Implementation

  • SNMP based data transport

M SQL b k d

  • MySQL backend
  • Sensors and data collector written mostly in PYTHON
  • Web interface developed in PHP

Web interface developed in PHP

  • Open Flash Chart libraries based plots
  • Periodically sends information to a NAGIOS server which acts as a

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 10

notification system

slide-11
SLIDE 11

Enabling Grids for E-sciencE

Contacts and Acknowledgments

  • CNAF Production Instance:

https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php p p p

  • PADOVA/EU-INDIA Production Instance:

https://eu-india-01.pd.infn.it:50080/wmsmon/main/main.php

  • Wiki, Documentation, Download, Support:

https://twiki.cnaf.infn.it/cgi-bin/twiki/view/WMSMonitor/WebHome wms-support<at>cnaf.infn.it

Special Thanks to all gLite WMS / LB developers

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 11

slide-12
SLIDE 12

Enabling Grids for E-sciencE

Backup slides

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 12

slide-13
SLIDE 13

Enabling Grids for E-sciencE

gLiteWMS / gLiteLB architecture

Interface Core C f Cache of Grid Information S t Logging & Bookkeeping System Bookkeeping B k E d

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 13

Back End

slide-14
SLIDE 14

Enabling Grids for E-sciencE

Metrics considered

  • Adopted metrics are of three types:

– Grid service metrics: daemons status, number of opened file descriptors entries in component queues number of available CE descriptors, entries in component queues, number of available CE queues, open connections on ports, Condor Job stats – System metrics: CPU load average, % occupacy of disk partitions J b fl t i J b b itt d f J b I t/O t t f h – Job flow metrics: Job submitted from users, Job Input/Output for each component in the WMS, Job Successfully Completed / Aborted

WMS Service Component

  • Daemons Status
  • File Descriptors
  • Queues

Jobs in Jobs out

  • Open connections on ports
  • Available Grid Information
  • Treated Job status

EGEE-III INFSO-RI-222667 WMSMonitor - EGEE08, Istanbul - Turkey 14