The State of FIFE Monitoring & Accounting Kevin Retzke FIFE - - PowerPoint PPT Presentation

the state of fife monitoring accounting
SMART_READER_LITE
LIVE PREVIEW

The State of FIFE Monitoring & Accounting Kevin Retzke FIFE - - PowerPoint PPT Presentation

The State of FIFE Monitoring & Accounting Kevin Retzke FIFE Workshop 20 th -21 st June 2016 Fifemon is a comprehensive monitoring Users platform for all Experiment Service FIFE experiments, Stakeholders Providers services, and


slide-1
SLIDE 1

Kevin Retzke FIFE Workshop 20th-21st June 2016

The State of FIFE Monitoring & Accounting

slide-2
SLIDE 2

FIFE Monitoring FIFE Workshop 2016

Fifemon is a comprehensive monitoring platform for all FIFE experiments, services, and stakeholders

2

https://fifemon.fnal.gov/monitor

Users Service Providers Management Experiment Stakeholders

slide-3
SLIDE 3

FIFE Monitoring FIFE Workshop 2016

The Landscape program supports the development of unified comprehensive monitoring for Scientific Computing.

3

https://landscape.fnal.gov

HEP cloud

slide-4
SLIDE 4

FIFE Monitoring FIFE Workshop 2016

State of FIFE Monitoring 2015

4

Fermigrid Fifemon

slide-5
SLIDE 5

FIFE Monitoring FIFE Workshop 2016

A New Monitoring Paradigm

  • Leverage open-source

monitoring technology

  • Focus on incorporating

new data sources and new dashboards

  • Rapid development and

iteration of tailored views for each target audience

5

slide-6
SLIDE 6

FIFE Monitoring FIFE Workshop 2016

Fifemon Architecture

6

Fifebatch GPGrid CMS Tier 1 CMS LPC HEP Cloud

Probes Collect:

  • Job Details
  • Slot Details
  • System Metrics
  • Event Logs

Graphite Elasticsearch Grafana Time-series Aggregations Raw Documents

Data handling dCache BlueArc Postgres more...

Kibana

slide-7
SLIDE 7

FIFE Monitoring FIFE Workshop 2016

Statistics

7

Unique Users per Day

250 users 15 data sources 50 dashboards 280K total metrics 500K datapoints per hour 70K log events per hour

Dashboard Loads per Day 150 25

slide-8
SLIDE 8

FIFE Monitoring FIFE Workshop 2016

Usage

8

FIFE 20% Management 5% Production 10% Users 65%

User Batch Details 50% Exp Batch Details 30%

Requests by Group

Statistics based on dashboard requests in last 60 days.

Top Dashboards

Exp. Overview 8%

  • Exp. Eff.

6%

  • Exp. Summ.

6%

slide-9
SLIDE 9

FIFE Monitoring FIFE Workshop 2016

Upcoming Features

Near-Term

  • Federated SSO Auth
  • Grafana v3
  • Completed job details &

resource usage

  • dCache
  • Running job logs
  • Outage notices and logs
  • Dashboard improvements

9

Long-Term

  • Realtime job updates
  • Email reports
  • User and experiment areas
  • Alerting

Preview and test: https://fifemon-pp.fnal.gov What do you want to see? https://fermi.service-now.com

slide-10
SLIDE 10

FIFE Monitoring FIFE Workshop 2016

  • CMS LPC cluster monitoring
  • Collaborating with OSG and wider

scientific computing community

– Increase Fermilab visibility – Feedback improvements – Better site monitoring - what resources are available offsite for FIFE jobs?

Beyond FIFE

10

Collaborative Project https://fifemon.github.io

http://www.lumaxart.com/

slide-11
SLIDE 11

FIFE Monitoring FIFE Workshop 2016

Accounting

  • OSG retiring Gratia, ruled it unmaintainable and inflexible
  • “GRÅCC” being developed by OSG and FNAL

– Modular, microservice-based architecture – Primary data store: Elasticsearch – Primary frontend: Grafana – Compatible with existing probes – Alpha stage this summer, Production by end of year

  • Accounting data will be more readily accessible

– Integrate FIFE accounting data into Fifemon...

11

https://gracc.opensciencegrid.org

slide-12
SLIDE 12

FIFE Monitoring FIFE Workshop 2016

Over Two Years of FIFE History

12

https://fifemon-pp.fnal.gov/dashboard/db/fife-history

slide-13
SLIDE 13

FIFE Monitoring FIFE Workshop 2016

Fifemon Tutorial Tomorrow

  • Grafana basic usage, tips & tricks
  • Common workflows:

– Checking your job status and resource usage – Checking your experiment’s status and usage – Checking batch system usage and resource availability

  • Q&A

13