The State of FIFE Monitoring & Accounting Kevin Retzke FIFE - - PowerPoint PPT Presentation
The State of FIFE Monitoring & Accounting Kevin Retzke FIFE - - PowerPoint PPT Presentation
The State of FIFE Monitoring & Accounting Kevin Retzke FIFE Workshop 20 th -21 st June 2016 Fifemon is a comprehensive monitoring Users platform for all Experiment Service FIFE experiments, Stakeholders Providers services, and
FIFE Monitoring FIFE Workshop 2016
Fifemon is a comprehensive monitoring platform for all FIFE experiments, services, and stakeholders
2
https://fifemon.fnal.gov/monitor
Users Service Providers Management Experiment Stakeholders
FIFE Monitoring FIFE Workshop 2016
The Landscape program supports the development of unified comprehensive monitoring for Scientific Computing.
3
https://landscape.fnal.gov
HEP cloud
FIFE Monitoring FIFE Workshop 2016
State of FIFE Monitoring 2015
4
Fermigrid Fifemon
FIFE Monitoring FIFE Workshop 2016
A New Monitoring Paradigm
- Leverage open-source
monitoring technology
- Focus on incorporating
new data sources and new dashboards
- Rapid development and
iteration of tailored views for each target audience
5
FIFE Monitoring FIFE Workshop 2016
Fifemon Architecture
6
Fifebatch GPGrid CMS Tier 1 CMS LPC HEP Cloud
Probes Collect:
- Job Details
- Slot Details
- System Metrics
- Event Logs
Graphite Elasticsearch Grafana Time-series Aggregations Raw Documents
Data handling dCache BlueArc Postgres more...
Kibana
FIFE Monitoring FIFE Workshop 2016
Statistics
7
Unique Users per Day
250 users 15 data sources 50 dashboards 280K total metrics 500K datapoints per hour 70K log events per hour
Dashboard Loads per Day 150 25
FIFE Monitoring FIFE Workshop 2016
Usage
8
FIFE 20% Management 5% Production 10% Users 65%
User Batch Details 50% Exp Batch Details 30%
Requests by Group
Statistics based on dashboard requests in last 60 days.
Top Dashboards
Exp. Overview 8%
- Exp. Eff.
6%
- Exp. Summ.
6%
FIFE Monitoring FIFE Workshop 2016
Upcoming Features
Near-Term
- Federated SSO Auth
- Grafana v3
- Completed job details &
resource usage
- dCache
- Running job logs
- Outage notices and logs
- Dashboard improvements
9
Long-Term
- Realtime job updates
- Email reports
- User and experiment areas
- Alerting
Preview and test: https://fifemon-pp.fnal.gov What do you want to see? https://fermi.service-now.com
FIFE Monitoring FIFE Workshop 2016
- CMS LPC cluster monitoring
- Collaborating with OSG and wider
scientific computing community
– Increase Fermilab visibility – Feedback improvements – Better site monitoring - what resources are available offsite for FIFE jobs?
Beyond FIFE
10
Collaborative Project https://fifemon.github.io
http://www.lumaxart.com/
FIFE Monitoring FIFE Workshop 2016
Accounting
- OSG retiring Gratia, ruled it unmaintainable and inflexible
- “GRÅCC” being developed by OSG and FNAL
– Modular, microservice-based architecture – Primary data store: Elasticsearch – Primary frontend: Grafana – Compatible with existing probes – Alpha stage this summer, Production by end of year
- Accounting data will be more readily accessible
– Integrate FIFE accounting data into Fifemon...
11
https://gracc.opensciencegrid.org
FIFE Monitoring FIFE Workshop 2016
Over Two Years of FIFE History
12
https://fifemon-pp.fnal.gov/dashboard/db/fife-history
FIFE Monitoring FIFE Workshop 2016
Fifemon Tutorial Tomorrow
- Grafana basic usage, tips & tricks
- Common workflows:
– Checking your job status and resource usage – Checking your experiment’s status and usage – Checking batch system usage and resource availability
- Q&A
13