Monitoring and Analysis at ALCF Kevin Harms harms@alcf.anl.gov - - PowerPoint PPT Presentation

monitoring and analysis at alcf
SMART_READER_LITE
LIVE PREVIEW

Monitoring and Analysis at ALCF Kevin Harms harms@alcf.anl.gov - - PowerPoint PPT Presentation

Monitoring and Analysis at ALCF Kevin Harms harms@alcf.anl.gov ALCF Operations Mark Fahey Eric Pershey Doug Waldron Ben Allen ... ALCF Philosophy Collect all data and store (ETL) into central location (data warehouse) Sometimes


slide-1
SLIDE 1

Monitoring and Analysis at ALCF

Kevin Harms – harms@alcf.anl.gov ALCF Operations Mark Fahey Eric Pershey Doug Waldron Ben Allen ...

slide-2
SLIDE 2

ALCF Philosophy

¤ Collect all data and store (ETL) into central location (data warehouse)

¥ Sometimes data is reduced or summarized ¥ Raw data is retained ¥ Not all data that is collected as an associated analysis or monitor ¥ Approach facilitates ad-hoc analysis by any staff after the fact

¡

Discover useful analyses as we go

¤ Focus primarily on monitoring for internal staff

¥ Augmenting and improving this is a continuous goal

¤ Limited data provided for users

¥ Improving this is a long term goal

2

All The Things

Analysis1 Analysis2 Analysis3

slide-3
SLIDE 3

Major Analysis

¤ Job Failure Analysis (JFA)

¥ Establish root cause for any job failure (non-zero) exit code ¥ Combination of automatic processing of “business” logic with human-in-the-

loop to look at outliers

¥ Generates “failure” records ¥ Find areas improvement at a system level

¤ Operational Data Processing System (ODPS)

¥ Produces various system wide metrics ¥ Availability, MTTI, utilization, etc.

¤ Machine Time Overlay (MTO)

¥ Graph node allocations over time on 2-d grid with annotations of events or

  • ther temporal-spatial information

¤ Darshan (I/O monitoring) ¤ XALT (library tracking)

3

slide-4
SLIDE 4

Total Knowledge of I/O (TOKIO) Framework

Elastic

Transforms monitoring data from across the data center into answers to answer "why is my I/O slow?"

https://www.nersc.gov/research-and-development/tokio/ https://www.github.com/nersc/pytokio

slide-5
SLIDE 5

Major Data Sources

¤ BG/Q Control System ¤ ALPS Logs ¤ HSSdb (Cray hardware supervisory system) ¤ Scheduler logs (job, reservations) ¤ File system logs (GPFS, Lustre) ¤ Accounting (sbank) ¤ Job Logs (standard output,error,info) ¤ Theta control system logs (boot log, etc.) ¤ Job instrumentation (Darshan, AutoPerf, ..) ¤ Future

¥ LDMS?

5

slide-6
SLIDE 6

The (complicated) Big Picture…

6

slide-7
SLIDE 7

Our standard availability report

7

slide-8
SLIDE 8

Machine Time Overlay…

8

¤ Y axis are the

allocable chunks of the machine

¤ X axis is time

¤ Analyze scheduling performance and behavior ¤ Any information such as data, location, time can be displayed this

way

¥

coolant temperature, power consumption, etc..

slide-9
SLIDE 9

Theta usage

slide-10
SLIDE 10

Theta – library usage (XALT)

10

slide-11
SLIDE 11

Theta – I/O usage (Darshan)

11

slide-12
SLIDE 12

Acknowledgements

ALCF Operations Staff! This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

12