Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL - - PowerPoint PPT Presentation

application monitoring
SMART_READER_LITE
LIVE PREVIEW

Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL - - PowerPoint PPT Presentation

Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL Sarah Michalak, LANL Presented at CUG 2008, Helsinki, Finland Unclassified Unlimited Release SAND Number: 2008-2932C What is it? Application monitoring is the automated


slide-1
SLIDE 1

Application Monitoring

Robert A. Ballance, SNL John T. Daly, LANL Sarah Michalak, LANL

Presented at CUG 2008, Helsinki, Finland Unclassified Unlimited Release SAND Number: 2008-2932C

slide-2
SLIDE 2

What is it?

Application monitoring is the automated process of tracking the real progress of an application over time

–It is not platform monitoring –It is not queue monitoring –It is not utilization monitoring

But it can be used to inform all of these processes!

slide-3
SLIDE 3

Application monitoring stems from a simple premise

slide-4
SLIDE 4

What if your jobs could talk?

slide-5
SLIDE 5

What if you knew how to listen?

slide-6
SLIDE 6

> cd ../../over/^H^H^H^H/back/somedir.d > ls > ls -l | less #! wrong directory. Where did I …? > cd ../../back^H^H^H^Hover/down/dir.2 > ls > head -100 myrandomoutput.log | tail

slide-7
SLIDE 7

What if Ballance knew how to listen?

slide-8
SLIDE 8

Telephone rings…. Hi John Hi Bob Looks like your job has stalled (again) Thanks!

slide-9
SLIDE 9

But how did he know that?

slide-10
SLIDE 10

Register in your scheduler job script

module load jobmonitor monitor -o myjob.out --check=size

MySQL

monitor System User User

monitor_job.pl

monitor_cron.sh (command)

update_monitored_jobs.pl

job_status

job_status.pl jobmonitor.cgi

Web .monitor jobmonitor.conf System Scheduler

Start

slide-11
SLIDE 11

Initial OK Stalled Config Errors Check Failed Check Timeout FS Timeout Probably Hung Exited Dequeued Queued N Any running state Holding states Running

slide-12
SLIDE 12

What can it check?

File size increasing decreasing Access time increasing Modification time increasing GREP out number increasing decreasing Still running? Count files matching increasing decreasing Count files on remote system increasing decreasing

slide-13
SLIDE 13

Where can you check?

✓ Where can you check

✓ job_status (command line) ✓ Web

✓ What can you see?

✓ You can see your jobs’ status ✓ Your jobs’ history, including the succession of

comparison values

✓ Job description, state, etc.

✓ Administrators can view all jobs

slide-14
SLIDE 14

What if your job had meaningful things to say?

slide-15
SLIDE 15

Why isn’t system monitoring good enough?

  • Preliminary investigations at Los Alamos indicate

that as much as two-thirds of system unavailability to the application may be unaccounted for in system monitoring data because

–System software interrupts (est. 50% of total interrupts) are frequently not tracked –Common-cause failures that may interrupt multiple applications are frequently counted as a single interrupt by system monitoring

  • NEED: A method of monitoring reliability from the

application’s perspective

slide-16
SLIDE 16

Application MTTI is a better metric than system MTBF for quantifying the user’s experience

A -- Inverse Proportionality B -- First Order Approximation C -- Exact (Contiguous Nodes) D -- Exact (Random Nodes) E -- Exact (Worst Case Nodes) k -- number of processors

First order approximation

  • f application mean time to

fatal error demonstrates super-linear per processor reliability scaling

slide-17
SLIDE 17

What application data is required?

  • kj ─ # of nodes allocation to the application
  • ∆tj ─ time that the application spent running
  • mj ─ # of interrupts that occurred during the run

These should be measured for each job “j”

slide-18
SLIDE 18

0.15 0.35 0.35 0.55 0.75 0.95 500 1000 1500 2000 9.6 9.8 10.0 10.2 10.4 M1 MN 0.15 0.35 0.55 0.75 0.95 500 1000 1500 2000 9.6 9.8 10.0 10.2 10.4 M1 MN

The paper provides the mathematical and statistical basis Data from application monitoring can be used to predict how effectively jobs of various sizes will run

slide-19
SLIDE 19

What else can app monitoring data reveal? Utilization? Others...? Availability? Scaling? Performance?

slide-20
SLIDE 20

Questions only the job can answer

  • Is the job making progress?
  • At what rate is it making progress?
  • How frequently is it interrupted?
  • What are the causes and symptoms of the

interrupts?

  • Should the system intervene (e.g., to kill or restart

the job)?

  • Should the system operators or user be notified?
  • How much time and storage are spent preparing

for restarts?

slide-21
SLIDE 21
  • Tri-Lab (LANL, LLNL, SNL) Application Monitoring

Project

  • Phase 1 is this year
  • Tools, techniques, libraries, algorithms to enable a

platform-independent app monitoring system