application monitoring
play

Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL - PowerPoint PPT Presentation

Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL Sarah Michalak, LANL Presented at CUG 2008, Helsinki, Finland Unclassified Unlimited Release SAND Number: 2008-2932C What is it? Application monitoring is the automated


  1. Application Monitoring Robert A. Ballance, SNL John T. Daly, LANL Sarah Michalak, LANL Presented at CUG 2008, Helsinki, Finland Unclassified Unlimited Release SAND Number: 2008-2932C

  2. What is it? Application monitoring is the automated process of tracking the real progress of an application over time –It is not platform monitoring –It is not queue monitoring –It is not utilization monitoring But it can be used to inform all of these processes!

  3. Application monitoring stems from a simple premise

  4. What if your jobs could talk?

  5. What if you knew how to listen?

  6. > cd ../../over/^H^H^H^H/back/somedir.d > ls > ls -l | less #! wrong directory. Where did I …? > cd ../../back^H^H^H^Hover/down/dir.2 > ls > head -100 myrandomoutput.log | tail

  7. What if Ballance knew how to listen?

  8. Telephone rings…. Hi John Hi Bob Looks like your job has stalled (again) Thanks!

  9. But how did he know that?

  10. Register in your scheduler job script module load jobmonitor monitor -o myjob.out --check=size User System User Start MySQL monitor jobmonitor.cgi Web monitor_job.pl .monitor jobmonitor.conf job_status job_status.pl monitor_cron.sh (command) System Scheduler update_monitored_jobs.pl

  11. Queued Dequeued Running Initial OK Any running Stalled Exited state N Con fi g Check Check FS Probably Errors Failed Timeout Timeout Hung Holding states

  12. What can it check? File size increasing decreasing Access time increasing Modification time increasing GREP out number increasing decreasing Still running? Count files matching increasing decreasing Count files on remote increasing decreasing system

  13. Where can you check? ✓ Where can you check ✓ job_status (command line) ✓ Web ✓ What can you see? ✓ You can see your jobs ’ status ✓ Your jobs ’ history, including the succession of comparison values ✓ Job description, state, etc. ✓ Administrators can view all jobs

  14. What if your job had meaningful things to say?

  15. Why isn ’ t system monitoring good enough? •Preliminary investigations at Los Alamos indicate that as much as two-thirds of system unavailability to the application may be unaccounted for in system monitoring data because –System software interrupts (est. 50% of total interrupts) are frequently not tracked –Common-cause failures that may interrupt multiple applications are frequently counted as a single interrupt by system monitoring •NEED: A method of monitoring reliability from the application ’ s perspective

  16. Application MTTI is a better metric than system MTBF for quantifying the user ’ s experience First order approximation of application mean time to fatal error demonstrates super-linear per processor reliability scaling A -- Inverse Proportionality B -- First Order Approximation C -- Exact (Contiguous Nodes) D -- Exact (Random Nodes) E -- Exact (Worst Case Nodes) k -- number of processors

  17. What application data is required? • k j ─ # of nodes allocation to the application • ∆ t j ─ time that the application spent running • m j ─ # of interrupts that occurred during the run These should be measured for each job “j”

  18. 0.35 Data from application 10.4 0.35 0.15 monitoring can be used 0.75 0.95 to predict how 10.2 effectively jobs of M N 10.0 0.55 various sizes will run 9.8 9.6 10.4 0 500 1000 1500 2000 M 1 10.2 The paper provides the 0.35 0.75 M N 0.55 10.0 0.95 mathematical and 0.15 statistical basis 9.8 9.6 0 500 1000 1500 2000 M 1

  19. Utilization? Performance? Scaling? What else can app monitoring data reveal? Availability? Others...?

  20. Questions only the job can answer •Is the job making progress? •At what rate is it making progress? •How frequently is it interrupted? •What are the causes and symptoms of the interrupts? •Should the system intervene (e.g., to kill or restart the job)? •Should the system operators or user be notified? •How much time and storage are spent preparing for restarts?

  21. •Tri-Lab (LANL, LLNL, SNL) Application Monitoring Project •Phase 1 is this year •Tools, techniques, libraries, algorithms to enable a platform-independent app monitoring system

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend