remora a resource monitoring tool for everyone
play

Remora: A Resource Monitoring Tool for Everyone Carlos Rosales - PowerPoint PPT Presentation

Remora: A Resource Monitoring Tool for Everyone Carlos Rosales carlos@tacc.utexas.edu Where does that odd name come from??? It attaches to the user processes It travels with them in the system It feeds off your job


  1. Remora: A Resource Monitoring Tool for Everyone Carlos ¡Rosales carlos@tacc.utexas.edu

  2. Where does that odd name come from??? • It attaches to the user processes • It travels with them in the system • It feeds off your job (overhead) but provides some benefits (information)

  3. What is Remora? • Remora monitors all user activity and provides per-node and per-job resource utilization data • Developed by Antonio Gomez-Iglesias and Carlos Rosales at TACC • Open source, available at github • NOT a profiler • NOT a debugger • But the data collected can often be used to improve code performance or detect issues

  4. Common Issues • User questions: – Why did I get banned from running jobs? – Why did my job crash? – Why is my performance so low in your supercomputer? • We have some tools in place: – Server logs (Splunk) – TACC Stats (hardware counter data, 10 min period)

  5. Current Tools Are Insufficient • 10 min interval in TACC Stats misses spikes of activity. – Fails to detect single large memory allocations – Fails to detect localized instances of high IO traffic. • Splunk is tedious to parse and typically only contains catastrophic errors. • NEITHER is visible to the user • Many useful features, but missing some critical to our users

  6. How does Remora fix those issues? • Fine-grained temporal resolution (tunable) • Simplified output for basic user – Highlights possible issues without overwhelming • Raw data available for advance users – Deep analysis of each run possible – Post-processing tools provided

  7. Information Collected • Detailed timing of the application • CPU utilization • Memory utilization • NUMA information • I/O information (FS load and Lustre traffic) • Network information (topology and IB traffic)

  8. Accelerator support • Intel Xeon Phi – Treated like any other node – Background process is bound to core 61 to minimize overhead • GPU – Collects memory information using nvidia-smi – Other information is much harder to get to!

  9. Remora Summary ============================================================================== TACC: Max Memory Used Per Node : 8.52 GB TACC: Total Elapsed Time : 0d 0h 0m 27s 64ms TACC: MDS Load (IO REQ/S) : 0.00 (HOME) / 0.00 (WORK) / 2.00 (SCRATCH) ------------------------------------------------------------------------------ TACC: Sampling Period : 2 seconds TACC: Complete Report Data : /full/path/to/workdir/remora_5905747 ============================================================================== Plus ¡additional ¡lines ¡for ¡memory ¡utilization ¡is ¡MICs ¡or ¡GPUs ¡are ¡used

  10. Raw Data Analysis Original Improved 35 10000 30 9000 Memory Used (GB/s) 8000 25 IO (requests/s) 7000 20 6000 5000 15 4000 10 3000 Remora 2000 Max Allowed 5 1000 Automated Collection 0 0 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 7000 8000 Time (seconds) Time (seconds)

  11. Raw Data Analysis 5 4.5 Memory Used (GB) 4 3.5 3 2.5 2 CPU 1.5 PHI 1 0 20 40 60 80 100 120 140 Execution Time (s)

  12. Raw Data Analysis

  13. Simple to Use module load remora remora ibrun mympi.code module load remora remora ./mycrazy.script

  14. Implementation • Bash and python, plus some C xltop trickery by Antonio J • Master starts flat tree ssh connection to all nodes • Background task spawned in each node • Background task collects data regularly • IO data collected only from master node

  15. Implementation Programs Files • numastat • /proc/meminfo • mpstat, • /proc/<pid>/status • nvidia-smi • /proc/sys/lnet/stats • ibtracert • /sys/class/infinband/… • Ibstatus • xltop • python

  16. Portability • Some hardcoded strings only applicable to TACC – easy fix (coming soon) • Hardcoded MPI launcher (ibrun) – easy fix (coming soon) • XPost-processing has some TACC specific entries – easy fix (coming soon) • ltop requirement for Lustre IO report • Need to expand on the way the hostlist is collected

  17. Future Plans • Comprehensive report generation • Identify egregious performance issues and generate appropriate warnings • Add database for better comparative / historical data analysis • Improve launch step for better scalabilty

  18. Thanks! {carlos,agomez}@tacc.utexas.edu www.github.com/TACC/remora For more information: www.tacc.utexas.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend