Remora: A Resource Monitoring Tool for Everyone Carlos Rosales - PowerPoint PPT Presentation

Remora: A Resource Monitoring Tool for Everyone Carlos ¡Rosales carlos@tacc.utexas.edu

Where does that odd name come from??? • It attaches to the user processes • It travels with them in the system • It feeds off your job (overhead) but provides some benefits (information)

What is Remora? • Remora monitors all user activity and provides per-node and per-job resource utilization data • Developed by Antonio Gomez-Iglesias and Carlos Rosales at TACC • Open source, available at github • NOT a profiler • NOT a debugger • But the data collected can often be used to improve code performance or detect issues

Common Issues • User questions: – Why did I get banned from running jobs? – Why did my job crash? – Why is my performance so low in your supercomputer? • We have some tools in place: – Server logs (Splunk) – TACC Stats (hardware counter data, 10 min period)

Current Tools Are Insufficient • 10 min interval in TACC Stats misses spikes of activity. – Fails to detect single large memory allocations – Fails to detect localized instances of high IO traffic. • Splunk is tedious to parse and typically only contains catastrophic errors. • NEITHER is visible to the user • Many useful features, but missing some critical to our users

How does Remora fix those issues? • Fine-grained temporal resolution (tunable) • Simplified output for basic user – Highlights possible issues without overwhelming • Raw data available for advance users – Deep analysis of each run possible – Post-processing tools provided

Information Collected • Detailed timing of the application • CPU utilization • Memory utilization • NUMA information • I/O information (FS load and Lustre traffic) • Network information (topology and IB traffic)

Accelerator support • Intel Xeon Phi – Treated like any other node – Background process is bound to core 61 to minimize overhead • GPU – Collects memory information using nvidia-smi – Other information is much harder to get to!

Remora Summary ============================================================================== TACC: Max Memory Used Per Node : 8.52 GB TACC: Total Elapsed Time : 0d 0h 0m 27s 64ms TACC: MDS Load (IO REQ/S) : 0.00 (HOME) / 0.00 (WORK) / 2.00 (SCRATCH) ------------------------------------------------------------------------------ TACC: Sampling Period : 2 seconds TACC: Complete Report Data : /full/path/to/workdir/remora_5905747 ============================================================================== Plus ¡additional ¡lines ¡for ¡memory ¡utilization ¡is ¡MICs ¡or ¡GPUs ¡are ¡used

Raw Data Analysis Original Improved 35 10000 30 9000 Memory Used (GB/s) 8000 25 IO (requests/s) 7000 20 6000 5000 15 4000 10 3000 Remora 2000 Max Allowed 5 1000 Automated Collection 0 0 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 7000 8000 Time (seconds) Time (seconds)

Raw Data Analysis 5 4.5 Memory Used (GB) 4 3.5 3 2.5 2 CPU 1.5 PHI 1 0 20 40 60 80 100 120 140 Execution Time (s)

Raw Data Analysis

Simple to Use module load remora remora ibrun mympi.code module load remora remora ./mycrazy.script

Implementation • Bash and python, plus some C xltop trickery by Antonio J • Master starts flat tree ssh connection to all nodes • Background task spawned in each node • Background task collects data regularly • IO data collected only from master node

Implementation Programs Files • numastat • /proc/meminfo • mpstat, • /proc/<pid>/status • nvidia-smi • /proc/sys/lnet/stats • ibtracert • /sys/class/infinband/… • Ibstatus • xltop • python

Portability • Some hardcoded strings only applicable to TACC – easy fix (coming soon) • Hardcoded MPI launcher (ibrun) – easy fix (coming soon) • XPost-processing has some TACC specific entries – easy fix (coming soon) • ltop requirement for Lustre IO report • Need to expand on the way the hostlist is collected

Future Plans • Comprehensive report generation • Identify egregious performance issues and generate appropriate warnings • Add database for better comparative / historical data analysis • Improve launch step for better scalabilty

Thanks! {carlos,agomez}@tacc.utexas.edu www.github.com/TACC/remora For more information: www.tacc.utexas.edu

Remora: A Resource Monitoring Tool for Everyone Carlos Rosales - PowerPoint PPT Presentation

Remora: A Resource Monitoring Tool for Everyone Carlos Rosales carlos@tacc.utexas.edu Where does that odd name come from??? It attaches to the user processes It travels with them in the system It feeds off your job

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Records with Rank Polymorphism Justin Slepak Olin Shivers Panagiotis Manolios

Monitoring Advanced Tiers Tool (MATT) PBIS Assessment Annual Assessment Progress Monitoring

**** PPR Monitoring and Assessment Tool A Companion Tool of the Global Strategy for the PPR

2016 Coordinated Monitoring Schedule 1 Navigation of Coordinated Monitoring website

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

Workflow Plus Signature Capture Tool for Synergy Enterprise What is This Tool ? This tool

Workflow Plus URL Hyperlinks Tool for Synergy Enterprise What is This Tool ? This tool will

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

ONTARIO PRESENTS COMMUNITY ENGAGEMENT WEBSITE RESOURCE TOOL INTRODUCTION ONTARIO TRILLIUM

Surveillance Programs - GLNPO Cooperative Monitoring Coordinated Science and Monitoring

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Coastal Monitoring Update Clive Moon Engineering Manager - Environment Coastal Monitoring

Fuel Monitoring Presentation Fuel Monitoring We specialize in fuel monitoring also can customize

Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise

CMSC427 Interac/ve programs in Processing: Polyline editor Interactive programming Example:

How to Create Resilient Microservices With a PostgreSQL Dependency Glen Gomez Zuazo Senior

Inside a Self-Driving Uber Matt Ranney March 6, 2018 1.3 million people die in car crashes

Android AsyncTask AsyncTask Android AsyncTask is an abstract class provided by Android

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

The Page Cache Now: Look at the other side of this wall Today: Focus on writing