Process Monitoring Of Nightly Builds PH/SFT & IT/CF Summer - - PowerPoint PPT Presentation

process monitoring of nightly builds ph sft amp it cf
SMART_READER_LITE
LIVE PREVIEW

Process Monitoring Of Nightly Builds PH/SFT & IT/CF Summer - - PowerPoint PPT Presentation

IT & PH DEPT. Process Monitoring Of Nightly Builds PH/SFT & IT/CF Summer Student Willem Van Lint Overview IT & PH DEPT. Nightly builds & problem statement Lemon monitoring framework Design of the process


slide-1
SLIDE 1

IT & PH DEPT.

Process Monitoring Of Nightly Builds PH/SFT & IT/CF

Summer Student Willem Van Lint

slide-2
SLIDE 2

IT & PH DEPT.

Overview

  • Nightly builds & problem statement
  • Lemon monitoring framework
  • Design of the process monitoring tool
  • Use in the nightly builds
slide-3
SLIDE 3

IT & PH DEPT.

Overview

  • Nightly builds & problem statement
  • Lemon monitoring framework
  • Design of the process monitoring tool
  • Use in the nightly builds
slide-4
SLIDE 4

IT & PH DEPT.

Nightly builds: Introduction

  • Nightly builds compile and test projects like ROOT, GAUDI,

CORAL, …

  • Different platforms and slots ( = build environment) →

Server – Client build system

slide-5
SLIDE 5

IT & PH DEPT.

Nightly builds: Architecture

  • Nightly builds use a server-client architecture through

RPC to distribute the architectures to be built.

Server Linux Client Get work unit DoBuild.py runs Compile ROOT Install ... Test ... Compile ... Install … Test ... ... Windows Client ... MySQL Mac Client ... Stores Web interface

slide-6
SLIDE 6

IT & PH DEPT.

Nightly builds: Problems

Problems Solutions

Processes hanging → high CPU time → large log files Detect and terminate processes and write reason in log files Low on disk usage A: Clean up old builds B: Stop building​

slide-7
SLIDE 7

IT & PH DEPT.

Nightly builds: Problem example

  • Sometimes hanging processes in tests or make
  • Example of process tree:
  • Client → doBuild.py → compile ROOT → subprocesses

PID TTY STAT TIME COMMAND 19485 ? Ss 0:00 /bin/sh 19486 ? S 0:00 \_ /bin/sh /afs/cern.ch/sw/lcg/app/nightlies/scripts/launch_client.sh lxbuild147 8002 19594 ? S 0:00 | \_ python /afs/cern.ch/sw/lcg/app/nightlies/scripts/client.py --machine lxbuild147 19600 ? Z 0:00 | \_ [uptime] <defunct> 15940 ? S 0:00 | \_ python /afs/cern.ch/sw/lcg/app/nightlies/scripts/doBuild.py --slots dev1 15947 ? S 0:00 | | \_ /bin/sh -c source{SITEROOT}/sw/contrib/ 21661 ? S 0:00 | | \_ cmt pkg_make 4 21683 ? S 0:00 | | \_ sh -c mkdir -p logs; 21690 ? S 0:00 | | \_ sh -x /build/nightlies/dev1/Fri/LCGCMT/LCGCMT_59 21695 ? R 5781:25 | | | \_ make -k -j4 21691 ? S 0:00 | | \_ tee -a logs/ROOT_x86_64-slc5-gcc43-dbg_make.log 8628 ? Z 0:00 | \_ [python] <defunct> 19487 ? S 0:00 \_ tee /afs/cern.ch/sw/lcg/app/nightlies/nightlies-logs/crncli64148.txt

slide-8
SLIDE 8

IT & PH DEPT.

Overview

  • Nightly builds & problem statement
  • Lemon monitoring framework
  • Design of the process monitoring tool
  • Use in the nightly builds
slide-9
SLIDE 9

IT & PH DEPT.

Lemon monitoring framework

  • Monitoring of sensor

values

  • 3 interactions:
  • Sensor – Agent
  • Agent – Server
  • UI - User
slide-10
SLIDE 10

IT & PH DEPT.

Lemon architecture

Web browser

Lemon CLI

User

Oracle Database

Repository backend SQL

Nodes

Monitoring Agent

Sensor Sensor Sensor

RRDT

  • ol

/ PHP apache

HTTP

Lemon-host-check

Applicati

  • n Server

TCP/UDP

slide-11
SLIDE 11

IT & PH DEPT.

Lemon agent & sensors

Nodes

Monitoring Agent

Sensor Sensor Sensor

  • Sensor = executable
  • Agent ↔ sensors:
  • Few simple commands.
  • Interaction: supported API for Perl,

C++

  • Sensors provide metric classes

that can be instantiated (e.g. with different parameters).

slide-12
SLIDE 12

IT & PH DEPT.

Lemon exceptions & actuators

  • Exceptions can be defined in the Lemon Agent

based on values of the sensors.

  • An actuator can be called to resolve the

exception.

30010 MetricName exception.hangingcpu MetricClass alarm.exception Timing 20 5 Parameters Correlation (33:2 > 1000) && (33:1 > 0) Actuator /usr/bin/lemon-actuator-kill cputime $act_value_02 MaxRuns 3 900 Timeout 100

slide-13
SLIDE 13

IT & PH DEPT.

Overview

  • Nightly builds & problem statement
  • Lemon monitoring framework
  • Design of the process monitoring tool
  • Use in the nightly builds
slide-14
SLIDE 14

IT & PH DEPT.

Monitoring the nightly builds system

  • Reuse and enhancement of Lemon sensor wrapper

in Python

  • Implementation of new metric classes
  • Implementation of an actuator

Monitoring Agent Sensor Sensor wrapper Metric module Actuator Exception Metric module

slide-15
SLIDE 15

IT & PH DEPT.

Lemon sensor wrapper

  • Normal case: sensor delivers metric values to

agent

slide-16
SLIDE 16

IT & PH DEPT.

Lemon sensor wrapper

  • Wrapper acts as a sensor but asks metric

values from other modules

slide-17
SLIDE 17

IT & PH DEPT.

Metrics

  • Searching the

process tree:

  • Select branches
  • Extract specific

information

  • Modular: other select

and extract functions can be used

slide-18
SLIDE 18

IT & PH DEPT.

Metrics

  • Cputime metric:
  • Returns the total cpu time, project name of a project build

branch and the PID of the root process of the branch

  • Example:

Machi ne Metric nr Time PID Total cpu time Project name lxbuild 148 5380 12825 52613 2912 9 Project0 lxbuild 148 5380 12825 52613 3813 5 Project1

slide-19
SLIDE 19

IT & PH DEPT.

Metrics

  • Files metric:
  • For each open log file in a project branch,

returns the PID of the process with the handle, file path, amount written by that process and project name

  • Example:

Machine Metric nr Time PID Path Amount written Project lxbuild148 5381 Mon Aug 23 04:03:54 2010 16231 .../x86_64- slc5- gcc43-

  • pt.log

12028 GAUDI lxbuild148 5381 Mon Aug 23 04:03:54 2010 17053 …/x86_64- slc5-icc11- dbg- tests.log COOL

slide-20
SLIDE 20

IT & PH DEPT.

Actuator

  • Different actions for when limits are exceeded:
  • Total cpu time > 2h:

Search the process with the largest cputime in the branch and terminate the branch from that process down.

  • Log files > 2Mb:

Terminate the branch from the responsible process down.

  • Partition usage > 85%:

Delete builds from the previous day.

  • Partition usage > 95%:

Terminate all the builds.

  • Always write an error in log files from terminated

processes

slide-21
SLIDE 21

IT & PH DEPT.

Code optimalization

  • Unittests: goal = attaining full code coverage
  • Documentation generated from docstrings by

Doxygen

  • Static code analysing by pylint, checks for:
  • Errors and warnings
  • Refactoring possibilities e.g. too many attributes
  • Coding conventions e.g. indentation
slide-22
SLIDE 22

IT & PH DEPT.

Overview

  • Nightly builds & problem statement
  • Lemon monitoring framework
  • Design of the process monitoring tool
  • Use in the nightly builds
slide-23
SLIDE 23

IT & PH DEPT.

Use in the nightly builds system

  • Project builds will be followed.
  • Hanging builds will be detected by the actuator limit.
  • Reason for termination in the logs.
slide-24
SLIDE 24

IT & PH DEPT.

Outlook

  • A Windows and Mac version of the Lemon

sensor wrapper to report sensor values through RPC to a Linux agent

Monitoring Agent Sensor Sensor wrapper client (repeater) Metric module Sensor wrapper server RPC Linux machine Windows/Mac machine(s) Sensor wrapper client (repeater) ...

slide-25
SLIDE 25

IT & PH DEPT.

Conclusion

  • Modular system for process monitoring based
  • n cputime and open file size
  • Used in production
  • Reuse and enhancement of Lemon sensor

wrapper for Python.