Short introduction to monitoring systems for large Short - - PowerPoint PPT Presentation

short introduction to monitoring systems for large short
SMART_READER_LITE
LIVE PREVIEW

Short introduction to monitoring systems for large Short - - PowerPoint PPT Presentation

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms Immediate Reaction / Configuration Change Immediate Reaction / Configuration Change Security Auditing Security


slide-1
SLIDE 1

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

Immediate Reaction / Configuration Change Immediate Reaction / Configuration Change Baseline (Service Level Agreement) Baseline (Service Level Agreement)

Long term monitoring and statistical evaluation of system and network performance values.

Performance Monitoring Performance Monitoring

Monitoring of hardware and software failures. Event based alerting gives the potential to react autonomous.

Failure Monitoring Failure Monitoring

Definition and setup of security polices for systems, users and network. Daily system checks and log analysis to find any incidents.

Security Auditing Security Auditing Performance Monitoring Performance Monitoring Failure Monitoring Failure Monitoring Security Auditing Security Auditing

slide-2
SLIDE 2

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

What has to be monitored? CPU & Disk & Memory (CDM) usage, Network bandwidth usage, CPU & Disk & Network I/O, Network latency Where it has to be monitored? On the systems itself per single user/process/instance. Network monitoring is done by accessing the switches via SNMP. What opensource solutions are existing? Ganglia, Lemon, Cacti, Smokeping and self build solutions based on RRDTool

Performance Monitoring Performance Monitoring

slide-3
SLIDE 3

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

What has to be monitored? Errors and failures in hardware and software components. Critical thresholds for performance values. Where it has to be monitored? On the systems itself per single software instance or hardware component. Performance thresholds can be monitored via access to an performance monitoring system. What opensource solutions are existing? SysMES, Nagios What should the software additional provide? Autonomous execution of reactions to known or possible errors and failures. Is it possible to foresee hardware failures? Yes, for some hardware components which provide indicator values.

Failure Monitoring Failure Monitoring

slide-4
SLIDE 4

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

Failure Monitoring Failure Monitoring

Enclosure Device ID: 252 Slot Number: 7 Device Id: 11 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 931.512 GB [0x74706db0 Sectors] Non Coerced Size: 931.012 GB [0x74606db0 Sectors] Coerced Size: 930.390 GB [0x744c8000 Sectors] Firmware state: Online, Spun Up SAS Address(0): 0x96803721a299998b Connected Port Number: 7(path0) Inquiry Data: WD-WMATV1432482WDC WD1002FBYS-02A6B0 03.00C06 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 3.0Gb/s Link Speed: 3.0Gb/s Media Type: Hard Disk Device

slide-5
SLIDE 5

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

Failure Monitoring Failure Monitoring

slide-6
SLIDE 6

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

What does security auditing cover? Check for policy enforcement setup and check for incidents. Where it has to be done? On the systems itself for policy enforcement setup. On a central logging facility for detecting incidents. What opensource solutions are existing? SELinux or AppArmor for policy enforcement. RSyslog and LogCheck, SNORT for incident detection.

Security Auditing Security Auditing

slide-7
SLIDE 7

[ Pierre Zelnicek 2010 ]

Short introduction to monitoring systems for large Short introduction to monitoring systems for large computer farms computer farms

Performance Monitoring Performance Monitoring Failure Monitoring Failure Monitoring Security Auditing Security Auditing

Reporting [ week / month / year ] Management Service Level Agrement Configuration/Setup Change Policy Change Check thresholds Check for incident Immediate autonomous reaction / response