Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me - - PowerPoint PPT Presentation

monitoring 6000 hosts in zabbix
SMART_READER_LITE
LIVE PREVIEW

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me - - PowerPoint PPT Presentation

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education Co founder of Passive Eye Ltd Open Source contributor Dept. Education ~800 schools


slide-1
SLIDE 1

Monitoring 6000+ hosts in Zabbix

A Pseudo-DevOps journey

slide-2
SLIDE 2

About me

š Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education š Co founder of Passive Eye Ltd š Open Source contributor

slide-3
SLIDE 3
  • Dept. Education

š ~800 schools š ~400,000 end users š ~5000 SOE servers @ schools š ~1500 heterogeneous servers @ central office š Hub-spoke topology š Vast geographic distribution

slide-4
SLIDE 4

Problem definition

š Multiple, disconnected monitoring tools

š Poor coverage š Lack of correlation š Duplication of effort š Inconsistent practice š Difficulty measuring SLAs

slide-5
SLIDE 5

Requirements

š Single pane of glass š Scalability š Extensibility š Ease of use š Low costs licensing

slide-6
SLIDE 6

Enter Zabbix

š All-in-one š Support for diverse devices š Small footprint and scalable architecture š Extensible API š Configuration UI š Free and open source + support

slide-7
SLIDE 7

How Zabbix works

š Primary server š Database š Frontend š Proxy Servers š Agents š Passive devices

proxy proxy web

slide-8
SLIDE 8

How Zabbix works

š Items and LLD š Active and passive checks š Hosts and templates š Triggers, Events and Actions š Graphs, Screens and Maps

slide-9
SLIDE 9

Planning

š Scrum + Jira Agile

slide-10
SLIDE 10

Building

š Discrete environments in Vagrant š Infrastructure as code š Discrete feature branches š Monolithic source repo

http://nvie.com/posts/a-successful-git-branching-model/

slide-11
SLIDE 11

Puppet code

slide-12
SLIDE 12

Testing

š Bamboo š Cucumber

slide-13
SLIDE 13

Deployment

š Server, DB, Web š No Proxies (so far)

slide-14
SLIDE 14

Performance

slide-15
SLIDE 15

Integrations

š Active Directory š CMDB š Service Management š ICT Dashboard

slide-16
SLIDE 16

Autonomy

š Host registration š Low level discovery š User provisioning š Remediation scripts š Data housekeeping š Incidents and escalations

slide-17
SLIDE 17

Template hierarchy

š Host š Class š Templates š Items, etc.

slide-18
SLIDE 18

Windows monitoring

š OOB Support:

š WMI queries š Performance counter probes š Event Log monitor š Service state

slide-19
SLIDE 19

Windows monitoring

š Customisations

š Hostname casing and format š Service discovery š Performance counter discovery š Failover Cluster discovery š Persistent disk and volume identification

slide-20
SLIDE 20

Windows monitoring

š Tools

š MSI installer package š Test PowerShell script š Performance Counter template builder

slide-21
SLIDE 21

Windows monitoring

š Convert-PerfCountToZabbixTemplate.ps1

š Counter Set > Application š Multi-instance counter > Discovery Rule + Prototypes š Single-instance counter > Item check

> ./Convert-PerfCountersToZabbixTemplate.ps1

  • CounterSet Processor | Out-File template.xml
slide-22
SLIDE 22

Linux monitoring

š Modules

š Extensions for Linux kernel š PostgreSQL

š Packaging š Test script

slide-23
SLIDE 23

SNMP monitoring

š CRAC š UPS š Dell iDRAC š IPS š Mail gateways

slide-24
SLIDE 24

SNMP monitoring

š mib2zabbix.pl

š Tree nodes > Applications š OID Tables > Discovery Rules + Prototypes š OID Scalars > Item checks š OID Traps > Item + Trigger + snmptt config š Enums > Value Maps

$ ./mib2zabbix.pl --template –oid=.1.3.6.1.2.1.25 --name=“Host resources”

slide-25
SLIDE 25

Application monitoring

š Microsoft Exchange š Microsoft SCCM š Microsoft SQL Server š Microsoft Active Directory š PostgreSQL Server š EMC Avamar š HP BPM š Squid Proxy š Zabbix Server š …

slide-26
SLIDE 26

Risk mitigation

š Document in code š Source control š Clearly defined interfaces š Quality gates š Upstream contribution š Change the hiring criteria to avoid SPOF

slide-27
SLIDE 27

Agent stress test

š Critical to finding:

š Memory leaks š Race conditions š Impact on system š Regressions

š Validate efficiency improvements

slide-28
SLIDE 28
slide-29
SLIDE 29

It’s no magic bullet…

š Data aggregation š Visualizations š Alert Scripts

slide-30
SLIDE 30

Future

š Zabbix v3 upgrade š Better engagement from ITOps š More devices and apps š More automation š Better use of data š Enterprise Integration Patterns š Cloud monitor

slide-31
SLIDE 31

DevOps?

š Meta-software š Agile delivery š Infrastructure As Code š Continuous Integration š Theory of Constraints

slide-32
SLIDE 32

Contrib

š PostgreSQL monitoring https://github.com/cavaliercoder/libzbxpgsql š Agent benchmarking https://github.com/cavaliercoder/zabbix_agent_bench š Windows MSI package https://github.com/cavaliercoder/zabbix-msi š Golang module adapter https://github.com/cavaliercoder/g2z

slide-33
SLIDE 33

Windows Counters

š Performance Counter IDs are non-persistent

š Today G: is PhysicalDisk(3 G:), tomorrow it is PhysicalDisk(5)

š Graphs and alerts break š Mapping is not practical in scripting APIs

slide-34
SLIDE 34

Windows physical disks

š Mutable performance counter ID: PhysicalDisk(0 C:) š Index (‘0’) changes on reboot, swap, failover, etc. š The drive letter (‘C:’) is undocumented š MBR Signatures and GPT GUIDs are more persistent

C: Q:

slide-35
SLIDE 35

Windows physical disks

š Q: What runtime counter index maps to desired MBR/GUID? š Identify via MBR Signature or GPT GUID DeviceIOControl (IOCTL_DISK_GET_DRIVE_LAYOUT_EX) š Get device index (\\.\PHYSICALDRIVE<i>) DeviceIOControl (IOCTL_STORAGE_GET_DEVICE_NUMBER) š Iterate PhysicalDisk counters (ignore drive letter) PdhEnumObjectItems

Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

slide-36
SLIDE 36

Windows Volumes

š Performance counter ID: LogicalDisk(C:|HarddiskVolumeN) š Drive letter is mutable š N is mutable š Volume GUIDs or Serials are more persistent

slide-37
SLIDE 37

Windows Volumes

š Q: Which runtime counter ID matches Volume GUID? š Find volumes with ID FindNextVolume š Compare GUID/Sig against name returned by GetVolumeInformation š Enumerate LogicalDisk counters with PdhEnumObjectItems š Test mount paths (N:) returned by GetVolumePathNamesForVolumeName š Test DOS Device Path (\Device\HarddiskVolumeN) returned by QueryDosDevice

Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

slide-38
SLIDE 38

Windows Failover Clusters

š Disks move between nodes š Node disks are visible on cluster IP š IDs and drive letters change

slide-39
SLIDE 39

Windows Failover Clusters

š Q: Is a MBR/GUID listed as a cluster resource? š Cluster API uses MBR Signature or GPT GUID! š Enumerate “Physical Disk” resources in cluster with ClusterEnum š Add a discovery parameter for clustered/non-clustered disks

Sample code: https://github.com/cavaliercoder/sysinv/blob/master/cluster.cpp

slide-40
SLIDE 40

Thank you!

Ryan Armstrong

š Blog: cavaliercoder.com š Twitter: @cavaliercoder š GitHub: cavaliercoder