monitoring 6000 hosts in zabbix
play

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me - PowerPoint PPT Presentation

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education Co founder of Passive Eye Ltd Open Source contributor Dept. Education ~800 schools


  1. Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey

  2. About me š Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education š Co founder of Passive Eye Ltd š Open Source contributor

  3. Dept. Education š ~800 schools š ~400,000 end users š ~5000 SOE servers @ schools š ~1500 heterogeneous servers @ central office š Hub-spoke topology š Vast geographic distribution

  4. Problem definition š Multiple, disconnected monitoring tools š Poor coverage š Lack of correlation š Duplication of effort š Inconsistent practice š Difficulty measuring SLAs

  5. Requirements š Single pane of glass š Scalability š Extensibility š Ease of use š Low costs licensing

  6. Enter Zabbix š All-in-one š Support for diverse devices š Small footprint and scalable architecture š Extensible API š Configuration UI š Free and open source + support

  7. How Zabbix works š Primary server web š Database š Frontend š Proxy Servers š Agents proxy š Passive devices proxy

  8. How Zabbix works š Items and LLD š Active and passive checks š Hosts and templates š Triggers, Events and Actions š Graphs, Screens and Maps

  9. Planning š Scrum + Jira Agile

  10. Building š Discrete environments in Vagrant š Infrastructure as code š Discrete feature branches š Monolithic source repo http://nvie.com/posts/a-successful-git-branching-model/

  11. Puppet code

  12. Testing š Bamboo š Cucumber

  13. Deployment š Server, DB, Web š No Proxies (so far)

  14. Performance

  15. Integrations š Active Directory š CMDB š Service Management š ICT Dashboard

  16. Autonomy š Host registration š Low level discovery š User provisioning š Remediation scripts š Data housekeeping š Incidents and escalations

  17. Template hierarchy š Host š Class š Templates š Items, etc.

  18. Windows monitoring š OOB Support: š WMI queries š Performance counter probes š Event Log monitor š Service state

  19. Windows monitoring š Customisations š Hostname casing and format š Service discovery š Performance counter discovery š Failover Cluster discovery š Persistent disk and volume identification

  20. Windows monitoring š Tools š MSI installer package š Test PowerShell script š Performance Counter template builder

  21. Windows monitoring š Convert-PerfCountToZabbixTemplate.ps1 š Counter Set > Application š Multi-instance counter > Discovery Rule + Prototypes š Single-instance counter > Item check > ./Convert-PerfCountersToZabbixTemplate.ps1 -CounterSet Processor | Out-File template.xml

  22. Linux monitoring š Modules š Extensions for Linux kernel š PostgreSQL š Packaging š Test script

  23. SNMP monitoring š CRAC š UPS š Dell iDRAC š IPS š Mail gateways

  24. SNMP monitoring š mib2zabbix.pl š Tree nodes > Applications š OID Tables > Discovery Rules + Prototypes š OID Scalars > Item checks š OID Traps > Item + Trigger + snmptt config š Enums > Value Maps $ ./mib2zabbix.pl --template –oid=.1.3.6.1.2.1.25 --name=“Host resources”

  25. Application monitoring š Microsoft Exchange š Microsoft SCCM š Microsoft SQL Server š Microsoft Active Directory š PostgreSQL Server š EMC Avamar š HP BPM š Squid Proxy š Zabbix Server š …

  26. Risk mitigation š Document in code š Source control š Clearly defined interfaces š Quality gates š Upstream contribution š Change the hiring criteria to avoid SPOF

  27. Agent stress test š Critical to finding: š Memory leaks š Race conditions š Impact on system š Regressions š Validate efficiency improvements

  28. It’s no magic bullet… š Data aggregation š Visualizations š Alert Scripts

  29. Future š Zabbix v3 upgrade š Better engagement from ITOps š More devices and apps š More automation š Better use of data š Enterprise Integration Patterns š Cloud monitor

  30. DevOps? š Meta-software š Agile delivery š Infrastructure As Code š Continuous Integration š Theory of Constraints

  31. Contrib š PostgreSQL monitoring https://github.com/cavaliercoder/libzbxpgsql š Agent benchmarking https://github.com/cavaliercoder/zabbix_agent_bench š Windows MSI package https://github.com/cavaliercoder/zabbix-msi š Golang module adapter https://github.com/cavaliercoder/g2z

  32. Windows Counters š Performance Counter IDs are non-persistent š Today G: is PhysicalDisk(3 G:) , tomorrow it is PhysicalDisk(5) š Graphs and alerts break š Mapping is not practical in scripting APIs

  33. Windows physical disks š Mutable performance counter ID: PhysicalDisk(0 C:) C: Q: š Index (‘0’) changes on reboot, swap, failover, etc. š The drive letter (‘C:’) is undocumented š MBR Signatures and GPT GUIDs are more persistent

  34. Windows physical disks š Q: What runtime counter index maps to desired MBR/GUID? š Identify via MBR Signature or GPT GUID DeviceIOControl (IOCTL_DISK_GET_DRIVE_LAYOUT_EX) š Get device index (\\.\PHYSICALDRIVE<i>) DeviceIOControl (IOCTL_STORAGE_GET_DEVICE_NUMBER) š Iterate PhysicalDisk counters (ignore drive letter) PdhEnumObjectItems Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

  35. Windows Volumes š Performance counter ID: LogicalDisk(C:|HarddiskVolumeN) š Drive letter is mutable š N is mutable š Volume GUIDs or Serials are more persistent

  36. Windows Volumes š Q: Which runtime counter ID matches Volume GUID? š Find volumes with ID FindNextVolume š Compare GUID/Sig against name returned by GetVolumeInformation š Enumerate LogicalDisk counters with PdhEnumObjectItems š Test mount paths (N:) returned by GetVolumePathNamesForVolumeName š Test DOS Device Path (\Device\HarddiskVolumeN) returned by QueryDosDevice Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

  37. Windows Failover Clusters š Disks move between nodes š Node disks are visible on cluster IP š IDs and drive letters change

  38. Windows Failover Clusters š Q: Is a MBR/GUID listed as a cluster resource? š Cluster API uses MBR Signature or GPT GUID! š Enumerate “ Physical Disk ” resources in cluster with ClusterEnum š Add a discovery parameter for clustered/non-clustered disks Sample code: https://github.com/cavaliercoder/sysinv/blob/master/cluster.cpp

  39. Thank you! Ryan Armstrong š Blog: cavaliercoder.com š Twitter: @cavaliercoder š GitHub: cavaliercoder

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend