Use puppet and network inventory to populate nagios/icinga - - PowerPoint PPT Presentation

use puppet and network inventory to populate nagios
SMART_READER_LITE
LIVE PREVIEW

Use puppet and network inventory to populate nagios/icinga - - PowerPoint PPT Presentation

http://www.grnet.g r GRNET NOC Use puppet and network inventory to populate nagios/icinga configuration TF-NOC Dublin Alexandros Kosiaris (alex@noc.grnet.gr) Network & Equipment Optical Network: Storage Equipment:


slide-1
SLIDE 1

฀ http://www.grnet.g r

GRNET NOC

Use puppet and network inventory to populate nagios/icinga configuration TF-NOC Dublin

Alexandros Kosiaris (alex@noc.grnet.gr)

slide-2
SLIDE 2

฀ Network & Equipment

  • Storage Equipment:
  • Netapp/IBM N5300
  • EMC Celerra NS-480
  • Computing Equipment:
  •  Virtualization (KVM)
  • 12 Blade servers, HP BL-460c
  • 12 IBM 1U Servers
  • 128 1U Fujitsu Servers
  • 275 2U HP Proliant Servers
  • ~200 Vms
  • Optical Network:
  • ~70 cities (+30 within next year)
  • 15years-leased dark fiber
  • DWDM/CWDM network
  • Optical Equipment:
  • Alcatel 1626LM, 1696MS, 1678MCC
  • Adva FSP2000
  • Routing Equipment:
  • Juniper T1600, Juniper MX960
  • ~10x Cisco 12000s,
  • a few Cisco 7200s/7300s
  • Switching Equipment:
  • Cisco 6500
  • Several Cisco 3750, Cisco 2970, Juniper

ex4200, Extreme X450a/X350

slide-3
SLIDE 3

Nagios + Network Equipment

  • r (more accurately) Switching and Routing
  • In-house developed Network Inventory (a.k.a. GRNETDB)
  • A MySQL database of almost 150 tables
  • Populated multiple times a day by a PHP discovery script
  • SNMP, telnet + expect
  • Basic Concepts:
  • Node
  • Interface
  • Layer
  • Domain
  • Location
  • These concepts get extended to represent functionality
  • Routing, Switching nodes
  • Layer2, Layer3 interfaces
  • Switching, administrative domains
slide-4
SLIDE 4
  • In-house developed python Django project, with multiple sub-apps
  • Network (the interface to the database)
  • RG (router graphs, take a peek at http://mon.grnet.gr/rg)
  • Maps (take a look at http://mon.grnet.gr/network/maps)
  • Hostmaster
  • Optical network (built mostly on Location info)
  • Nadjicingo
  • Builts on network app and generates a nagios/icinga

configuration

  • Nagvis
  • Same thing but generates/updates nagvis config

Nagios + Network Equipment

  • r (more accurately) Switching and Routing
slide-5
SLIDE 5

Nadjicingo

  • A Django management command outputing nagios/icinga configuration
  • Run by crontab every hour (manage.py nadjicingo)
  • Will generate nagios configuration objects for
  • Routers
  • Switches
  • Interfaces
  • L3 Topology aware (nagios hates cyclic dependencies – aka

redundant links), populates parents field for most devices.

  • Hardware checks in devices
  • Business logic embedded in interface descriptions:
  • Part of it is a unique identifier for a customers link

–[.NTUA-4] => National Technical University's L3 link –[AUTH@ERMOU-1] => Aristotle University of Thessaloniki L2 link at Ermou PoP

slide-6
SLIDE 6

Nagvis

  • A Django management command (again...)
  • Run by crontab every hour (manage.py nagvis)
  • Will update a specific nagvis map configuration by:
  • Removing obsolete nodes
  • Adding new nodes to a special area for manual positioning on

map

  • Also features an automated positioning mode based on devices

Latitude Longitude.

  • Nice for showoff but not for overview in monitoring applications
  • Will only populate host objects in map.
  • Service objects cluttered it too much and information is rightly

available anyway

slide-7
SLIDE 7

Nagvis Network Map

slide-8
SLIDE 8

Servers, Services ?

  • A little bit of history
  • For years, GRNET only had very basic services (DNS, email,

Web)

  • And some router supporting services (Looking glass, mrtg, rancid)
  • And very few servers (<=10)
  • 3 years ago, major paradigm shift from networking to services
  • 20 Servers bought, and then 132 and recently 275 more
  • End user services were born:
  • Public cloud storage service (Pithos)
  • Virtual Private Servers (ViMa)
  • Students books statements (Eudoxus)
  • Student Id cards (Paso)
  • Public IaaS (Okeanos)
  • Academic Professor Elections (Apella)
  • Plus many other services and projects (TCS, Whois, NTP, VoD,…)
  • The result ? => 200 Vms were created for managing all this

infrastructure

slide-9
SLIDE 9

Puppet to the rescue

  • What is Puppet?
  • It's a stack of applications
  • It's a language (a declarative one as well)
  • It's a policy and state enforcing tool
  • It's a attribute and state discovery tool (kind of...)
  • It's a new paradigm in managing systems!
  • What is Puppet not?
  • Not just an automation tool
  • Not a “For loop”
  • Not a command execution framework (it can be reduced to

that though)

  • AGAIN: A new paradigm, you need to change the way you work
slide-10
SLIDE 10

Puppet Concepts

  • Facts
  • Attributes of a system:
  • OS Version and family
  • Available memory
  • CPUs
  • Block devices
  • IP addresses/netmasks
  • MAC addresses
  • And anything else you can write code for it to be discovered
  • LLDP neighbours
  • IPMI functionality
  • Hardware info
  • Apache vhosts
  • Discovered by facter and then made available to Puppet
slide-11
SLIDE 11

Puppet Concepts(2)

  • Resources
  • Files, Directories
  • Users, Groups
  • Packages
  • Vlans
  • Interfaces
  • Nagios objects!!!!
  • And a lot more (http://docs.puppetlabs.com/references/latest/type.html)
  • Classes
  • A way to group resources
  • Support inheritance and mixins (aka including)
  • The standard class has 3 resources defined
  • Package {'software': }
  • File { '/etc/software.conf': }
  • Service { 'softwared': }
slide-12
SLIDE 12

Puppet Concepts(3)

  • Nodes
  • A.k.a. machines (VM or hardware)
  • A node CAN (and probably will) have multiple puppet classes
  • Node population can be done in multiple ways:
  • Puppet language config
  • LDAP
  • External script
  • Puppetd agents running in each machine (daemon or crontab)
  • Central Puppetmaster (with an RDBMS) holds all the configuration

and data

slide-13
SLIDE 13

Hello World example

class helloworld { file { '/tmp/helloworld': ensure => present,

  • wner => root,

group => root, mode => 640, content => 'Hello world' } } node mynode { include helloworld }

  • Will create the /tmp/helloworld with all the attributes as defined above
  • More importantly, if run again it will make sure to wipe any possible

changes and restore the state as is defined above

slide-14
SLIDE 14

Back to nagios

  • Let’s use a puppet native type

nagios_host { “$hostname”: address => 10.10.10.10, alias => myhost, contact_groups => hostadmins, hostgroups => 'Puppeted Servers', }

  • /etc/nagios/nagios_host.cfg gets populated
  • Problem is ...
  • This is executed in the machine running puppetd not the nagios

server.

  • No problem. Puppet supports exported resources.
slide-15
SLIDE 15

Exported resources

  • Let’s prepend the definition with two @ signs

@@nagios_service { 'myservice' contact_groups => hostadmins, host_name => $hostname, tag => 'collect_me_nagios_server', }

  • Exports the resource but does not realize it on the machine

running puppetd

  • No /etc/nagios/nagios_service.cfg file created

<<| Nagios_service tag == 'collect_me_nagiosserver' |>>

  • In nagios server’s manifest.
  • /etc/nagios/nagios_service.cfg populated.
  • nagios,icinga.cfg can now just include the file/directory and

monitoring begins

slide-16
SLIDE 16

Simple example

  • A manifest for all authoritative DNS servers
  • Install bind9, install configuration and ensure it is running
  • Open up firewall
  • Setup a simple DNS check

class authoritativedns { include bind9 include service::dns @@nagios_service { "authdns": command => "check_dig!www.grnet.gr", servicegroups => "DNS,DNS:Authoritative" } }

slide-17
SLIDE 17

Interesting use cases

  • Class hierarchy means:
  • A base class nagios::host that is included in all other
  • So all servers nagios-monitored without any intervention

But:

  • A Server is physical and has IPMI capabilities: So export another

nagios host for it if $ipmi_capable { @@nagios_host { "$ipmi_dns": address => $ipmi_ipaddress, tag => "hardwarehost", } }

slide-18
SLIDE 18

Interesting use cases (2)

  • Server is an HP Proliant Server

class hp-health { package { [ 'hp-health', 'hpacucli' ]: ensure => present, } nagios::host::service { 'hpacucli': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpacucli!0', } nagios::host::service { 'hpasm': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpasm!0', } }

slide-19
SLIDE 19

Interesting use cases (3)

  • Multicast beacons (double exported resources!!!)

define ssmping_check($ipv4, $ipv6) { $local = $::fqdn $remote = $name if ($::ipaddress and $ipv4 and $local != $remote) { @@nagios_service { "ping-ssm-$remote-$local-v4": ensure => present, check_command => "check_nrpe!check_ssmping!$ipv4", host_name => $local, service_description => "Multicast from $remote SSM IPv4", } … } # export the checks... @@ssmping_check { $fqdn: ipv4 => $ipaddress, ipv6 => $ipv6address}

slide-20
SLIDE 20

Interesting use cases (4)

  • Standard checks for all servers

nagios::host::service { "disk": command => "check_nrpe!check_disk!13% 7%", } nagios::host::service { "load": command => "check_nrpe!check_load!4,3,2 5,4,3", } nagios::host::service { "users": command => "check_nrpe!check_load!20 30", } nagios::host::service { "swap": command => "check_nrpe!check_swap!60 40", } nagios::host::service { "check_tainted": command => "check_nrpe!check_tainted!0", } nagios::host::service { "check_firewall": command => "check_nrpe!check_firewall!0", }

slide-21
SLIDE 21

Problems arise

  • /etc/nagios/*.cfg files can become quickly large
  • However each resource collection reads the entire file
  • Problem solved by disabling collections and creating the entire

config file every time, however a more elegant solution would be nice

  • Exported resources cost
  • Each is an entry in the database and they are not used for nagios

alone.

  • Execution speed suffers and sometimes times out
  • Problem solved in database by adding some indexes... but is

bound to show up again

  • Puppet devs know it, some effort goes there
slide-22
SLIDE 22

Problems arise (2)

  • Puppet's declarative language can cause problems at times

@@nagios_host { 'myhost': Hostgroups => $myhostgroups }

  • And host also has classes A,B,C apart from nagios class.
  • Which class is going to declare $myhostgroups?
  • Multiple solutions exist, all of them not elegant.
  • Externally (via LDAP)
  • Fact based
  • Populated hostgroups, not hosts
slide-23
SLIDE 23

Problems arise (3)

  • Active checks cost. Not a Puppet issue but a nagios one
  • check_mk
  • Distributed monitoring
  • Well obsess_over_services sucks…
  • mod_gearman
  • For now splitting the infrastructure in
  • Networking
  • Services
  • But if Services grow more?
  • Variable tagging on resources

@@nagios_service { 'myservice' contact_groups => hostadmins, host_name => $hostname, tag => 'collect_me_nagios_server_N', }

slide-24
SLIDE 24

฀ http://www.grnet.g r

Questions

?

Alexandros Kosiaris

GRNET NOC Systerms Admin

alex@noc.grnet.gr