Setting up puppet provisioned nagios monitoring

nagios_monitoring_sscreenshotPuppet has had native nagios resource types for quite some time. As both a nagios and a puppet fan, I really liked the idea of not setting up any monitoring but have some base level of monitoring on every managed system automatically. Deploying new systems involves a lot of steps and forgetting to setup proper monitoring is a thing a lot of clients run into one day or another.

Setting up nagios checks can involve using exported resources. Coupled with role/profile based puppet classes it allows for very specific tests for very specific applications.

What I did find is that using large amounts of exported resources can really slow down the puppet run on the nagios monitoring server. Even to the point of timing out and/or killing the puppetdb instance.

As a general rule, I’ve used a setup where very specific tests where node bound, but more generic checks were hostgroup bound. This lowers the number of exported resources that need to be collected and thus prevents overload / timeouts etc on the nagios monitoring node.

This post assumes working knowledge of both puppet and nagios.

Generic code samples

Basic setup of the puppet code for the nagios monitoring server

Collecting exported resources within the current puppet environment and purging any unmanaged nagios host and service resources.

  resources { 'nagios_host': purge => true, }
  resources { 'nagios_service': purge => true, }
    
  Nagios_host <<| tag == $environment |>>
  Nagios_service <<| tag == $environment |>> { notify => Exec['reload_nagios_config'], }

  exec { 'reload_nagios_config':
    command     => '/sbin/service nagios reload',
    refreshonly => true
  }

Basic plugin / check definition for all hosts

Just a single example. Please note that for this to work, you will also need to manage the nagios monitoring client nrpe, but this is properly documented in the puppet module docs. As it is bound to a hostgroup, it does not need to be exported.

  nagios_service { "${::fqdn}-disk-root":
    hostgroup_name      => 'puppet-managed-hosts',
    check_command       => "check_nrpe!check_disk!-a '-w 10% -c 5% -p /'",
    check_interval      => '5',
    retry_interval      => '1',
    contact_groups      => 'my_contact_group',
    service_description => 'Disk /',
    use                 => 'generic-service',
    tag                 => $environment,
    notification_period => 'none',
    check_period        => 'timeperiod_24x7',
  }

For every managed host we have a exported resource in a profile we apply to all nodes (our ‘base’ profile).

    @@nagios_host { "${::fqdn}":
      host_name           => "${::fqdn}",
      alias               => "${::fqdn}",
      address             => $::ipaddress,
      hostgroups          => 'puppet-managed-hosts',
      check_interval      => '5',
      retry_interval      => '1',
      use                 => 'generic-host',
      contact_groups      => 'my_contact_group',
      tag                 => [$environment, 'node-specific'],
      max_check_attempts  => '10',
      notification_period => 'none',
      check_period        => 'timeperiod_24x7',
    }

Role specific checks

Specific check for a certain role. The check below would be part of my clients puppet profile managing their Nexus instance :

  @@nagios_service { "${::fqdn}-nexus-webservice":
    host_name           => "${::fqdn}",
    check_command       => "check_http!-a '-I ${::ipaddress} -H ${svcname} -S -u /index.html '",
    check_interval      => '5',
    retry_interval      => '1',
    contact_groups      => 'my_contact_group',
    service_description => 'Nexus runs on https port 443',
    use                 => 'generic-service',
    tag                 => [$environment, 'node-specific'],
    notification_period => 'none',
    check_period        => 'timeperiod_24x7',
  }