PagerDuty and SaltStack for transparent automation of data center infrastructure

PagerDuty and SaltStack for transparent automation of data center infrastructure

By November 12, 2014 Blog One Comment

FI_pagerdutySaltStack is used by systems administrators for data center infrastructure command and control. PagerDuty helps these systems administrators keep tabs on all the important changes and events happening within the data center. This article describes a number of ways to integrate PagerDuty with the SaltStack server configuration management system, providing a wealth of options for users to track the state of their servers and data center infrastructure.

The official guide for setting up SaltStack to use with PagerDuty focuses on basic configuration, using just enough examples to get started. Be sure to read this guide first to get your Salt minions properly configured.

Also, by default, SaltStack state (SLS) files are written in YAML, with the option of using Jinja for templating. The examples in this article are written in YAML, but don’t let that stop you from using one of the other rendering systems available.

For more information on SaltStack file states, see http://docs.saltstack.com/en/latest/ref/states/all/salt.states.file.html.  For more information on service states, see http://docs.saltstack.com/en/latest/ref/states/all/salt.states.service.html.

Monitoring Failures
By default, SaltStack states are imperative, meaning statements are evaluated in the order in which they appear in the SLS file. However, using one of the many “requisites” available, they are also declarative, meaning that one state can call another if it needs to, regardless of where it is configured. One of these requisites, called “onfail”, can be used to trigger one action if another one fails to complete successfully.

For example, let’s say that your infrastructure makes use of the Apache Web server, and you need to be notified when a configuration change causes Apache to fail to restart. Consider the following:

httpd:
 file:
   – managed
   – name: /etc/httpd/conf/httpd.conf
   – source: salt://httpd/httpd.conf
   – user: root
   – group: root
   – mode: 644

 service:
   – running
   – enable: True
   – watch:
     – file: httpd
   – onfail:
     – pagerduty: httpd_failure

httpd_failure:
 pagerduty:
   – create_event
   – name: ‘httpd service failure’
   – details: ‘A failure was detected with the httpd service’
   – service_key: 8eb116b11626346239365c9651e
   – profile: pagerduty-critical

The above state manages the httpd.conf file. The httpd service watches that file, and if changes are detected, the httpd service will attempt to reload. However, if the service fails to reload, then an incident will be triggered in PagerDuty, using the “pagerduty-critical” configuration profile.

What if you want notifications fired anytime any change is made to your infrastructure, good or bad? Or perhaps just a change to a specific component? This could simply help you get your job done, or perhaps management has requested to be notified on infrastructure changes. This can be handled using the SaltStack “onchanges” requisite.

By modifying the above example, we can also send a notification to a second configured service in PagerDuty, which is only relevant to changes.

httpd:
 file:
   – managed
   – name: /etc/httpd/conf/httpd.conf
   – source: salt://httpd/httpd.conf
   – user: root
   – group: root
   – mode: 644

 service:
   – running
   – enable: True
   – watch:
     – file: httpd
   – onfail:
     – pagerduty: httpd_failure
   – onchanges:
     – pagerduty: httpd_changes

httpd_changes:
 pagerduty:
   – create_event
   – name: ‘httpd service changes’
   – details: ‘Changes have been made to the httpd service’
   – service_key: 263493658eb116b116236c9651e
   – profile: pagerduty-info

httpd_failure:
 pagerduty:
   – create_event
   – name: ‘httpd service failure’
   – details: ‘A failure was detected with the httpd service’
   – service_key: 8eb116b11626346239365c9651e
   – profile: pagerduty-critical

The well-known system monitoring tools have been used for years to keep an eye on data center infrastructure vitals, such as load average or disk usage. SaltStack also has the ability to monitor system vitals.

Introducing SaltStack monitoring states. These states are not designed to make changes and enforce state management on a machine. Rather, they monitor a specific piece of information, and send a notification when that information falls outside the bounds that have been configured.

The following SLS will monitor the load average on the target system, and trigger an incident in PagerDuty if it goes too high, or even if it goes too low.

check_load:
 status.loadavg:
   – maximum: 1.2
   – minimum: 0.05
   – onfail:
     – pagerduty: loadavg_trigger

loadavg_trigger:
 pagerduty.create_event:
   – name: ‘Bad Load Average’
   – details: ‘Load average is outside desired range’
   – service_key: 8eb116b11626346239365c9651e
   – profile: my-pagerduty-config

The above options are great, but how do you tie it all together into a complete, production-ready state? Consider the following story.

You manage an infrastructure which uses a Web app written in Django, running under the Apache Web server. Before the Django codebase is deployed, the Apache service needs to be stopped, and after the code is deployed, Apache needs to be started again.

Meanwhile, you have set up a PagerDuty service to send an email any time any changes are made anywhere in the infrastructure, and a second PagerDuty service to page the on-call team anytime a failure is detected. And to round things out, you want to verify that the Website is still functional after Apache restarts, and that the code changes aren’t causing excessive load on the Web server.

The above code deployment example makes use of a number of SaltStack states and requisites. You have already seen the httpd.conf file and httpd service, and their warnings, in the previous examples. You have also seen the load average states. However, some new states have been added.

The first is the http.query state, which is called any time the httpd service reports changes (usually meaning it has been started). This state will perform an http request against the specified URL, and both check for a status of 200 (the Web server reports OK) and that the text “Welcome to my company website” appears at that URL. Either or both of these items can be checked.

Two new stanzas have also been added pertaining to Django. The first uses the service.dead state to stop the httpd service. However, it will only run if Salt first detects that the django_code state is expected to make changes to the system.

This is made possible by the “prereq” requisite. This will cause Salt to perform a test run of the django_code to see if any changes would be made to the system. If so, then the stop_httpd state will be triggered.

Once the httpd service has been stopped, the djando_code state will recursively deploy the directory structure on the server which contains the Django codebase, without us having to worry about pages being served using an incomplete codebase. This state also contains a “watch_in” requisite, which will notify the httpd service state when it is finished. Once the service has been restarted, the http.query state will be performed, to notify us whether the code changes have been successful. If they are not, of course, an alert will be triggered in PagerDuty.

httpd:
 file:
   – managed
   – name: /etc/httpd/conf/httpd.conf
   – source: salt://httpd/httpd.conf
   – user: root
   – group: root
   – mode: 644

 service:
   – running
   – enable: True
   – watch:
     – file: httpd
   – onfail:
     – pagerduty: httpd_failure
   – onchanges:
     – pagerduty: httpd_changes
     – http: httpd

 http:
   – query
   – name: http://mysample.domain.com/path/to/verify.html
   – match: ‘Welcome to my company website’
   – status: 200
   – onfail:
     – pagerduty: httpd_failure

stop_httpd:
 service:
   – dead
   – prereq:
     – file: django_code

django_code:
 file:
   – recurse
   – name: /srv/django/
   – source: salt://django/codebase/
   – dir_mode: 2755
   – file_mode: 644
   – include_empty: True
   – watch_in:
     – service: httpd

httpd_changes:
 pagerduty:
   – create_event
   – name: ‘httpd service changes’
   – details: ‘Changes have been made to the httpd service’
   – service_key: 263493658eb116b116236c9651e
   – profile: pagerduty-info

httpd_failure:
 pagerduty:
   – create_event
   – name: ‘httpd service failure’
   – details: ‘A failure was detected with the httpd service’
   – service_key: 8eb116b11626346239365c9651e
   – profile: pagerduty-critical

check_load:
 status.loadavg:
   – name: 1-min
   – maximum: 1.2
   – minimum: 0.05
   – onfail:
     – pagerduty: loadavg_trigger

loadavg_trigger:
 pagerduty.create_event:
   – name: ‘Bad Load Average’
   – details: ‘Load average is outside desired range’
   – service_key: 8eb116b11626346239365c9651e
   – profile: pagerduty-critical

Hopefully these examples demonstrate how much more useful and effective the SaltStack state system can be when combined with PagerDuty. Once in place, you will wonder how anyone ever managed complex data center infrastructures without these powerful tools making the difficult jobs easy.

Join the discussion One Comment

Leave a Reply