Benjamin Benjamin 05.10.2015

Using Configuration as Code for Third-Party Services (Librato Edition)

One central practice of Devops is "Configuration as Code", where developers encode the state of development, staging and production systems into a source control system such as Git. Every change is therefore documented with the author and a commit message describing the reason for the configuration change.

The changes are then deployed to the affected systems similar how application changes are rolled out. You can even rely on continuous delivery to automatically rollout the changes from your Travis or Jenkins CI system.

For our own development and production servers we have been using Ansible for automation of infrastructure and application configuration from the beginning.

One thing that I have not seen automated before is configuration of the various external services that applications use. Just two weeks ago I realized that our third party service configuration requires automation as well. If the service has an API, you can usually write a small Domain Specific Languages (DSL) to achieve third party service configuration in code.

How we monitor Tideways with Third-Party Services

Tideways provides application monitoring, which we are using to monitor our own software. Because we don't have a server monitoring feature, we use StatsD, CollectD and Librato for this task.

  • StatsD runs on all our application servers to receive business and application level metrics every minute. Example: The number of traces per minute or HTTP requests that fail with status code 500. Instead of the original StatsD running on Node.JS we use this Go clone.

  • CollectD runs on all our servers (application, databases, loadbalancer) and regularly sends CPU, Load, Memory, Network and Disk metrics to Librato.

  • Librato is a SaaS platform that accepts metrics from various sources such as our StatsD and CollectD daemons and then allows to compose graphs and dashboards from them. No configuration is necessary to accept metrics, they are automatically created and configured with sane defaults. For CollectD we can even use an integration to create a pre-defined dashboard without additional work.

  • On top of metric collection, you can trigger alerts inside Librato and send them to an operation management service. For Tideways we use OpsGenie.

Up until last week we have configured alerts directly in the Librato user interface.

But using the user interface for configuration violates the configuration as code practice.

There is no visibilty into what alerts exist and how they work from our source control repository and changing the alerts in the UI became increasingly tedious.

How we use code to define Librato Alerts

To fix this problem we turned the alert configuration into executable code and synchronize it through the Librato REST API whenever we add, change or remove an alert. We invented a simple config format and DSL that represents Librato alerts and a small parser that converts the human readable conditions into the format required by Librato.

For example to monitor Elasticsearch Fielddata Circuit Breakers we define the following two alerts:

alerts:
    elasticsearch.fielddata_tripped:
        description: Fielddata Circuit Breaker tripped, causing failed Elasticsearch /_search queries
        conditions: elasticsearch.nodes.fielddata.tripped above 10 for 2 minutes summarized by derivative

    elasticsearch.fielddata_free_capacity_low:
        description: Free Fielddata Circuit Breaker Capacity is low
        conditions: elasticsearch.nodes.fielddata.capacity above 90% for 10 minutes

By converting to this code-based approach, we were able to quickly triple the amount of alerts and calibrate them to avoid flapping and false-positives. It also increased visibility for all team members. Technically YAML is not "code", but we achieved the main goal to have the alert configuration directly in the source control repository that also defines and collects the same metrics.

If you are familiar with Librato Alerts you know there are many other configuration values. We used default values for most of them in the synchronization code behind the YAML configuration file. Because YAML does not have a schema like JSON and XML we took the Symfony Configuration component to define and validate the structure.

Embracing configuration as code

In the future we also plan to configure Librato metric attributes, spaces and charts with this configuration synchronization approach. For now alerts already improved our monitoring experience massively.

As Tideways is still a monolithic application, we have just one central alert configuration for now. But as soon we are extracting more services from the monolith, it will make sense to have each component define its own alerts in a dedicated file. This idea is having something similar to a .travis.yml file for each component.

We also plan to move the configuration of all our other third party services into the source control repository if possible. Some ideas we currently have:

  • OpsGenie users, groups and the schedule and escalations rules
  • Recurly E-Mail templates, plans, coupons, webhooks etc.
  • Mailgun Domains, Lists and Routes
  • Cloudflare DNS rules
  • Supportbee users, groups and filters

There are no plans to opensource the code for this in the future, because its highly coupled to our infrastructure layout, defaults and assumptions. However you shouldn't have trouble rebuilding a similar approache for your projects, the code and tests for the Librato sync took not more than half a day to write and roll out. For Librato in particular you can look into lbrt if you don't mind writing Ruby.

Obviously, we are now thinking about adding APIs to Tideways that allow configuration of our own alert system with code as well.

I hope this blog post inspired you to think about configuration heavy services in your application that could benefit from being configured in code rather than from a user interface.