Validating Prometheus rules

FUSAKLA
4 min readNov 23, 2020

There are a lot of tools in the Prometheus ecosystem for you to check or test your configuration. The promtool currently allows you to check syntax of your alerts and also unit test the expressions they use. As a side effect, it also tests rendering of the Go templates used in annotations which is also handy. Once verified, those alerts will be sent to the Alertmanager eventually. It has its own amtool allowing you to test the routing if the alert, with a given set of labels, will notify the expected receivers.

So what’s still missing?

Imagine you have routing in Alertmanager based on the severity label values critical, warning and info. But what happens if a new developer, not aware of the well known severities, creates an alert with severity high? Well that depends on the rest of the configuration, but most probably the alert will end up thrown away or in some generic receiver no one watches, oops! 😱 We should probably check that no one uses some weird severities or even check for typos to avoid such cases.

Another Alertmanager related issue can be rendering the templates. The Go templates used in Alertmanager most probably expect some specific annotations present in the alert, so it would be handy to verify if the alert has those, before receiving weird messed up notifications with missing text for example 🚧

Also best practice is to link a playbook right from the alert, so the on-call person does not have to think of how to resolve the situation from zero. You would put such a link as a URL in the alert annotation, but you should probably check if the annotation is even there, and if so, is it really a valid URL? You could possibly go even further and try to check if the URL does not return 404? 🤔

And what about writing an alert analyzing the difference in incoming traffic between now and a week ago, but forget your Prometheus having only retention of 5 days? Been there, done that 😐 How cool it would be if your CI would tell, you are trying to use more data than available?

Or in more complex cluster setup (for example when using Thanos) you have multiple places where you can evaluate your rules. But those can differ in labels you can use there. If some labels are propagated as external labels or added/dropped by relabeling of some kind, the user can get easily confused, what labels he can or cannot use in the queries.

So let's validate those cases! 💪

To address these situations and more, the promruval tool was created. An open-source, free to use, single binary tool allowing you to write simple validation rules and check if files with Prometheus rules conform to it.

Let’s try it out and write a validation configuration for all the situations mentioned above.

Example promruval configuration to check all previous cases.

Now we can write an alert violating all these rules.

Example rule violating all the requirements defined above.

And let's validate it

Output of the promruval validation.

So in the output beginning we can see a brief summary of the used validations. Next there is a tree of validated files, groups and alerts. Each having a list of failed validation checks. The promruval exits with an exit code 1 if failed and prints out also statistics about the validation run.

So the validation result is…

The tool supports even more validation checks not covered in this example. The full list can be found here.

Also, there are few other minor but useful features such as disabling rules and printing more human-readable form of the validation configuration to be rendered in git pages for example.

So that’s it, The promruval tool can be found here
https://github.com/FUSAKLA/promruval with hopefully all the documentation needed. You’ll find there prebuilt binaries and Docker images or just simply go get it.

So feel free to use it and hopefully it will help you make your monitoring stack even more reliable and error proof! 👋

--

--

FUSAKLA

Observability enthusiast messing around Prometheus and related stuff.