Monitoring Prometheus with Healthchecks

2022-12-15

Prometheus is a crazy powerful metrics and monitoring tool. Prometheus not only lets you scrape and collect metrics from other tools like Traefik and HomeAssistant, but also thanks to Blackbox, monitor the availability of other sites. Prometheus' main loop involves scraping a number of "exporters" over HTTP, looking at the data which comes back and collecting the results. Therefore, it's important both that Prometheus is running, but also that scrapes are actually happening.

In my opinion, if a tool or service isn't being monitored, it may as well not exist. I recently switched to Prometheus for my uptime monitoring needs from uptime-kuma (more on that another day). So because Prometheus now plays a big role in my monitoring stack I need it to be up and reliable. I can't really have it monitor itself, because if it goes down, who will tell me? For that, my intention was to use an external monitoring tool like UptimeRobot, which I'm using for a few other critical services. However, for security reasons, Prometheus isn't open to the internet (I don't want you seeing all my metrics!), so unfortunately UptimeRobot won't work here as there's nothing it can ping. I would need another solution...

And then, it hit me. Throughout the rest of my infrastructure, I use healthchecks to monitor scheduled tasks. At a fundamental level, these checks work by pinging an endpoint at an interval, and when the pings stop, healthchecks yells in my general direction. This same method rings true for uptime checking: If Prometheus keeps "checking" that healthchecks is available, then when Prometheus stops checking or isn't running at all, healthchecks notifies me that something is wrong.

The proposal

#Setting up

I already have both Prometheus and Blackbox running, so I'm going to assume you have too. If you haven't, I highly recommend playing around with them. But getting started is a topic for someone else to cover (that someone might be future me, it might not).

For the purposes of isolation and maximum configuration, I setup my scrape as a dedicated job, separate from any of the other Blackbox jobs I have running. Blackbox's configuration in Prometheus is a little verbose due to its use of the multi-target pattern, but I think it's still fairly readable:

prometheus.yml
- job_name: blackbox_healthcheck
  scrape_interval: 10m
  metrics_path: /probe
  params:
    module: [http]  # Use `http_2xx` if you have a stock blackbox config
  static_configs:
    - targets:
        - https://hc-ping.com/{{ prometheus_healthcheck_uuid }}
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115

The job itself is pretty straight forward. Every 10 minutes, Prometheus will instruct Blackbox to ping https://hc-ping.com/$ID, and report on the process. We don't care about the metrics Blackbox comes back with (even though I am storing them), we just care that the request is sent.

Once configured, all it took was a restart of Prometheus (or a configuration reload if you want to be fancy), and Prometheus begins keeping healthchecks informed that it's still alive. If Prometheus ever stops scraping, healthchecks tells me. That way I know it's still working, and still keeping an eye on my other applications. 10 minutes may seem like a while for uptime checks, but healthchecks is a great service, and I don't need to abuse it - 10 minutes will be absolutely ample for there to be a problem at the scale I'm working at.

#Conclusion

This approach isn't perfect. Should the healthchecks job keep working but the others stop, I'd never know. The benefit of this however is that rather than relying on Prometheus and alertmanager trying to self-monitor, is that even if it all falls over completely, I still find out.

At some point, I would like to add more traditional monitoring of Prometheus, but that requires a few tweaks to my routing to ensure only the readiness check endpoint of Prometheus is exposed to the outside. I have some ideas, but that's a project for another day.

Share this page