Monument to Prometheus at front and the Macedonian Assembly in the background.

Monitoring Prometheus with Healthchecks

2022-12-15
3 minutes

Prometheus is a crazy powerful metrics and monitoring tool. Prometheus not only lets you scrape and collect metrics from other tools like Traefik and HomeAssistant, but also thanks to Blackbox, monitor the availability of other sites. Prometheus' main loop involves scraping a number of "exporters" over HTTP, looking at the data which comes back and collecting the results. Therefore, it's important both that Prometheus is running, but also that scrapes are actually happening.

In my opinion, if a tool or service isn't being monitored, it may as well not exist. I recently switched to Prometheus for my uptime monitoring needs from uptime-kuma (more on that another day). So because Prometheus now plays a big role in my monitoring stack I need it to be up and reliable. I can't really have it monitor itself, because if it goes down, who will tell me? For that, my intention was to use an external monitoring tool like UptimeRobot, which I'm using for a few other critical services. However, for security reasons, Prometheus isn't open to the internet (I don't want you seeing all my metrics!), so unfortunately UptimeRobot won't work here as there's nothing it can ping. I would need another solution...

And then, it hit me. Throughout the rest of my infrastructure, I use healthchecks to monitor scheduled tasks. At a fundamental level, these checks work by pinging an endpoint at an interval, and when the pings stop, healthchecks yells in my general direction. This same method rings true for uptime checking: If Prometheus keeps "checking" that healthchecks is available, then when Prometheus stops checking or isn't running at all, healthchecks notifies me that something is wrong.

The proposal

#Setting up

I already have both Prometheus and Blackbox running, so I'm going to assume you have too. If you haven't, I highly recommend playing around with them. But getting started is a topic for someone else to cover (that someone might be future me, it might not).

For the purposes of isolation and maximum configuration, I setup my scrape as a dedicated job, separate from any of the other Blackbox jobs I have running. Blackbox's configuration in Prometheus is a little verbose due to its use of the multi-target pattern, but I think it's still fairly readable:

prometheus.yml
- job_name: blackbox_healthcheck
  scrape_interval: 10m
  metrics_path: /probe
  params:
    module: [http]  # Use `http_2xx` if you have a stock blackbox config
  static_configs:
    - targets:
        - https://hc-ping.com/{{ prometheus_healthcheck_uuid }}
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115

The job itself is pretty straight forward. Every 10 minutes, Prometheus will instruct Blackbox to ping https://hc-ping.com/$ID, and report on the process. We don't care about the metrics Blackbox comes back with (even though I am storing them), we just care that the request is sent.

Once configured, all it took was a restart of Prometheus (or a configuration reload if you want to be fancy), and Prometheus begins keeping healthchecks informed that it's still alive. If Prometheus ever stops scraping, healthchecks tells me. That way I know it's still working, and still keeping an eye on my other applications. 10 minutes may seem like a while for uptime checks, but healthchecks is a great service, and I don't need to abuse it - 10 minutes will be absolutely ample for there to be a problem at the scale I'm working at.

#Conclusion

This approach isn't perfect. Should the healthchecks job keep working but the others stop, I'd never know. The benefit of this however is that rather than relying on Prometheus and alertmanager trying to self-monitor, is that even if it all falls over completely, I still find out.

At some point, I would like to add more traditional monitoring of Prometheus, but that requires a few tweaks to my routing to ensure only the readiness check endpoint of Prometheus is exposed to the outside. I have some ideas, but that's a project for another day.

Share this page

Similar content

View all →

Flashing MagicHome with ESPHome

2020-11-07
3 minutes

I recently added some RGB LED strips around my headboard and bed frame, because everyone needs more RGB in their life. The only thing better than RGB is internet connected RGB. One of the most common controllers for this is the MagicHome. The MagicHome comes with its own firmware, which…

Using Scrutiny to monitor your drives

2020-09-24
2 minutes

After recently deploying a ZFS pool, I realized I had little insight into the health of my drives. I can run SMART stats now and then, but that’s not quite the same.Scrutiny Scrutiny is a tool to help you with just that. It presents a web UI which shows you…