Alerting That Actually Wakes You Up: Discord, Email, and Dead Man's Switches
Installing kube-prometheus-stack takes five minutes. The Helm chart drops 147 alerting rules into your cluster, fires a Discord webhook on the first KubePodCrashLooping, and every tutorial declares victory. Then KubeProxyDown starts firing because you replaced kube-proxy with Cilium. etcdMembersDown fires because kubeadm binds etcd to localhost. CPUThrottlingHigh fires on every scrape burst in the monitoring namespace. Within a week, you've trained yourself to ignore the alerts channel entirely.
This cluster took a different path. After a month of real incidents (apiserver restarts that silently dropped the kube-vip VIP, UPS events that needed immediate attention, a DNS outage caused by a Cilium L2 lease misalignment), the alerting stack grew to 56 custom rules across 18 PrometheusRule manifests. Combined with the 147 defaults from kube-prometheus-stack v81.0.0, Prometheus evaluates 203 alerting rules every 30 seconds. Five of those defaults are silenced because they're wrong for kubeadm. The rest route through three tiers: Discord for awareness, email for emergencies, and a dead man's switch that catches Alertmanager itself when it goes down.
This is Part 7 of the homelab series. Part 5 set up distributed storage with Longhorn. Part 6 deployed Loki + Alloy for log collection. Now we're closing the observability loop: when something breaks, you find out before your users do.
Three Channels, Three Severities
Alertmanager receives alerts from Prometheus and routes them by severity. This cluster uses three tiers:
Prometheus (203 rules)
|
Alertmanager v0.30.1
|
+--------------------+--------------------+
| | |
v v v
+-----------------+ +-----------------+ +-----------------+
| Critical | | Warning | | Watchdog |
| 15 alerts | | 38 alerts | | healthchecks.io |
| #incidents | | #status | | (dead man's) |
| + Email | | (Discord) | | |
+-----------------+ +-----------------+ +-----------------+
Critical hits Discord #incidents and email simultaneously. Discord is easy to miss on silent; email to three addresses ensures at least one device buzzes. Warning goes to a separate #status channel you check in the morning. Watchdog pings an external service that notices when the pings stop.
The full routing and receiver config from Helm values:
alertmanager:
config:
route:
receiver: 'discord-status'
group_by: ['alertname', 'namespace', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match: {alertname: Watchdog}
receiver: 'healthchecks-heartbeat'
repeat_interval: 1m
- match: {severity: critical}
receiver: 'discord-incidents-email'
- match: {severity: warning}
receiver: 'discord-status'
receivers:
- name: 'discord-incidents-email'
discord_configs:
- webhook_url: SET_VIA_HELM
title: >-
🔴 {{ .Status | toUpper }}:
{{ .CommonLabels.alertname }}
message: |
{{ range .Alerts }}
**{{ .Labels.alertname }}** ({{ .Labels.severity }})
{{ .Annotations.summary }}
{{ end }}
email_configs:
- to: >-
[email protected],
[email protected],
[email protected]
send_resolved: true
Alertmanager added native discord_configs in v0.25.0, so there's no webhook bridge or OAuth flow. The send_resolved: true on email notifies when the situation clears. The group_by groups related alerts into single notifications, keeping messages within Discord's 2,000-character limit and avoiding the 30 requests/60 seconds rate limit during cluster-wide events. Every webhook URL and SMTP credential is injected at deploy time from 1Password CLI via a trap-guarded temporary values overlay. Real credentials never touch version control.
Monitoring the Monitor
The hardest failure to detect is when the monitoring stack itself goes down. A Prometheus crash means no alerts fire. An Alertmanager crash is worse: alerts fire, but nobody receives them. Take the entire cluster offline and both fail simultaneously. You need something outside the cluster watching for silence.
kube-prometheus-stack includes an alert called Watchdog that fires continuously. Its expression is always true. As long as Prometheus is evaluating rules and Alertmanager is processing notifications, Watchdog keeps firing. Route it to an external service that notices when the pings stop:
receivers:
- name: 'healthchecks-heartbeat'
webhook_configs:
- url: 'SET_VIA_HELM' # healthchecks.io ping URL
send_resolved: false
healthchecks.io expects a ping at a configured interval. Miss enough pings, and it sends its own notification to Discord, configured on the healthchecks.io side and completely independent of the cluster's Alertmanager. This creates a monitoring chain: Prometheus monitors the cluster, healthchecks.io monitors Prometheus.
There's a gotcha that most tutorials miss. The Watchdog route sets repeat_interval: 1m, but the parent route sets group_interval: 5m. Alertmanager uses the larger value. The actual ping interval is 5 minutes, not 1. Alertmanager logs a warning about this on startup:
repeat_interval is less than group_interval.
Notifications will not repeat until the next group_interval.
Configure your healthchecks.io check with a period of 5 minutes and a grace period of at least 10 minutes. Setting a 1-minute expected period will generate false "down" notifications while everything is working fine.
Live proof the pipeline works:
$ kubectl exec -n monitoring \
alertmanager-prometheus-kube-prometheus-alertmanager-0 -- \
amtool alert query alertname=Watchdog
Alertname Starts At
Watchdog 2026-02-16T09:15:00Z
Watchdog has been firing continuously since the last Alertmanager restart. That's expected. If this command returns nothing, the dead man's switch is broken.
Silencing the Noise
A fresh kube-prometheus-stack installation on a kubeadm cluster fires false positives immediately. This cluster silences five of them by routing to a null receiver:
routes:
# kube-proxy intentionally not running (Cilium eBPF replacement)
- match: {alertname: KubeProxyDown}
receiver: 'null'
# kubeadm binds etcd to 127.0.0.1, Prometheus can't scrape it
- match: {alertname: etcdInsufficientMembers}
receiver: 'null'
- match: {alertname: etcdMembersDown}
receiver: 'null'
# Control plane components also bind to localhost
- match_re:
alertname: TargetDown
job: kube-scheduler|kube-controller-manager|kube-etcd
receiver: 'null'
# Node-exporter scrape bursts trigger throttling in monitoring namespace
- match:
alertname: CPUThrottlingHigh
namespace: monitoring
receiver: 'null'
The first four are kubeadm-specific. KubeProxyDown fires because this cluster replaced kube-proxy with Cilium eBPF in Part 3. There's no kube-proxy to scrape, so the target is permanently down. The etcd alerts fire because kubeadm binds etcd metrics to 127.0.0.1. Prometheus can't scrape etcd from outside the node, so it reports insufficient members. TargetDown catches the same localhost binding issue for the scheduler, controller-manager, and etcd scrape jobs.
CPUThrottlingHigh in the monitoring namespace is a different problem. Node-exporter and kube-state-metrics generate bursty scrape patterns that trigger CPU throttling alerts without actual performance impact. Silencing it only in the monitoring namespace keeps the alert active for application workloads where throttling matters.
An inhibition rule handles the overlap between warning and critical alerts for the same issue:
inhibit_rules:
- source_match: {severity: 'critical'}
target_match: {severity: 'warning'}
equal: ['alertname', 'namespace']
When UPSBatteryCritical fires, UPSBatteryWarning gets suppressed automatically. One notification per escalation, not two.
Alerts Born From Real Incidents
The most useful alerts in this cluster weren't planned. They were written after something broke.
Apiserver restarts nobody notices
kube-apiserver on cp3 accumulated 30 restarts over the month following deployment. cp2 had 21 restarts, cp1 had 7. The default KubeAPIDown alert never fired because each restart recovers within seconds. But every restart triggers a kube-vip leader re-election, dropping the floating VIP while the new leader takes over. Frequent restarts mean frequent brief API outages that KubeAPIDown can't see.
The most recent restarts all hit within the same minute across all three nodes, with termination reason "Error," pointing to a cluster-wide event rather than a single node problem. cp3's higher cumulative count suggests heavier scheduling pressure from colocated workloads (Prometheus and Alertmanager both run there). Root cause is still being tracked.
- alert: KubeApiserverFrequentRestarts
expr: >
increase(kube_pod_container_status_restarts_total
{container="kube-apiserver"}[24h]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: >-
kube-apiserver {{ $labels.pod }} has restarted
{{ $value }} times in 24h
The for field on a Prometheus alerting rule controls how long an expression must be true before the alert transitions from pending to firing. Setting for: 0m skips the pending period entirely. For a trailing 24-hour window that already smooths out single restarts, an immediate fire is the right choice.
UPS severity ladder
Power events need graduated urgency. Losing mains power is worth knowing about. A battery below 30% needs attention. The UPS reporting "low battery" means shutdown is seconds away.
- alert: UPSOnBattery
expr: network_ups_tools_ups_status{flag="OB"} == 1
for: 1m
labels: {severity: warning}
- alert: UPSBatteryCritical
expr: network_ups_tools_battery_charge < 30
for: 1m
labels: {severity: critical}
- alert: UPSLowBattery
expr: network_ups_tools_ups_status{flag="LB"} == 1
for: 0m
labels: {severity: critical}
UPSLowBattery fires immediately because the low battery flag means the UPS has decided shutdown is imminent. The staggered node shutdown is already running at this point: cp3 shuts down at 10 minutes on battery, cp2 at 20 minutes, cp1 on the final low-battery forced shutdown. Every second of notification delay is wasted.
The full escalation: on-battery warning (Discord #status) → battery below 50% warning → battery below 30% critical (Discord #incidents + email) → low battery critical, immediate (Discord #incidents + email). The inhibition rule suppresses warnings once critical fires, so you receive one notification per escalation step.
DNS vanishes, pod stays healthy
AdGuardDNSUnreachable fires when probe_success{job="adguard-dns"} == 0 for 2 minutes. This alert exists because of a Cilium L2 announcement edge case: the AdGuard pod runs fine, but the L2 lease announcing its LoadBalancer IP can drift to a different node. DNS stops working even though the pod is healthy. The runbook annotation includes the fix: kubectl delete lease -n kube-system cilium-l2announce-adguard-home-dns.
What's Still Missing
Honesty about gaps matters more than pretending full coverage.
GitLab has zero custom alerts. It's the most complex deployment in the cluster: 10+ components including Gitaly, Webservice, Sidekiq, Redis, and PostgreSQL, all running as StatefulSets. The default kube-prometheus-stack alerts catch pod crashes, but not GitLab-specific failures like stalled Sidekiq queues or Gitaly latency. This is the next alerting project.
Alertmanager runs as a single replica on cp3, which also hosts the Prometheus server and the most-restarting apiserver. If cp3 goes down, alerting stops until the pod reschedules to another node. The dead man's switch on healthchecks.io detects this within 10 minutes. For production, you'd run 3 Alertmanager replicas with the built-in Gossip protocol for deduplication. For a homelab, the single replica plus external monitoring is an acceptable trade-off.
Log-based alerting isn't configured yet. Part 6 identified this gap: Loki supports ruler-based LogQL alerts, but none are deployed. All 203 rules in this cluster are PromQL (metric-based). Metrics tell you "the pod restarted" but not "why." A Ghost 500 error, a Gitaly corruption warning, or a cert-manager ACME failure shows up in logs before it shows up in metrics. Bridging that gap requires deploying Loki's ruler component, which adds another pod and a separate evaluation loop. For now, {namespace="ghost"} |= "ERROR" is a manual Grafana query, not an automated alert.
The Full Inventory
Every custom PrometheusRule in the cluster, grouped by what they protect:
| Category | Alerts | Critical | Warning | Info |
|---|---|---|---|---|
| UPS / Power | 8 | 4 | 3 | 1 |
| Media Stack (ARR) | 10 | 2 | 8 | 0 |
| Logging (Loki + Alloy) | 6 | 1 | 5 | 0 |
| kube-vip HA | 4 | 2 | 2 | 0 |
| Claude Code Spending | 4 | 1 | 2 | 1 |
| Certificates (cert-manager) | 3 | 2 | 1 | 0 |
| Version Checker | 3 | 0 | 2 | 1 |
| Cloudflare Tunnel | 2 | 1 | 1 | 0 |
| Storage (Longhorn) | 2 | 1 | 1 | 0 |
| Services (Ghost, Ollama, etc.) | 12 | 0 | 12 | 0 |
| API Server | 1 | 0 | 1 | 0 |
| DNS (AdGuard) | 1 | 1 | 0 | 0 |
| Total | 56 | 15 | 38 | 3 |
Plus 147 defaults from kube-prometheus-stack. Total: 203 active alerting rules.
$ kubectl get prometheusrules -A -o json | \
jq '[.items[].spec.groups[].rules[] | select(.alert != null)] | length'
203
What alerting costs
Alertmanager is a fraction of the monitoring stack's footprint:
| Component | Memory | CPU | Note |
|---|---|---|---|
| Prometheus server | ~3,200 Mi | ~250m | TSDB, rule evaluation, 203 alerting rules |
| Grafana | ~430 Mi | ~15m | Dashboards, data source queries |
| Alertmanager | ~70 Mi | ~4m | Routing, deduplication, notification delivery |
| Node exporters (3) | ~60 Mi | ~10m | DaemonSet, one per node |
| kube-state-metrics | ~45 Mi | ~8m | Kubernetes object metrics |
| Monitoring total | ~3,805 Mi | ~287m | 8% of 48 GB cluster RAM |
Prometheus dominates. It evaluates 203 alerting rules every 30 seconds, adding rule evaluation load proportional to the number of time series each rule queries. The 56 custom rules contribute roughly a quarter of that evaluation load. Alertmanager's 70 Mi is a rounding error. Adding the logging layer from Part 6 (Loki + 3 Alloy pods at ~1,010 Mi), the full observability stack totals roughly 4.8 GB. On 48 GB of cluster RAM, that's 10%.
Beyond Alertmanager: weekly version drift
A CronJob called version-check runs every Sunday at 08:00 PHT using Fairwinds Nova to scan for Helm chart drift. It builds Discord embed payloads with jq and POSTs directly to a #version-alerts webhook, bypassing Alertmanager entirely. Orange embeds for outdated charts, red for deprecated, green when everything is current. This is the only notification path in the cluster that doesn't flow through the Prometheus → Alertmanager pipeline. It exists because Helm chart version drift isn't a Prometheus metric; it's a CLI scan that runs against the Helm release history.
Separately, a jetstack version-checker Deployment runs continuously and exposes container image version data as Prometheus metrics. The ContainerImageOutdated PrometheusRule fires when an image has been outdated for 7+ days. These two components cover different layers: Nova checks Helm charts, version-checker checks container images.
Testing the pipeline
The repo includes a test-alert.yaml (not applied to the cluster) that fires both critical and warning alerts using vector(1) as the expression. Apply it, verify Discord and email routing end-to-end, then delete it. Faster than waiting for a real incident to validate the pipeline.
Two alerts were firing while I wrote this post: Watchdog (expected, it never stops) and ClaudeCodeHighDailySpend (the alerting system catching me spending too much on AI while writing about the alerting system).
This post is the seventh in the "Building a Production-Grade Homelab" series:
- Why kubeadm Over k3s, RKE2, and Talos in 2026
- HA Control Plane with kube-vip: No Load Balancer Needed
- Cilium Deep Dive: What Replacing kube-proxy Actually Means
- Gateway API vs Ingress: No Ingress Controller Needed
- Distributed Storage with Longhorn: 2 Replicas Are Enough
- The Modern Logging Stack: Loki + Alloy (Why Not Promtail)
- Alerting That Actually Wakes You Up: Discord, Email, and Dead Man's Switches (you are here)
- Self-Hosted GitLab: CI/CD Without Cloud Vendor Lock-in
Part 8 covers self-hosted GitLab CI/CD, closing the loop from code commit to deployed workload without leaving the cluster.