From 6dc77f46128474fe141a178c2f0348e2d00318cc Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 09:11:22 +0000 Subject: [PATCH] uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review) Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 --- stacks/uptime-kuma/CONTEXT.md | 29 ++++++++++++ .../0001-uptime-kuma-sizing-and-placement.md | 45 +++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 stacks/uptime-kuma/CONTEXT.md create mode 100644 stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md diff --git a/stacks/uptime-kuma/CONTEXT.md b/stacks/uptime-kuma/CONTEXT.md new file mode 100644 index 00000000..e8d2c981 --- /dev/null +++ b/stacks/uptime-kuma/CONTEXT.md @@ -0,0 +1,29 @@ +# Uptime Kuma — Context + +Glossary for the uptime-kuma monitoring context. Terms only — no implementation +detail. Decisions live in `docs/adr/`. + +## Glossary + +**Active check (poll)** — Uptime Kuma actively probes a target on an interval +(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes* +exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a +separate monitoring lane.) + +**Monitor** — one configured target plus its check definition. + +**Internal monitor** — probes a service on its in-cluster address +(`*.svc.cluster.local`). Answers "is the service itself healthy?" + +**`[External]` monitor** — probes a service via its full public path +(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service +reachable the way users reach it?" Maintained one-per-externally-reachable-service +by deliberate choice (see ADR-0001). + +**Heartbeat** — one recorded check result (up/down + latency), persisted to the +datastore. + +**External-access divergence** — the condition where a service is healthy +*internally* but its `[External]` path is down — i.e. the shared +Cloudflare/tunnel/Traefik path is broken while the service itself is fine. +Surfaced by the `ExternalAccessDivergence` alert. diff --git a/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md b/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md new file mode 100644 index 00000000..80db84ac --- /dev/null +++ b/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md @@ -0,0 +1,45 @@ +# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement + +## Status +Accepted (2026-06-13) + +## Context +A review was prompted by a suspicion that Kuma was "scraping too much / causing +unnecessary traffic," itself triggered by a socket.io login-timeout incident on +the monitor-sync CronJobs. Measured state at review time: + +- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate. +- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat + write/sec, 30-day retention. +- **122 `[External]` monitors** (full public path) + ~105 internal. + +The data did **not** support a load problem — Kuma is already lean. The +login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event +loop briefly stalling), fixed separately by wrapping login in a retry — not a +load issue. + +## Decisions +1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate + (~1/s) and DB footprint (77 MB) are modest. +2. **`[External]` monitors stay per-service** (one per externally-reachable + service), **not** a small canary set. Rejected cutting to ~6-10 canaries: + although the Cloudflare → tunnel → Traefik path is shared infra that fails as a + unit, per-service external probes also catch *single-service* external + misconfig (one service's DNS / auth carve-out / route), which canaries miss. + The ~35k Cloudflare requests/day this generates is accepted for that coverage. +3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to + self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the + single-instance MySQL it also helps monitor, including during that MySQL's + 8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as + low-impact for now. + +## Consequences +- All three decisions are **cheap to reverse**; revisit if measured load on + `mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This + ADR exists mainly so that review isn't re-run from scratch. +- **Known gap:** the *internal* monitor-sync creates/updates monitors but does + **not** prune orphans (the external sync does). Internal monitors for deleted + services linger and need periodic manual cleanup — e.g. the stale + "Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted + by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the + sync owns, never hand-made ones) is a possible future improvement.