infra/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md

46 lines
2.4 KiB
Markdown
Raw Permalink Normal View History

# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement
## Status
Accepted (2026-06-13)
## Context
A review was prompted by a suspicion that Kuma was "scraping too much / causing
unnecessary traffic," itself triggered by a socket.io login-timeout incident on
the monitor-sync CronJobs. Measured state at review time:
- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
write/sec, 30-day retention.
- **122 `[External]` monitors** (full public path) + ~105 internal.
The data did **not** support a load problem — Kuma is already lean. The
login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
loop briefly stalling), fixed separately by wrapping login in a retry — not a
load issue.
## Decisions
1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
(~1/s) and DB footprint (77 MB) are modest.
2. **`[External]` monitors stay per-service** (one per externally-reachable
service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
unit, per-service external probes also catch *single-service* external
misconfig (one service's DNS / auth carve-out / route), which canaries miss.
The ~35k Cloudflare requests/day this generates is accepted for that coverage.
3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
single-instance MySQL it also helps monitor, including during that MySQL's
8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
low-impact for now.
## Consequences
- All three decisions are **cheap to reverse**; revisit if measured load on
`mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
ADR exists mainly so that review isn't re-run from scratch.
- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
**not** prune orphans (the external sync does). Internal monitors for deleted
services linger and need periodic manual cleanup — e.g. the stale
"Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
sync owns, never hand-made ones) is a possible future improvement.