infra/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md
Viktor Barzin 6dc77f4612
All checks were successful
ci/woodpecker/push/default Pipeline was successful
uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 09:11:22 +00:00

2.4 KiB

ADR-0001: Uptime Kuma is intentionally lean — sizing & placement

Status

Accepted (2026-06-13)

Context

A review was prompted by a suspicion that Kuma was "scraping too much / causing unnecessary traffic," itself triggered by a socket.io login-timeout incident on the monitor-sync CronJobs. Measured state at review time:

  • 227 active monitors; 209 of them at 300s intervals; ~1 check/sec aggregate.
  • Datastore: the shared mysql.dbaas (MariaDB), ~77 MB, ~1 heartbeat write/sec, 30-day retention.
  • 122 [External] monitors (full public path) + ~105 internal.

The data did not support a load problem — Kuma is already lean. The login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event loop briefly stalling), fixed separately by wrapping login in a retry — not a load issue.

Decisions

  1. Keep Kuma as-is; do not reflexively cut monitors or intervals. Poll rate (~1/s) and DB footprint (77 MB) are modest.
  2. [External] monitors stay per-service (one per externally-reachable service), not a small canary set. Rejected cutting to ~6-10 canaries: although the Cloudflare → tunnel → Traefik path is shared infra that fails as a unit, per-service external probes also catch single-service external misconfig (one service's DNS / auth carve-out / route), which canaries miss. The ~35k Cloudflare requests/day this generates is accepted for that coverage.
  3. Datastore stays on the shared mysql.dbaas. Rejected moving to self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the single-instance MySQL it also helps monitor, including during that MySQL's 8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as low-impact for now.

Consequences

  • All three decisions are cheap to reverse; revisit if measured load on mysql.dbaas or Cloudflare ever becomes a real (not gut-feel) problem. This ADR exists mainly so that review isn't re-run from scratch.
  • Known gap: the internal monitor-sync creates/updates monitors but does not prune orphans (the external sync does). Internal monitors for deleted services linger and need periodic manual cleanup — e.g. the stale "Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted by hand on 2026-06-13. A scoped internal-prune (only deleting monitors the sync owns, never hand-made ones) is a possible future improvement.