infra/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md

# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement

## Status
Accepted (2026-06-13)

## Context
A review was prompted by a suspicion that Kuma was "scraping too much / causing
unnecessary traffic," itself triggered by a socket.io login-timeout incident on
the monitor-sync CronJobs. Measured state at review time:

- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
  write/sec, 30-day retention.
- **122 `[External]` monitors** (full public path) + ~105 internal.

The data did **not** support a load problem — Kuma is already lean. The
login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
loop briefly stalling), fixed separately by wrapping login in a retry — not a
load issue.

## Decisions
1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
   (~1/s) and DB footprint (77 MB) are modest.
2. **`[External]` monitors stay per-service** (one per externally-reachable
   service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
   although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
   unit, per-service external probes also catch *single-service* external
   misconfig (one service's DNS / auth carve-out / route), which canaries miss.
   The ~35k Cloudflare requests/day this generates is accepted for that coverage.
3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
   self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
   single-instance MySQL it also helps monitor, including during that MySQL's
   8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
   low-impact for now.

## Consequences
- All three decisions are **cheap to reverse**; revisit if measured load on
  `mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
  ADR exists mainly so that review isn't re-run from scratch.
- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
  **not** prune orphans (the external sync does). Internal monitors for deleted
  services linger and need periodic manual cleanup — e.g. the stale
  "Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
  by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
  sync owns, never hand-made ones) is a possible future improvement.
uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review) Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-14 09:11:22 +00:00			`# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement`

			`## Status`
			`Accepted (2026-06-13)`

			`## Context`
			`A review was prompted by a suspicion that Kuma was "scraping too much / causing`
			`unnecessary traffic," itself triggered by a socket.io login-timeout incident on`
			`the monitor-sync CronJobs. Measured state at review time:`

			`- 227 active monitors; 209 of them at 300s intervals; ~1 check/sec aggregate.`
			- Datastore: the shared `mysql.dbaas` (MariaDB), ~77 MB, ~1 heartbeat
			`write/sec, 30-day retention.`
			- 122 `[External]` monitors (full public path) + ~105 internal.

			`The data did not support a load problem — Kuma is already lean. The`
			`login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event`
			`loop briefly stalling), fixed separately by wrapping login in a retry — not a`
			`load issue.`

			`## Decisions`
			`1. Keep Kuma as-is; do not reflexively cut monitors or intervals. Poll rate`
			`(~1/s) and DB footprint (77 MB) are modest.`
			2. `[External]` monitors stay per-service (one per externally-reachable
			`service), not a small canary set. Rejected cutting to ~6-10 canaries:`
			`although the Cloudflare → tunnel → Traefik path is shared infra that fails as a`
			`unit, per-service external probes also catch single-service external`
			`misconfig (one service's DNS / auth carve-out / route), which canaries miss.`
			`The ~35k Cloudflare requests/day this generates is accepted for that coverage.`
			3. Datastore stays on the shared `mysql.dbaas`. Rejected moving to
			`self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the`
			`single-instance MySQL it also helps monitor, including during that MySQL's`
			`8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as`
			`low-impact for now.`

			`## Consequences`
			`- All three decisions are cheap to reverse; revisit if measured load on`
			`mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
			`ADR exists mainly so that review isn't re-run from scratch.`
			`- Known gap: the internal monitor-sync creates/updates monitors but does`
			`not prune orphans (the external sync does). Internal monitors for deleted`
			`services linger and need periodic manual cleanup — e.g. the stale`
			`"Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted`
			`by hand on 2026-06-13. A scoped internal-prune (only deleting monitors the`
			`sync owns, never hand-made ones) is a possible future improvement.`