uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
05bec26d09
commit
6dc77f4612
2 changed files with 74 additions and 0 deletions
29
stacks/uptime-kuma/CONTEXT.md
Normal file
29
stacks/uptime-kuma/CONTEXT.md
Normal file
|
|
@ -0,0 +1,29 @@
|
||||||
|
# Uptime Kuma — Context
|
||||||
|
|
||||||
|
Glossary for the uptime-kuma monitoring context. Terms only — no implementation
|
||||||
|
detail. Decisions live in `docs/adr/`.
|
||||||
|
|
||||||
|
## Glossary
|
||||||
|
|
||||||
|
**Active check (poll)** — Uptime Kuma actively probes a target on an interval
|
||||||
|
(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes*
|
||||||
|
exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a
|
||||||
|
separate monitoring lane.)
|
||||||
|
|
||||||
|
**Monitor** — one configured target plus its check definition.
|
||||||
|
|
||||||
|
**Internal monitor** — probes a service on its in-cluster address
|
||||||
|
(`*.svc.cluster.local`). Answers "is the service itself healthy?"
|
||||||
|
|
||||||
|
**`[External]` monitor** — probes a service via its full public path
|
||||||
|
(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service
|
||||||
|
reachable the way users reach it?" Maintained one-per-externally-reachable-service
|
||||||
|
by deliberate choice (see ADR-0001).
|
||||||
|
|
||||||
|
**Heartbeat** — one recorded check result (up/down + latency), persisted to the
|
||||||
|
datastore.
|
||||||
|
|
||||||
|
**External-access divergence** — the condition where a service is healthy
|
||||||
|
*internally* but its `[External]` path is down — i.e. the shared
|
||||||
|
Cloudflare/tunnel/Traefik path is broken while the service itself is fine.
|
||||||
|
Surfaced by the `ExternalAccessDivergence` alert.
|
||||||
|
|
@ -0,0 +1,45 @@
|
||||||
|
# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement
|
||||||
|
|
||||||
|
## Status
|
||||||
|
Accepted (2026-06-13)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
A review was prompted by a suspicion that Kuma was "scraping too much / causing
|
||||||
|
unnecessary traffic," itself triggered by a socket.io login-timeout incident on
|
||||||
|
the monitor-sync CronJobs. Measured state at review time:
|
||||||
|
|
||||||
|
- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
|
||||||
|
- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
|
||||||
|
write/sec, 30-day retention.
|
||||||
|
- **122 `[External]` monitors** (full public path) + ~105 internal.
|
||||||
|
|
||||||
|
The data did **not** support a load problem — Kuma is already lean. The
|
||||||
|
login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
|
||||||
|
loop briefly stalling), fixed separately by wrapping login in a retry — not a
|
||||||
|
load issue.
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
|
||||||
|
(~1/s) and DB footprint (77 MB) are modest.
|
||||||
|
2. **`[External]` monitors stay per-service** (one per externally-reachable
|
||||||
|
service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
|
||||||
|
although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
|
||||||
|
unit, per-service external probes also catch *single-service* external
|
||||||
|
misconfig (one service's DNS / auth carve-out / route), which canaries miss.
|
||||||
|
The ~35k Cloudflare requests/day this generates is accepted for that coverage.
|
||||||
|
3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
|
||||||
|
self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
|
||||||
|
single-instance MySQL it also helps monitor, including during that MySQL's
|
||||||
|
8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
|
||||||
|
low-impact for now.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
- All three decisions are **cheap to reverse**; revisit if measured load on
|
||||||
|
`mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
|
||||||
|
ADR exists mainly so that review isn't re-run from scratch.
|
||||||
|
- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
|
||||||
|
**not** prune orphans (the external sync does). Internal monitors for deleted
|
||||||
|
services linger and need periodic manual cleanup — e.g. the stale
|
||||||
|
"Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
|
||||||
|
by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
|
||||||
|
sync owns, never hand-made ones) is a possible future improvement.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue