Viktor Barzin 50e8184d99 [uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob

## Context

Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.

## This change

Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.

- `local.internal_monitors` — list of desired monitors (name, type,
  connection string, Vault password key, interval, retries). Seeded with
  the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
  referenced DB passwords from Vault `secret/viktor` at apply time. Secret
  key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
  list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
  looks up monitors by name, creates if missing, patches if drifted,
  leaves id and history untouched when already in desired state.

## Why this approach (Option B, not a Terraform provider)

The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.

## Monitor 663: no-op on first sync

Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:

  Monitor MySQL Standalone (dbaas) (id=663) already in desired state
  Internal monitor sync complete

DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.

## Vault key — left manual (by design)

`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.

## Plan + apply

  Plan: 3 to add, 0 to change, 0 to destroy.
  Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
  Re-plan: No changes. Your infrastructure matches the configuration.

Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.

Closes: code-ed2

2026-04-18 12:04:17 +00:00

5.1 KiB

Raw Blame History

name	description	author	version	date
uptime-kuma	Manage Uptime Kuma monitoring via the Python API. Use when: (1) User asks to add, remove, or list monitors, (2) User asks about service uptime or monitoring status, (3) User asks to check what's being monitored, (4) User deploys a new service and needs monitoring added, (5) User mentions "uptime", "monitoring", "health check", or "uptime kuma". Uptime Kuma v2 running in Kubernetes, managed via uptime-kuma-api Python library.	Claude Code	1.0.0	2026-02-14

Uptime Kuma Monitoring Management

Overview

URL: https://uptime.viktorbarzin.me
Internal: uptime-kuma.uptime-kuma.svc.cluster.local:80
Image: louislam/uptime-kuma:2
Storage: NFS at /mnt/main/uptime-kuma -> /app/data
API Library: uptime-kuma-api (pip, available via PYTHONPATH)
Credentials: admin / (from UPTIME_KUMA_PASSWORD env var)

Python API Access

Connection Pattern

import os
from uptime_kuma_api import UptimeKumaApi, MonitorType

api = UptimeKumaApi('https://uptime.viktorbarzin.me')
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))

# ... operations ...

api.disconnect()

Execution

python3 -c "
import os
from uptime_kuma_api import UptimeKumaApi, MonitorType
api = UptimeKumaApi('https://uptime.viktorbarzin.me')
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))
# ... your code ...
api.disconnect()
"

Common Operations

List All Monitors

monitors = api.get_monitors()
for m in monitors:
    print(f'{m["id"]:3d} | {m["name"]:30s} | {m["type"]:15s} | interval={m["interval"]}s')

Add HTTP Monitor

api.add_monitor(
    type=MonitorType.HTTP,
    name="Service Name",
    url="http://service.namespace.svc.cluster.local",
    interval=120,
    maxretries=2,
)

Add PING Monitor

api.add_monitor(
    type=MonitorType.PING,
    name="Host Name",
    hostname="10.0.20.1",
    interval=30,
    maxretries=3,
)

Add PORT Monitor

api.add_monitor(
    type=MonitorType.PORT,
    name="Service Port",
    hostname="service.namespace.svc.cluster.local",
    port=8080,
    interval=120,
    maxretries=2,
)

Edit Monitor

api.edit_monitor(monitor_id, interval=120, maxretries=2)

Delete Monitor

api.delete_monitor(monitor_id)

Pause/Resume Monitor

api.pause_monitor(monitor_id)
api.resume_monitor(monitor_id)

Monitor Types

MonitorType.HTTP — HTTP(S) endpoint check
MonitorType.PING — ICMP ping
MonitorType.PORT — TCP port check
MonitorType.POSTGRES — PostgreSQL connection
MonitorType.REDIS — Redis connection
MonitorType.DNS — DNS resolution check

Tiered Monitoring System

Monitors use tiered intervals to balance responsiveness with resource usage:

Tier	Interval	Retries	Use For
1 - Critical	30s	3	Core infra (DNS, gateway, ingress, NFS, K8s API, auth, mail)
2 - Important	120s	2	Actively used services (Nextcloud, Immich, Vaultwarden, etc.)
3 - Standard	300s	1	Auxiliary/optional services (blog, games, tools)

Tier Assignment Guidelines

Tier 1: If it goes down, multiple other services fail or the cluster is unreachable
Tier 2: User-facing services that are actively used daily
Tier 3: Nice-to-have services, tools, dashboards

When Adding a New Service

Match the tier to the service's DEFCON level from CLAUDE.md:

DEFCON 1-2 → Tier 1 (30s)
DEFCON 3-4 → Tier 2 (120s)
DEFCON 5 → Tier 3 (300s)

Internal Service URL Pattern

Most K8s services follow: http://<service-name>.<namespace>.svc.cluster.local:<port>

Common port is 80. Exceptions:

Homepage: port 3000
Ollama: port 11434
Loki: port 3100 (use /ready endpoint)
Traefik dashboard: port 8080 (use /dashboard/ path)
K8s API: https://10.0.20.100:6443
Immich: port 2283 (use /api/server/ping)

Notes

Uptime Kuma uses Socket.IO (WebSocket) for its API, not REST
The uptime-kuma-api Python library wraps Socket.IO
Add time.sleep(0.3) between bulk operations to avoid overloading
Homepage dashboard widget slug: cluster-internal
Cloudflare-proxied at uptime.viktorbarzin.me

Terraform-Managed Monitors

There is NO louislam/uptime-kuma Terraform provider. Two patterns exist for declarative monitor management in this stack:

External HTTPS monitors — auto-discovered from ingress annotations by the external-monitor-sync CronJob (*/10 * * * *). Opt-out via uptime.viktorbarzin.me/external-monitor: "false" on the ingress.
Internal monitors (DBs, non-HTTP) — declared in the local.internal_monitors list in stacks/uptime-kuma/modules/uptime-kuma/main.tf and synced by the internal-monitor-sync CronJob. To add one, append to the list (provide name, type, database_connection_string, database_password_vault_key, interval, retry_interval, max_retries) and scripts/tg apply. The sync is idempotent — looks up by name, creates if missing, patches if drifted. Existing monitors keep their id and history.

5.1 KiB Raw Blame History