From d31bbc9a187797362306840f9829f7f343b16ee8 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 15 Apr 2026 06:37:07 +0000 Subject: [PATCH] docs: update monitoring and backup docs for external monitors and per-db backups - CLAUDE.md: document external monitoring (ExternalAccessDivergence alert, external-monitor-sync CronJob) and per-database backup/restore paths - backup-dr.md: add per-db backup CronJobs to inventory table and daily timeline, update restore runbook references - monitoring.md: add External Monitor Sync component and external monitoring architecture section [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) --- .claude/CLAUDE.md | 11 +++++++---- docs/architecture/backup-dr.md | 16 ++++++++++------ docs/architecture/monitoring.md | 10 ++++++++-- 3 files changed, 25 insertions(+), 12 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index e852f193..28b91818 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -132,8 +132,9 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle ## Monitoring & Alerting - Alert cascade inhibitions: if node is down, suppress pod alerts on that node. - Exclude completed CronJob pods from "pod not ready" alerts. -- Every new service gets Prometheus scrape config + Uptime Kuma monitor. -- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable. +- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). +- **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars`. +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence. - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay). ## Storage & Backup Architecture @@ -209,13 +210,15 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" { - `nfs-ssd/` — mirrors `/srv/nfs-ssd` on Proxmox (inotify change-tracked rsync) **App-level CronJobs** (write to Proxmox host NFS, synced to Synology via inotify): -- MySQL (daily), PostgreSQL (daily), Vault (weekly), Vaultwarden (6h + integrity), Redis (weekly), etcd (weekly) +- MySQL (daily full + per-db), PostgreSQL (daily full + per-db), Vault (weekly), Vaultwarden (6h + integrity), Redis (weekly), etcd (weekly) +- **Per-database backups**: `postgresql-backup-per-db` (00:15, `pg_dump -Fc` → `/backup/per-db//`) and `mysql-backup-per-db` (00:45, `mysqldump` → `/backup/per-db//`). Enables single-database restore without affecting others. - **Convention**: New proxmox-lvm apps MUST add a backup CronJob writing to `/mnt/main/-backup/` **Restore paths**: +- Single database: `pg_restore -d --clean --if-exists` (PG) or `mysql < dump.sql.gz` (MySQL) from per-db backup - Accidental delete: `lvm-pvc-snapshot restore` (instant, 7 daily snapshots) - Older data: Browse `/mnt/backup/pvc-data////`, rsync back -- Database: Restore from dump at `/srv/nfs/-backup/` or Synology `nfs/-backup/` +- Database (full cluster): Restore from dump at `/srv/nfs/-backup/` or Synology `nfs/-backup/` - pfsense: Upload config.xml via web UI, or extract tar for custom scripts - Full disaster: Restore from Synology diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md index b052cb9a..a98f4303 100644 --- a/docs/architecture/backup-dr.md +++ b/docs/architecture/backup-dr.md @@ -200,8 +200,10 @@ graph LR | NFS Change Tracker | Continuous (inotifywait) | PVE host: `nfs-change-tracker.service` | Logs changed NFS file paths to `/mnt/backup/.nfs-changes.log` | | pfSense Backup | Daily 05:00 + daily-backup | PVE host: SSH + API | config.xml + full filesystem tar | | Offsite Sync | Daily 06:00 (after daily-backup) | PVE host: `offsite-sync-backup` | Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify | -| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases | -| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for all databases | +| PostgreSQL Backup (full) | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases | +| PostgreSQL Backup (per-db) | Daily 00:15, 14d retention | CronJob in `dbaas` namespace | pg_dump -Fc per database → `/backup/per-db//` | +| MySQL Backup (full) | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump --all-databases | +| MySQL Backup (per-db) | Daily 00:45, 14d retention | CronJob in `dbaas` namespace | mysqldump per database → `/backup/per-db//` | | etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot | | Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity | | Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot | @@ -277,8 +279,10 @@ K8s CronJobs run inside the cluster, dumping database/state to NFS-exported back - Need point-in-time recovery for specific apps without full LVM rollback **Daily backups (00:00-00:30)**: -- **PostgreSQL** (`pg_dumpall`): Dumps all databases to `/mnt/main/postgresql-backup/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`. -- **MySQL** (`mysqldump`): Dumps all databases. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation. +- **PostgreSQL full** (`pg_dumpall`, 00:00): Dumps all databases to `/mnt/main/postgresql-backup/dump_*.sql.gz`. 14-day rotation. +- **PostgreSQL per-db** (`pg_dump -Fc`, 00:15): Dumps each database individually to `/mnt/main/postgresql-backup/per-db//dump_*.dump`. Enables single-database restore via `pg_restore -d --clean --if-exists`. 14-day rotation. +- **MySQL full** (`mysqldump --all-databases`, 00:30): Dumps all databases to `/mnt/main/mysql-backup/dump_*.sql.gz`. 14-day rotation. +- **MySQL per-db** (`mysqldump`, 00:45): Dumps each database individually to `/mnt/main/mysql-backup/per-db//dump_*.sql.gz`. Enables single-database restore. 14-day rotation. **Daily backups (Sunday 01:00-04:00)**: - **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery. @@ -733,8 +737,8 @@ Detailed runbooks in `docs/runbooks/`: - **`restore-lvm-snapshot.md`** — Instant rollback of a PVC using LVM snapshot (RTO <5 min) - **`restore-pvc-from-backup.md`** — Restore a PVC from sda file backup (when snapshots expired) -- **`restore-postgresql.md`** — Restore individual database or full cluster from pg_dumpall backup -- **`restore-mysql.md`** — Restore MySQL databases from mysqldump backup +- **`restore-postgresql.md`** — Restore individual database (from per-db `pg_dump -Fc`) or full cluster (from `pg_dumpall`) +- **`restore-mysql.md`** — Restore individual database (from per-db `mysqldump`) or full cluster (from `mysqldump --all-databases`) - **`restore-vault.md`** — Restore Vault from raft snapshot - **`restore-vaultwarden.md`** — Restore password vault from sqlite3 backup - **`restore-etcd.md`** — Restore etcd cluster from snapshot diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 983bf3f2..7826887d 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -59,7 +59,8 @@ graph TB | Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) | | Loki | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying | | Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions | -| Uptime Kuma | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Per-service HTTP monitors, status page | +| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page | +| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` | | dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection | | Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP | @@ -69,7 +70,12 @@ graph TB Prometheus scrapes metrics from all cluster components and applications using ServiceMonitor CRDs and scrape configs. Every new service deployed to the cluster receives: 1. A Prometheus scrape configuration (via ServiceMonitor or static config) -2. An Uptime Kuma HTTP monitor for health checks +2. An Uptime Kuma HTTP monitor for internal health checks +3. An external HTTP monitor (auto-created by `external-monitor-sync` for all Cloudflare-proxied services) + +### External Monitoring + +The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] ` monitors for every service in `cloudflare_proxied_names`. These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence. Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.