infra

Author	SHA1	Message	Date
Viktor Barzin	f6812fe69f	[uptime-kuma] Support per-ingress probe path annotation ## Context The `external-monitor-sync` CronJob probed `https://<host>/` for every `.viktorbarzin.me` ingress. Homepages frequently return 200 (or allow-listed 30x/40x) even when the backend or DB is broken, producing false-negatives — the forgejo outage on 2026-04-17 was not caught for this reason: `/` returned a login page while `/api/healthz` returned 503 from the DB probe. Manual monitor edits don't stick: the next sync is create-if-missing only, so a deleted monitor gets recreated pointing at `/` again. ## This change Teaches the sync three things: 1. Reads a new annotation* `uptime.viktorbarzin.me/external-monitor-path`. The annotation value is appended as the probe path; default `/` preserves today's behaviour for every ingress that hasn't opted in. 2. Tightens accepted status codes when an explicit path is set: `['200-299']` (strict — we expect a real healthz). The default `/` path keeps the existing lenient set `['200-299','300-399','400-499']` because homepages routinely 30x redirect or 40x on missing auth. 3. Updates existing monitors when the target URL or accepted status codes drift. Previously the loop was create-if-missing only, so annotating an already-monitored ingress had no effect until the monitor was deleted. Now re-running the sync after changing the annotation converges the live monitor. ## What is NOT in this change - No change to the Ingress annotations on any individual stack. Each service that wants a non-`/` probe path opts in separately. - No change to the ConfigMap fallback payload shape — legacy entries still get the lenient status codes. - Monitor DB state in Uptime Kuma's SQLite is untouched at plan time; the sync CronJob is what reconciles state on each run. ## Flow ``` ingress annotation CronJob Python ------------------ -------------- (none) --> url = https://host/ codes = lenient external-monitor-path --> url = https://host<path> codes = strict ['200-299'] ^^ "/api/healthz" https://host/api/healthz codes = ['200-299'] existing monitor + drifted target url --> api.edit_monitor(id, url=..., accepted_statuscodes=...) ``` ## Test Plan ### Automated - `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0. - `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to change, 0 to destroy`. The single in-place change is the CronJob command (Python heredoc re-rendered). No other resources drift. - Embedded Python compiles: extracted the `PYEOF` block and ran `python3 -m py_compile` — OK. ### Manual Verification 1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz` 2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual` 3. Expected log line: `Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])` 4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes reflect the annotation. 5. Final summary line includes updated count: `Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:06:23 +00:00
Viktor Barzin	498e7f3305	[uptime-kuma] Fix duplicate monitor creation + clean up down monitors ## Duplicate bug fix The external-monitor-sync deduped targets by hostname (`host in seen`) but multiple ingresses can share the same hostname. Changed to dedupe by final monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate [External] monitors on every sync run. This caused 90 duplicates. ## Monitor cleanup Deleted 118 monitors total: - 90 duplicate [External] monitors (kept lower ID of each pair) - 14 paused internal monitors for decommissioned services - 14 external monitors for non-existent, scaled-down, or non-HTTP services (xray-vless, complaints, hermes-agent, etc.) ## Opt-outs Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses that shouldn't have external HTTP monitors: xray (non-HTTP protocol), council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS). 329 monitors → ~210 monitors. Zero down monitors expected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:12:31 +00:00
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	366e2ab083	[uptime-kuma] Opt-out external monitoring for every public ingress [ci skip] ## Context After the previous commit migrated monitor discovery to per-ingress annotation (opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress (Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added → no monitor → invisible outage). The user's concern: "I should have known external Immich was down before users tried to open it." ## This change Flipped the semantic from opt-in to opt-out by default: - Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>` monitor automatically - Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false` are skipped - Host dedup via a `seen` set (one monitor per hostname, regardless of how many Ingress resources share it) ## Verification Triggered a manual CronJob run post-apply: ``` Sync complete: 102 created, 1 deleted, 23 unchanged ``` Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services now have dedicated monitors: - [External] immich, authentik, forgejo, grafana, ntfy, vault ## Scope Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the CronJob resource). No RBAC or service account changes — the ones added in the previous commit still cover this path. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 \| grep -iE 'immich\|authentik\|grafana\|forgejo\|vault\|ntfy' Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me Creating monitor: [External] forgejo -> https://forgejo.viktorbarzin.me Creating monitor: [External] immich -> https://immich.viktorbarzin.me Creating monitor: [External] grafana -> https://grafana.viktorbarzin.me Creating monitor: [External] ntfy -> https://ntfy.viktorbarzin.me Creating monitor: [External] vault -> https://vault.viktorbarzin.me \`\`\` ### Manual Verification 1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists 2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor should go red within the probe interval (5min); internal monitor stays up (pod-level from a different probe angle) → `ExternalAccessDivergence` alert fires after 15 min Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 11:12:00 +00:00
Viktor Barzin	66d2d9916b	[infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip] ## Context Two operational gaps surfaced during a healthcheck sweep today: 1. External monitoring coverage: Only ~13 hostnames (via `cloudflare_proxied_names` in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT registered for external probing — so outages like Immich going down externally were invisible until a user complained. 99 of ~125 public ingresses had no external monitor. 2. actualbudget stack unplannable: `count = var.budget_encryption_password != null ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the value flows from a `data.kubernetes_secret` whose contents are `(known after apply)` at plan time. Blocked CI applies and drift reconciliation. ## This change ### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory) - New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string, nullable). Default is "follow dns_type" — enabled for any public DNS record (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other direct-A records are also monitored). - Emits two annotations on the Ingress: - `uptime.viktorbarzin.me/external-monitor = "true"` - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override) ### external-monitor-sync CronJob (uptime-kuma stack) - Discovers targets from live Ingress objects via the K8s API first (filter by annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any API error (zero rollout risk). - New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving `list`/`get` on `networking.k8s.io/ingresses`. - `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s) instead of `kubernetes.default.svc` — the search-domain expansion failed in the CronJob pod's DNS config. Verified working: CronJob now logs `Loaded N external monitor targets (source=k8s-api)`. ### actualbudget count-on-unknown refactor - Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at plan; no `-target` workaround needed. - Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is unchanged — the secret is still consumed via env var. - Also aligned the factory with live state (the 3 budget-* PVCs had been migrated `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module removed. State was rm'd + re-imported with matching UIDs, so no data was moved. ## Rollout status (already partially applied in this session) - `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified - `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally - `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live - CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active (was 13 on the central list) ## Deferred (separate work) - 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory, rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade. `[ci skip]` here so those don't auto-apply; they will be fixed manually before the next CI push. - Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik, grafana, vault, forgejo) are annotated — separate PR. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name \| tail -1) Loaded 26 external monitor targets (source=k8s-api) Sync complete: 7 created, 0 deleted, 17 unchanged \$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\ https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\ https://budget-viktor.viktorbarzin.me/ 200 302 200 \$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor deployment.apps/budget-viktor 1/1 1 1 Ready persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted \`\`\` ### Manual Verification 1. Confirm the annotation is present on an ingress_factory ingress: \`\`\` kubectl -n dawarich get ingress dawarich -o \\ jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}' # Expected: "true" \`\`\` 2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min (CronJob interval). For Immich specifically, it will appear after the immich stack is re-applied. 3. Verify actualbudget plan is clean: \`\`\` cd stacks/actualbudget && scripts/tg plan --non-interactive # Expected: no "Invalid count argument" errors \`\`\` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:34:32 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	ee39dd2fc9	feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all SQLite-backed services. Deployments updated to use new block storage PVCs. Old NFS modules retained for 1-week rollback. Services: ntfy, freshrss, insta2spotify, actualbudget (x3), wealthfolio, navidrome (DB only), audiobookshelf config, headscale, forgejo, uptime-kuma. Also: set Recreate strategy on ntfy, forgejo, insta2spotify, wealthfolio (required for RWO volumes).	2026-04-04 16:26:59 +03:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00

12 commits