infra

Author	SHA1	Message	Date
Viktor Barzin	3d43d96a5e	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	b107c0be8c	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	30eff178e9	healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO handshake the uptime-kuma-api library uses, and the library can't complete the OAuth flow, so every healthcheck reported "Connection failed" even though the pod was healthy and serving 225 monitors. Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the uptime-kuma namespace for the duration of the check, connect the library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the port-forward on the way out. The disown is to suppress bash's "Killed" job notification on stderr, which corrupted stdout when stderr was merged for JSON parsing. Verified end-to-end: healthcheck now reports the real signal — "external down(3): www, xray-vless, hermes-agent" — the same 3 Cloudflare-facing endpoints flagging in the uptime-kuma logs.	2026-05-22 14:16:44 +00:00
Viktor Barzin	9e5a5fb0c7	infra/scripts/tg: enforce ingress_factory auth-comment convention Every `tg plan/apply/destroy/refresh` now runs `scripts/check-ingress-auth-comments.py` against the current stack before invoking terragrunt. The check fails closed if any `auth = "app"` or `auth = "none"` line in the stack's .tf files lacks an immediately-preceding `# auth = "<tier>": ...` comment documenting what gates the app (for "app") or why the endpoint is intentionally public (for "none"). Why tg-level (not git pre-commit): tg is the universal entry point for all infra changes. CI runs it, headless agents run it, humans run it. A pre-commit hook only catches the human path. Wiring the check into tg means the anti-exposure guard fires regardless of who or what is invoking terragrunt. Stack-scoped: each stack documents itself the next time it's edited. The 30+ existing `auth = "none"` stacks that predate this guard are not blocked from operating today; they'll need the comment added the next time someone runs `tg plan` on them — at which point the gate forces a conscious "yes, this is intentional" moment before any state change can land. Skipped on: init, fmt, validate, output, etc. — anything that doesn't read or write infra state.	2026-05-22 14:16:44 +00:00
Viktor Barzin	a168277213	healthcheck: tune noise filters + nvidia-exporter auth=none Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning\|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	e75bcaf394	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-22 14:16:42 +00:00
Viktor Barzin	d2be0921e8	scripts: timeout rsync + sqlite calls in daily-backup Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a corrupted snapshot or a sqlite held open by a writer) blocked the whole script until systemd's 4h TimeoutStartSec kicked in, leaving every later PVC silently unbacked. Today's run hung on mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover — hence WeeklyBackupFailing alert. Now: - rsync per PVC: timeout 30 min, exit 124 logged separately - sqlite3 per database: timeout 5 min - /etc/pve rsync: timeout 5 min Each timed-out PVC bumps PVC_FAIL but the loop keeps moving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	a2377a38df	scripts: cluster_healthcheck defaults to ~/.kube/config The previous default of $(pwd)/config required running the script from the infra/ directory or always passing --kubeconfig. From a parent shell or any other working directory, the lookup hit a non-existent file and kubectl returned a stale-token error, masking real check results. Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to $(pwd)/config for backwards compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	0d8e0ca6fc	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
Viktor Barzin	ffa1d6d5dc	[woodpecker] Programmatic Forgejo repo registration Earlier I claimed the OAuth Web UI flow was the only way to onboard new Forgejo repos in Woodpecker. That's wrong. Two parts to the actual workaround: 1. Woodpecker session JWTs are HS256 signed with the user's per-user `hash` column from the PG `users` table (NOT the global agent secret). Mint a session JWT for the Forgejo viktor user (id=2, forge_id=2), and you're authenticated as that user. 2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls Forgejo with viktor's stored OAuth access_token to create the webhook + per-repo signing key. Works. The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub admin), whose user row has no Forgejo OAuth token — Woodpecker's forge-API call fails for that user, surfacing as a 500. scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow: extract hash from PG → mint JWT → activate repo. Verified against viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in this session — all activated cleanly. Also updated the runbook with the actual mechanism + the WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the 'context deadline exceeded' failures, NOT the v3.14 upgrade).	2026-05-10 11:12:36 +00:00
Viktor Barzin	e86efd107a	[forgejo] Migration script: exclude empty repos, all-images full mode Updated to handle the actual situation: wealthfolio-sync and fire-planner have registry repos but no tags (broken/abandoned deployments). Skip those with a SKIP marker. Migrate everything else as a stop-gap until Woodpecker pipelines start producing Forgejo images on their own. The image list now covers all private images currently in scope.	2026-05-07 23:29:34 +00:00
Viktor Barzin	76d2d0e536	[forgejo] Add chrome-service-novnc:v4 to orphan-image migrator	2026-05-07 23:29:34 +00:00
Viktor Barzin	f793a5f50b	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	4c8d12229f	mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported for ai_socktype` — the daemon respawns get throttled by postfix master, and real client connections that land mid-respawn time out. We saw this as ~50% timeout rate on public 587 from inside the cluster. Layer 1 (book-search) — stacks/ebooks/main.tf: SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local Internal services should use ClusterIP, not hairpin through pfSense+HAProxy. 12/12 OK in <28ms vs ~6/12 timeouts on the public path. Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php: Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc: 30145 → pod 25 (stock postscreen) 30146 → pod 465 (stock smtps) 30147 → pod 587 (stock submission) HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to redirect L4 health probes to those ports while real client traffic keeps going to 30125-30128 with PROXY v2. Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s. Inter dropped 120000 → 5000 since log-spam concern is gone. `option smtpchk EHLO` was tried first but flapped against postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP). Plain TCP accept-on-port check is sufficient for both submission and postscreen. Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck path and mark the "Known warts" entry as resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 19:45:33 +00:00
Viktor Barzin	4315ed5c2a	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward (7d retention kicked in), pruned snapshots polluted the captured value with multi-line log text, breaking the Prometheus exposition format on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from Pushgateway). Snapshots themselves were always fine; only the metric push silently failed for ~9 nights, eventually triggering LVMSnapshotNeverRun (alert has 48h `for:`). Fix: redirect the inner log call to stderr so cmd_prune_count's stdout contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh` as the source-of-truth (was edited only on the PVE host) and updates backup-dr.md to point at the .sh and document the scp deploy. Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:30:58 +00:00
Viktor Barzin	3eb8b9a4ea	ci: add vault CLI to infra-ci image + surface real errors in scripts/tg The Woodpecker CI pipeline has been silently failing to apply Tier 1 stacks since the state-migration commit `e80b2f02` because the Alpine CI image never had the vault CLI. `scripts/tg` swallowed stderr with `2>/dev/null` and surfaced a misleading "Cannot read PG credentials from Vault" message — the real error was `sh: vault: not found`. Verified with an in-cluster probe: woodpecker/default SA + role=ci already gets the terraform-state policy and has read capability on database/static-creds/pg-terraform-state. Auth was never the problem; the vault binary just wasn't there. - ci/Dockerfile: pin vault v1.18.1 (matches server) and install - scripts/tg: pre-flight check + surface real vault output on failure - Next build-ci-image.yml run rebuilds :latest with vault included; subsequent default.yml runs unblock monitoring apply (code-aoxk) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:46:50 +00:00
Viktor Barzin	4bedabb9e8	healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) - HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL = https://ha-sofia.viktorbarzin.me. - cert-manager: add cert_manager_installed() probe (kubectl get crd certificates.cert-manager.io). When not installed — which is our current state — report PASS "N/A" instead of noisy WARN "CRDs unavailable". - LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check always WARN'd. Fixed to `grep _snap`. After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).	2026-04-19 22:13:32 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	9806d515dd	[mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip] ## Context (bd code-yiu) Cutover of external mail traffic from the MetalLB LB IP path (ETP:Local, pod-speaker colocation) to pfSense HAProxy + PROXY v2 (ETP:Cluster). Real client IP now preserved end-to-end on ports 25/465/587/993, both for postscreen anti-spam scoring and CrowdSec auth-failure bans. ## This change ### k8s (stacks/mailserver/modules/mailserver/main.tf) - `mailserver-user-patches` ConfigMap's `user-patches.sh` now appends 3 alt PROXY-speaking services to master.cf: - `:2525` postscreen (alt :25) - `:4465` smtpd (alt :465 SMTPS, wrappermode TLS) - `:5587` smtpd (alt :587 submission) All with `postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy`. Mirror stock submission/submissions options (SASL via Dovecot, TLS, client restrictions, mua_sender_restrictions). chroot=n so the SASL socket path `/dev/shm/sasl-auth.sock` resolves outside the chroot. - `dovecot.cf` ConfigMap adds: ``` haproxy_trusted_networks = 10.0.20.0/24 service imap-login { inet_listener imaps_proxy { port=10993; ssl=yes; haproxy=yes } } ``` Stock :993 stays PROXY-free for internal Roundcube/probe clients. - Container ports: 4 new (4465, 5587, 10993, 2525 already there). - `mailserver-proxy` NodePort Service now exposes all 4 ports: 25→2525→30125, 465→4465→30126, 587→5587→30127, 993→10993→30128 (ETP:Cluster). ### pfSense (scripts/pfsense-haproxy-bootstrap.php) Rebuilt to declare 4 backend pools (one per NodePort) and 4 production frontends on `10.0.20.1:{25,465,587,993}` TCP mode, plus the legacy `:2525` test frontend. All pools: `send-proxy-v2 check inter 120000`. Idempotent — re-runs converge on declared state. ### pfSense (scripts/pfsense-nat-mailserver-haproxy-{flip,unflip}.php) Flip script: updates `<nat><rule>` entries for mail ports from target `<mailserver>` alias (10.0.20.202 MetalLB) → `10.0.20.1` (pfSense HAProxy). Runs `filter_configure()` to rebuild pf rules. Unflip is the rollback. Both scripts are idempotent. ## What is NOT in this change - Phase 6 (decommission MetalLB LB path, downgrade mailserver Service from LoadBalancer to ClusterIP, free 10.0.20.202) — USER-GATED. Do NOT run until explicit approval. - Legacy MetalLB `mailserver` LB still live on 10.0.20.202 with stock ETP:Local ports — functional backup path + consumed by internal clients that hit `mailserver.mailserver.svc.cluster.local` (routes via ClusterIP layer of the LB Service, bypassing ETP). - Port :143 (plain IMAP) — no HAProxy frontend; stays on MetalLB via unchanged NAT rule. ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # k8s container listens on all 8 ports $ kubectl exec -c docker-mailserver deployment/mailserver -n mailserver \ -- ss -ltn \| grep -E ':(25\|2525\|465\|4465\|587\|5587\|993\|10993)\b' ... all 8 listening ... # pfSense HAProxy listens on all 5 (production + legacy test) $ ssh admin@10.0.20.1 'sockstat -l \| grep haproxy' www haproxy 49418 5 tcp4 :25 www haproxy 49418 6 tcp4 :2525 www haproxy 49418 10 tcp4 :465 www haproxy 49418 11 tcp4 :587 www haproxy 49418 12 tcp4 :993 # Post-flip: pf rdr rules point at pfSense, not <mailserver> $ ssh admin@10.0.20.1 'pfctl -sn' \| grep 'smtp\\|sub\\|imap\\|:25' rdr on vtnet0 ... port = submission -> 10.0.20.1 rdr on vtnet0 ... port = imaps -> 10.0.20.1 rdr on vtnet0 ... port = smtps -> 10.0.20.1 rdr on vtnet0 ... port = 25 -> 10.0.20.1 # 4 HAProxy frontends reachable + SMTP/IMAP banners $ python3 <test script> → SMTP/SMTPS/Sub/IMAPS all respond correctly # Real client IP in maillog for external delivery via Brevo → MX postfix/smtpd-proxy25/postscreen: CONNECT from [77.32.148.26]:36334 to [10.0.20.1]:25 postfix/smtpd-proxy25/postscreen: PASS NEW [77.32.148.26]:36334 # E2E probe (Brevo HTTP → external SMTP delivery → IMAP fetch) succeeds $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-yiu-flip -n mailserver ... Round-trip SUCCESS in 20.3s ... # Internal Roundcube path unchanged $ curl -sI https://mail.viktorbarzin.me/ → 302 (Authentik gate intact) # No mail alerts firing $ kubectl exec prometheus-server ... /api/v1/alerts \| grep Email → (empty) ``` ### Rollback ``` scp infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php admin@10.0.20.1:/tmp/ ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-unflip.php' ``` Immediate (<2s). Flips all 4 NAT rdrs back to `<mailserver>` alias. Pre-flip config snapshot also saved at `/tmp/config.xml.pre-yiu-flip.20260419-1222` on pfSense. ## Phase roadmap (bd code-yiu) \| Phase \| Status \| \|---\|---\| \| 1a \| ✅ commit `ef75c02f` — alt :2525 listener + NodePort \| \| 2 \| ✅ 2026-04-19 — HAProxy pkg installed on pfSense \| \| 3 \| ✅ commit `ba697b02` — HAProxy config persisted in pfSense XML \| \| 4+5\| ✅ this commit* — 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ⏸ USER-GATED — MetalLB LB decommission after 48h observation \|	2026-04-19 12:24:50 +00:00
Viktor Barzin	ba697b02a2	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip] ## Context (bd code-yiu) Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so it lives in the nightly backup) of the PROXY-v2 migration. Test path only — listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525 postscreen. Real client IP verified in maillog (`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a container plumbing is already live (commit `ef75c02f`). pfSense HAProxy config lives in `/cf/conf/config.xml` under `<installedpackages><haproxy>`. That file is captured daily by `scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`) and synced offsite to Synology. No new backup wiring needed — this commit documents the fact + adds the reproducer script. ## This change Two files, both additive: 1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that edits pfSense config.xml to add: - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125, `send-proxy-v2`, TCP health-check every 120000 ms (2 min). - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525 in TCP mode, forwarding to the pool. Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg` and reload HAProxy. Removes existing items with the same name before adding, so repeat runs converge on declared state. 2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering current state, validation, bootstrap/restore, health checks, phase roadmap, and known warts (health-check noise + bind-address templating). ## What is NOT in this change - Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred. - Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual- inet_listener) — deferred. - Terraform for pfSense HAProxy pkg install — not possible (no Terraform provider for pfSense pkg management). Runbook documents the manual `pkg install` command. ## Test Plan ### Automated ``` $ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l \| grep :2525' 64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D www haproxy 64009 5 tcp4 :2525 :* $ ssh admin@10.0.20.1 "echo 'show servers state' \| socat /tmp/haproxy.socket stdio" \ \| awk 'NR>1 {print $4, $6}' node1 2 node2 2 node3 2 node4 2 # all UP $ python3 -c " import socket; s=socket.socket(); s.settimeout(10) s.connect(('10.0.20.1', 2525)) print(s.recv(200).decode()) s.send(b'EHLO persist-test.example.com\r\n') print(s.recv(500).decode()) s.send(b'QUIT\r\n'); s.close()" 220-mail.viktorbarzin.me ESMTP ... 250-mail.viktorbarzin.me 250-SIZE 209715200 ... 221 2.0.0 Bye $ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \ \| grep smtpd-proxy.*CONNECT postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525 ``` Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy SNAT) → PROXY-v2 roundtrip confirmed. ### Manual Verification Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the now-persisted config (`<enable>yes</enable>` in XML). Connection test above should still work. ## Reproduce locally 1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/` 2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK 3. `python3 -c '...' ` SMTP roundtrip test above.	2026-04-19 12:07:47 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f726d1c3fd	fix: stash local changes before git pull in CI pipelines DevVM may have unstaged changes from active sessions. Use git stash before pull to avoid 'cannot pull with rebase: unstaged changes' errors. Stash pop after to restore working state. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:10 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	92495d0fc3	fix: start Claude from ~/code to load root CLAUDE.md correctly Both issue-automation and postmortem pipelines were cd'ing into ~/code/infra before running Claude, missing the root CLAUDE.md with beads config and project-wide instructions. Now cd to ~/code and use relative agent paths from there. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:15:32 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	f2e7367401	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
Viktor Barzin	c742fa3dfb	fix: scan all post-mortems for TODOs (no git diff needed)	2026-04-14 17:14:22 +00:00
Viktor Barzin	59367cc588	fix: handle Woodpecker shallow clone in postmortem pipeline	2026-04-14 17:12:02 +00:00
Viktor Barzin	8540f48a28	fix: move pipeline logic to shell script (avoid YAML quoting issues)	2026-04-14 16:46:42 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	ca5039f8aa	switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip] - weekly-backup.timer: Sun 05:00 → daily 05:00 - offsite-sync-backup.timer: Sun 08:00 → daily 06:00 - Monthly full rsync --delete unchanged (1st-7th of month) - Total daily I/O cost: ~20GB sdc reads, ~3.5GB sda writes, seconds of network - Updated script headers and service descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:24:38 +00:00
Viktor Barzin	28ad11d12c	consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip] Architecture overhaul: - Synology truenas/ renamed to nfs/, immich paths flattened to match source - Created nfs-ssd/ on Synology for SSD data (thumbs, ML cache) - Deleted pve-backup/nfs-mirror (53GB duplication eliminated) - New inotifywait daemon (nfs-change-tracker.service) watches /srv/nfs + /srv/nfs-ssd - offsite-sync Step 2: reads inotify change log, rsync --files-from only changed files - weekly-backup: removed NFS mirror step entirely (NFS goes direct to Synology) - Cleaned 9 orphaned LVs (101GB + 38 snapshots reclaimed from thin pool) Performance: incremental sync completes in seconds (vs 30+ min with full rsync) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:06:20 +00:00
Viktor Barzin	aa4c125f9c	improve 3-2-1 backup: auto-discover dirs, Immich offsite sync, SQLite backup [ci skip] - weekly-backup.sh: replace hardcoded BACKUP_DIRS with glob auto-discovery (catches nextcloud-backup, council-complaints-backup, future dirs) - weekly-backup.sh: add auto SQLite backup from PVC snapshots (magic number check, ?mode=ro URI, fallback to raw copy) - offsite-sync-backup.sh: add NFS media direct-to-Synology sync (Immich, calibre, audiobookshelf — reuses existing TrueNAS Cloud Sync paths) - Cleaned up 9 orphaned LVs + 38 snapshots on PVE host (101GB reclaimed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 15:47:56 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	d2af5339af	fix offsite sync: use --chmod for Synology permission compatibility Synology Administrator user can't create dirs with root-owned permissions from PVC snapshots. Switch from -az to -rltz --chmod to set writable permissions on destination. Also updated Cloud Sync Task 1 excludes to prevent duplication of backup dirs on Synology.	2026-04-06 16:01:42 +03:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	77143dfd6b	state: per-stack Transit keys for namespace-owner access control - Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>) - state-sync passes per-stack Transit URI + age keys on encrypt - Vault policies scope namespace-owners to their stacks only: - sops-admin: wildcard access to all transit keys - sops-user-<name>: access only to owned stack keys - Anca (plotting-book) can only decrypt plotting-book state - Admin can decrypt everything (via admin Transit policy or age fallback) - External group sops-plotting-book maps Authentik group to Vault policy - Updated CLAUDE.md with state sync documentation	2026-03-17 23:08:18 +00:00
Viktor Barzin	4e7ca1ad61	state: add Vault Transit as primary SOPS backend, age as fallback - .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state - state-sync: try Vault Transit first, fall back to age key on disk - Re-encrypted all 101 state files with both Vault Transit + age - Normal workflow: vault login → decrypt via Transit (no key files) - Bootstrap/DR: age key at ~/.config/sops/age/keys.txt	2026-03-17 22:56:33 +00:00
Viktor Barzin	b6faa24349	state: add SOPS-encrypted terraform state to git - SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted) - scripts/state-sync: encrypt/decrypt/commit wrapper - scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy - terragrunt.hcl: -backup=- prevents backup file accumulation - .gitignore: track .tfstate.enc, ignore plaintext .tfstate - Cleaned 964MB of stale backups (state/backups/, .backup files)	2026-03-17 22:37:56 +00:00

1 2

75 commits