infra

Author	SHA1	Message	Date
Viktor Barzin	e86efd107a	[forgejo] Migration script: exclude empty repos, all-images full mode Updated to handle the actual situation: wealthfolio-sync and fire-planner have registry repos but no tags (broken/abandoned deployments). Skip those with a SKIP marker. Migrate everything else as a stop-gap until Woodpecker pipelines start producing Forgejo images on their own. The image list now covers all private images currently in scope.	2026-05-07 23:29:34 +00:00
Viktor Barzin	76d2d0e536	[forgejo] Add chrome-service-novnc:v4 to orphan-image migrator	2026-05-07 23:29:34 +00:00
Viktor Barzin	f793a5f50b	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	4c8d12229f	mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported for ai_socktype` — the daemon respawns get throttled by postfix master, and real client connections that land mid-respawn time out. We saw this as ~50% timeout rate on public 587 from inside the cluster. Layer 1 (book-search) — stacks/ebooks/main.tf: SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local Internal services should use ClusterIP, not hairpin through pfSense+HAProxy. 12/12 OK in <28ms vs ~6/12 timeouts on the public path. Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php: Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc: 30145 → pod 25 (stock postscreen) 30146 → pod 465 (stock smtps) 30147 → pod 587 (stock submission) HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to redirect L4 health probes to those ports while real client traffic keeps going to 30125-30128 with PROXY v2. Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s. Inter dropped 120000 → 5000 since log-spam concern is gone. `option smtpchk EHLO` was tried first but flapped against postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP). Plain TCP accept-on-port check is sufficient for both submission and postscreen. Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck path and mark the "Known warts" entry as resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 19:45:33 +00:00
Viktor Barzin	4315ed5c2a	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward (7d retention kicked in), pruned snapshots polluted the captured value with multi-line log text, breaking the Prometheus exposition format on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from Pushgateway). Snapshots themselves were always fine; only the metric push silently failed for ~9 nights, eventually triggering LVMSnapshotNeverRun (alert has 48h `for:`). Fix: redirect the inner log call to stderr so cmd_prune_count's stdout contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh` as the source-of-truth (was edited only on the PVE host) and updates backup-dr.md to point at the .sh and document the scp deploy. Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:30:58 +00:00
Viktor Barzin	3eb8b9a4ea	ci: add vault CLI to infra-ci image + surface real errors in scripts/tg The Woodpecker CI pipeline has been silently failing to apply Tier 1 stacks since the state-migration commit `e80b2f02` because the Alpine CI image never had the vault CLI. `scripts/tg` swallowed stderr with `2>/dev/null` and surfaced a misleading "Cannot read PG credentials from Vault" message — the real error was `sh: vault: not found`. Verified with an in-cluster probe: woodpecker/default SA + role=ci already gets the terraform-state policy and has read capability on database/static-creds/pg-terraform-state. Auth was never the problem; the vault binary just wasn't there. - ci/Dockerfile: pin vault v1.18.1 (matches server) and install - scripts/tg: pre-flight check + surface real vault output on failure - Next build-ci-image.yml run rebuilds :latest with vault included; subsequent default.yml runs unblock monitoring apply (code-aoxk) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:46:50 +00:00
Viktor Barzin	4bedabb9e8	healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) - HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL = https://ha-sofia.viktorbarzin.me. - cert-manager: add cert_manager_installed() probe (kubectl get crd certificates.cert-manager.io). When not installed — which is our current state — report PASS "N/A" instead of noisy WARN "CRDs unavailable". - LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check always WARN'd. Fixed to `grep _snap`. After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).	2026-04-19 22:13:32 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	9806d515dd	[mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip] ## Context (bd code-yiu) Cutover of external mail traffic from the MetalLB LB IP path (ETP:Local, pod-speaker colocation) to pfSense HAProxy + PROXY v2 (ETP:Cluster). Real client IP now preserved end-to-end on ports 25/465/587/993, both for postscreen anti-spam scoring and CrowdSec auth-failure bans. ## This change ### k8s (stacks/mailserver/modules/mailserver/main.tf) - `mailserver-user-patches` ConfigMap's `user-patches.sh` now appends 3 alt PROXY-speaking services to master.cf: - `:2525` postscreen (alt :25) - `:4465` smtpd (alt :465 SMTPS, wrappermode TLS) - `:5587` smtpd (alt :587 submission) All with `postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy`. Mirror stock submission/submissions options (SASL via Dovecot, TLS, client restrictions, mua_sender_restrictions). chroot=n so the SASL socket path `/dev/shm/sasl-auth.sock` resolves outside the chroot. - `dovecot.cf` ConfigMap adds: ``` haproxy_trusted_networks = 10.0.20.0/24 service imap-login { inet_listener imaps_proxy { port=10993; ssl=yes; haproxy=yes } } ``` Stock :993 stays PROXY-free for internal Roundcube/probe clients. - Container ports: 4 new (4465, 5587, 10993, 2525 already there). - `mailserver-proxy` NodePort Service now exposes all 4 ports: 25→2525→30125, 465→4465→30126, 587→5587→30127, 993→10993→30128 (ETP:Cluster). ### pfSense (scripts/pfsense-haproxy-bootstrap.php) Rebuilt to declare 4 backend pools (one per NodePort) and 4 production frontends on `10.0.20.1:{25,465,587,993}` TCP mode, plus the legacy `:2525` test frontend. All pools: `send-proxy-v2 check inter 120000`. Idempotent — re-runs converge on declared state. ### pfSense (scripts/pfsense-nat-mailserver-haproxy-{flip,unflip}.php) Flip script: updates `<nat><rule>` entries for mail ports from target `<mailserver>` alias (10.0.20.202 MetalLB) → `10.0.20.1` (pfSense HAProxy). Runs `filter_configure()` to rebuild pf rules. Unflip is the rollback. Both scripts are idempotent. ## What is NOT in this change - Phase 6 (decommission MetalLB LB path, downgrade mailserver Service from LoadBalancer to ClusterIP, free 10.0.20.202) — USER-GATED. Do NOT run until explicit approval. - Legacy MetalLB `mailserver` LB still live on 10.0.20.202 with stock ETP:Local ports — functional backup path + consumed by internal clients that hit `mailserver.mailserver.svc.cluster.local` (routes via ClusterIP layer of the LB Service, bypassing ETP). - Port :143 (plain IMAP) — no HAProxy frontend; stays on MetalLB via unchanged NAT rule. ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # k8s container listens on all 8 ports $ kubectl exec -c docker-mailserver deployment/mailserver -n mailserver \ -- ss -ltn \| grep -E ':(25\|2525\|465\|4465\|587\|5587\|993\|10993)\b' ... all 8 listening ... # pfSense HAProxy listens on all 5 (production + legacy test) $ ssh admin@10.0.20.1 'sockstat -l \| grep haproxy' www haproxy 49418 5 tcp4 :25 www haproxy 49418 6 tcp4 :2525 www haproxy 49418 10 tcp4 :465 www haproxy 49418 11 tcp4 :587 www haproxy 49418 12 tcp4 :993 # Post-flip: pf rdr rules point at pfSense, not <mailserver> $ ssh admin@10.0.20.1 'pfctl -sn' \| grep 'smtp\\|sub\\|imap\\|:25' rdr on vtnet0 ... port = submission -> 10.0.20.1 rdr on vtnet0 ... port = imaps -> 10.0.20.1 rdr on vtnet0 ... port = smtps -> 10.0.20.1 rdr on vtnet0 ... port = 25 -> 10.0.20.1 # 4 HAProxy frontends reachable + SMTP/IMAP banners $ python3 <test script> → SMTP/SMTPS/Sub/IMAPS all respond correctly # Real client IP in maillog for external delivery via Brevo → MX postfix/smtpd-proxy25/postscreen: CONNECT from [77.32.148.26]:36334 to [10.0.20.1]:25 postfix/smtpd-proxy25/postscreen: PASS NEW [77.32.148.26]:36334 # E2E probe (Brevo HTTP → external SMTP delivery → IMAP fetch) succeeds $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-yiu-flip -n mailserver ... Round-trip SUCCESS in 20.3s ... # Internal Roundcube path unchanged $ curl -sI https://mail.viktorbarzin.me/ → 302 (Authentik gate intact) # No mail alerts firing $ kubectl exec prometheus-server ... /api/v1/alerts \| grep Email → (empty) ``` ### Rollback ``` scp infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php admin@10.0.20.1:/tmp/ ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-unflip.php' ``` Immediate (<2s). Flips all 4 NAT rdrs back to `<mailserver>` alias. Pre-flip config snapshot also saved at `/tmp/config.xml.pre-yiu-flip.20260419-1222` on pfSense. ## Phase roadmap (bd code-yiu) \| Phase \| Status \| \|---\|---\| \| 1a \| ✅ commit `ef75c02f` — alt :2525 listener + NodePort \| \| 2 \| ✅ 2026-04-19 — HAProxy pkg installed on pfSense \| \| 3 \| ✅ commit `ba697b02` — HAProxy config persisted in pfSense XML \| \| 4+5\| ✅ this commit* — 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ⏸ USER-GATED — MetalLB LB decommission after 48h observation \|	2026-04-19 12:24:50 +00:00
Viktor Barzin	ba697b02a2	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip] ## Context (bd code-yiu) Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so it lives in the nightly backup) of the PROXY-v2 migration. Test path only — listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525 postscreen. Real client IP verified in maillog (`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a container plumbing is already live (commit `ef75c02f`). pfSense HAProxy config lives in `/cf/conf/config.xml` under `<installedpackages><haproxy>`. That file is captured daily by `scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`) and synced offsite to Synology. No new backup wiring needed — this commit documents the fact + adds the reproducer script. ## This change Two files, both additive: 1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that edits pfSense config.xml to add: - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125, `send-proxy-v2`, TCP health-check every 120000 ms (2 min). - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525 in TCP mode, forwarding to the pool. Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg` and reload HAProxy. Removes existing items with the same name before adding, so repeat runs converge on declared state. 2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering current state, validation, bootstrap/restore, health checks, phase roadmap, and known warts (health-check noise + bind-address templating). ## What is NOT in this change - Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred. - Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual- inet_listener) — deferred. - Terraform for pfSense HAProxy pkg install — not possible (no Terraform provider for pfSense pkg management). Runbook documents the manual `pkg install` command. ## Test Plan ### Automated ``` $ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l \| grep :2525' 64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D www haproxy 64009 5 tcp4 :2525 :* $ ssh admin@10.0.20.1 "echo 'show servers state' \| socat /tmp/haproxy.socket stdio" \ \| awk 'NR>1 {print $4, $6}' node1 2 node2 2 node3 2 node4 2 # all UP $ python3 -c " import socket; s=socket.socket(); s.settimeout(10) s.connect(('10.0.20.1', 2525)) print(s.recv(200).decode()) s.send(b'EHLO persist-test.example.com\r\n') print(s.recv(500).decode()) s.send(b'QUIT\r\n'); s.close()" 220-mail.viktorbarzin.me ESMTP ... 250-mail.viktorbarzin.me 250-SIZE 209715200 ... 221 2.0.0 Bye $ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \ \| grep smtpd-proxy.*CONNECT postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525 ``` Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy SNAT) → PROXY-v2 roundtrip confirmed. ### Manual Verification Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the now-persisted config (`<enable>yes</enable>` in XML). Connection test above should still work. ## Reproduce locally 1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/` 2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK 3. `python3 -c '...' ` SMTP roundtrip test above.	2026-04-19 12:07:47 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f726d1c3fd	fix: stash local changes before git pull in CI pipelines DevVM may have unstaged changes from active sessions. Use git stash before pull to avoid 'cannot pull with rebase: unstaged changes' errors. Stash pop after to restore working state. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:10 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	92495d0fc3	fix: start Claude from ~/code to load root CLAUDE.md correctly Both issue-automation and postmortem pipelines were cd'ing into ~/code/infra before running Claude, missing the root CLAUDE.md with beads config and project-wide instructions. Now cd to ~/code and use relative agent paths from there. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:15:32 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	f2e7367401	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
Viktor Barzin	c742fa3dfb	fix: scan all post-mortems for TODOs (no git diff needed)	2026-04-14 17:14:22 +00:00
Viktor Barzin	59367cc588	fix: handle Woodpecker shallow clone in postmortem pipeline	2026-04-14 17:12:02 +00:00
Viktor Barzin	8540f48a28	fix: move pipeline logic to shell script (avoid YAML quoting issues)	2026-04-14 16:46:42 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	ca5039f8aa	switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip] - weekly-backup.timer: Sun 05:00 → daily 05:00 - offsite-sync-backup.timer: Sun 08:00 → daily 06:00 - Monthly full rsync --delete unchanged (1st-7th of month) - Total daily I/O cost: ~20GB sdc reads, ~3.5GB sda writes, seconds of network - Updated script headers and service descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:24:38 +00:00
Viktor Barzin	28ad11d12c	consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip] Architecture overhaul: - Synology truenas/ renamed to nfs/, immich paths flattened to match source - Created nfs-ssd/ on Synology for SSD data (thumbs, ML cache) - Deleted pve-backup/nfs-mirror (53GB duplication eliminated) - New inotifywait daemon (nfs-change-tracker.service) watches /srv/nfs + /srv/nfs-ssd - offsite-sync Step 2: reads inotify change log, rsync --files-from only changed files - weekly-backup: removed NFS mirror step entirely (NFS goes direct to Synology) - Cleaned 9 orphaned LVs (101GB + 38 snapshots reclaimed from thin pool) Performance: incremental sync completes in seconds (vs 30+ min with full rsync) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:06:20 +00:00
Viktor Barzin	aa4c125f9c	improve 3-2-1 backup: auto-discover dirs, Immich offsite sync, SQLite backup [ci skip] - weekly-backup.sh: replace hardcoded BACKUP_DIRS with glob auto-discovery (catches nextcloud-backup, council-complaints-backup, future dirs) - weekly-backup.sh: add auto SQLite backup from PVC snapshots (magic number check, ?mode=ro URI, fallback to raw copy) - offsite-sync-backup.sh: add NFS media direct-to-Synology sync (Immich, calibre, audiobookshelf — reuses existing TrueNAS Cloud Sync paths) - Cleaned up 9 orphaned LVs + 38 snapshots on PVE host (101GB reclaimed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 15:47:56 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	d2af5339af	fix offsite sync: use --chmod for Synology permission compatibility Synology Administrator user can't create dirs with root-owned permissions from PVC snapshots. Switch from -az to -rltz --chmod to set writable permissions on destination. Also updated Cloud Sync Task 1 excludes to prevent duplication of backup dirs on Synology.	2026-04-06 16:01:42 +03:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	77143dfd6b	state: per-stack Transit keys for namespace-owner access control - Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>) - state-sync passes per-stack Transit URI + age keys on encrypt - Vault policies scope namespace-owners to their stacks only: - sops-admin: wildcard access to all transit keys - sops-user-<name>: access only to owned stack keys - Anca (plotting-book) can only decrypt plotting-book state - Admin can decrypt everything (via admin Transit policy or age fallback) - External group sops-plotting-book maps Authentik group to Vault policy - Updated CLAUDE.md with state sync documentation	2026-03-17 23:08:18 +00:00
Viktor Barzin	4e7ca1ad61	state: add Vault Transit as primary SOPS backend, age as fallback - .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state - state-sync: try Vault Transit first, fall back to age key on disk - Re-encrypted all 101 state files with both Vault Transit + age - Normal workflow: vault login → decrypt via Transit (no key files) - Bootstrap/DR: age key at ~/.config/sops/age/keys.txt	2026-03-17 22:56:33 +00:00
Viktor Barzin	b6faa24349	state: add SOPS-encrypted terraform state to git - SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted) - scripts/state-sync: encrypt/decrypt/commit wrapper - scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy - terragrunt.hcl: -backup=- prevents backup file accumulation - .gitignore: track .tfstate.enc, ignore plaintext .tfstate - Cleaned 964MB of stale backups (state/backups/, .backup files)	2026-03-17 22:37:56 +00:00
Viktor Barzin	0f262ceda3	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-15 19:17:57 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	46afa85b01	fix openclaw config mount and OOM: use init container, increase memory to 2Gi - Replace subPath ConfigMap mount with init container that copies openclaw.json to writable NFS home (OpenClaw writes back to the file at runtime) - Remove invalid memory-api plugin references causing "Config invalid" - Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536 - Fix tg wrapper to inject -auto-approve when apply --non-interactive is used	2026-03-14 23:42:17 +00:00
Viktor Barzin	76a4987eef	[ci skip] add Forgejo task pipeline for OpenClaw AI agent Forgejo issues as a task queue for OpenClaw: - Forgejo OAuth2 with Authentik SSO, self-registration disabled - Webhook-triggered task processing (instant) + CronJob backup (5min poll) - Tasks processed via Mistral Large 3 (NVIDIA NIM API) - Results posted as issue comments, auto-labeled and closed - Comment follow-ups and reopened issues supported - n8n RBAC for OpenClaw pod exec (future workflow integration)	2026-03-07 21:11:07 +00:00
Viktor Barzin	39333033a6	[ci skip] phase 1: SOPS tooling setup (.sops.yaml, scripts/tg, .gitignore) Part of SOPS multi-user secrets migration. - .sops.yaml: defines age recipients (Viktor + CI) - scripts/tg: wrapper that decrypts secrets before running terragrunt - .gitignore: excludes decrypted secrets.auto.tfvars.json No functional change — terraform.tfvars still works as before.	2026-03-07 13:57:42 +00:00
Viktor Barzin	422dadafe5	[ci skip] replace resource overcommitment check with actual usage Check real CPU/memory usage via kubectl top nodes instead of limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL. Limits overcommit is expected with 70+ services on 3 worker nodes.	2026-03-06 20:28:55 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	14b1c43713	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script - k8s-node2: 128G → 256G (160GB free) - k8s-node3: 128G → 256G (135GB free) - k8s-node4: 128G → 256G (127GB free) - k8s-node1: already 256G (51GB free) - extend_vm_storage.sh: increase drain timeout to 300s, add --force flag - Remove Vaultwarden from SQLite migration plan (too risky)	2026-02-28 16:00:16 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	d041459ef2	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00

1 2

65 commits