infra

Author	SHA1	Message	Date
Viktor Barzin	cd96fb64a8	phpipam-pfsense-import: every 5min → hourly Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the heaviest single contributor in our hourly fan-out investigation (11.2 MB/s burst when it fired). Kea DDNS still handles real-time DNS auto-registration; phpIPAM inventory just lags by up to 1h, which we don't need fresher. Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.	2026-04-26 22:48:43 +00:00
Viktor Barzin	51bf38815c	vault: record Phase 3 vault Released-PV cleanup Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the proxmox-lvm/encrypted side stays out of scope.	2026-04-25 23:08:45 +00:00
Viktor Barzin	484b4c7190	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1 + vault-2 today). The NFS fsync incompatibility identified in the 2026-04-22 raft-leader-deadlock post-mortem is no longer reachable — raft consensus log + audit log live on LUKS2 block storage with real fsync semantics. Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox dropped to zero after the rolling, so the resource is removed from infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster and will be reclaimed in Phase 3 cleanup. Lesson learned (recorded in plan): pvc-protection finalizer races the StatefulSet controller — pod recreates on the OLD PVCs unless the finalizer is patched out before pod delete. Force-finalize technique applied to vault-1 + vault-2 successfully. Closes: code-gy7h	2026-04-25 17:10:00 +00:00
Viktor Barzin	ac8d2f548b	paperless-ngx: migrate to proxmox-lvm-encrypted Document scans (receipts, contracts, IDs) are unambiguously sensitive PII. Storage decision rule defaults sensitive data to `proxmox-lvm-encrypted`, but paperless-ngx had been left on plain `proxmox-lvm` by an abandoned migration attempt that left a dormant, non-Terraform-managed encrypted PVC sitting unbound for 11 days. Cleaned up the orphan, added the encrypted PVC properly via Terraform, rsynced data with deployment scaled to 0, swapped claim_name. Plain `proxmox-lvm` PVC retained for a 7-day soak before removal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:48:53 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	43e4f3f68e	immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per commit on NFS contributed to the 2026-04-22 NFS writeback storm (2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS (append-only, NFS-tolerant). The init container that writes postgresql.override.conf is now gated on PG_VERSION presence — on a fresh PVC the file would otherwise make initdb refuse the non-empty PGDATA. First boot skips the override and initdb's cleanly; second boot (after a forced restart) writes the override so vchord/vectors/pg_prewarm load before the dump restore. Idempotent on initialised PVCs. Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC → REINDEX clip_index/face_index → 111,843 assets verified, external HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only). LV created on PVE host, picked up by lvm-pvc-snapshot. See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md. Phase 2 (Vault Raft) follows under code-gy7h. Closes: code-ahr7 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:47:30 +00:00
Viktor Barzin	4315ed5c2a	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward (7d retention kicked in), pruned snapshots polluted the captured value with multi-line log text, breaking the Prometheus exposition format on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from Pushgateway). Snapshots themselves were always fine; only the metric push silently failed for ~9 nights, eventually triggering LVMSnapshotNeverRun (alert has 48h `for:`). Fix: redirect the inner log call to stderr so cmd_prune_count's stdout contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh` as the source-of-truth (was edited only on the PVE host) and updates backup-dr.md to point at the .sh and document the scp deploy. Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:30:58 +00:00
Viktor Barzin	344fce3692	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:32:29 +00:00
Viktor Barzin	7dfe89a6e0	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes Five compounding factors produced the 2026-04-22 flap cascade: soft anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe timing amplified LUKS-encrypted LVM I/O stalls into spurious +switch-master loops, HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters, publish_not_ready_addresses=true fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery CrashLoopBackOff closed the feedback loop. Changes: - Anti-affinity: preferred → required (one redis pod per node, hard) - Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000 - Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5 - HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s - Headless svc: publish_not_ready_addresses true→false Post-rollout verification clean: 0 flaps, 0 +switch-master events, 0 celery ReadOnlyError in the 60s window after settle. Docs updated.	2026-04-22 15:59:00 +00:00
Viktor Barzin	e2146e6916	gpu: schedule off NFD label, not k8s-node1 hostname Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string.	2026-04-22 13:43:07 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	4cb2c157da	post-mortem 2026-04-22: full timeline — second regression + node4 reboot The initial recovery at 11:03 was premature; vault-1's audit writes over NFS started hanging ~15 min later and the cluster regressed to 503. Full recovery required rebooting node4 (to free vault-0's stuck NFS mount and shed PVE NFS thread contention) and a second reboot of node3 (to clear another round of kernel NFS client degradation). Final recovery at 11:43:28 UTC with vault-2 as active leader on the quorum vault-0 + vault-2. vault-1 remains stuck in ContainerCreating on node2 — a third node2 reboot is required for full 3/3 quorum, but 2/3 is operationally sufficient, so that's deferred.	2026-04-22 11:44:56 +00:00
Viktor Barzin	2f1f9107f8	vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.	2026-04-22 11:12:19 +00:00
Viktor Barzin	d39770b30d	monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection Threshold was 48h + 30m for: a job that runs daily. We don't need to wait 2.5 days to detect a broken timer — bring it down to 30h + 30m (just over a day of cadence + minor drift/retry grace). Also add a description pointing to the restore runbook so the alert text surfaces the fix path directly. Threshold change: 172800s → 108000s. Docs in backup-dr.md synced. Re-triggers default.yml apply now that ci/Dockerfile is rebuilt with vault CLI — this is the first commit touching a stack that will actually succeed since the `e80b2f02` regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:54:37 +00:00
Viktor Barzin	4a343c33f0	monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m Doc claimed >40m; actual fire time is 80m (60m last-success threshold + 20m 'for'). Stale since pre-existing config; now re-stale after raising 'for' from 10m to 20m in `9b4970da`. Files out of sync only on this one alert row.	2026-04-21 22:39:46 +00:00
Viktor Barzin	ac695dea38	[registry] bulk-clean 34 orphan manifests + beads-server image bump Registry integrity probe surfaced 38 broken manifest references (34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19 infra-ci incident). Deleted all via registry HTTP API + ran GC; reclaimed ~3GB blob storage. beads-server CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h — bumped variable default to 2fd7670d (canonical tag in claude-agent-service stack, already healthy in registry) so new ticks can fire. Rebuilt in-use broken tags: freedify:{latest,c803de02} and beadboard:{17a38e43,latest} on registry VM; priority-pass via Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly CronJob, next run 2026-05-01). Probe now reports 0/39 failures. RegistryManifestIntegrityFailure alert cleared. Closes: code-8hk Closes: code-jh3c Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:16:34 +00:00
Viktor Barzin	7e34b67f24	[docs] Architecture docs: registry integrity probe, pin, new CI pipelines Bring the architecture set in line with what's actually deployed after today's registry reliability work (commits `7cb44d72` → `42961a5f`): - docs/architecture/ci-cd.md: expand Infra Pipelines table with build-ci-image (+ verify-integrity step), registry-config-sync, pve-nfs-exports-sync, postmortem-todos, drift-detection, issue-automation, provision-user. Note registry:2.8.3 pin + integrity probe in the image-registry flow section. - docs/architecture/monitoring.md: add Registry Integrity Probe to components table; add 3-alert section (Manifest Integrity Failure / Probe Stale / Catalog Inaccessible). - .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the revision-link-not-blob rule so the next agent knows the right check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:51:26 +00:00
Viktor Barzin	42961a5f58	[registry] fix-broken-blobs.sh — check revision-link, not blob data The original index-child scan checked if the child's blob data file existed under /blobs/sha256/<child>/data. That's wrong in a subtle way: registry:2 serves a per-repo manifest via the link file at <repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision links for its index's children also disappear — but the blob data survives (GC owns that, and runs weekly). Result: blob present, link absent, API 404 on HEAD — the exact 2026-04-19 failure mode. Live proof: the registry-integrity-probe CronJob just found 38 real orphan children (including 98f718c8 from the original incident) while the previous fix-broken-blobs.sh scan reported 0. After the fix, both tools agree. The probe had been authoritative all along; the scan was a false-negative because it was asking the wrong question. Post-mortem updated to reflect the true mechanism (link-file absence, not blob deletion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:43:35 +00:00
Viktor Barzin	34ee282d88	[ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs Replaces the manual scp+bounce sequence that landed registry:2.8.3 on 10.0.20.10 today (see commit `7cb44d72` + nginx-DNS-trap in runbook). Addresses the "no repeat manual fixes" preference — future changes to docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf / config-private.yml / cleanup-tags.sh now deploy through CI. Pipeline (.woodpecker/registry-config-sync.yml) mirrors pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set, bounce compose only when compose-visible files changed, always restart nginx after a compose bounce (critical — nginx caches upstream DNS), end with a dry-run fix-broken-blobs.sh to catch regressions. Credentials: - Woodpecker repo-secret `registry_ssh_key` (events: push, manual) - Mirror at Vault `secret/woodpecker/registry_ssh_key` (private_key / public_key / known_hosts_entry) - Public key on /root/.ssh/authorized_keys on 10.0.20.10 - Key label: woodpecker-registry-config-sync Runbook updated with "Auto-sync pipeline" section pointing at the new flow + manual override command. Closes: code-3vl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:32:12 +00:00
Viktor Barzin	6e96b436b1	[docs] Capture nginx stale-DNS trap in registry-vm runbook Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches its upstream DNS at startup and does NOT re-resolve after registry-* containers are recreated. Symptom was /v2/_catalog returning {"repositories": []} and /v2/ returning 200 without auth — nginx was forwarding to a stale IP that a different backend container now owns. Fix is always 'docker restart registry-nginx' after any registry-* bounce. Captured in registry-vm.md so future manual operators and the coming auto-sync pipeline (beads code-3vl) both encode the step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:09 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	e55c549c9a	[redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h with 0 alerts firing and 127 ops/sec on the v2 master — skipped the nominal 24h rollback window per user direction. - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm destroy cleaned up the StatefulSet redis-node (already scaled to 0), ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless` ClusterIP services that the chart owned. - Removed `null_resource.patch_redis_service` — the kubectl-patch hack that worked around the Bitnami chart's broken service selector. No Helm chart, no patch needed. - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy deployment. - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete). - Simplified the top-of-file comment and the redis-v2 architecture comment — they talked about the parallel-cluster migration state that no longer exists. Folded in the sentinel hostname gotcha, the redis 8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning so the rationale survives in the code rather than only in beads. - `RedisDown` alert no longer matches `redis-node\|redis-v2` — just `redis-v2` since that's the only StatefulSet now. Kept the `or on() vector(0)` so the alert fires when kube_state_metrics has no sample (e.g. after accidental delete). - `docs/architecture/databases.md` trimmed: no more "pending TF removal" or "cold rollback for 24h" language. Verification after apply: - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-* (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only. - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted). - Sentinel: all 3 agree mymaster = redis-v2-0 hostname. - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master. - Prometheus: 0 firing redis alerts. Closes: code-v2b Closes: code-2mw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:32:14 +00:00
Viktor Barzin	b6cd83f85a	[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:` BullMQ queues and `_kombu.` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:13:43 +00:00
Viktor Barzin	f6685a23a9	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) Workstream E of the DNS hardening push. Two independent pfSense-side changes to eliminate single-point DNS failures and the unauthenticated RFC 2136 update vector. Part 1 — Multi-IP DHCP option 6 - Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24 got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark. - After: - 10.0.10/24 -> [10.0.10.1, 94.140.14.14] - 10.0.20/24 -> [10.0.20.1, 94.140.14.14] - 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2, 94.140.14.14] so the end state is consistent across all three subnets. - Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's services_kea4_configure() implodes the array into "data: a, b" on the "domain-name-servers" option-data entry (services.inc L1214). - Verified: - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" + "nameserver 94.140.14.14" after networkd renew. - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved restart. - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`; dig @10.0.20.1 google.com -> "no servers could be reached"; dig @94.140.14.14 google.com -> 216.58.204.110; system resolver (getent hosts) succeeds via the fallback IP. Blackhole route removed. Part 2 — TSIG-signed Kea DHCP-DDNS - Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated update vector was latent (DDNS wiring in Kea DHCP4 is actually off today — "DDNS: disabled" in dhcpd.log) but would activate as soon as anyone turned on ddnsupdate on LAN/OPT1. - Generated HMAC-SHA256 secret, base64-encoded 32 random bytes. - Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27). - Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium instances via /api/settings/set (tsigKeys[]). - Updated kea-dhcp-ddns.conf on pfSense with tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …} and key-name: kea-ddns on each forward-ddns / reverse-ddns domain. Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - Configured viktorbarzin.lan + 10.0.10.in-addr.arpa + 20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary: - update = UseSpecifiedNetworkACL - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2] - updateSecurityPolicies = [{tsigKeyName: kea-ddns, domain: "*.<zone>", allowedTypes: [ANY]}] Technitium requires BOTH a source-IP match AND a valid TSIG signature. - Verified TSIG end-to-end: - Signed A-record update from pfSense -> "successfully processed", dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo: hmac-sha256; TSIG Error: NoError; RCODE: NoError"). - Signed PTR update same zone pattern -> dig -x returns tsig-test FQDN. - Unsigned update from pfSense IP (in ACL) -> "update failed: REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic Updates Security Policy"). - Test records cleaned up via signed nsupdate. Safety - pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip (145898 bytes, pre-change snapshot — keep 30d). - DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on pfSense; not committed to git. Docs - architecture/dns.md: zone dynamic-updates section records the TSIG policy; Incident History gets a WS E entry. - architecture/networking.md: DHCP Coverage table now shows the DNS option 6 values per subnet; pfSense block notes the TSIG-signed DDNS and config backup path. - runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers key rotation, emergency bypass, and enforcement-verification. Closes: code-o6j Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:12:23 +00:00
Viktor Barzin	33d934c32f	[dns] pfSense: Unbound replaces dnsmasq (WS D) Replace pfSense dnsmasq (DNS Forwarder) with Unbound (DNS Resolver) so LAN-side .viktorbarzin.lan resolution survives a full Kubernetes outage. Out-of-band pfSense changes (not in Terraform; pfSense config.xml is VM-managed). Backup at /cf/conf/config.xml.2026-04-19-pre-unbound on-box + /mnt/backup/pfsense/ nightly. - <unbound> enabled; listens on lan, opt1, wan, lo0 - <forwarding> on + <forward_tls_upstream> → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853, SNI cloudflare-dns.com) - <dnssec>, <prefetch>, <prefetchkey>, <dnsrecordcache> (serve-expired) - msgcachesize=256MB, cache_max_ttl=7d, cache_min_ttl=60s - custom_options: auth-zone viktorbarzin.lan master=10.0.20.201 fallback-enabled=yes for-upstream=yes + serve-expired-ttl=259200 - <dnsmasq><enable> removed; dnsmasq stopped - NAT rdr WAN UDP 53 → 10.0.20.201 removed (Unbound listens on WAN now) - Technitium zone viktorbarzin.lan: zoneTransferNetworkACL set to 10.0.20.1, 10.0.10.1, 192.168.1.2 (pfSense source IPs) Verified: - unbound-control list_auth_zones: viktorbarzin.lan serial 49367 - dig @127.0.0.1 idrac.viktorbarzin.lan returns 192.168.1.4 with aa flag (served from auth-zone, not forwarded) - dig @127.0.0.1 example.com +dnssec returns ad flag (DoT + validated) - /var/unbound/viktorbarzin.lan.zone has ~114 records - K8s outage drill passed: scale technitium=0 → dig still returns via WAN/LAN/OPT1 interfaces → scale restored - LAN/management/K8s VLAN clients all resolve via pfSense 192.168.1.2 / 10.0.10.1 / 10.0.20.1 respectively Trade-off: Technitium Split Horizon hairpin for 192.168.1.x → *.viktorbarzin.me (non-proxied) no longer runs via pfSense (Unbound answers locally). Fix if it bites: switch service to proxied or add Unbound Host Override. Documented in docs/runbooks/pfsense-unbound.md. Closes: code-k0d	2026-04-19 15:52:41 +00:00
Viktor Barzin	0f6321ce86	[dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) Adds per-node DNS cache that transparently intercepts pod queries on 10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet rollout needed for transparent mode. Layout mirrors existing stacks (technitium, descheduler, kured): stacks/nodelocal-dns/ main.tf # module wiring + IP params modules/nodelocal-dns/main.tf # SA, Services, ConfigMap, DS Key decisions: - Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1 - Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception) - Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods (separate ClusterIP avoids cache looping back through itself) - viktorbarzin.lan zone forwards directly to Technitium ClusterIP (10.96.0.53), bypassing CoreDNS for internal names - priorityClassName: system-node-critical - tolerations: operator=Exists (runs on master + all tainted nodes) - No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi - Kyverno dns_config drift suppressed on the DaemonSet - Kubelet clusterDNS NOT changed — transparent mode is sufficient; rolling 5 nodes just to switch to 169.254.20.10 has no additional benefit and expanding blast radius for no reason. Verified: - DaemonSet 5/5 Ready across k8s-master + 4 workers - dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4 - dig @169.254.20.10 github.com -> 140.82.121.3 - Deleted all 3 CoreDNS pods; cached queries still resolved via NodeLocal DNSCache (resilience confirmed) Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table, graph diagram, stacks table; rewrites pod DNS resolution paths to show the cache layer; adds troubleshooting entry. Closes: code-2k6	2026-04-19 15:46:41 +00:00
Viktor Barzin	eb6ceac5f5	[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F) Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8	2026-04-19 15:43:49 +00:00
Viktor Barzin	150f196095	[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof <master> 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:05 +00:00
Viktor Barzin	6ee283c2f0	[docs] Document external-monitor opt-out mechanism in monitoring.md The doc said monitors were created for everything in cloudflare_proxied_names, but since the k8s-api discovery rewrite the ConfigMap is a fallback only. Describe the opt-OUT semantics and how external_monitor=false on a factory call translates to the sync script's skip annotation.	2026-04-19 15:19:06 +00:00
Viktor Barzin	af6574a006	[dns] Fix CoreDNS serve_stale syntax — 24h TTL, no refresh-mode arg CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`: plugin/cache: invalid value for serve_stale refresh mode: 86400s serve_stale takes one DURATION and an optional refresh_mode keyword ("immediate" or "verify"), not two durations. Simplified to `serve_stale 86400s` (serve cached entries for up to 24h when upstream is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two old pods kept serving traffic so there was no outage, but the partial apply left the cluster wedged with the bad ConfigMap. Also collapses the inline viktorbarzin.lan cache block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:18:43 +00:00
Viktor Barzin	9a21c0f065	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:53:41 +00:00
Viktor Barzin	e6e5fc5f17	[docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip] ## Context After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter component row, stale PVC names (`-proxmox` suffixes that were renamed `-encrypted` during the LUKS migration), a wrong probe schedule (claimed 10 min, actually 20 min), and a Mailgun-API claim for the probe (it's been on Brevo since code-n5l). The two-path architecture (external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the current design wasn't visualised at all. ## This change Rewrote the Architecture Diagram section to show both ingress paths in one Mermaid flowchart, colour-coded: - External (orange): Sender → pfSense NAT → HAProxy → NodePort → alt PROXY listeners (2525/4465/5587/10993). - Intra-cluster (blue): Roundcube / probe → ClusterIP Service → stock listeners (25/465/587/993), no PROXY. - The pod subgraph shows both listener sets feeding the same Postfix / Rspamd / Dovecot / Maildir pipeline. - Security dotted edges: Postfix log stream → CrowdSec agent → LAPI → pfSense bouncer decisions. - Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP → Pushgateway/Uptime Kuma. Added a sequenceDiagram for the external SMTP roundtrip — walks through the wire-level handshake from external MTA → pfSense NAT → HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod postscreen parse → smtpd banner. Makes the "how does the pod see the real IP despite SNAT?" question self-answering. Added a Port mapping table listing all 8 container listeners (4 stock + 4 alt) with their Service, NodePort, PROXY-required flag, and who uses each path. Replaces the ambiguous prose about "alt ports". Fixed stale bits: - Removed Dovecot Exporter row from Components (retired in code-1ik). - Added pfSense HAProxy row. - Probe schedule: every 10 min → every 20 min (`/20 * * `). - Probe API: Mailgun → Brevo HTTP. - PVC names: `-proxmox` → `-encrypted`* (all three); storage class `proxmox-lvm` → `proxmox-lvm-encrypted`. - Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS PVCs to the Storage table with backup flow pointer. - Expanded Troubleshooting → Inbound to include HAProxy health check + container-listener verification steps. - Secrets table: `brevo_api_key` now marked as used by both relay + probe; `mailgun_api_key` marked historical. Added a prominent UPDATE 2026-04-19 header to `docs/runbooks/mailserver-proxy-protocol.md` pointing future readers at the implemented state in `mailserver-pfsense-haproxy.md`. Research doc preserved as a decision record — it's the canonical "why not just pin the pod?" reference. ## What is NOT in this change - No Terraform changes; this is docs-only. - No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was already rewritten during Phase 6. ## Test Plan ### Automated ``` $ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md 2 $ grep -c '\-encrypted' docs/architecture/mailserver.md 5 # PVC references normalised $ grep -c '\-proxmox' docs/architecture/mailserver.md 0 # no stale names left ``` ### Manual Verification Render `docs/architecture/mailserver.md` on GitHub or any Mermaid- capable viewer: 1. Top Architecture Diagram should show two labelled paths into the pod, colour-coded (orange = external, blue = intra-cluster). 2. Sequence diagram should show 10 numbered steps ending at Rspamd + Dovecot delivery. 3. Port Mapping table should make it obvious that the 4 alt container ports are only reachable via `mailserver-proxy` NodePort and require PROXY v2.	2026-04-19 12:40:53 +00:00
Viktor Barzin	43fe11fffc	[mailserver] Phase 6 — decommission MetalLB LB path [ci skip] ## Context (bd code-yiu) With Phase 4+5 proven (external mail flows through pfSense HAProxy + PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are obsolete. Phase 6 decommissions them and documents the steady-state architecture. ## This change ### Terraform (stacks/mailserver/modules/mailserver/main.tf) - `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`. - Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation. - Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP). - Port set unchanged — the Service still exposes 25/465/587/993 for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor` CronJob) that hit the stock PROXY-free container listeners. - Inline comment documents the downgrade rationale + companion `mailserver-proxy` NodePort Service that now carries external traffic. ### pfSense (ops, not in git) - `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT rule references it post-Phase-4; keeping it would be misleading dead metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php` companion script (ad-hoc, not checked in — alias is just a Firewall → Aliases → Hosts entry). ### Uptime Kuma (ops) - Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202` → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` / `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy layer liveness). History retained (edit, not delete-recreate). ### Docs - `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten "Current state" section; now reflects steady-state architecture with two-path diagram (external via HAProxy / intra-cluster via ClusterIP). Phase history table marks Phase 6 ✅. Rollback section updated (no one-liner post-Phase-6; need Service-type re-upgrade + alias re-add). - `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound flow, CrowdSec section, Uptime Kuma monitors list, Decisions section (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY v2"), Troubleshooting all updated. - `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph updated with new external path description; references the new runbook. ## What is NOT in this change - Removal of `10.0.20.202` from `cloudflare_proxied_names` or any reserved-IP tracking — wasn't there to begin with. The `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of 19 available after this, confirming `.202` went back to the pool. - Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if someone re-introduces the MetalLB LB (see runbook "Rollback"). ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # Service is ClusterIP with no EXTERNAL-IP $ kubectl get svc -n mailserver mailserver mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP # 10.0.20.202 no longer answers ARP (ping from pfSense) $ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202' 2 packets transmitted, 0 packets received, 100.0% packet loss # MetalLB pool released the IP $ kubectl get ipaddresspool default -n metallb-system \ -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}' 2 of 19 available # E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver ... Round-trip SUCCESS in 20.3s ... $ kubectl delete job probe-phase6 -n mailserver # pfSense mailserver alias removed $ ssh admin@10.0.20.1 'php -r "..." \| grep mailserver' (no output) ``` ### Manual Verification 1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new hostname `10.0.20.1`. 2. Roundcube login works (`https://mail.viktorbarzin.me/`). 3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in mailserver logs within ~10s. 4. CrowdSec should still see real client IPs in postfix/dovecot parsers (verify with `cscli alerts list` on next auth-fail event). ## Phase history (bd code-yiu) \| Phase \| Status \| Description \| \|---\|---\|---\| \| 1a \| ✅ ``ef75c02f`` \| k8s alt :2525 listener + NodePort Service \| \| 2 \| ✅ 2026-04-19 \| pfSense HAProxy pkg installed \| \| 3 \| ✅ ``ba697b02`` \| HAProxy config persisted in pfSense XML \| \| 4+5 \| ✅ ``9806d515`` \| 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ✅ this commit \| MetalLB LB retired; 10.0.20.202 released; docs updated \| Closes: code-yiu	2026-04-19 12:36:11 +00:00
Viktor Barzin	ba697b02a2	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip] ## Context (bd code-yiu) Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so it lives in the nightly backup) of the PROXY-v2 migration. Test path only — listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525 postscreen. Real client IP verified in maillog (`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a container plumbing is already live (commit `ef75c02f`). pfSense HAProxy config lives in `/cf/conf/config.xml` under `<installedpackages><haproxy>`. That file is captured daily by `scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`) and synced offsite to Synology. No new backup wiring needed — this commit documents the fact + adds the reproducer script. ## This change Two files, both additive: 1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that edits pfSense config.xml to add: - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125, `send-proxy-v2`, TCP health-check every 120000 ms (2 min). - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525 in TCP mode, forwarding to the pool. Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg` and reload HAProxy. Removes existing items with the same name before adding, so repeat runs converge on declared state. 2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering current state, validation, bootstrap/restore, health checks, phase roadmap, and known warts (health-check noise + bind-address templating). ## What is NOT in this change - Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred. - Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual- inet_listener) — deferred. - Terraform for pfSense HAProxy pkg install — not possible (no Terraform provider for pfSense pkg management). Runbook documents the manual `pkg install` command. ## Test Plan ### Automated ``` $ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l \| grep :2525' 64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D www haproxy 64009 5 tcp4 :2525 :* $ ssh admin@10.0.20.1 "echo 'show servers state' \| socat /tmp/haproxy.socket stdio" \ \| awk 'NR>1 {print $4, $6}' node1 2 node2 2 node3 2 node4 2 # all UP $ python3 -c " import socket; s=socket.socket(); s.settimeout(10) s.connect(('10.0.20.1', 2525)) print(s.recv(200).decode()) s.send(b'EHLO persist-test.example.com\r\n') print(s.recv(500).decode()) s.send(b'QUIT\r\n'); s.close()" 220-mail.viktorbarzin.me ESMTP ... 250-mail.viktorbarzin.me 250-SIZE 209715200 ... 221 2.0.0 Bye $ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \ \| grep smtpd-proxy.*CONNECT postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525 ``` Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy SNAT) → PROXY-v2 roundtrip confirmed. ### Manual Verification Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the now-persisted config (`<enable>yes</enable>` in XML). Connection test above should still work. ## Reproduce locally 1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/` 2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK 3. `python3 -c '...' ` SMTP roundtrip test above.	2026-04-19 12:07:47 +00:00
Viktor Barzin	c6784f87b5	[docs] Add NFS prerequisite runbook for nfs_volume module [ci skip] ## Context `modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`). The first time a new consumer is added, the mount fails with `mount.nfs: … No such file or directory` and the pod hangs in ContainerCreating. This bit us twice during the Wave 1/2 rollout — once for the mailserver backup (code-z26) and again for the Roundcube backup (code-1f6). Both times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`. Rather than automate the SSH dependency into the module (which would break hermeticity and fail for operators without host SSH), this runbook documents the manual bootstrap step and the rationale. Addresses bd code-yo4. ## This change New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers, gives the copy-paste SSH command, and explains why auto-creation was rejected (two options, neither worth the churn). ## What is NOT in this change - Any automation of the bootstrap — runbook only - Migration to `nfs-subdir-external-provisioner` — explicitly out of scope ## Test Plan ### Automated ``` $ cat docs/runbooks/nfs-prerequisites.md \| head -5 # NFS Prerequisites for `modules/kubernetes/nfs_volume` The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a path on the Proxmox NFS server (`192.168.1.127`). It does not create the underlying directory on the server. ``` ### Manual Verification Before the next stack adds a new `nfs_volume` consumer, read the runbook and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod reaches Ready within a minute of the PV creation. Closes: code-yo4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:40:55 +00:00
Viktor Barzin	09c1105648	[mailserver] Delete postfix_cf_reference_DO_NOT_USE dead code [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix `main.cf` layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but actively wastes reviewer attention on every `git blame`, drives up `variables.tf` read time, and lets drift calcify. Trade-offs considered: - Keep it "just in case" → rejected; the file it mirrored (`/usr/share/postfix/main.cf.dist`) is already canonical upstream and reproducible inside any docker-mailserver container. - Move it to a comment block → rejected; same noise cost, no value over deletion (authoritative source is in the image). ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the single live variable `postfix_cf` that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately; this commit explicitly avoids conflating cleanup with drift resolution. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift (volume_mount ordering + stale `metallb.io/ip-allocated-from-pool` annotation). No change is attributable to this commit — the dead variable was never referenced, so removing it leaves state untouched. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` → expected `0` 3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed including the trailing blank after the EOT) 4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains 5. `cd ../..` (stack root) → `terragrunt validate` → expected: `Success! The configuration is valid` 6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to destroy.` (the 2 are the pre-existing drift, not from this commit). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:05:44 +00:00
Viktor Barzin	b2d2a5bb1c	[docs] Document Fail2ban-disabled rationale (CrowdSec is policy) [ci skip] ## Context An audit of the mailserver stack raised the question: why is Fail2ban disabled in the docker-mailserver deployment? The setting `ENABLE_FAIL2BAN = "0"` lives in the env ConfigMap at `stacks/mailserver/modules/mailserver/main.tf:68` with no documented rationale, which made the decision look accidental rather than deliberate. The decision is deliberate: CrowdSec is the cluster-wide bouncer for SSH, HTTP, and SMTP/IMAP brute-force defence. It already tails `postfix` + `dovecot` logs via the installed collections and enforces decisions at the LB/firewall tier with real client IPs preserved by `externalTrafficPolicy: Local` on the dedicated MetalLB IP. Enabling Fail2ban in-pod would duplicate that response path — two systems racing to ban the same offender from different enforcement points, iptables churn inside the container, and a split audit trail across two decision stores. User decision 2026-04-18: keep disabled, document the decision so the next auditor doesn't have to re-derive it. ## This change Adds a new subsection "Fail2ban Disabled (CrowdSec is the Policy)" to the Security section of `docs/architecture/mailserver.md`, placed immediately after the existing CrowdSec Integration block. The paragraph cites `stacks/mailserver/modules/mailserver/main.tf:68` (where `ENABLE_FAIL2BAN = "0"` lives) and explains why duplicating the layer would make things worse, not better. Pure docs — no Terraform touched. ## Test Plan ### Automated None — docs-only change. No tests, lint, or type checks apply to markdown prose. ### Manual Verification 1. `less infra/docs/architecture/mailserver.md` — locate the Security section; confirm the new "Fail2ban Disabled (CrowdSec is the Policy)" subsection appears between "CrowdSec Integration" and "Rspamd". 2. Render on GitHub or via a markdown previewer; confirm the inline link to `main.tf` resolves and the paragraph reads cleanly. 3. `grep -n 'ENABLE_FAIL2BAN' infra/stacks/mailserver/modules/mailserver/main.tf` — confirm it still reports the value on line 68, matching the citation in the doc. Closes: code-zhn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:47:59 +00:00
Viktor Barzin	b9e9c3f084	[mailserver] Update SPF + docs for Brevo migration [ci skip] ## Context Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`. Postfix immediately began using Brevo for user mail — but the SPF TXT record at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every Brevo-relayed message failed SPF alignment and was spam-foldered or DMARC-quarantined by Gmail/Outlook. Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`: "v=spf1 include:mailgun.org -all" <-- wrong sender network User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`. Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo sends quarantined rather than outright rejected while we validate Brevo's sending IPs + rate limits for real user mail. Tighten to `-all` once the relay is proven stable. The docs in `docs/architecture/mailserver.md` still described the old Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change", those are updated in the same commit. ## This change Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs): 1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`, with an inline comment pointing at the mailserver docs for rationale. 2. `docs/architecture/mailserver.md` — - Last-updated stamp moved to 2026-04-18 with the cutover note. - Overview paragraph now says "relays through Brevo EU" (was Mailgun). - DNS table SPF row reflects the new value plus an annotated history note ("was include:mailgun.org -all until 2026-04-18"). - DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua target and flags that the current live record still points at e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569. - Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo relay credentials; `mailgun_api_key` annotated as retained for the E2E roundtrip probe only (inbound delivery testing, not user mail). Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf` to avoid sweeping up two unrelated pre-existing drifts that the Terraform state shows on this stack: the DMARC + mail._domainkey_rspamd records are stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and a naive refresh+apply would normalize them in the state back to single strings. Those drifts are semantically equivalent (DNS concatenates adjacent TXT strings at resolution time) and are out of scope for this commit — they'll be handled under their own ticket. ## What is NOT in this change - DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1), still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses in the live record. - DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and `mail_domainkey_rspamd` — pre-existing Cloudflare representation drift, untouched here. - Removal of Mailgun references in history/decision sections of the docs, or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API on purpose for inbound delivery testing (code-569 scope). - Mailgun DKIM record `s1._domainkey` — left in place; still consumed by the roundtrip probe. - Other pending items from the 2026-04-18 mail audit plan. ## Test Plan ### Automated Targeted plan showed exactly one change, no other drift sneaking in: module.cloudflared.cloudflare_record.mail_spf will be updated in-place ~ content = "\"v=spf1 include:mailgun.org -all\"" -> "\"v=spf1 include:spf.brevo.com ~all\"" Plan: 0 to add, 1 to change, 0 to destroy. Apply result: Apply complete! Resources: 0 added, 1 changed, 0 destroyed. DNS propagation verified on three independent resolvers immediately after apply: $ dig TXT viktorbarzin.me @1.1.1.1 +short \| grep spf "v=spf1 include:spf.brevo.com ~all" $ dig TXT viktorbarzin.me @8.8.8.8 +short \| grep spf "v=spf1 include:spf.brevo.com ~all" $ dig TXT viktorbarzin.me @10.0.20.201 +short \| grep spf # Technitium primary "v=spf1 include:spf.brevo.com ~all" ### Manual Verification Setup: nothing extra — change is already live (TF applied before commit per home-lab convention; `[ci skip]` in title). 1. Confirm SPF is the Brevo-only record from an external resolver: dig TXT viktorbarzin.me @1.1.1.1 +short Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference. 2. Send a test email via the mailserver (through Brevo relay) to a Gmail account and view the original headers: Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me ... Received-SPF: Pass (google.com: domain of ... designates ... as permitted sender) Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this change because the envelope sender IP was a Brevo IP not covered by `include:mailgun.org`). 3. Confirm no live Mailgun references in the mailserver doc: grep -n mailgun.org infra/docs/architecture/mailserver.md Expected: only annotated-history mentions — SPF "was ... until 2026-04-18" and DMARC "current live record still points at e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active Mailgun relay. ## Reproduce locally cd infra git pull dig TXT viktorbarzin.me @1.1.1.1 +short \| grep spf # expected: "v=spf1 include:spf.brevo.com ~all" # inspect the TF change: git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf # inspect the doc change: git show HEAD -- docs/architecture/mailserver.md Closes: code-q8p Closes: code-9pe Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:13:47 +00:00
Viktor Barzin	1a7f68fe5b	[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - Sentinel assignee `agent` — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - Sequential dispatch — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - Fixed agent `beads-task-runner` — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - Image reuse — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - ConfigMap-mounted `metadata.json` — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - Kill switch (`beads_dispatcher_enabled`) — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - Reaper threshold 30 min — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. Apply ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. CronJobs exist with right schedule ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher /2 * * ` and `beads-reaper /10 * * * `, both with `SUSPEND=False`. 3. End-to-end smoke* ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json \| jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. Reaper smoke Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. Kill switch ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:35:46 +00:00
Viktor Barzin	a24cf8c689	[docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults LimitRange in authentik ns applies a default container memory limit of 256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count against the container's cgroup memory, so the container was OOM-killed (exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`. Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds, df -h /dev/shm reports 2.0G. Updates the post-mortem P1 row to capture this for future readers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:23:14 +00:00
Viktor Barzin	cacc282f1a	.gitignore: ignore terragrunt_rendered.json debug output Generated by `terragrunt render-json` for debugging. Not meant to be tracked — a stale one was sitting untracked in stacks/dbaas/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:18:05 +00:00
Viktor Barzin	b41528e564	[docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18) ## Context On 2026-04-18 all Authentik-protected .viktorbarzin.me sites returned HTTP 400 for all users. Reported first as a per-user issue affecting Emil since 2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached session stopped being enough. Duration: ~44h for the first-affected user, ~30 min from cluster-wide report to unblocked. ## Root cause The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB tmpfs) filled to 100% with ~44k `session_` files from gorilla/sessions FileStore. Every forward-auth request with no valid cookie creates one session-state file; with `access_token_validity=7d` and measured ~18 files/min, steady-state accumulation (~180k files) vastly exceeds the default tmpfs. Once full, every new `store.Save()` returned ENOSPC and the outpost replied HTTP 400 instead of the usual 302 to login. ## What's captured - Full timeline, impact, affected services - Root-cause chain diagram (request rate → retention → ENOSPC → 400) - Why diagnosis took 2 days (misattribution of a Viktor event to Emil, red-herring suspicion of the new Rybbit Worker, cached sessions masking the outage) - Contributing factors + detection gaps - Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream - Lessons learned (check outpost logs first; cookie-less `curl` disproves per-user symptoms fast; UI-managed Authentik config is invisible to git) ## Follow-ups not in this commit - Prometheus alert for outpost /dev/shm usage > 80% - Meta-alert for correlated Uptime Kuma external-monitor failures - Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction (see discussion in beads code-zru) Closes: code-zru Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:12:27 +00:00
Viktor Barzin	6e19dce99e	[docs] automated-upgrades: document long-lived OAuth + expiry monitoring Adds the `claude_oauth_token` Vault entries to the secrets table, a new "OAuth token lifecycle" section explaining the two CLI auth modes (`claude login` vs `claude setup-token`) and why we picked the latter for headless use, the Ink 300-col PTY gotcha from today's harvest, and the monitoring/rotation playbook for the new expiry alerts. Follow-up to `8a054752` and `50dea8f0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:00:07 +00:00
Viktor Barzin	69fbd0ffd6	[docs] Update auto-upgrade docs — new HTTP auth path + n8n expression gotcha Replaces the stale "Dev VM SSH key" secret entry with the current `claude-agent-service` bearer token path (synced to both consumer + caller namespaces). Adds an "n8n workflow gotchas" section documenting: 1. The workflow is DB-state, not Terraform-managed — the JSON in the repo is a backup, not authoritative. 2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat `='Bearer ' + $env.X` does NOT — costs silent 401s. 3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement. 4. 401-troubleshooting steps and the UPDATE pattern for in-place workflow patches. Follow-up to `99180bec` which fixed the actual pipeline break. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:42:11 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	18338a883f	[ci skip] cleanup: remove e2e test file	2026-04-17 22:06:24 +00:00
Claude Agent	842646ea4f	[ci skip] e2e: test commit from claude-agent-service	2026-04-17 22:03:50 +00:00
Viktor Barzin	65b0f30d5e	[docs] Update anti-AI and rybbit docs after rewrite-body removal - Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit) - Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible - Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter) - strip-accept-encoding middleware removed from all references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:43:13 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00

1 2 3

134 commits