infra

Author	SHA1	Message	Date
Viktor Barzin	4ec40ea804	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	a3024d1f51	[docs] Forgejo registry image-rebuild runbook Companion to forgejo-registry-breakglass.md but for the more common case: the Forgejo registry is healthy as a whole, but one image's manifest/blob references are broken (orphan child, half-pushed upload, retention-vs-pull race). The RegistryManifestIntegrityFailure alert annotation already points here. Mirrors registry-rebuild-image.md (the registry-private equivalent) in structure: confirm via probe + curl, delete broken version through Forgejo API, rebuild via Woodpecker manual run, force consumers to re-pull, verify integrity recovery. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	fbb41eff9d	[ci] Phase 1: infra-ci dual-push + break-glass tarball Adds Forgejo as a second push target on the build-ci-image pipeline and saves the just-pushed image as a gzipped tarball on the registry VM disk (/opt/registry/data/private/_breakglass/) so we can recover infra-ci with `ctr images import` if both registries are down. * Dual-push: registry.viktorbarzin.me:5050/infra-ci AND forgejo.viktorbarzin.me/viktor/infra-ci, in the same woodpeckerci/plugin-docker-buildx step. Same image bytes; the Forgejo integrity probe (every 15min) catches any divergence. * Break-glass step: SSHes to 10.0.20.10, docker pulls + saves + gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant so a transient registry blip doesn't fail the build pipeline. * Runbook docs/runbooks/forgejo-registry-breakglass.md documents the recovery flow (when to use, scp+ctr import, node cordon, underlying-issue fix). Tarball mirrors to Synology automatically through the existing daily offsite-sync-backup job — no new sync wiring needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	f793a5f50b	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	4c8d12229f	mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported for ai_socktype` — the daemon respawns get throttled by postfix master, and real client connections that land mid-respawn time out. We saw this as ~50% timeout rate on public 587 from inside the cluster. Layer 1 (book-search) — stacks/ebooks/main.tf: SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local Internal services should use ClusterIP, not hairpin through pfSense+HAProxy. 12/12 OK in <28ms vs ~6/12 timeouts on the public path. Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php: Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc: 30145 → pod 25 (stock postscreen) 30146 → pod 465 (stock smtps) 30147 → pod 587 (stock submission) HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to redirect L4 health probes to those ports while real client traffic keeps going to 30125-30128 with PROXY v2. Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s. Inter dropped 120000 → 5000 since log-spam concern is gone. `option smtpchk EHLO` was tried first but flapped against postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP). Plain TCP accept-on-port check is sufficient for both submission and postscreen. Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck path and mark the "Known warts" entry as resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-05 19:45:33 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	34ee282d88	[ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs Replaces the manual scp+bounce sequence that landed registry:2.8.3 on 10.0.20.10 today (see commit `7cb44d72` + nginx-DNS-trap in runbook). Addresses the "no repeat manual fixes" preference — future changes to docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf / config-private.yml / cleanup-tags.sh now deploy through CI. Pipeline (.woodpecker/registry-config-sync.yml) mirrors pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set, bounce compose only when compose-visible files changed, always restart nginx after a compose bounce (critical — nginx caches upstream DNS), end with a dry-run fix-broken-blobs.sh to catch regressions. Credentials: - Woodpecker repo-secret `registry_ssh_key` (events: push, manual) - Mirror at Vault `secret/woodpecker/registry_ssh_key` (private_key / public_key / known_hosts_entry) - Public key on /root/.ssh/authorized_keys on 10.0.20.10 - Key label: woodpecker-registry-config-sync Runbook updated with "Auto-sync pipeline" section pointing at the new flow + manual override command. Closes: code-3vl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:32:12 +00:00
Viktor Barzin	6e96b436b1	[docs] Capture nginx stale-DNS trap in registry-vm runbook Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches its upstream DNS at startup and does NOT re-resolve after registry-* containers are recreated. Symptom was /v2/_catalog returning {"repositories": []} and /v2/ returning 200 without auth — nginx was forwarding to a stale IP that a different backend container now owns. Fix is always 'docker restart registry-nginx' after any registry-* bounce. Captured in registry-vm.md so future manual operators and the coming auto-sync pipeline (beads code-3vl) both encode the step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:09 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	f6685a23a9	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) Workstream E of the DNS hardening push. Two independent pfSense-side changes to eliminate single-point DNS failures and the unauthenticated RFC 2136 update vector. Part 1 — Multi-IP DHCP option 6 - Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24 got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark. - After: - 10.0.10/24 -> [10.0.10.1, 94.140.14.14] - 10.0.20/24 -> [10.0.20.1, 94.140.14.14] - 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2, 94.140.14.14] so the end state is consistent across all three subnets. - Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's services_kea4_configure() implodes the array into "data: a, b" on the "domain-name-servers" option-data entry (services.inc L1214). - Verified: - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" + "nameserver 94.140.14.14" after networkd renew. - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved restart. - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`; dig @10.0.20.1 google.com -> "no servers could be reached"; dig @94.140.14.14 google.com -> 216.58.204.110; system resolver (getent hosts) succeeds via the fallback IP. Blackhole route removed. Part 2 — TSIG-signed Kea DHCP-DDNS - Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated update vector was latent (DDNS wiring in Kea DHCP4 is actually off today — "DDNS: disabled" in dhcpd.log) but would activate as soon as anyone turned on ddnsupdate on LAN/OPT1. - Generated HMAC-SHA256 secret, base64-encoded 32 random bytes. - Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27). - Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium instances via /api/settings/set (tsigKeys[]). - Updated kea-dhcp-ddns.conf on pfSense with tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …} and key-name: kea-ddns on each forward-ddns / reverse-ddns domain. Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - Configured viktorbarzin.lan + 10.0.10.in-addr.arpa + 20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary: - update = UseSpecifiedNetworkACL - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2] - updateSecurityPolicies = [{tsigKeyName: kea-ddns, domain: "*.<zone>", allowedTypes: [ANY]}] Technitium requires BOTH a source-IP match AND a valid TSIG signature. - Verified TSIG end-to-end: - Signed A-record update from pfSense -> "successfully processed", dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo: hmac-sha256; TSIG Error: NoError; RCODE: NoError"). - Signed PTR update same zone pattern -> dig -x returns tsig-test FQDN. - Unsigned update from pfSense IP (in ACL) -> "update failed: REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic Updates Security Policy"). - Test records cleaned up via signed nsupdate. Safety - pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip (145898 bytes, pre-change snapshot — keep 30d). - DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on pfSense; not committed to git. Docs - architecture/dns.md: zone dynamic-updates section records the TSIG policy; Incident History gets a WS E entry. - architecture/networking.md: DHCP Coverage table now shows the DNS option 6 values per subnet; pfSense block notes the TSIG-signed DDNS and config backup path. - runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers key rotation, emergency bypass, and enforcement-verification. Closes: code-o6j Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:12:23 +00:00
Viktor Barzin	33d934c32f	[dns] pfSense: Unbound replaces dnsmasq (WS D) Replace pfSense dnsmasq (DNS Forwarder) with Unbound (DNS Resolver) so LAN-side .viktorbarzin.lan resolution survives a full Kubernetes outage. Out-of-band pfSense changes (not in Terraform; pfSense config.xml is VM-managed). Backup at /cf/conf/config.xml.2026-04-19-pre-unbound on-box + /mnt/backup/pfsense/ nightly. - <unbound> enabled; listens on lan, opt1, wan, lo0 - <forwarding> on + <forward_tls_upstream> → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853, SNI cloudflare-dns.com) - <dnssec>, <prefetch>, <prefetchkey>, <dnsrecordcache> (serve-expired) - msgcachesize=256MB, cache_max_ttl=7d, cache_min_ttl=60s - custom_options: auth-zone viktorbarzin.lan master=10.0.20.201 fallback-enabled=yes for-upstream=yes + serve-expired-ttl=259200 - <dnsmasq><enable> removed; dnsmasq stopped - NAT rdr WAN UDP 53 → 10.0.20.201 removed (Unbound listens on WAN now) - Technitium zone viktorbarzin.lan: zoneTransferNetworkACL set to 10.0.20.1, 10.0.10.1, 192.168.1.2 (pfSense source IPs) Verified: - unbound-control list_auth_zones: viktorbarzin.lan serial 49367 - dig @127.0.0.1 idrac.viktorbarzin.lan returns 192.168.1.4 with aa flag (served from auth-zone, not forwarded) - dig @127.0.0.1 example.com +dnssec returns ad flag (DoT + validated) - /var/unbound/viktorbarzin.lan.zone has ~114 records - K8s outage drill passed: scale technitium=0 → dig still returns via WAN/LAN/OPT1 interfaces → scale restored - LAN/management/K8s VLAN clients all resolve via pfSense 192.168.1.2 / 10.0.10.1 / 10.0.20.1 respectively Trade-off: Technitium Split Horizon hairpin for 192.168.1.x → *.viktorbarzin.me (non-proxied) no longer runs via pfSense (Unbound answers locally). Fix if it bites: switch service to proxied or add Unbound Host Override. Documented in docs/runbooks/pfsense-unbound.md. Closes: code-k0d	2026-04-19 15:52:41 +00:00
Viktor Barzin	eb6ceac5f5	[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F) Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8	2026-04-19 15:43:49 +00:00
Viktor Barzin	9a21c0f065	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:53:41 +00:00
Viktor Barzin	e6e5fc5f17	[docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip] ## Context After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter component row, stale PVC names (`-proxmox` suffixes that were renamed `-encrypted` during the LUKS migration), a wrong probe schedule (claimed 10 min, actually 20 min), and a Mailgun-API claim for the probe (it's been on Brevo since code-n5l). The two-path architecture (external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the current design wasn't visualised at all. ## This change Rewrote the Architecture Diagram section to show both ingress paths in one Mermaid flowchart, colour-coded: - External (orange): Sender → pfSense NAT → HAProxy → NodePort → alt PROXY listeners (2525/4465/5587/10993). - Intra-cluster (blue): Roundcube / probe → ClusterIP Service → stock listeners (25/465/587/993), no PROXY. - The pod subgraph shows both listener sets feeding the same Postfix / Rspamd / Dovecot / Maildir pipeline. - Security dotted edges: Postfix log stream → CrowdSec agent → LAPI → pfSense bouncer decisions. - Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP → Pushgateway/Uptime Kuma. Added a sequenceDiagram for the external SMTP roundtrip — walks through the wire-level handshake from external MTA → pfSense NAT → HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod postscreen parse → smtpd banner. Makes the "how does the pod see the real IP despite SNAT?" question self-answering. Added a Port mapping table listing all 8 container listeners (4 stock + 4 alt) with their Service, NodePort, PROXY-required flag, and who uses each path. Replaces the ambiguous prose about "alt ports". Fixed stale bits: - Removed Dovecot Exporter row from Components (retired in code-1ik). - Added pfSense HAProxy row. - Probe schedule: every 10 min → every 20 min (`/20 * * `). - Probe API: Mailgun → Brevo HTTP. - PVC names: `-proxmox` → `-encrypted`* (all three); storage class `proxmox-lvm` → `proxmox-lvm-encrypted`. - Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS PVCs to the Storage table with backup flow pointer. - Expanded Troubleshooting → Inbound to include HAProxy health check + container-listener verification steps. - Secrets table: `brevo_api_key` now marked as used by both relay + probe; `mailgun_api_key` marked historical. Added a prominent UPDATE 2026-04-19 header to `docs/runbooks/mailserver-proxy-protocol.md` pointing future readers at the implemented state in `mailserver-pfsense-haproxy.md`. Research doc preserved as a decision record — it's the canonical "why not just pin the pod?" reference. ## What is NOT in this change - No Terraform changes; this is docs-only. - No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was already rewritten during Phase 6. ## Test Plan ### Automated ``` $ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md 2 $ grep -c '\-encrypted' docs/architecture/mailserver.md 5 # PVC references normalised $ grep -c '\-proxmox' docs/architecture/mailserver.md 0 # no stale names left ``` ### Manual Verification Render `docs/architecture/mailserver.md` on GitHub or any Mermaid- capable viewer: 1. Top Architecture Diagram should show two labelled paths into the pod, colour-coded (orange = external, blue = intra-cluster). 2. Sequence diagram should show 10 numbered steps ending at Rspamd + Dovecot delivery. 3. Port Mapping table should make it obvious that the 4 alt container ports are only reachable via `mailserver-proxy` NodePort and require PROXY v2.	2026-04-19 12:40:53 +00:00
Viktor Barzin	43fe11fffc	[mailserver] Phase 6 — decommission MetalLB LB path [ci skip] ## Context (bd code-yiu) With Phase 4+5 proven (external mail flows through pfSense HAProxy + PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are obsolete. Phase 6 decommissions them and documents the steady-state architecture. ## This change ### Terraform (stacks/mailserver/modules/mailserver/main.tf) - `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`. - Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation. - Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP). - Port set unchanged — the Service still exposes 25/465/587/993 for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor` CronJob) that hit the stock PROXY-free container listeners. - Inline comment documents the downgrade rationale + companion `mailserver-proxy` NodePort Service that now carries external traffic. ### pfSense (ops, not in git) - `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT rule references it post-Phase-4; keeping it would be misleading dead metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php` companion script (ad-hoc, not checked in — alias is just a Firewall → Aliases → Hosts entry). ### Uptime Kuma (ops) - Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202` → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` / `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy layer liveness). History retained (edit, not delete-recreate). ### Docs - `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten "Current state" section; now reflects steady-state architecture with two-path diagram (external via HAProxy / intra-cluster via ClusterIP). Phase history table marks Phase 6 ✅. Rollback section updated (no one-liner post-Phase-6; need Service-type re-upgrade + alias re-add). - `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound flow, CrowdSec section, Uptime Kuma monitors list, Decisions section (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY v2"), Troubleshooting all updated. - `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph updated with new external path description; references the new runbook. ## What is NOT in this change - Removal of `10.0.20.202` from `cloudflare_proxied_names` or any reserved-IP tracking — wasn't there to begin with. The `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of 19 available after this, confirming `.202` went back to the pool. - Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if someone re-introduces the MetalLB LB (see runbook "Rollback"). ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # Service is ClusterIP with no EXTERNAL-IP $ kubectl get svc -n mailserver mailserver mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP # 10.0.20.202 no longer answers ARP (ping from pfSense) $ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202' 2 packets transmitted, 0 packets received, 100.0% packet loss # MetalLB pool released the IP $ kubectl get ipaddresspool default -n metallb-system \ -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}' 2 of 19 available # E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver ... Round-trip SUCCESS in 20.3s ... $ kubectl delete job probe-phase6 -n mailserver # pfSense mailserver alias removed $ ssh admin@10.0.20.1 'php -r "..." \| grep mailserver' (no output) ``` ### Manual Verification 1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new hostname `10.0.20.1`. 2. Roundcube login works (`https://mail.viktorbarzin.me/`). 3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in mailserver logs within ~10s. 4. CrowdSec should still see real client IPs in postfix/dovecot parsers (verify with `cscli alerts list` on next auth-fail event). ## Phase history (bd code-yiu) \| Phase \| Status \| Description \| \|---\|---\|---\| \| 1a \| ✅ ``ef75c02f`` \| k8s alt :2525 listener + NodePort Service \| \| 2 \| ✅ 2026-04-19 \| pfSense HAProxy pkg installed \| \| 3 \| ✅ ``ba697b02`` \| HAProxy config persisted in pfSense XML \| \| 4+5 \| ✅ ``9806d515`` \| 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ✅ this commit \| MetalLB LB retired; 10.0.20.202 released; docs updated \| Closes: code-yiu	2026-04-19 12:36:11 +00:00
Viktor Barzin	ba697b02a2	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip] ## Context (bd code-yiu) Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so it lives in the nightly backup) of the PROXY-v2 migration. Test path only — listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525 postscreen. Real client IP verified in maillog (`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a container plumbing is already live (commit `ef75c02f`). pfSense HAProxy config lives in `/cf/conf/config.xml` under `<installedpackages><haproxy>`. That file is captured daily by `scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`) and synced offsite to Synology. No new backup wiring needed — this commit documents the fact + adds the reproducer script. ## This change Two files, both additive: 1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that edits pfSense config.xml to add: - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125, `send-proxy-v2`, TCP health-check every 120000 ms (2 min). - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525 in TCP mode, forwarding to the pool. Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg` and reload HAProxy. Removes existing items with the same name before adding, so repeat runs converge on declared state. 2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering current state, validation, bootstrap/restore, health checks, phase roadmap, and known warts (health-check noise + bind-address templating). ## What is NOT in this change - Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred. - Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual- inet_listener) — deferred. - Terraform for pfSense HAProxy pkg install — not possible (no Terraform provider for pfSense pkg management). Runbook documents the manual `pkg install` command. ## Test Plan ### Automated ``` $ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l \| grep :2525' 64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D www haproxy 64009 5 tcp4 :2525 :* $ ssh admin@10.0.20.1 "echo 'show servers state' \| socat /tmp/haproxy.socket stdio" \ \| awk 'NR>1 {print $4, $6}' node1 2 node2 2 node3 2 node4 2 # all UP $ python3 -c " import socket; s=socket.socket(); s.settimeout(10) s.connect(('10.0.20.1', 2525)) print(s.recv(200).decode()) s.send(b'EHLO persist-test.example.com\r\n') print(s.recv(500).decode()) s.send(b'QUIT\r\n'); s.close()" 220-mail.viktorbarzin.me ESMTP ... 250-mail.viktorbarzin.me 250-SIZE 209715200 ... 221 2.0.0 Bye $ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \ \| grep smtpd-proxy.*CONNECT postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525 ``` Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy SNAT) → PROXY-v2 roundtrip confirmed. ### Manual Verification Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the now-persisted config (`<enable>yes</enable>` in XML). Connection test above should still work. ## Reproduce locally 1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/` 2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK 3. `python3 -c '...' ` SMTP roundtrip test above.	2026-04-19 12:07:47 +00:00
Viktor Barzin	c6784f87b5	[docs] Add NFS prerequisite runbook for nfs_volume module [ci skip] ## Context `modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`). The first time a new consumer is added, the mount fails with `mount.nfs: … No such file or directory` and the pod hangs in ContainerCreating. This bit us twice during the Wave 1/2 rollout — once for the mailserver backup (code-z26) and again for the Roundcube backup (code-1f6). Both times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`. Rather than automate the SSH dependency into the module (which would break hermeticity and fail for operators without host SSH), this runbook documents the manual bootstrap step and the rationale. Addresses bd code-yo4. ## This change New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers, gives the copy-paste SSH command, and explains why auto-creation was rejected (two options, neither worth the churn). ## What is NOT in this change - Any automation of the bootstrap — runbook only - Migration to `nfs-subdir-external-provisioner` — explicitly out of scope ## Test Plan ### Automated ``` $ cat docs/runbooks/nfs-prerequisites.md \| head -5 # NFS Prerequisites for `modules/kubernetes/nfs_volume` The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a path on the Proxmox NFS server (`192.168.1.127`). It does not create the underlying directory on the server. ``` ### Manual Verification Before the next stack adds a new `nfs_volume` consumer, read the runbook and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod reaches Ready within a minute of the PV creation. Closes: code-yo4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:40:55 +00:00
Viktor Barzin	09c1105648	[mailserver] Delete postfix_cf_reference_DO_NOT_USE dead code [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix `main.cf` layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but actively wastes reviewer attention on every `git blame`, drives up `variables.tf` read time, and lets drift calcify. Trade-offs considered: - Keep it "just in case" → rejected; the file it mirrored (`/usr/share/postfix/main.cf.dist`) is already canonical upstream and reproducible inside any docker-mailserver container. - Move it to a comment block → rejected; same noise cost, no value over deletion (authoritative source is in the image). ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the single live variable `postfix_cf` that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately; this commit explicitly avoids conflating cleanup with drift resolution. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift (volume_mount ordering + stale `metallb.io/ip-allocated-from-pool` annotation). No change is attributable to this commit — the dead variable was never referenced, so removing it leaves state untouched. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` → expected `0` 3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed including the trailing blank after the EOT) 4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains 5. `cd ../..` (stack root) → `terragrunt validate` → expected: `Success! The configuration is valid` 6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to destroy.` (the 2 are the pre-existing drift, not from this commit). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:05:44 +00:00
Viktor Barzin	1a7f68fe5b	[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - Sentinel assignee `agent` — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - Sequential dispatch — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - Fixed agent `beads-task-runner` — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - Image reuse — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - ConfigMap-mounted `metadata.json` — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - Kill switch (`beads_dispatcher_enabled`) — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - Reaper threshold 30 min — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. Apply ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. CronJobs exist with right schedule ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher /2 * * ` and `beads-reaper /10 * * * `, both with `SUSPEND=False`. 3. End-to-end smoke* ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json \| jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. Reaper smoke Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. Kill switch ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:35:46 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	b345b086ef	update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture	2026-04-06 15:06:01 +03:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	2d8aa5ed89	docs: update hardware inventory for R730 RAM upgrade to 272GB Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400. Added physical DIMM slot diagram, channel layout, and BIOS speed override notes. Updated compute architecture with correct CPU (single socket), VM memory values, and capacity figures.	2026-04-02 00:48:13 +03:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	af2222fce8	backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild	2026-03-19 20:34:33 +00:00

28 commits