infra

Author	SHA1	Message	Date
Viktor Barzin	a5df175a67	[mailserver] Retire Dovecot exporter + scrape + alerts [ci skip] ## Context code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot 2.3 deprecated in favour of `service stats` + `doveadm-server` with a different wire format. The scrape only ever returned `dovecot_up{scope="user"} 0`. code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or (b) retire the exporter + scrape + alerts. Picking (b) — carrying a no-op exporter + scrape + alert group taxes cluster resources, clutters Prometheus /targets, and tees up an alert that can never fire correctly. If a future session needs real Dovecot stats, reach for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and rebuild this scaffolding. ## This change ### mailserver stack - Removes the `dovecot-exporter` container from `kubernetes_deployment.mailserver` (was ~28 lines). Pod now runs a single `docker-mailserver` container. - Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service added in code-izl). The `mailserver` LoadBalancer (ports 25, 465, 587, 993) is unaffected. - Drops the dovecot.cf comment documenting the failed code-vnc attempt — the documentation survives here + in bd code-vnc / code-1ik. ### monitoring stack - Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`. - Removes the `Mailserver Dovecot` PrometheusRule group (`DovecotConnectionsNearLimit`, `DovecotExporterDown`). - Inline comments in both files point future work at code-1ik's decision record. Prometheus configmap-reload picked up the change; scrape target set now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to 1/1 Running. ## What is NOT in this change - No replacement exporter — deliberate. The alert that was removed was a false-signal alert; its removal returns cluster alerting to a correct, lower-noise state. - mailserver MetalLB Service + SMTP/IMAP ports — unchanged. - `auth_failure_delay`, `mail_max_userip_connections` — stay; those are unrelated to stats export. ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver NAME READY STATUS RESTARTS AGE mailserver-78589bfd95-swz6h 1/1 Running 0 49s $ kubectl get svc -n mailserver NAME TYPE PORT(S) mailserver LoadBalancer 25/TCP,465/TCP,587/TCP,993/TCP roundcubemail ClusterIP 80/TCP # mailserver-metrics gone $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' {"status":"success","data":{"activeTargets":[]}} ``` ### Manual Verification 1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence) 2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even without the exporter signal 3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit or DovecotExporterDown Closes: code-1ik Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:01:07 +00:00
Viktor Barzin	137404a6a2	[mailserver] Document Dovecot exporter incompatibility [ci skip] ## Context bd code-vnc investigated why `viktorbarzin/dovecot_exporter` only exposed `dovecot_up{scope="user"} 0`. Root cause: the exporter speaks the legacy pre-2.3 `old_stats` FIFO wire protocol. docker-mailserver 15.0.0 ships Dovecot 2.3.19, which moved to `service stats` with a different architecture — `doveadm stats dump` on the old-stats unix_listener returns "Failed to read VERSION line" and the exporter loops on "Input does not provide any columns". Attempted fix: enabled `old_stats` plugin via `mail_plugins` + declared `service old-stats { unix_listener stats-reader }`. Socket was created but protocol incompatibility made it useless. Reverted. ## This change - Reverts the attempted dovecot.cf additions - Adds a comment in the dovecot.cf heredoc explaining why we deliberately do NOT enable old_stats here - `auth_failure_delay = 5s` (code-9mi) and `mail_max_userip_connections = 50` stay — they're unrelated to stats ## What is NOT in this change - A replacement exporter — filed as follow-up bd code-1ik with two paths: switch to jtackaberry/dovecot_exporter, or retire the exporter+scrape+alert entirely - The `mailserver-metrics` ClusterIP Service (from code-izl) — kept; it will be useful for whichever path code-1ik chooses ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ supervisorctl status dovecot postfix dovecot RUNNING pid 1022, uptime 0:00:27 postfix RUNNING pid 1063, uptime 0:00:26 $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification Dovecot config returns to baseline + auth_failure_delay. Mail continues to flow (E2E probe continues to succeed via `email-roundtrip-monitor`). Closes: code-vnc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:55:48 +00:00
Viktor Barzin	973f549810	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser ## Context Companion change to payslip-ingest v2 (regex parser + accurate RSU tax attribution). The Grafana dashboard now has 4 more panels powered by the new earnings-decomposition and YTD-snapshot columns, and the Claude fallback agent's prompt is aligned with the new schema so non-Meta payslips still land with the full field set. ## This change ### `.claude/agents/payslip-extractor.md` Rewrites the RSU handling section to match Meta UK's actual template (rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead). Adds a new "Earnings decomposition (v2)" section telling the fallback agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_* and when to use pension_employee vs pension_sacrifice without double-counting. ### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json` - Panel 4 (Effective rate) — SQL switched from the naive `(income_tax + NIC) / cash_gross` to the YTD-effective-rate method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid / ytd_taxable_pay)`. Title updated to "YTD-corrected" so the change is discoverable. - Panel 5 (Table) — adds salary, bonus, pension_sacrifice, taxable_pay columns so row-level debugging against the parser output is trivial. - +Panel 8 (Earnings breakdown) — monthly stacked bars of salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice months show up as a massive negative pension_sacrifice spike paired with a near-zero bonus bar. - +Panel 9 (Accurate cash tax rate) — timeseries of cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU contribution the payslip hides in the single `Tax paid` line. - +Panel 10 (All-in compensation) — stacked bars of cash_gross + rsu_vest per payslip. - +Panel 11 (YTD cumulative cash gross vs total comp) — two lines partitioned by tax_year; the gap between them is the RSU contribution YTD. Total panels go from 7 → 11. ## Test Plan ### Automated Dashboard JSON validity: ``` $ python3 -m json.tool uk-payslip.json > /dev/null && echo ok ok ``` ### Manual Verification After applying `stacks/monitoring/`: 1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels 2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the negative pension_sacrifice bar in panel 8 3. Panel 9 "Accurate cash effective tax rate" shows the cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in RSU-vest months ## Reproduce locally 1. `cd infra/stacks/monitoring && terragrunt plan` 2. Expected: ConfigMap diff on the payslip dashboard with the new panel JSON 3. `terragrunt apply` — Grafana reloads the dashboard automatically (configmap-reload sidecar) Relates to: payslip-ingest commit 9741816 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:54:33 +00:00
Viktor Barzin	28009a0e85	[redis] Bump master/replica memory 64Mi→256Mi (OOMKilled on PSYNC) ## Context redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts. Cluster-health check flagged it as WARN; Prometheus was firing `StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and `PodCrashLooping` alerts continuously. ## Root cause Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but the replica needs transient headroom during PSYNC full resync: - RDB snapshot transfer buffer - Copy-on-write during AOF rewrite (`fork()` + writes during snapshot) - Replication backlog tracking The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137), looping forever. This also broke Sentinel quorum when master would fail — no healthy replica to promote. ## Fix Master + replica: 64Mi → 256Mi (both requests and limits, per `CLAUDE.md` resource management rule: `requests=limits` based on VPA upperBound). Sentinels stay at 64Mi — they don't store data. ## Deployment note Helm upgrade initially deadlocked because StatefulSet uses `OrderedReady` podManagementPolicy: the update rollout refuses to start until all pods Ready, but redis-node-1 could not be Ready without the update. Recovered via: helm rollback redis 43 -n redis kubectl -n redis patch sts redis-node --type=strategic \ -p '{...memory: 256Mi...}' kubectl -n redis delete pod redis-node-1 --force Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery runbook to be written under `code-cnf`. ## Verification kubectl -n redis get pods redis-node-0 2/2 Running 0 <bounce> redis-node-1 2/2 Running 0 <bounce> kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}' 256Mi ## Follow-ups filed - code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically (separate root cause; surfaced via same cluster-health run) - code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait deadlock recovery Closes: code-pqt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:40:51 +00:00
Viktor Barzin	468a7a266b	[mailserver] Drop unneeded NET_ADMIN capability [ci skip] ## Context The mailserver container had `capabilities.add = ["NET_ADMIN"]`. Upstream docker-mailserver docs say the capability is only needed by Fail2ban to run iptables ban actions. Fail2ban is DISABLED in this stack (`ENABLE_FAIL2BAN=0`, see line ~68) — CrowdSec owns the brute-force policy at the LB layer. The capability was therefore unused ballast and a minor attack-surface reduction opportunity. Addresses code-4mu. ## This change Replaces the explicit `capabilities { add = ["NET_ADMIN"] }` block with an empty `security_context {}`. Post-rollout verification (`supervisorctl status`) confirms every service we actually run is healthy — dovecot, postfix, rspamd, rsyslog, postsrsd, changedetector, cron, mailserver. Every STOPPED entry was already disabled. The inline comment documents the revert trigger: check `kubectl logs -c docker-mailserver` for permission-denied patterns and restore the capability if observed. ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver -o jsonpath='{.items[0].spec.containers[?(@.name=="docker-mailserver")].securityContext}' {"allowPrivilegeEscalation":true,"privileged":false,"readOnlyRootFilesystem":false,"runAsNonRoot":false} $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ supervisorctl status \| grep RUNNING changedetector RUNNING ... cron RUNNING ... dovecot RUNNING ... mailserver RUNNING ... postfix RUNNING ... postsrsd RUNNING ... rspamd RUNNING ... rsyslog RUNNING ... ``` ### Observation window EmailRoundtripFailing + EmailRoundtripStale alerts continue to run every 20 min. If no alert fires in the 24h post-rollout window (through ~2026-04-20 10:40 UTC), the change is considered safe and this commit stands. Otherwise revert this commit. ## What is NOT in this change - readOnlyRootFilesystem (separate hardening, out of scope) - runAsNonRoot (docker-mailserver needs root for Postfix) - Removing privilege-escalation defaults (container needs those for chowning mail spool at startup) Closes: code-4mu Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:39:43 +00:00
Viktor Barzin	c941199f8d	[mailserver] Split Dovecot metrics port onto ClusterIP service [ci skip] ## Context Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable, shipping an internal metric on the same listening IP as external mail conflated two concerns and over-exposed the port. Prometheus was scraping via the same LB Service. Addresses code-izl (follow-up to code-61v which added the scrape job). ## This change ### mailserver stack - Drops `dovecot-metrics` port from `kubernetes_service.mailserver` (LoadBalancer stays: 25, 465, 587, 993). - Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only, selecting the same `app=mailserver` pod, exposing 9166. ### monitoring stack - Updates `extraScrapeConfigs` in the Prometheus chart values to target the new `mailserver-metrics.mailserver.svc.cluster.local:9166` instead of `mailserver.mailserver.svc.cluster.local:9166`. - helm_release.prometheus updated in-place; configmap-reload sidecar picked up the new target within 10s. ``` mailserver LB mailserver-metrics ClusterIP ┌──────────────────┐ ┌──────────────────┐ │ 25 smtp │ │ 9166 dovecot- │ │ 465 smtp-secure │ │ metrics │ ← Prometheus only │ 587 smtp-auth │ └──────────────────┘ │ 993 imap-secure │ └──────────────────┘ ↑ 10.0.20.202 ``` ## What is NOT in this change - Per-Service RBAC/NetworkPolicy tightening (separate task) - Moving the metrics port to a dedicated sidecar-only Service Monitor (ServiceMonitor CRDs not installed; extraScrapeConfigs is correct for the prometheus-community chart in use) ## Test Plan ### Automated ``` $ kubectl get svc -n mailserver mailserver LoadBalancer 10.0.20.202 25/TCP,465/TCP,587/TCP,993/TCP mailserver-metrics ClusterIP 10.100.102.174 9166/TCP $ kubectl get endpoints -n mailserver mailserver-metrics mailserver-metrics 10.10.169.163:9166 $ # Prometheus target (after 10s configmap-reload) $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics health: up ``` ### Manual Verification 1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused 2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name 3. PromQL: `up{job="mailserver-dovecot"}` returns `1` Closes: code-izl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:37:30 +00:00
Viktor Barzin	7502e0db21	[mailserver] Document postfix-accounts.cf hash-drift invariant [ci skip] ## Context The `postfix-accounts.cf` ConfigMap renders `bcrypt(pass, 6)` for each user in `var.mailserver_accounts`. bcrypt generates a fresh salt on every evaluation → the ConfigMap `data` hash line differs every plan run. `ignore_changes = [data["postfix-accounts.cf"]]` was the pragmatic workaround, but the side-effect wasn't documented: a Vault rotation of a mailserver password would be MASKED by ignore_changes — TF would never push the new hash and the pod would keep accepting the old password until manual taint/replace. Addresses bd code-7ns. ## This change Inline comment on the lifecycle block spelling out: - Why ignore_changes exists (non-deterministic bcrypt) - What the invariant costs (masks automatic rotation) - Why it's acceptable TODAY (no automatic rotation for mailserver_accounts — verified in Vault; manual password change is a manual TF run anyway) - Two concrete alternatives if rotation is ever added: (a) deterministic bcrypt with stable per-user salt (b) render from an ESO-synced K8s Secret No code change, no apply needed — this is a comment-only commit. The decision (live-with + document) is one of the three options in the plan. ## What is NOT in this change - Deterministic hashing (not needed until automatic rotation exists) - ESO-driven Secret (same reason) - Removal of ignore_changes (would cause the original drift flap) ## Test Plan ### Automated ``` $ cd stacks/mailserver && /home/wizard/code/infra/scripts/tg plan # no diff expected on this comment-only change; other drift remains # but is pre-existing and out of scope. ``` ### Manual Verification Read the new comment block at `stacks/mailserver/modules/mailserver/ main.tf` around the postfix-accounts-cf lifecycle — comprehensible without session context. Closes: code-7ns Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:33:57 +00:00
Viktor Barzin	23173131f4	[mailserver] Add Dovecot auth_failure_delay 5s [ci skip] ## Context Dovecot's `dovecot.cf` block previously set only `mail_max_userip_connections = 50`. No equivalent of the SMTP rate limit existed for IMAP auth — brute-force against IMAP/POP auth was throttled only by CrowdSec at the LB level. Adding an in-process auth delay is cheap defense in depth. Addresses code-9mi. ## This change Adds `auth_failure_delay = 5s` to the dovecot.cf ConfigMap key. Each failed auth attempt pauses 5s before responding; a sequential 1000-entry dictionary attack stretches from <1s to ~85min, bought out CrowdSec's ban window. ## What is NOT in this change - `login_processes_count` tuning (workload doesn't warrant it yet) - Equivalent SMTP AUTH delay (CrowdSec already covers, and SMTP AUTH is rate-limited via `smtpd_client_connection_rate_limit`) ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ doveconf -n \| grep -E 'auth_failure\|mail_max_userip' auth_failure_delay = 5 secs mail_max_userip_connections = 50 $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. `openssl s_client -connect mail.viktorbarzin.me:993` 2. `a1 LOGIN bogus@viktorbarzin.me wrongpass` — expect ~5s delay before `NO [AUTHENTICATIONFAILED]` 3. Fire 5 failed attempts rapidly: total ≥25s ## Reproduce locally 1. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- doveconf -n \| grep auth_failure` 2. Expected: `auth_failure_delay = 5 secs` Closes: code-9mi Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:33:05 +00:00
Viktor Barzin	a32bfbf07e	[mailserver] Require STARTTLS before AUTH on submission [ci skip] ## Context docker-mailserver 15.0.0's default Postfix config does NOT set `smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587 (or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec and rate limiting don't catch this — it's an auth-path leak, not a bruteforce. Addresses bd code-vnw. ## This change Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the `postfix-main.cf` ConfigMap key consumed by docker-mailserver). Rolled the pod to pick up the new ConfigMap. ### Deviation from task spec code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT a real Postfix parameter — attempting it gets `postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The acceptance test (reject PLAIN auth before STARTTLS) is satisfied by `smtpd_tls_auth_only`, which is the correct knob. Added an inline comment noting the common confusion. ## What is NOT in this change - Per-service override in master.cf (smtpd_tls_auth_only applied globally, which is safe because port 25 doesn't accept AUTH here) - Other Postfix hardening (sender_restrictions, etc.) ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ postconf smtpd_tls_auth_only smtpd_tls_auth_only = yes $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp` 2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS` 3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled` 4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds ## Reproduce locally 1. From a shell with `kubectl` access to the cluster: 2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only` 3. Expected: `smtpd_tls_auth_only = yes` Closes: code-vnw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:31:15 +00:00
Viktor Barzin	e12c7b43e4	[mailserver] Pin dovecot_exporter to SHA + add Diun [ci skip] ## Context `viktorbarzin/dovecot_exporter:latest` was consumed with `IfNotPresent` pull, which means whichever node landed the pod kept whatever digest was cached from an earlier pull. A SHA-level pin is the reproducibility baseline this repo uses for every other home-built image (`headscale`, `excalidraw`, `linkwarden`). ## This change - Pins `dovecot-exporter` container image to `viktorbarzin/dovecot_exporter@sha256:1114224c...` — the digest the pod is actually running today (captured from live `imageID`). - Enables Diun tag watching on the mailserver Deployment (`diun.enable=true`, `diun.include_tags=^latest$`) so new `:latest` digests trigger a notification rather than silently landing on the next `IfNotPresent` miss. Deviation from task spec (code-cno): the task asked for an 8-char SHA tag, but Docker Hub only publishes `:latest` for this image — a SHA tag doesn't exist. Used the digest-pin pattern already established at `stacks/headscale/modules/headscale/main.tf:204` instead; Diun watches the `:latest` tag for drift, which is the equivalent notification. ## What is NOT in this change - Volume-mount ordering drift on `kubernetes_deployment.mailserver` (pre-existing; tolerated by Waves 1+2). - Splitting the metrics port into its own Service (code-izl). ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver \ -o jsonpath='{.items[0].spec.containers[*].image}' docker.io/mailserver/docker-mailserver:15.0.0 \ viktorbarzin/dovecot_exporter@sha256:1114224c9bf0261ca8e9949a6b42d3c5a2c923d34ca4593f6b62f034daf14fc5 $ kubectl get deployment -n mailserver mailserver \ -o jsonpath='{.spec.template.metadata.annotations}' {"diun.enable":"true","diun.include_tags":"^latest$"} $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. Push a new `:latest` digest to the exporter image (or wait for one). 2. Check Diun notifier output: a tag event for `^latest$` should fire. 3. `kubectl describe deployment/mailserver -n mailserver` shows the digest pin unchanged until someone rebumps it. ## Reproduce locally 1. `kubectl -n mailserver get pod -l app=mailserver -o yaml \| \ grep -A1 dovecot_exporter` 2. Expected: `image: viktorbarzin/dovecot_exporter@sha256:1114224c...`. Closes: code-cno Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:26:31 +00:00
Viktor Barzin	c36b41eabc	[monitoring] Scrape mailserver Dovecot exporter + near-limit alerts Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but nothing was scraping it. Added a static `mailserver-dovecot` scrape job to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not `kube-prometheus-stack`, so no ServiceMonitor CRDs are available). Two alerts in a new `Mailserver Dovecot` rule group: - `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for 5m (85% of `mail_max_userip_connections = 50`). - `DovecotExporterDown` fires if the scrape target is unreachable for 10m (catches pod restarts + network issues). Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule on `mailserver-beta1` branch; that commit is abandoned because the CRDs aren't installed. This path is functionally equivalent and plans cleanly. Closes: code-61v	2026-04-19 00:24:12 +00:00
Viktor Barzin	6a75ed4809	[mailserver] Add targeted retention for spam@ mailbox ## Context The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. The mailbox had no retention policy. On 2026-04-18 it held 519 messages consuming 43 MiB. Without a policy, the only brake on growth was manual deletion, which has not been happening - hence the bd task. Viktor's explicit constraint when filing code-oy4: DO NOT blind age-expunge. We need targeted retention that keeps genuine forwarded human mail for a long time while shedding the recurring-newsletter cruft that dominates the byte count. ## Profile findings (2026-04-18, verified on the live pod) Total: 519 messages, 43 MiB, 0 in new/, 0 in tmp/. Top senders by volume: 138 dan@tldrnewsletter.com 51 hi@ratepunk.com 40 uber@uber.com 35 truenas@viktorbarzin.me 19 ubereats@uber.com 15 hello@travel.jacksflightclub.com 12 chris@chriswillx.com 10 me@viktorbarzin.me Top senders by storage bytes: 8,176,481 dan@tldrnewsletter.com (19 % of 43 MiB alone) 2,866,104 uber@uber.com 2,207,458 noreply@mail.selfh.st 2,066,094 hi@ratepunk.com 1,675,435 ubereats@uber.com Age distribution: 97 % older than 14 days (502 / 519) 23 % older than 90 days (121 / 519) Automated-sender markers: 66 % carry List-Unsubscribe: (342 / 519) 4 % carry Precedence: bulk\|list\|junk ( 21 / 519) 34 % carry neither marker (= human-ish tail) (177 / 519) Combined "automated AND >14d": 328 messages -> target of rule 1. ## Retention strategy Signed off by Viktor 2026-04-18. Two rules, both delete-leaf: 1. Older than 14 days AND header matches one of: - `^List-Unsubscribe:` - `^Precedence:\s(bulk\|list\|junk)` - `^Auto-Submitted:\sauto-` -> DELETE. Rationale: these markers are the RFC-agreed indicators of bulk / robotic senders. A 14-day window still lets genuine subscription alerts (delivery, flight, calendar invite) come to attention. 2. Older than 90 days AND no automated marker at all -> DELETE. Rationale: these are long-tail forwards from real people to the catch-all. 90 days is deliberately generous - I would rather leak bytes than lose Viktor's personal correspondence. 3. Everything else -> KEEP (recent traffic, or aged human tail younger than 90d). ## Implementation A `kubernetes_cron_job_v1.spam_retention` running every 4h (at :17 past) that `kubectl exec`s a Python retention script into the mailserver pod. Why kubectl exec and not a sibling CronJob with the Maildir mounted: mailserver-data-encrypted is a RWO volume held by the mailserver pod. A sibling would fail to attach. The nextcloud-watchdog pattern in stacks/nextcloud/main.tf already solves this for a similar "interact with the live pod on a schedule" shape. Mirrored here with its own SA + Role + RoleBinding scoped to list/get pods and create pods/exec in the mailserver namespace only. Why Python and not pure shell: POSIX `find + stat + awk` struggles with the header-scan-up-to-blank-line rule, and `stat -c` is Linux- GNU-specific anyway. The script reads each message's first 64 KiB, stops at the first blank line, scans headers only, then checks mtime. The CronJob streams the Python source via `kubectl exec -i ... -- python3 - <<PYEOF`. After the retention pass, `doveadm force-resync -u spam@viktorbarzin.me INBOX/spam` refreshes Dovecot's cached index so the deletions appear in IMAP immediately instead of after the next pod restart. Includes the standard KYVERNO_LIFECYCLE_V1 marker on the CronJob so Kyverno ndots mutation does not cause perpetual drift. ## What is NOT in this change - Dovecot sieve rules (no sieve infrastructure exists in the module; the plan file's fallback option was precisely this CronJob path). - Push of retention metrics to Pushgateway - the script prints them to the job log for now; plumbing Pushgateway is a follow-up if Viktor wants alerts. - Any touch of other mailboxes - only `/var/mail/viktorbarzin.me/spam/cur` is walked. - Any mailserver pod restart or config reload. ## Test plan ### Automated `terraform fmt` + `terragrunt hclfmt` pass. `scripts/tg plan` on the mailserver stack shows: Plan: 7 to add, 3 to change, 0 to destroy. Of the 7 adds, 4 are mine (SA + Role + RoleBinding + CronJob). The other 3 adds belong to the concurrent roundcube-backup CronJob + nfs_roundcube_backup_host PV + PVC already on master in parallel. The 3 in-place updates are pre-existing drift on the mailserver Deployment, Service and email_roundtrip_monitor CronJob, not introduced by this change. ### Manual Verification After `scripts/tg apply` lands the CronJob: 1. Trigger an immediate run: `kubectl -n mailserver create job --from=cronjob/spam-retention manual-1` 2. Wait for completion, read the log: `kubectl -n mailserver logs job/manual-1` -> expected tail: spam_retention_scanned_total <N> spam_retention_auto_deleted_total <M> spam_retention_human_deleted_total <H> spam_retention_kept_total <K> spam_retention_errors_total 0 Retention pass complete 3. Confirm mailbox shrunk: `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \ -- du -sh /var/mail/viktorbarzin.me/spam/` -> expected: well below 43 MiB within one run (bulk rule alone purges ~328 messages per the profile numbers above). 4. Confirm IMAP reflects the deletions: `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \ -- doveadm mailbox status -u spam@viktorbarzin.me messages INBOX/spam` -> expected: message count dropped accordingly. 5. 4 hours later, confirm the next scheduled run logs a much smaller scan count and 0 deletions (nothing new crossed the threshold). Closes: code-oy4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:22:55 +00:00
Viktor Barzin	6cfc4b7836	[mailserver] Add backup CronJob for Roundcube html + enigma PVCs ## Context Roundcube webmail runs with two encrypted RWO PVCs (see roundcubemail.tf: `roundcubemail-html-encrypted`, `roundcubemail-enigma-encrypted`). These carry user-visible state that is NOT regenerable without user action: - `html` PVC → Apache docroot, plugin installs, skin overrides, session artefacts (two_factor_webauthn keys, persistent_login tokens, rcguard throttle state) - `enigma` PVC → user-uploaded PGP private keyrings Per the subdir CLAUDE.md "Storage & Backup Architecture" rule every proxmox-lvm* PVC MUST have a backup CronJob writing to NFS `/mnt/main/<app>-backup/`. Mailserver already complies via code-z26's `mailserver-backup` CronJob; Roundcube does not. Losing either Roundcube PVC means users must re-add 2FA devices, re-install plugins, and re-import PGP keys — none of it recoverable from a database dump. Target task: `code-1f6`. ## This change - Adds `module.nfs_roundcube_backup_host` sourcing `modules/kubernetes/nfs_volume` pointed at `/srv/nfs/roundcube-backup` on the Proxmox host (NFSv4, inotify change-tracker picks it up for Synology offsite). - Adds `kubernetes_cron_job_v1.roundcube-backup`: - Schedule `10 3 * * ` — 10 minutes after `mailserver-backup` (`0 3 * `) to avoid NFS write-window contention. Roundcube PVCs are tiny (<200 MiB combined on current cluster) so the window is well under 10 min. - `pod_affinity` on `app=roundcubemail` (Roundcube runs 1 replica with `Recreate` strategy on a fresh node per pod; the backup pod must co-locate because both PVCs are RWO). - `rsync -aH --delete --link-dest=/backup/<prev-week>` into `/backup/<YYYY-WW>/{html,enigma}/` — hardlinks unchanged files vs the previous weekly snapshot, keeping storage cost ~= delta only. - Weekly rotation retains 8 snapshots (~2 months), matching `mailserver-backup`. - Pushgateway metrics under `job=roundcube-backup` so existing `BackupDurationHigh` / `BackupStale` alert patterns detect regressions without extra wiring. - `KYVERNO_LIFECYCLE_V1` `ignore_changes` for mutated `dns_config`. ## Layout ``` NFS server 192.168.1.127:/srv/nfs/ ├── mailserver-backup/ (0 3 * * — code-z26) │ └── <YYYY-WW>/{data,state,log}/ └── roundcube-backup/ (10 3 * * * — this change) └── <YYYY-WW>/{html,enigma}/ ``` ## What is NOT in this change - Changing the mailserver-backup CronJob to also cover Roundcube. Two separate CronJobs keep the concerns (and pod anti-affinity/affinity) clean; the 10-min stagger eliminates the contention justification for merging them. - Retention alerting tuning — existing Pushgateway/Prometheus rule ecosystem suffices for now. - Restore tooling — follows the standard pattern in `docs/runbooks/` (rsync back, fix perms). ## Reproduce locally 1. Plan: `cd stacks/mailserver && scripts/tg plan -lock=false` → 2 new resources (nfs_volume module + CronJob). 2. Apply, then trigger a one-shot run: `kubectl -n mailserver create job --from=cronjob/roundcube-backup roundcube-backup-manual-1` 3. Expected on success: - `kubectl -n mailserver logs job/roundcube-backup-manual-1` → "=== Backup IO Stats ===". - On Proxmox host: `ls /srv/nfs/roundcube-backup/$(date +%Y-%W)/` → `html`, `enigma`. - `/mnt/backup/.nfs-changes.log` (Proxmox) lists fresh paths under `roundcube-backup/` within ~1s of the rsync finishing. - Pushgateway: `curl -s prometheus-prometheus-pushgateway.monitoring:9091/metrics \| grep roundcube` shows `backup_duration_seconds`, `backup_last_success_timestamp`. ## Automated - `terraform fmt -check -recursive stacks/mailserver/modules/mailserver/` → clean. - `scripts/tg plan -lock=false` in stacks/mailserver expected to show `+ module.nfs_roundcube_backup_host.*`, `+ kubernetes_cron_job_v1.roundcube-backup`. Closes: code-1f6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:14:47 +00:00
Viktor Barzin	f707968091	[mailserver] Retry probe Pushgateway + Uptime Kuma pushes with backoff ## Context The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)` in bare `try/except` that only prints "Failed to push ..." on error. If Pushgateway is transiently unreachable (e.g., during a Prometheus Helm upgrade / HPA scale-down / brief network blip) metrics silently drop and downstream detection relies entirely on `EmailRoundtripStale` firing after 60 min of staleness. Single transient failures masquerade as data-plane breakage for up to an hour. Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes. ## This change - Extracts a `push_with_retry(label, func, url)` helper that performs 3 attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as success, everything else as failure. On final failure, logs an explicit `ERROR:` line to stderr with the URL and either the last HTTP status or the exception repr — matches the existing `print(...)` logging style used throughout the heredoc (no stdlib `logging` dependency added). - Replaces the two inline `try/requests.put/except print` blocks with calls to the helper. Pushgateway runs unconditionally; Uptime Kuma still only runs on round-trip success (same as before). - Makes exit code responsive to push outcome: probe exits non-zero when the round-trip itself failed (unchanged), OR when BOTH pushes failed all retries on the success path. Single-endpoint push failure with the other succeeding keeps exit 0 — partial observability is preferred over noisy pod restarts from Kubernetes' Job controller. ## Behavior matrix ``` roundtrip \| pushgw \| kuma \| exit \| rationale ----------+--------+------+------+------------------------------- success \| ok \| ok \| 0 \| happy path (unchanged) success \| fail \| ok \| 0 \| one endpoint still has telemetry success \| ok \| fail \| 0 \| one endpoint still has telemetry success \| fail \| fail \| 1 \| NEW — total observability loss fail \| ok \| - \| 1 \| roundtrip failed (unchanged, Kuma skipped) fail \| fail \| - \| 1 \| roundtrip failed (unchanged, Kuma skipped) ``` ## What is NOT in this change - Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of scope per the task description. - `logging` stdlib adoption — rest of heredoc uses `print`, staying consistent. - Moving the heredoc out of `main.tf` into a sidecar Python file — separate refactor. ## Reproduce locally 1. Point PUSHGATEWAY at a black hole: `kubectl -n mailserver set env cronjob/email-roundtrip-monitor \` `PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test` 2. Trigger a one-shot job: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test` 3. Expected in logs: - 3 attempts, each ~1s/2s/4s apart - `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...` - Uptime Kuma push still succeeds (round-trip ok) → exit 0 4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect exit 1 + two ERROR lines. ## Automated - `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK (heredoc extracts cleanly). - `terraform fmt -check -recursive modules/mailserver/` → no diff. Closes: code-n5l Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:14:46 +00:00
Viktor Barzin	f568e7d2bf	[mailserver] Delete unused postfix_cf_reference_DO_NOT_USE variable [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix main.cf layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but wastes reviewer attention on every `git blame` and drives up `variables.tf` read time. Note on history: the prior commit `09c11056` landed with an identical title ("Delete postfix_cf_reference_DO_NOT_USE dead code") but actually committed `docs/runbooks/mailserver-proxy-protocol.md` — fallout from a race between two concurrent mailserver sessions that staged files in parallel. That commit accidentally closed this beads task via the `Closes:` trailer without performing the deletion. This commit does the actual deletion that was originally intended for code-o3q. The runbook from `09c11056` is legitimate work for code-rtb and is left in place. ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the live `postfix_cf` variable that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` -> `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift. No change is attributable to this commit — the dead variable was never referenced. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` -> expected `0` 3. `wc -l variables.tf` -> expected `39` (was `175`; 136 lines removed) 4. `cd ../..` -> `terragrunt validate` -> expected `Success!` 5. `terragrunt plan` -> expected `Plan: 0 to add, 2 to change, 0 to destroy.` (pre-existing drift only). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:07:43 +00:00
Viktor Barzin	8ea2dea84c	[mailserver] Authentik-gate Roundcube webmail ingress [ci skip] ## Context mail.viktorbarzin.me exposed the Roundcube login page directly: requests hit Traefik → CrowdSec + anti-AI middleware → Roundcube. The `ingress_factory` call in `roundcubemail.tf` omitted `protected = true`, so the Authentik ForwardAuth middleware was never wired up. Project rule (`infra/.claude/CLAUDE.md`): ingresses should be `protected = true` unless there is a specific reason to leave them open. Credentialed surfaces (login pages) have no reason to skip the OIDC gate — CrowdSec alone is a behavioural signal, not an identity gate. Trade-off accepted by Viktor on 2026-04-18: webmail now requires two logins (Authentik SSO, then Roundcube IMAP auth against dovecot). This is tolerable for a low-volume personal webmail; mail clients (Thunderbird, phone Mail) bypass the webmail entirely and speak IMAPS/SMTP directly against `mail.viktorbarzin.me` on the MetalLB service IP (10.0.20.202), which is a separate path and MUST stay open. ## This change Single-line flip: `protected = true` added to the `ingress_factory` call in `stacks/mailserver/modules/mailserver/roundcubemail.tf`. The factory (`modules/kubernetes/ingress_factory/main.tf`) responds to the flag by: 1. Appending `traefik-authentik-forward-auth@kubernetescrd` to the ingress `router.middlewares` annotation — Traefik then hands each request to the Authentik outpost before forwarding to Roundcube. 2. Flipping `effective_anti_ai` from true → false (logic: `anti_ai_scraping != null ? … : !var.protected`), which removes the two anti-AI middlewares. Rationale in the factory: a login-gated resource is already invisible to unauthenticated scrapers, so the robots/noai middleware chain is redundant. Request path before vs after: Before: Client → Traefik → [retry, error-pages, rate-limit, csp, crowdsec, ai-bot-block, anti-ai-headers] → Roundcube (200 on /) After: Client → Traefik → [retry, error-pages, rate-limit, csp, crowdsec, authentik-forward-auth] → if unauth: 302 to authentik.viktorbarzin.me → if auth: Roundcube (login form) ## What is NOT in this change - The `mailserver` Service (MetalLB IP 10.0.20.202) is untouched. IMAPS (993), SMTPS (465), SMTP-Submission (587) continue to bypass Traefik entirely and speak directly to dovecot/postfix. Mail clients are unaffected. - Pre-existing drift on `kubernetes_deployment.mailserver` (volume_mount ordering) and `kubernetes_service.mailserver` (stale metallb annotation) is left alone — out of scope per bd-bmh. Apply was scoped with `-target=` to the ingress resource only. - No Authentik app/provider Terraform was touched — the `mail.` ingress is already covered by the existing wildcard Authentik proxy outpost on `.viktorbarzin.me` (standard pattern). ## Test Plan ### Automated Baseline (before apply): $ curl -sI https://mail.viktorbarzin.me/ \| head -2 HTTP/2 200 alt-svc: h3=":443"; ma=2592000 $ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \ \| grep -E 'CONNECTED\|subject=' CONNECTED(00000003) subject=CN = viktorbarzin.me After apply: $ curl -sI https://mail.viktorbarzin.me/ \| head -3 HTTP/2 302 alt-svc: h3=":443"; ma=2592000 location: https://authentik.viktorbarzin.me/application/o/authorize/?client_id=… $ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \ \| grep -E 'CONNECTED\|subject=' CONNECTED(00000003) subject=CN = viktorbarzin.me Middleware annotation on the ingress: $ kubectl get ingress -n mailserver mail \ -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}' traefik-retry@kubernetescrd,traefik-error-pages@kubernetescrd, traefik-rate-limit@kubernetescrd,traefik-csp-headers@kubernetescrd, traefik-crowdsec@kubernetescrd,traefik-authentik-forward-auth@kubernetescrd Terraform apply (targeted): $ scripts/tg apply --non-interactive \ -target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress … Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ### Manual Verification 1. In a private browser window, navigate to https://mail.viktorbarzin.me/ 2. Expected: redirected to Authentik SSO login (not Roundcube) 3. Authenticate with Authentik credentials 4. Expected: redirected back and shown the Roundcube IMAP login form 5. Enter IMAP credentials (same as before the change) 6. Expected: Roundcube inbox loads normally 7. Separately, verify a mail client (Thunderbird, phone Mail) still connects to IMAPS on mail.viktorbarzin.me:993 and SMTP on :587 without any Authentik prompt — that path hits MetalLB 10.0.20.202 directly. ## Reproduce locally 1. cd infra/stacks/mailserver 2. vault login -method=oidc 3. scripts/tg plan Expected: 0 to add, 3 to change, 0 to destroy. Relevant change is the `router.middlewares` annotation on `module.ingress.kubernetes_ingress_v1.proxied-ingress` swapping the two anti-AI middlewares for `traefik-authentik-forward-auth`. The other 2 changes are pre-existing drift (volume_mounts, metallb annotation) and are out of scope. 4. scripts/tg apply --non-interactive \ -target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress 5. curl -sI https://mail.viktorbarzin.me/ — expect HTTP/2 302 to authentik.viktorbarzin.me Closes: code-bmh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:56:25 +00:00
Viktor Barzin	8f5e131572	[mailserver] Route DMARC rua/ruf to dmarc@viktorbarzin.me [ci skip] ## Context Mailgun was decommissioned on 2026-04-12 in favour of Brevo as the outbound SMTP relay. The DMARC aggregate (`rua`) and forensic (`ruf`) report targets still pointed at `e21c0ff8@dmarc.mailgun.org`, an inbox that no longer exists — meaning every DMARC report Google/Microsoft/etc. generate has been bouncing or silently dropped for six days. No alerts fire on this (DMARC reports are best-effort, not RFC-mandated), but we've lost visibility into alignment failures and spoofing attempts during the exact window where the SPF/DKIM/DMARC posture was being reshaped for the Brevo cutover. Decision (2026-04-18): route reports to `mailto:dmarc@viktorbarzin.me`. The mailserver's catch-all sieve delivers anything to non-existent local-parts into `spam@`, so `dmarc@` does not need to be provisioned as a real mailbox — the inbox will land in `spam@`'s maildir unchanged. Alternative considered: route to a dedicated `dmarc@` maildir with sieve rules to file into a folder. Rejected for now — the monitoring value of DMARC reports is low-frequency (one aggregate per reporter per day at most), so the catch-all path is good enough until volume justifies a proper parser. Can be revisited once we see actual report traffic. The third-party aggregator target `adb84997@inbox.ondmarc.com` (Red Sift OnDMARC) is preserved in both rua and ruf — it provides parsed dashboards that we actually read. The `postmaster@viktorbarzin.me` ruf-only target also stays as a local mirror. As a side effect, this apply also canonicalises the TXT record: the previous value was stored as a two-string split in Cloudflare state (`...viktorbarzin" ".me;"`) due to the 255-byte TXT string limit (the record length exceeded 255 chars). The new value is shorter (dmarc@viktorbarzin.me is 21 chars vs e21c0ff8@dmarc.mailgun.org's 26 chars, doubled across rua and ruf) and fits in a single string, so the provider serialises it as one string and the prior split-drift noise disappears from future plans. ## This change Single-line content edit on `cloudflare_record.mail_dmarc` in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: Before → After (rua and ruf, both): ``` mailto:e21c0ff8@dmarc.mailgun.org → mailto:dmarc@viktorbarzin.me ``` All other DMARC tags unchanged: `v=DMARC1`, `p=quarantine`, `pct=100`, `fo=1`, `ri=3600`, `sp=quarantine`, `adkim=r`, `aspf=r`. Delivery flow: ``` DMARC reporter (Gmail/Outlook/...) │ aggregate XML.gz to rua / forensic to ruf ▼ dmarc@viktorbarzin.me │ mailserver catch-all (no local recipient) ▼ spam@viktorbarzin.me (Viki's mailbox) ``` ## What is NOT in this change - Mailbox sieve rules to file DMARC reports into a dedicated folder (separate concern; deferred until traffic justifies it). - DMARC parser / dashboard. OnDMARC (adb84997@inbox.ondmarc.com) already provides this for aggregate reports. - Policy tightening (`p=reject`, `pct` ramp) — out of scope. - SPF / DKIM records — not touched. - Removal of the split-string drift suppression, if any existed in prior work. The canonicalisation happens naturally on this apply; no separate workaround was needed. ## Test Plan ### Automated Targeted terragrunt plan + apply via `scripts/tg`: ``` $ cd stacks/cloudflared && scripts/tg plan \ -target=module.cloudflared.cloudflare_record.mail_dmarc ... Terraform will perform the following actions: # module.cloudflared.cloudflare_record.mail_dmarc will be updated in-place ~ resource "cloudflare_record" "mail_dmarc" { ~ content = "\"v=DMARC1; ... rua=mailto:e21c0ff8@dmarc.mailgun.org, mailto:adb84997@inbox.ondmarc.com; ... ruf=mailto:e21c0ff8@dmarc.mailgun.org, mailto:adb84997@inbox.ondmarc.com, mailto:postmaster@viktorbarzin\" \".me;\"" -> "\"v=DMARC1; ... rua=mailto:dmarc@viktorbarzin.me, mailto:adb84997@inbox.ondmarc.com; ... ruf=mailto:dmarc@viktorbarzin.me, mailto:adb84997@inbox.ondmarc.com, mailto:postmaster@viktorbarzin.me;\"" } Plan: 0 to add, 1 to change, 0 to destroy. $ scripts/tg apply /tmp/dmarc.tfplan module.cloudflared.cloudflare_record.mail_dmarc: Modifying... module.cloudflared.cloudflare_record.mail_dmarc: Modifications complete after 1s Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` Authoritative DNS post-apply: ``` $ dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short "v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;" ``` Note: `dig @1.1.1.1` still served the old value immediately after apply — Cloudflare's public resolver holds its cache until TTL expires (TTL=1/auto ≈ 5 min). Authoritative NS is the source of truth. ### Manual Verification Setup: none (DNS change only). Commands: ``` # 1. Confirm authoritative DNS (run now, should pass) dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short # Expected: rua=mailto:dmarc@viktorbarzin.me,... and ruf similarly. # 2. Confirm public resolver catches up (run after ~5min) dig TXT _dmarc.viktorbarzin.me @1.1.1.1 +short # Expected: same as above (no more mailgun.org entries). # 3. Within 24-48h, check Viki's spam@ inbox for an incoming DMARC # aggregate report from Google/Microsoft/etc. Reports are # typically .zip or .gz attachments with XML inside. ``` Interpretation: seeing a DMARC report land in spam@ proves the end-to-end delivery path works: reporter DNS lookup → _dmarc.viktorbarzin.me → mailto:dmarc@viktorbarzin.me → catch-all → spam@ maildir. ## Reproduce locally ``` 1. git pull 2. cd stacks/cloudflared 3. dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short 4. Expected: rua=mailto:dmarc@viktorbarzin.me (and ruf the same). ``` Closes: code-569 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:49:14 +00:00
Viktor Barzin	17a3e03e07	[owntracks] Bridge Recorder → Dawarich via Lua hook script ## Context Viktor wanted live forwarding from Owntracks to Dawarich so his map stays in sync without a periodic backfill. The original plan assumed ot-recorder honoured an `OTR_HTTPHOOK` environment variable — but Recorder 1.0.1 (latest on Docker Hub as of Aug 2025) has no such feature: ``` $ kubectl -n owntracks exec deploy/owntracks -- \ strings /usr/bin/ot-recorder \| grep -iE 'hook\|webhook\|http_post' (no matches) ``` Lua hooks, on the other hand, are first-class: `--lua-script` loads a file and calls the `otr_hook(topic, _type, data)` function for every publish. That is the pivot this commit makes. ## This change Mount a Lua script via ConfigMap and tell ot-recorder to load it: ``` Phone POST /pub ---> Traefik ---> Recorder pod \| \| handle_payload() writes .rec \| otr_hook(topic,_type,data) \| \| \| +---> os.execute("curl … &") \| \| \| v \| Dawarich /api/v1/owntracks/points \| +---> HTTP 200 to phone ``` Per-publish cost: one `curl` subprocess, `--max-time 5`, backgrounded with `&` so it doesn't block the HTTP response to the phone. A Dawarich 5xx drops exactly one point — the `.rec` write still happens, so the one-shot backfill Job can always re-play. `DAWARICH_API_KEY` is injected from K8s Secret `owntracks-secrets` (sourced from Vault `secret/owntracks.dawarich_api_key` via the existing `dataFrom.extract` ExternalSecret). The Lua reads it with `os.getenv()` so the key never lands in Terraform state. ### Key discoveries in the verification loop (why iteration count > 1) 1. The hook function must be named `otr_hook`, not `hook` (recorder's `luasupport.c` calls `lua_getglobal(L, "otr_hook")`). The recorder logs `cannot invoke otr_hook in Lua script` when missing — the plan's `hook()` naming was wrong. 2. Dawarich's `latitude`/`longitude` scalar columns are legacy and always NULL; the authoritative geometry is in the `lonlat` PostGIS column (`ST_AsText(lonlat::geometry)`). Early "it's broken" readings were me querying the wrong columns. 3. Default Recreate-strategy rollouts cause ~30s 502/503 windows on the ingress — tolerable, but every apply is visible as an outage to the phone. Batching edits is important. ## What is NOT in this change - Not OTR_HTTPHOOK. Removed with this commit (dead env var). - Not the one-shot backfill Job — that comes after the phone buffer has flushed to avoid racing against incoming hook POSTs (follow-up: code-h2r). - Not Anca's bridge — a second Recorder instance or a smarter hook is needed to route her posts under her own Dawarich api_key (follow-up: code-72g). - No Ingress or Service change — Commit 1 (``a21d4a44``) already landed those. ## Test Plan ### Automated ``` $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 1 added, 1 changed, 0 destroyed. $ kubectl -n owntracks logs deploy/owntracks --tail=5 + initializing Lua hooks from `/hook/dawarich-hook.lua' + dawarich-bridge: init + HTTP listener started on 0.0.0.0:8083, without browser-apikey ... + dawarich-bridge: tst=1 lat=0 lon=0 ok=true ``` ### Manual Verification ``` $ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) $ TST=$(date +%s) $ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \ curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \ -H 'Content-Type: application/json' \ -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \ -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \ https://owntracks.viktorbarzin.me/pub HTTP 200 $ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \ psql -U postgres -d dawarich -c \ "SELECT timestamp, ST_AsText(lonlat::geometry) FROM points \ WHERE user_id=1 AND timestamp=$TST" timestamp \| st_astext ------------+------------------------- 1776555707 \| POINT(-0.1278 51.5074) ``` Real phone traffic (from in-flight buffer flush) lands in Dawarich too: `traefik logs -l app.kubernetes.io/name=traefik \| grep 'POST /api/v1/owntracks/points'` shows ingress POSTs from `owntracks` namespace to `dawarich` backend with status 200. ### Reproduce locally 1. `vault login -method=oidc` 2. `kubectl -n owntracks logs deploy/owntracks --tail=20` — expect `dawarich-bridge: init` after the Lua loader line. 3. Do the curl above, poll the DB, expect `POINT(lon lat)`. Closes: code-z9b Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:47:22 +00:00
Viktor Barzin	cfd0f5bcc9	[mailserver] Add liveness/readiness TCP probes [ci skip] ## Context The mailserver container (Postfix + Dovecot in one pod) had no liveness, readiness, or startup probes declared. If either daemon deadlocked or hung on a socket, Kubernetes had no way to detect it and restart. The only external canary was the email-roundtrip-monitor CronJob which runs on a 20-minute interval, giving a detection lag of 20-60 minutes — long enough for real delivery failures before an alert fires. Tracked as bd code-ekf out of the mailserver probe audit. Both port 25 (SMTP) and port 993 (IMAPS) are cheap, reliable up-signals — the existing e2e probe already hits IMAPS, so TCP probes on those ports are a close proxy for user-visible service health without the cost of full SMTP/IMAP handshakes every 10s. ## This change Adds a readiness_probe (TCP :25, initial_delay=30s, period=10s) and a liveness_probe (TCP :993, initial_delay=60s, period=60s, timeout=15s) to the mailserver deployment's primary container. Design choices: - TCP over exec/HTTP: the daemons do not expose HTTP health; exec probes would require shelling into the container with auth for SMTP/IMAP banner checks, which is both costly and flaky. TCP accept is sufficient — if postfix cannot accept a TCP connection on :25 it is unambiguously broken. - Split ports per probe: readiness on :25 (the public SMTP surface — if this is down, external delivery is broken) and liveness on :993 (IMAPS, the other critical daemon — catches Dovecot deadlocks independently of Postfix). - 30s readiness delay: Postfix needs ~20-30s to warm up including chroot setup and DKIM key loading; probing earlier would cause bogus NotReady cycles on deploy. - 60s liveness delay + 60s period + 15s timeout: generous so transient blips (brief CPU spike, RBL timeout, slow NFS unmount during rotation) do not trigger a restart loop. With failure_threshold=3 (default), a real deadlock is detected in ~3 minutes; false positives on transient load are suppressed. - No startup_probe: the 60s liveness initial_delay is enough cover for the warmup window; adding a startup probe would be redundant machinery. ## What is NOT in this change - No startup_probe (liveness initial_delay_seconds=60 handles warmup) - No exec-based probes (banner-check probes are out of scope and not needed) - No changes to the opendkim or other sidecars - Pre-existing drift in other stacks (dawarich namespace label, owntracks dawarich-hook wiring) is deliberately left out — those are separate workstreams ## Test Plan ### Automated Applied via `tg apply -target=kubernetes_deployment.mailserver` before this commit. Current pod state: ``` $ kubectl get pod -n mailserver -l app=mailserver NAME READY STATUS RESTARTS AGE mailserver-6c6bf77ffb-w7nl5 2/2 Running 0 2m26s $ kubectl describe pod -n mailserver -l app=mailserver \| grep -E "(Liveness\|Readiness\|Restart Count\|Status:\|Ready:)" Status: Running Ready: True Restart Count: 0 Ready: True Restart Count: 0 Liveness: tcp-socket :993 delay=60s timeout=15s period=60s #success=1 #failure=3 Readiness: tcp-socket :25 delay=30s timeout=1s period=10s #success=1 #failure=3 ``` Pod has run >120s (two full liveness cycles) with RESTARTS=0 and Ready=True. ### Manual Verification 1. Confirm probes are declared on the live pod: ``` kubectl describe pod -n mailserver -l app=mailserver \| grep -E "(Liveness\|Readiness)" ``` Expected: `Liveness: tcp-socket :993 ...` and `Readiness: tcp-socket :25 ...` 2. Confirm pod stays Ready under normal load for 5+ minutes: ``` kubectl get pod -n mailserver -l app=mailserver -w ``` Expected: RESTARTS stays at 0, READY stays at 2/2. 3. (Optional) Failure-simulate by dropping :993 inside the pod and observing liveness failure + restart within ~3 minutes (3 × period_seconds). ## Reproduce locally 1. `cd infra/stacks/mailserver` 2. `tg plan -target=kubernetes_deployment.mailserver` 3. Expected: no drift (or only the probe additions if rolling forward a stale state) 4. `kubectl get pod -n mailserver -l app=mailserver` — pod Ready, RESTARTS=0 5. `kubectl describe pod -n mailserver -l app=mailserver \| grep -E "(Liveness\|Readiness)"` — both probes present Closes: code-ekf Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:45:17 +00:00
Viktor Barzin	ac604d4d1f	[monitoring] uk-payslip: cash-basis queries + RSU vest panel - Panels 1/2/4: compute on (gross_pay - rsu_vest) so numbers reflect actual UK cash pay, not the RSU-inflated figure the payslip shows. - Detailed table: add cash_gross / rsu_vest / rsu_offset columns. - New RSU panel at the bottom: bar chart of rsu_vest over time (only shows months with stock vests). Taxed at Schwab — included here for reporting/reconciliation, not for P&L.	2026-04-18 23:39:46 +00:00
Viktor Barzin	0a2d8b2138	[mailserver] Move probe secrets to ExternalSecret via ESO [ci skip] ## Context The email-roundtrip-monitor CronJob injected `BREVO_API_KEY` and `EMAIL_MONITOR_IMAP_PASSWORD` as inline `env { value = var.xxx }` — Terraform read them from Vault at plan time and embedded them in the generated CronJob spec. Anyone with `kubectl describe cronjob` (or pod-event read) in the `mailserver` namespace could read both secrets verbatim. The two upstream Vault entries are not flat strings: - `secret/viktor` → `brevo_api_key` = base64(JSON({"api_key": "..."})) - `secret/platform` → `mailserver_accounts` = JSON({"spam@viktorbarzin.me": "<pw>", ...}) A plain ESO `remoteRef.property` can traverse one level of JSON but cannot base64-decode the wrapper or index a map key that contains `@`. So the ExternalSecret pulls the raw Vault values and the rendered K8s Secret is produced via ESO's `target.template` (engineVersion v2, sprig pipeline `b64dec \| fromJson \| dig`). `mergePolicy` defaults to Replace, so only the transformed `BREVO_API_KEY` / `EMAIL_MONITOR_IMAP_PASSWORD` keys land in the K8s Secret — the raw wrapped inputs never reach it. ## This change 1. New `kubernetes_manifest.email_roundtrip_monitor_secrets` rendering an `external-secrets.io/v1beta1` ExternalSecret into a K8s Secret named `mailserver-probe-secrets` via the `vault-kv` ClusterSecretStore. 2. CronJob's two `env { name=... value=var.xxx }` blocks replaced with a single `env_from { secret_ref { name = "mailserver-probe-secrets" } }`. 3. Unused `brevo_api_key` / `email_monitor_imap_password` module variables + their wiring in `stacks/mailserver/main.tf` removed. `data "vault_kv_secret_v2" "viktor"` dropped (last consumer gone). ``` Before: After: ┌────────────┐ ┌────────────┐ │ Vault KV │ │ Vault KV │ └────┬───────┘ └────┬───────┘ │ (plan-time read) │ (runtime pull) ▼ ▼ ┌────────────┐ ┌────────────┐ │ Terraform │ │ ESO ctrl │ │ state │ │ +template │ └────┬───────┘ └────┬───────┘ │ inline value= │ sprig b64dec \| fromJson ▼ ▼ ┌────────────┐ ┌────────────┐ │ CronJob │ <-- kubectl describe leaks! │ K8s Secret │ │ env[].value│ │ probe-sec │ └────────────┘ └────┬───────┘ │ env_from.secret_ref ▼ ┌────────────┐ │ CronJob │ │ (no values │ │ in spec) │ └────────────┘ ``` ## Test Plan ### Automated `terragrunt plan -target=...ExternalSecret -target=...CronJob`: ``` Plan: 1 to add, 1 to change, 0 to destroy. + kubernetes_manifest.email_roundtrip_monitor_secrets (ExternalSecret) ~ kubernetes_cron_job_v1.email_roundtrip_monitor - env { name = "BREVO_API_KEY" ... } - env { name = "EMAIL_MONITOR_IMAP_PASSWORD" ... } + env_from { secret_ref { name = "mailserver-probe-secrets" } } ``` `terragrunt apply --non-interactive` same targets: ``` Apply complete! Resources: 1 added, 1 changed, 0 destroyed. ``` `kubectl get externalsecret -n mailserver mailserver-probe-secrets`: ``` NAME STORE REFRESH INTERVAL STATUS READY mailserver-probe-secrets vault-kv 15m SecretSynced True ``` `kubectl get secret -n mailserver mailserver-probe-secrets -o yaml` exposes exactly two data keys (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) — both populated, 120 / 32 base64 chars, no raw `brevo_api_key_wrapped` / `mailserver_accounts` keys. `kubectl describe cronjob -n mailserver email-roundtrip-monitor`: ``` Environment Variables from: mailserver-probe-secrets Secret Optional: false Environment: <none> ``` (Previously the `Environment:` block listed both secrets with their raw values.) ### Manual Verification 1. `kubectl create job --from=cronjob/email-roundtrip-monitor \ probe-test-$RANDOM -n mailserver` 2. `kubectl logs -n mailserver -l job-name=probe-test-... --tail=30` expected: ``` Sent test email via Brevo: 201 marker=e2e-probe-... Found test email after 1 attempts Deleted 1 e2e probe email(s) Round-trip SUCCESS in 20.3s Pushed metrics to Pushgateway Pushed to Uptime Kuma ``` 3. `kubectl exec -n monitoring deploy/prometheus-prometheus-pushgateway \ -- wget -q -O- http://localhost:9091/metrics \| grep email_roundtrip` shows `email_roundtrip_success=1`, fresh timestamp, duration in range. 4. `kubectl delete job -n mailserver probe-test-...` to clean up. Closes: code-39v Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:39:06 +00:00
Viktor Barzin	73ed2d9001	[monitoring] Add detailed-payslips table + full-deductions panels Two new panels below the 4 existing ones: - Detailed table: every payslip sorted by pay_date DESC with all fields (gross, all deductions, net, tax_year, validated flag, paperless_doc_id). Footer reducer sums the numeric columns. - Full deductions stacked bars: income_tax + NI + pension_employee + pension_employer + student_loan per payslip. The earlier panel only showed 4 deductions; this one shows the complete picture.	2026-04-18 23:32:21 +00:00
Viktor Barzin	4cd8d96b01	[monitoring] Widen uk-payslip default time range to 10y Oldest payslip in Paperless is July 2019. Previous default (now-2y) hid everything from 2019-2023, making it look like the backfill was broken.	2026-04-18 23:26:49 +00:00
Viktor Barzin	1698cd1ce1	[mailserver] Add daily backup CronJob for mailserver PVC ## Context The mailserver stack holds everything valuable and hard to recreate: 243M of maildirs, dovecot/rspamd state, and the DKIM private key that signs outbound mail. Today the only defense is the LVM thin-pool snapshots on the PVE host (7-day retention, storage-class scope only) — there is no app-level backup. Infra/.claude/CLAUDE.md mandates that every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob, and the mailserver stack was the only one still out of compliance. Loss of mailserver-data-encrypted without backups = total loss of all stored mail plus a DKIM key rotation (which requires a DNS update and breaks signature verification on every message in transit for the TTL window). Unacceptable for a service people actually use. Trade-offs considered: - mysqldump-style single-file dump vs rsync snapshot — maildirs are millions of small files, not a DB export. rsync --link-dest gives incremental weekly snapshots for ~10% of the cost of a full copy. - RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so the backup Job has to co-locate with the mailserver pod. vaultwarden solves this with pod_affinity; mirrored here. - Image choice — alpine + apk add rsync matches vaultwarden's pattern and keeps the container image small. ## This change Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup and 00:45 per-db windows, and the /20 email-roundtrip cadence). The job rsyncs /var/mail, /var/mail-state, /var/log/mail into /srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the previous week for space-efficient incrementals. 8-week retention. Data layout (flowed through from the deployment's subPath mounts so the rsync tree matches the mailserver's own on-disk layout): PVC mailserver-data-encrypted (RWO, 2Gi) ├─ data/ (subPath) → pod's /var/mail → backup/<week>/data/ ├─ state/ (subPath) → pod's /var/mail-state → backup/<week>/state/ └─ log/ (subPath) → pod's /var/log/mail → backup/<week>/log/ Safety: - PVC mounted read-only (volume.persistent_volume_claim.read_only AND all three volume_mounts set read_only=true) so a backup-script bug cannot corrupt maildirs. - pod_affinity on app=mailserver + topology_key=hostname forces the Job pod onto the same node holding the RWO PVC attachment. - set -euxo pipefail + per-directory existence guard so a missing subPath short-circuits cleanly instead of silently no-op'ing. Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup convention (job="mailserver-backup"): backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_output_bytes, backup_last_success_timestamp. Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden: - MailserverBackupStale — 36h threshold, critical, 30m for: - MailserverBackupNeverSucceeded — critical, 1h for: ## Reproduce locally 1. cd infra/stacks/mailserver && ../../scripts/tg plan Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on deployment/service is pre-existing. 2. ../../scripts/tg apply --non-interactive \ -target=module.mailserver.module.nfs_mailserver_backup_host \ -target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup 3. cd ../monitoring && ../../scripts/tg apply --non-interactive 4. kubectl create job --from=cronjob/mailserver-backup \ mailserver-backup-test -n mailserver 5. kubectl wait --for=condition=complete --timeout=300s \ job/mailserver-backup-test -n mailserver 6. Expected: test pod co-locates with mailserver on same node (k8s-node2 today), rsync writes ~950M to /srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes backup_output_bytes{job="mailserver-backup"}. ## Test Plan ### Automated $ kubectl get cronjob -n mailserver mailserver-backup NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE mailserver-backup 0 3 * * <none> False 0 <none> 3s $ kubectl create job --from=cronjob/mailserver-backup \ mailserver-backup-test -n mailserver job.batch/mailserver-backup-test created $ kubectl wait --for=condition=complete --timeout=300s \ job/mailserver-backup-test -n mailserver job.batch/mailserver-backup-test condition met $ kubectl logs -n mailserver job/mailserver-backup-test \| tail -5 === Backup IO Stats === duration: 80s read: 1120 MiB written: 1186 MiB output: 947.0M $ kubectl run nfs-verify --rm --image=alpine --restart=Never \ --overrides='{...nfs mount /srv/nfs...}' \ -n mailserver --attach -- ls -la /nfs/mailserver-backup/ 947.0M /nfs/mailserver-backup/2026-15 $ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \ \| grep mailserver-backup backup_duration_seconds{instance="",job="mailserver-backup"} 80 backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09 backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08 backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09 backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09 $ curl -s http://prometheus-server/api/v1/rules \ \| jq '.data.groups[].rules[] \| select(.name \| test("Mailserver"))' MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600 MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0 ### Manual Verification 1. Wait for the scheduled 03:00 run tonight; verify `kubectl get job -n mailserver` shows a new completed job. 2. Check that `backup_last_success_timestamp` advances past today. 3. Confirm `MailserverBackupNeverSucceeded` did not fire. 4. Next week (week 16), confirm `--link-dest` builds hardlinks vs 2026-15 (size delta should drop from ~950M to ~the actual churn). ## Deviations from mysql-backup pattern - Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0` base is not applicable for a filesystem rsync). - pod_affinity: required for RWO PVC co-location (mysql uses its own MySQL service for network access; mailserver must mount the PVC). - Metric push via wget (mirrors vaultwarden; alpine has wget, not curl). - Week-folder layout with --link-dest rotation: rsync pattern, closer to the PVE daily-backup script than mysql's single-file gzip dumps. [ci skip] Closes: code-z26 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:08 +00:00
Viktor Barzin	a21d4a4424	[owntracks] Fix Service port scheme (https→http), unbreak phone POSTs ## Context iOS Owntracks app has been unable to upload for months — phone buffer now holds ~1200 pending points. Last successful `.rec` write was 2026-01-02T14:32:00Z, matching when the failures started. ### The 500 — verified in Traefik access log ``` 152.37.101.156 - viktor "POST /pub HTTP/1.1" 500 21 "-" "-" 47900 "owntracks-owntracks-owntracks-viktorbarzin-me@kubernetes" "https://10.10.107.194:8083" 84ms ``` Basic-auth + middleware chain (rate-limit, csp, crowdsec) all pass. Traefik then opens backend connection to `https://10.10.107.194:8083`. The Recorder pod listens plain HTTP on :8083 (`OTR_PORT=0` disables HTTPS in ot-recorder), so the TLS handshake never completes → 500. ### Root cause — Service port spec `kubernetes_service.owntracks` declared the port as: ``` name: https port: 443 targetPort: 8083 ``` Traefik's IngressClass scheme inference: if the Service port is named `https` OR numbered `443`, Traefik speaks HTTPS to that backend. Both were true here, pointing at a plain-HTTP socket. The name/number were purely cosmetic — a leftover from mirroring the external `:443` edge — and worked only while Traefik's default happened to be HTTP. A Traefik upgrade (or middleware-chain change) tightened inference and surfaced the mismatch. ## This change Rename port to `name=http, port=80` and update the matching Ingress backend `port.number` from 443 to 80. `targetPort` stays at 8083. ``` Phone -----> CF tunnel -----> Traefik (:443, TLS) -----> Service \ :80 (http) \ \| \ v ---------------> Pod :8083 (plain HTTP hop) (HTTP listener) ``` Deployment container port label also renamed `https` → `http` for consistency (no functional effect — just readability). ## What is NOT in this change - Not switching the Recorder pod to HTTPS natively. That would require mounting a cert + rotation plumbing. External TLS is already terminated at Cloudflare/Traefik; in-cluster hop to the pod is plain-HTTP by design. - Not enabling `OTR_HTTPHOOK` to bridge Recorder → Dawarich (follow-up: code-z9b). - Not backfilling historical `.rec` files into Dawarich (follow-up: code-h2r). - Incidental: `providers.tf` + `.terraform.lock.hcl` refreshed by `terraform init -upgrade` to pick up the goauthentik provider that the ingress_factory module recently started requiring. ## Test Plan ### Automated ``` $ ../../scripts/tg plan Plan: 0 to add, 3 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 3 changed, 0 destroyed. $ kubectl -n owntracks get svc owntracks -o=jsonpath='{.spec.ports[0]}' {"name":"http","port":80,"protocol":"TCP","targetPort":8083} $ kubectl -n owntracks get ingress owntracks -o=jsonpath='{.spec.rules[0].http.paths[0].backend}' {"service":{"name":"owntracks","port":{"number":80}}} ``` ### Manual Verification In-cluster auth'd POST through the full ingress chain: ``` VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) kubectl -n owntracks run curltest --rm -i --image=curlimages/curl --restart=Never -- \ curl -s -o /dev/null -w "HTTP %{http_code}\n" -X POST -u "viktor:$VIKTOR_PW" \ -H "Content-Type: application/json" \ -d '{"_type":"location","lat":0,"lon":0,"tst":1000000000,"tid":"vb"}' \ https://owntracks.viktorbarzin.me/pub # HTTP 200 ``` (previously: HTTP 500 on identical request) ### Reproduce locally 1. `vault login -method=oidc` 2. `cd infra/stacks/owntracks && ../../scripts/tg plan` 3. Expected: `Plan: 0 to add, 3 to change, 0 to destroy.` (or empty if already applied) 4. Watch next iOS Owntracks POST → Traefik access log should show `200`, not `500`. Closes: code-nqd Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:24:25 +00:00
Viktor Barzin	cc56ba2939	[payslip-ingest] Move Payslips datasource 'database' into jsonData Grafana 11.2+ Postgres plugin reads the DB name from jsonData.database (see grafana/grafana#112418). The top-level 'database' field is silently ignored by the frontend — datasource health checks and POST /api/ds/query still work because the backend honors it, but every dashboard panel fails with 'you do not have default database'. Rolling back to the supported shape fixes rendering for all 4 uk-payslip panels.	2026-04-18 23:23:07 +00:00
Viktor Barzin	f6cff262f0	broker-sync: chown fidelity_storage_state to broker uid in init container ## Context First end-to-end test of the broker-sync-fidelity CronJob failed with `PermissionError: [Errno 13] Permission denied: '/data/fidelity_storage_state.json'`. Init container runs as root (uid 0) but the broker-sync container runs as uid 10001; chmod 600 without chown made the file unreadable from the main container. ## This change Added `chown 10001:10001` before the existing `chmod 600` in the `stage-storage-state` init container command. Init container has CAP_CHOWN by default as root, so this succeeds. ## Verification $ kubectl apply -f test-pod.yaml # same init + main pattern $ kubectl logs fidelity-debug -c broker-sync ... broker_sync.providers.fidelity_planviewer.FidelitySessionError: PlanViewer session stale — run `broker-sync fidelity-seed` Init container succeeded + main container read the file + Playwright launched Chromium + navigated to PlanViewer + hit the 15-min idle page → exactly the intended behaviour for a stale session. Next step (out-of-band): Viktor paste a fresh SMS OTP and re-seed via fidelity-seed on Viktor's laptop or the existing chat-driven flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:22:43 +00:00
Viktor Barzin	b9e9c3f084	[mailserver] Update SPF + docs for Brevo migration [ci skip] ## Context Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`. Postfix immediately began using Brevo for user mail — but the SPF TXT record at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every Brevo-relayed message failed SPF alignment and was spam-foldered or DMARC-quarantined by Gmail/Outlook. Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`: "v=spf1 include:mailgun.org -all" <-- wrong sender network User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`. Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo sends quarantined rather than outright rejected while we validate Brevo's sending IPs + rate limits for real user mail. Tighten to `-all` once the relay is proven stable. The docs in `docs/architecture/mailserver.md` still described the old Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change", those are updated in the same commit. ## This change Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs): 1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`, with an inline comment pointing at the mailserver docs for rationale. 2. `docs/architecture/mailserver.md` — - Last-updated stamp moved to 2026-04-18 with the cutover note. - Overview paragraph now says "relays through Brevo EU" (was Mailgun). - DNS table SPF row reflects the new value plus an annotated history note ("was include:mailgun.org -all until 2026-04-18"). - DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua target and flags that the current live record still points at e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569. - Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo relay credentials; `mailgun_api_key` annotated as retained for the E2E roundtrip probe only (inbound delivery testing, not user mail). Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf` to avoid sweeping up two unrelated pre-existing drifts that the Terraform state shows on this stack: the DMARC + mail._domainkey_rspamd records are stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and a naive refresh+apply would normalize them in the state back to single strings. Those drifts are semantically equivalent (DNS concatenates adjacent TXT strings at resolution time) and are out of scope for this commit — they'll be handled under their own ticket. ## What is NOT in this change - DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1), still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses in the live record. - DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and `mail_domainkey_rspamd` — pre-existing Cloudflare representation drift, untouched here. - Removal of Mailgun references in history/decision sections of the docs, or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API on purpose for inbound delivery testing (code-569 scope). - Mailgun DKIM record `s1._domainkey` — left in place; still consumed by the roundtrip probe. - Other pending items from the 2026-04-18 mail audit plan. ## Test Plan ### Automated Targeted plan showed exactly one change, no other drift sneaking in: module.cloudflared.cloudflare_record.mail_spf will be updated in-place ~ content = "\"v=spf1 include:mailgun.org -all\"" -> "\"v=spf1 include:spf.brevo.com ~all\"" Plan: 0 to add, 1 to change, 0 to destroy. Apply result: Apply complete! Resources: 0 added, 1 changed, 0 destroyed. DNS propagation verified on three independent resolvers immediately after apply: $ dig TXT viktorbarzin.me @1.1.1.1 +short \| grep spf "v=spf1 include:spf.brevo.com ~all" $ dig TXT viktorbarzin.me @8.8.8.8 +short \| grep spf "v=spf1 include:spf.brevo.com ~all" $ dig TXT viktorbarzin.me @10.0.20.201 +short \| grep spf # Technitium primary "v=spf1 include:spf.brevo.com ~all" ### Manual Verification Setup: nothing extra — change is already live (TF applied before commit per home-lab convention; `[ci skip]` in title). 1. Confirm SPF is the Brevo-only record from an external resolver: dig TXT viktorbarzin.me @1.1.1.1 +short Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference. 2. Send a test email via the mailserver (through Brevo relay) to a Gmail account and view the original headers: Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me ... Received-SPF: Pass (google.com: domain of ... designates ... as permitted sender) Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this change because the envelope sender IP was a Brevo IP not covered by `include:mailgun.org`). 3. Confirm no live Mailgun references in the mailserver doc: grep -n mailgun.org infra/docs/architecture/mailserver.md Expected: only annotated-history mentions — SPF "was ... until 2026-04-18" and DMARC "current live record still points at e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active Mailgun relay. ## Reproduce locally cd infra git pull dig TXT viktorbarzin.me @1.1.1.1 +short \| grep spf # expected: "v=spf1 include:spf.brevo.com ~all" # inspect the TF change: git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf # inspect the doc change: git show HEAD -- docs/architecture/mailserver.md Closes: code-q8p Closes: code-9pe Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:13:47 +00:00
Viktor Barzin	06e3425a39	[monitoring] Set rawQuery+editorMode on uk-payslip panel targets Grafana 11's Postgres plugin shows 'you do not have default database' on any panel whose target is missing rawQuery:true / editorMode:"code". The query builder can't reason about a custom schema.table path and blanks the panel.	2026-04-18 23:12:45 +00:00
Viktor Barzin	ed820e9b58	[monitoring] Fix uk-payslip datasource type to grafana-postgresql-datasource The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana to fall back to 'default database' and error out when rendering. Ran sed over the JSON; all 8 panel+target type refs now match the installed plugin name. UID (payslips-pg) was already correct.	2026-04-18 23:10:13 +00:00
Viktor Barzin	471e946133	[monitoring] Put uk-payslip dashboard in Finance folder Grafana can't auto-create the reserved 'General' folder ('A folder with that name already exists'), which aborts the sidecar provisioner's walk and drops every dashboard in that folder. Move uk-payslip to Finance so it loads.	2026-04-18 23:03:22 +00:00
Viktor Barzin	11082f7e83	[infra] Partial Calico adoption: namespaces only (Wave 5b) ## Context Wave 5b of the state-drift consolidation plan. Calico has run this cluster's pod networking since 2024-07-30, installed via raw kubectl manifests — tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan flagged Calico as HIGH BLAST because the operator + Installation CR sit on the critical path for pod scheduling; any mistake during adoption can break CNI and block new pods cluster-wide within seconds. This session takes the safe sub-step: adopt only the three namespaces. Namespaces are label containers — TF managing their names + PSA labels cannot disrupt Calico networking. Getting the operator, Installation CR, and CRDs under TF requires dedicated prep (picking the right `ignore_changes` fields to absorb operator-generated defaults in the Installation CR, decoupling from the embedded PSA labels applied at admission, and a low-traffic window). Deferred to `code-3ad`. ## This change New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks (Wave 8 convention, commit `8a99be11`): - `kubernetes_namespace.calico_system` ← id `calico-system` - `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver` - `kubernetes_namespace.tigera_operator` ← id `tigera-operator` Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a second `tg plan` that returns `No changes`. Zero cluster impact — namespaces stayed exactly as they were cluster-side. ### terragrunt dependency choice Deliberately no `dependency "platform"` clause — Calico is lower in the stack than platform, so introducing a `platform → calico` or `calico → platform` edge would invite cycle-like pain on first bootstrap. The plan on this stack is always safe to run standalone. ### `ignore_changes` scope on each namespace - `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy stamp (Wave 3B sweep, commit `8b43692a`). - `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator stamps these on `calico-system` + `calico-apiserver` to opt them out of PSA. These labels aren't surfaced by the kubernetes provider as part of the import (they arrive through a different field manager), so left unmanaged to keep the plan clean. `tigera-operator` ns doesn't get the PSA labels so they aren't ignored there. ## What is NOT in this change - The three live workloads: `tigera-operator` Deployment in `tigera-operator` ns, `calico-kube-controllers`/`calico-node`/ `calico-typha` workloads in `calico-system`, the `calico-apiserver` in `calico-apiserver`. These are all reconciled by the tigera-operator from the Installation CR — importing them into TF is redundant with importing the CR itself. - The `Installation` CR (`default`, apiVersion `operator.tigera.io/v1`) — the user-authored minimal spec has since been filled to 104 lines of operator-generated defaults. Adopting it requires a well-scoped `ignore_changes` list on the `manifest` field. Separate follow-up `code-3ad`. - `.sops.yaml` / `tier0_stacks` updates — the original plan suggested Tier 0 (local SOPS state) for the full Calico stack on the theory that "network underpins all". With only three namespaces in the stack, the argument doesn't hold: a failed Tier 1 plan on calico namespaces cannot break networking, so no need to pay the Tier 0 tax. ## Verification ``` $ cd stacks/calico && ../../scripts/tg plan No changes. Your infrastructure matches the configuration. $ kubectl get pods -n calico-system NAME READY STATUS RESTARTS calico-kube-controllers-... 1/1 Running 0 calico-node-... 1/1 Running 0 ... (all healthy, pre-existing) ``` Follow-up: code-3ad for operator + Installation CR adoption (needs low-traffic window + ignore_changes scoping). Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:52:56 +00:00
Viktor Barzin	16d9fd8bde	[infra] Adopt Authentik catch-all Proxy Provider + Application into TF (Wave 6a) ## Context Wave 6a of the state-drift consolidation plan. The Domain wide catch all Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all) + the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2 hit, it was harder to reason about the config than it should have been — the only source of truth was the Authentik admin UI. Bringing the provider + application under Terraform means future changes are reviewable in PRs and recoverable from git if the admin UI misbehaves. ## This change Adds the `goauthentik/authentik` provider to the repo's central `terragrunt.hcl` `required_providers` (side-effect: every stack can now declare authentik resources; this stack is the only current consumer). Stack-local `stacks/authentik/authentik_provider.tf` holds the provider instance configuration + API token wiring + two resources + their flow data-source lookups. ### Auth - API token stored in Vault at `secret/authentik/tf_api_token`, identifier `terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable by rewriting the Vault KV + any running TF apply picks it up on next plan. ### Imports (both landed zero-diff) - `authentik_application.catchall` ← id `domain-wide-catch-all` - `authentik_provider_proxy.catchall` ← id `5` ### Flow references Authorization + invalidation flows are looked up via `data "authentik_flow"` by slug (`default-provider-authorization-implicit-consent` + `default-provider-invalidation-flow`). Keeping them as data sources rather than hardcoded UUIDs means a flow recreation (slug unchanged) doesn't require an HCL edit. ### `lifecycle { ignore_changes }` scope On `authentik_provider_proxy.catchall`: - `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the live state references complex many-to-many relations that are easier to manage from the Authentik UI than to serialise in HCL. Drift suppressed. - `skip_path_regex`, `internal_host`, all `basic_auth_*`, `intercept_header_auth`, `access_token_validity` — either defaults or UI-only tuning knobs that aren't part of Terraform's concern for this catch-all provider. On `authentik_application.catchall`: - `meta_description`, `meta_launch_url`, `meta_icon`, `group`, `backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` — cosmetic/non-functional attributes; the Authentik UI is the right place to edit these and drift on them isn't interesting. ## What is NOT in this change - Outpost-binding resource — the embedded outpost's provider list is a single-row many-to-many that the Authentik UI manages cleanly; adding TF there would fight the UI without reducing drift. - Property mappings and JWT federation source — managed via UI, drift suppressed. A future wave can bring them in when someone actually wants to edit them through code review. - Other Authentik entities (Flows, Stages, Groups, RBAC policies) — same rationale: UI is the natural editing surface. Adopt incrementally as they become interesting to code-review. ## Verification ``` $ cd stacks/authentik && ../../scripts/tg plan \| grep Plan: Plan: 0 to add, 1 to change, 0 to destroy. # module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift, # unrelated to this commit (image_pull_policy Always -> IfNotPresent) $ ../../scripts/tg state list \| grep authentik_ authentik_application.catchall authentik_provider_proxy.catchall data.authentik_flow.default_authorization_implicit_consent data.authentik_flow.default_provider_invalidation ``` ## Reproduce locally 1. `git pull && cd stacks/authentik && ../../scripts/tg init` 2. Terraform pulls goauthentik/authentik provider (first time). 3. `tg plan` — expect only pgbouncer drift; authentik resources read-only. Refs: Wave 6a of the state-drift consolidation (code-hl1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:48:26 +00:00
Viktor Barzin	b28c76e371	[infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift ## Context Wave 7 of the state-drift consolidation plan. The drift-detection pipeline (`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every stack daily and Slack-posted a summary, but its output was ephemeral — nothing persisted in Prometheus, so there was no historical view of which stacks drift, when, or for how long. Following the convergence work in waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4 mysql cleanup), the baseline is clean enough that new drift should stand out. That only works if we have observability. ## This change ### `.woodpecker/drift-detection.yml` Enhances the existing cron pipeline to push a batched set of metrics to the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`) after each run: \| Metric \| Kind \| Purpose \| \|---\|---\|---\| \| `drift_stack_state{stack}` \| gauge, 0/1/2 \| 0=clean, 1=drift, 2=error \| \| `drift_stack_first_seen{stack}` \| gauge (unix seconds) \| Preserved across runs for drift-age tracking \| \| `drift_stack_age_hours{stack}` \| gauge (hours) \| Computed from `first_seen` \| \| `drift_stack_count` \| gauge (count) \| Total drifted stacks this run \| \| `drift_error_count` \| gauge (count) \| Total plan-errored stacks \| \| `drift_clean_count` \| gauge (count) \| Total clean stacks \| \| `drift_detection_last_run_timestamp` \| gauge (unix seconds) \| Pipeline heartbeat \| First-seen preservation: on each drift hit, the pipeline queries Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}` value. If present and non-zero, reuse it; otherwise stamp with `NOW`. That means age-hours grows monotonically until the stack goes clean (at which point state=0 resets first_seen by omission). Atomic batched push: all metrics for a run are POST'd in a single HTTP request. Pushgateway doesn't support atomic multi-metric updates natively, but batching at the pipeline layer prevents half-updated state if the curl is interrupted mid-run (the second call would just fail the entire run and alert on `DriftDetectionStale`). ### `stacks/monitoring/.../prometheus_chart_values.tpl` New `Infrastructure Drift` alert group with three rules: - DriftDetectionStale (warning, 30m): fires if `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h grace window on top of the 24h cron so transient Pushgateway or cluster unavailability doesn't false-alarm. Guards against the pipeline silently failing or the cron not firing. - DriftUnaddressed (warning, 1h): fires if any stack has `drift_stack_age_hours > 72` — three days of unacknowledged drift. Three days is long enough to absorb weekends + typical review cycles but short enough to force follow-up before drift compounds. - DriftStacksMany (warning, 30m): fires if `drift_stack_count > 10` in a single run. Sudden wide drift usually signals systemic causes (new admission webhook, provider version bump, cluster-wide CRD upgrade) rather than individual configuration errors, and the alert body nudges toward that diagnosis. Applied to `stacks/monitoring` this session — 1 helm_release changed, no other drift surfaced. ## What is NOT in this change - The Wave 7 GitHub issue auto-filer — the full plan included filing a `drift-detected` issue per drifted stack. Deferred because it requires wiring the `file-issue` skill's convention + a gh token exposed to Woodpecker, both of which need separate setup. The Slack alert covers the same need at lower fidelity in the meantime. - The Wave 7 PG drift_history table — would provide the richest historical view but adds a new DB schema dependency for a CI pipeline. Pushgateway + Prometheus handle the 72h window we care about; PG history is nice-to-have for quarterly reviews. - Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the baseline has been stable for a few cycles. Follow-ups tracked: file dedicated beads items for GH-issue filer + PG drift_history. ## Verification ``` $ cd stacks/monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 1 changed, 0 destroyed. # After next cron run (cron expr: "drift-detection" in Woodpecker UI): $ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \ \| grep -c '^drift_' # expect a positive number ``` ## Reproduce locally 1. `git pull` 2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules \| jq '.data.groups[] \| select(.name == "Infrastructure Drift")'` 3. Manually trigger the Woodpecker cron and watch Pushgateway populate. Refs: Wave 7 umbrella (code-hl1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:42:51 +00:00
Viktor Barzin	124a756351	[infra] Adopt local-path-provisioner into Terraform (Wave 5c) ## Context Wave 5c of the state-drift consolidation plan. `local-path-provisioner` (Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw `kubectl apply` against the upstream manifest. It serves as the cluster's default StorageClass and is still actively in use — the 2026-04-18 live survey showed helper-pod-delete cycles running against existing PVCs. Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding), ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown scripts, StorageClass `local-path` (default), and the 1-replica Deployment itself. Seven resources total. ## This change New Tier 1 stack `stacks/local-path/` with all seven resources, adopted via Wave 8's HCL `import {}` block convention (commit `8a99be11`): - `kubernetes_namespace.local_path_storage` → id `local-path-storage` - `kubernetes_service_account.local_path_provisioner` → id `local-path-storage/local-path-provisioner-service-account` - `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role` - `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind` - `kubernetes_config_map.local_path_config` → id `local-path-storage/local-path-config` - `kubernetes_storage_class_v1.local_path` → id `local-path` - `kubernetes_deployment.local_path_provisioner` → id `local-path-storage/local-path-provisioner` Conventions applied: - Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the Goldilocks `vpa-update-mode` label drift (Wave 3B, commit `8b43692a`). - Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the ndots dns_config drift (Wave 3A, commit `c9d221d5` + `327ce215`). - ServiceAccount + pod spec pin `automount_service_account_token = false` and `enable_service_links = false` to match the live spec exactly. - `import {}` stanzas removed after the apply converged to zero-diff (per AGENTS.md → "Adopting Existing Resources"). ## Apply outcome `Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.` The 3 in-place changes were: - `kubernetes_config_map.local_path_config.data` — whitespace/format reshuffle. The live ConfigMap contained the upstream manifest's hand-indented JSON + YAML; my HCL uses canonical `jsonencode` / heredoc. Semantic content identical, so the provisioner continued running (no pod restart). - `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true` — TF-only attribute, no cluster impact. - `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false` + `is-default-class` annotation re-asserted — TF-schema reconciliation only; the StorageClass remained default throughout. Post-apply `scripts/tg plan` returns `No changes`. ## Verification ``` $ cd stacks/local-path && ../../scripts/tg plan No changes. Your infrastructure matches the configuration. $ kubectl -n local-path-storage get deploy NAME READY UP-TO-DATE AVAILABLE AGE local-path-provisioner 1/1 1 1 55d $ kubectl get sc local-path NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE local-path (default) rancher.io/local-path Delete WaitForFirstConsumer ``` ## What is NOT in this change - Helm-release adoption — local-path-provisioner was never installed via Helm in this cluster; raw manifests only. Keeping native typed resources rather than retrofitting a chart. - PV-path customisation — sticks with upstream default `/opt/local-path-provisioner` on all nodes (via `DEFAULT_PATH_FOR_NON_LISTED_NODES`). Closes: code-3gp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:39:55 +00:00
Viktor Barzin	1a7f68fe5b	[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - Sentinel assignee `agent` — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - Sequential dispatch — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - Fixed agent `beads-task-runner` — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - Image reuse — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - ConfigMap-mounted `metadata.json` — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - Kill switch (`beads_dispatcher_enabled`) — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - Reaper threshold 30 min — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. Apply ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. CronJobs exist with right schedule ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher /2 * * ` and `beads-reaper /10 * * * `, both with `SUSPEND=False`. 3. End-to-end smoke* ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json \| jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. Reaper smoke Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. Kill switch ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:35:46 +00:00
Viktor Barzin	01955916b2	[infra] Adopt kured + sentinel-gate into Terraform (Wave 5a) ## Context Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all cluster changes via TF" invariant and drifting silently: 1. kured (Helm release): deployed 265d ago via `helm install kured` on the CLI. Values were edited only via `helm upgrade` — never captured. Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri 02:00–06:00 London reboot window, Slack notifyUrl, and a custom `/sentinel/gated-reboot-required` sentinel file. 2. kured-sentinel-gate: a custom DaemonSet + ServiceAccount + ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem (memory 390) when kured rebooted nodes during a containerd overlayfs outage and turned a single-node blip into a 26h cluster outage. The gate DaemonSet creates `/var/run/gated-reboot-required` only when (a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all calico-node pods Running, (d) no node transitioned Ready in the last 30 minutes (cool-down). kured's `rebootSentinel` then points at the gated file so reboots are effectively gated by cluster health. Applied 33d ago via `kubectl apply` — no TF footprint. Both are now codified in the new `stacks/kured/` (Tier 1, PG state). ## This change - New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl` (standard platform-dep) + `secrets` symlink. - All 6 resources adopted via Wave 8's HCL `import {}` block pattern (commit `8a99be11`) — written as `import {}` stanzas in the initial commit, plan-applied to zero, then stanzas deleted before this commit per the convention: - `kubernetes_namespace.kured` (id: `kured`) - `helm_release.kured` (id: `kured/kured`) - `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`) - `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`) - `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`) - `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`) - Slack notifyUrl moved from inline helm values into Vault at `secret/kured` under key `slack_kured_webhook`, consumed via `data "vault_kv_secret_v2"`. No plaintext secret in git. - Namespace gets `tier = "1-cluster"` label (new — previously untiered, so Kyverno auto-quotas applied cluster-tier defaults on kured pods). Benign additive change; pod specs have explicit resources anyway. - DaemonSet + SA get `automount_service_account_token = false` / `enable_service_links = false` to match the live pod spec exactly — otherwise TF schema defaults would flip these fields. - DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift (Wave 3A convention, commit `c9d221d5` + `327ce215`). - Namespace carries the same marker on the `goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep, commit `8b43692a`). ## Import outcomes Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.` The 3 in-place changes were all TF-schema reconciliation, not cluster mutations: - `helm_release.kured.values` — format reshuffle; the imported state stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic YAML is byte-identical, so the triggered Helm upgrade was a no-op on the cluster side (revision bumped 2→3, zero pod restarts). - `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new label added. Already discussed above. - `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true — TF-only attribute, no k8s impact. Post-apply `scripts/tg plan` on `stacks/kured` returns: `No changes. Your infrastructure matches the configuration.` ## What is NOT in this change - `import {}` stanzas — intentionally removed after the apply landed. They would be no-ops and would clutter future diffs. Per Wave 8 convention (AGENTS.md → "Adopting Existing Resources"). - Calico adoption (Wave 5b) — separate higher-blast change, needs a dedicated low-traffic window. - local-path-storage (Wave 5c) — check-or-remove task still open. ## Verification ``` $ kubectl -n kured get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE kured 5 5 5 5 5 kured-sentinel-gate 5 5 5 5 5 $ helm -n kured list NAME NAMESPACE REVISION STATUS CHART APP VERSION kured kured 3 deployed kured-5.11.0 1.21.0 $ cd stacks/kured && ../../scripts/tg plan \| tail -1 No changes. Your infrastructure matches the configuration. ``` ## Reproduce locally 1. `git pull` 2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes 3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready. Closes: code-q8k Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:33:29 +00:00
Viktor Barzin	10fd88aec5	wealthfolio: add nightly backup sidecar — SQLite → NFS ## Context Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL support — confirmed 2026-04-18 via repo inspection). The DB lives on an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod. First attempt at a standalone backup CronJob failed with Multi-Attach error: RWO volume is already attached to the running WF pod, so no separate pod can mount it. Switched to a backup sidecar in the same pod — shares the PVC mount naturally. ## This change - `container "backup"` added to the WF Deployment: - alpine:3.20 + sqlite + busybox-suid (for crond). - Mounts /data read-only (shared with WF container) + /backup (new NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup). - Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh which runs `sqlite3 .backup` (WAL-safe online snapshot, zero downtime), copies secrets.json, and prunes anything older than 30d. - 16Mi request / 64Mi limit — sleeps most of the time. - NFS volume declared in pod spec — server from the existing `var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created on the PVE host in the same session. Removed the standalone backup CronJob that couldn't work. ## Verification ### Automated `scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0 added, 1 changed, 1 destroyed (the transient CronJob). ### Manual (2026-04-18) $ kubectl -n wealthfolio get pods -l app=wealthfolio wealthfolio-95d8bd498-cj8kw 2/2 Running $ kubectl -n wealthfolio logs <pod> -c backup wealthfolio-backup sidecar ready; next 04:30 UTC $ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M) $ ls /srv/nfs/wealthfolio-backup/ 2026-04-18T22-24-55/ ← first sidecar-produced backup ## Reproduce locally 1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh 2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/ 3. Expected: new dated folder appears with wealthfolio.db + secrets.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:25:19 +00:00
Viktor Barzin	345ba2182f	[mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout ## Context After fixing the two mail-server-side root causes of probe false-failures (Dovecot userdb duplicates, postscreen btree lock contention), the probe is expected to succeed well under 120s. This commit is defence in depth against residual SMTP relay variance and against a future scenario where Dovecot is transiently unresponsive during IMAP login. The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's queueing, DNS variance, and general SMTP retry backoff can easily exceed that on a bad day. Widening to 5 minutes gives plenty of headroom while still remaining well within the CronJob's 20-minute schedule interval. Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake hang), the connect call can block indefinitely and the probe hangs without ever looping to the next attempt. Adding `timeout=10` caps each connect at 10s so the retry loop keeps making forward progress. ## This change Two edits to the embedded probe script inside the cronjob resource: ``` - # Step 2: Wait for delivery, retry IMAP up to 3 min + # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s) ... - for attempt in range(9): + for attempt in range(15): ... - imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx) + imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10) ``` Flow (before): ``` send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total ``` Flow (after): ``` send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total │ └─ timeout ─► log, continue to next loop ``` ## What is NOT in this change - Probe frequency stays at `/20 * * `. - The `EmailRoundtripStale` alert thresholds are intentionally left at 3600s + for: 10m. Those fire only on sustained multi-hour issues and should not be loosened — they would mask future regressions. Probe success rate is now expected to recover to ≥95% from the two upstream fixes; if it doesn't, alert tuning gets revisited separately. - No change to the Brevo send step, the success-metrics push, or the cleanup of stale e2e-probe- messages. ## Test Plan ### Automated `scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`: ``` # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place - for attempt in range(9): + for attempt in range(15): - imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx) + imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10) Plan: 0 to add, 1 to change, 0 to destroy. ``` `scripts/tg apply`: ``` Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` ### Manual Verification 1. Trigger the probe manually: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` 2. Tail its logs: `kubectl -n mailserver logs job/probe-verify-<ts> -f` 3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical successful run should still complete in < 60s now that postscreen is no longer stalling. 4. Watch the 48-hour window on the `email_roundtrip_success` gauge in Prometheus — expect ≥95% (was ~65% before all three fixes). ## Reproduce locally 1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml \| grep -E "range\(\|timeout"` 2. Expect: `range(15)` and `timeout=10` 3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` 4. `kubectl -n mailserver logs -f job/probe-verify-<ts>` 5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0. Closes: code-18e Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:33:56 +00:00
Viktor Barzin	e2516b07a3	[mailserver] Disable postscreen btree cache to stop SMTP lock-contention stalls ## Context Postfix inside docker-mailserver was spamming fatal errors at roughly 1 per minute — 5,464 of them in a 24h window — all of the same shape: ``` postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache: unable to get exclusive lock: Resource temporarily unavailable ``` Every time one of these fires, the postscreen process dies mid-connection and the inbound SMTP session is dropped. Legitimate mail (including Brevo deliveries for our e2e email-roundtrip probe) gets re-queued by the sender and arrives late — frequently past the probe's 180s IMAP polling window, producing a 35%/7d probe success rate and the EmailRoundtripStale alert noise that was originally flagged as "probably nothing." ## Root cause `master.cf` declares postscreen with `maxproc=1`, but postscreen still re-spawns per incoming connection (or for short-lived reopens), and each instance opens the shared btree cache with an exclusive file lock. Under any concurrency (two TCP SYNs arriving close together, or a retry during teardown), the second process hits EWOULDBLOCK on fcntl and Postfix treats that as fatal. Three options were considered: \| Option \| Verdict \| \|--------\|---------\| \| (a) Disable cache (postscreen_cache_map = ) \| ✓ chosen \| \| (b) Switch btree → lmdb \| ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) \| \| (c) proxy:btree via proxymap \| ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" \| \| (d) Memcached sidecar \| ✗ new moving part; deferred \| Option (a) is a small trade-off: legitimate clients re-run the greet-action / bare-newline-action checks on every fresh TCP session instead of hitting the 7-day whitelist cache. At our volume (~100 deliveries/day, ~72 of which are the probe itself) that's negligible CPU. DNSBL re-evaluation is also avoided only partially, but this mailserver already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role was doing nothing anyway. ## This change Appends a stanza to the user-merged postfix main.cf stored in `variable.postfix_cf` that sets `postscreen_cache_map =` (empty value). Postfix treats an empty cache_map as "no persistent cache" — per-session decisions are still enforced, they just aren't cached across sessions. Before: ``` smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock) ├─ concurrent access → fcntl EWOULDBLOCK → fatal └─ connection dropped, sender retries, mail arrives late ``` After: ``` smtpd ──► postscreen (no cache, per-session checks only) └─ no shared file, no lock → no fatal, no dropped session ``` No change to master.cf (postscreen still the front-end), no change to DNSBL / greet / bare-newline policy. ## What is NOT in this change - Dovecot userdb dedup (shipped in the previous commit). - Email-roundtrip probe widening (next commit). - Rebuilding docker-mailserver image with lmdb support (deferred — disabling the cache is simpler and sufficient at our volume). ## Test Plan ### Automated `postconf -m` in the running container to confirm lmdb is genuinely absent (ruling out option (b) before we commit to (a)): ``` btree cidr environ fail hash inline internal ldap memcache nis pcre pipemap proxy randmap regexp socketmap static tcp texthash unionmap unix ``` No lmdb entry — confirmed. `scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`: ``` ~ "postfix-main.cf" = <<-EOT + postscreen_cache_map = ``` `scripts/tg apply`: ``` Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` Reloader triggers pod rollout — baseline error count before apply was 34 `unable to get exclusive lock` lines per `--tail=500` log window. ### Manual Verification Post-rollout, when the new pod is Ready: 1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map` Expect: empty (no value) 2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 \| grep -c "unable to get exclusive lock"` Expect: 0 new occurrences (any hits are from before the rollout). 3. Trigger a probe run manually: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` then `kubectl -n mailserver logs job/probe-verify-...` Expect: `Round-trip SUCCESS` with duration < 120s. ## Reproduce locally 1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map` 2. Expect: `postscreen_cache_map =` (empty value) 3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m \| grep -c "unable to get exclusive lock"` 4. Expect: 0 Closes: code-1dc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:32:48 +00:00
Viktor Barzin	01a718e17b	[mailserver] Filter redundant local→local aliases to fix Dovecot 'exists more than once' ## Context Dovecot auth logs have been steadily spamming `passwd-file /etc/dovecot/userdb: User r730-idrac@viktorbarzin.me exists more than once` (and the same for vaultwarden@) at ~31 occurrences per 500 log lines. Under load this flakes IMAP auth for the e2e email-roundtrip probe (spam@viktorbarzin.me uses the catch-all), which was masquerading as "Brevo or probe timing" noise. ## Root cause docker-mailserver builds Dovecot's `/etc/dovecot/userdb` from two sources: real accounts (`postfix-accounts.cf`) AND virtual-alias entries whose target resolves to a local mailbox (`postfix-virtual.cf`). When the same address appears as BOTH a real mailbox AND an alias whose target is another local mailbox, the generated userdb has two lines for that username pointing to different home directories — e.g.: r730-idrac@viktorbarzin.me:...:/var/mail/.../r730-idrac/home r730-idrac@viktorbarzin.me:...:/var/mail/.../spam/home ← from alias Dovecot's passwd-file driver rejects the duplicate, and every subsequent auth lookup logs the error. This affected exactly two addresses: - r730-idrac@viktorbarzin.me (real account + alias → spam@) - vaultwarden@viktorbarzin.me (real account + alias → me@) Other aliases are fine: they either forward to external addresses (gmail etc.) — no local userdb entry generated — or map an address to itself (me@ → me@) which docker-mailserver dedups internally. Note: removing the real accounts is not an option because Vaultwarden uses `vaultwarden@viktorbarzin.me` as its live SMTP_USERNAME (stacks/vaultwarden/modules/vaultwarden/main.tf:121). ## This change Introduces a `local.postfix_virtual` that concatenates the Vault-sourced aliases with `extra/aliases.txt`, then filters out any line matching the exact "LHS RHS" shape where both sides are in `var.mailserver_accounts` and LHS != RHS. That is, only the pure local→local redundant entries are dropped; all forwarding aliases and the catch-all are preserved. The filter is self-healing: if a future alias ever collides with a real account, it gets silently suppressed instead of breaking Dovecot auth. ``` Vault mailserver_aliases ─┐ ├─ concat ─ split \n ─ filter ─ join \n ─► postfix-virtual.cf extra/aliases.txt ─────────┘ │ └── drop if LHS+RHS both in mailserver_accounts and LHS != RHS ``` Filtered entries (confirmed via locally-simulated filter on live data): - r730-idrac@viktorbarzin.me spam@viktorbarzin.me - vaultwarden@viktorbarzin.me me@viktorbarzin.me Preserved (sample): postmaster→me, contact→me, alarm-valchedrym→self+3 ext, lubohristov→gmail, yoana→gmail, @viktorbarzin.me→spam (catch-all), all four disposable `*-generated@` aliases. ## What is NOT in this change - Real accounts in Vault (`secret/platform.mailserver_accounts`) are untouched — vaultwarden SMTP auth keeps working. - Postfix postscreen btree lock contention (separate commit). - Email-roundtrip probe IMAP window (separate commit). ## Test Plan ### Automated `terraform validate` — passes (docker-mailserver module): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` `scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`: ``` # module.mailserver.kubernetes_config_map.mailserver_config will be updated in-place ~ resource "kubernetes_config_map" "mailserver_config" { ~ data = { ~ "postfix-virtual.cf" = (sensitive value) # (9 unchanged elements hidden) } id = "mailserver/mailserver.config" } Plan: 0 to add, 1 to change, 0 to destroy. ``` `scripts/tg apply` — applied: ``` Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` ### Manual Verification Post-apply configmap content (the two lines are gone): ``` $ kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}' postmaster@viktorbarzin.me me@viktorbarzin.me contact@viktorbarzin.me me@viktorbarzin.me me@viktorbarzin.me me@viktorbarzin.me lubohristov@viktorbarzin.me lyubomir.hristov3@gmail.com alarm-valchedrym@viktorbarzin.me alarm-valchedrym@...,vbarzin@...,emil.barzin@...,me@... yoana@viktorbarzin.me divcheva.yoana@gmail.com @viktorbarzin.me spam@viktorbarzin.me firmly-gerardo-generated@viktorbarzin.me me@viktorbarzin.me closely-keith-generated@viktorbarzin.me vbarzin@gmail.com literally-paolo-generated@viktorbarzin.me viktorbarzin@fb.com hastily-stefanie-generated@viktorbarzin.me elliestamenova@gmail.com ``` Reloader triggers a pod rollout; once new pod is Ready: - `kubectl -n mailserver exec <pod> -c docker-mailserver -- cut -d: -f1 /etc/dovecot/userdb \| sort \| uniq -d` expected output: empty (no duplicate usernames) - `kubectl -n mailserver logs <pod> -c docker-mailserver --tail=500 \| grep -c "exists more than once"` expected output: 0 (baseline was 31/500 lines) ## Reproduce locally 1. `kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'` 2. Expect: no `r730-idrac@viktorbarzin.me spam@viktorbarzin.me` line and no `vaultwarden@viktorbarzin.me me@viktorbarzin.me` line. 3. After pod restart: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=500 \| grep -c "exists more than once"` → 0. Closes: code-27l Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:29:02 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	e612baac15	[dawarich] Re-enable Sidekiq worker with resource limits + probes ## Context Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the unbounded 10-thread worker drove the whole pod into memory pressure — the kubelet then evicted the web container along with it. Viktor's recollection was "it was crashing"; the cgroup-root cause was that the Sidekiq container had no `resources.limits.memory` set, so a misbehaving job could pull the entire pod down instead of being OOM-killed and restarted in isolation. During the ~55 days the worker was off, POSTs to /api/v1 continued to enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not the cluster default DB 0). track_segments and digests tables stayed empty because nothing was processing the backfill queue (beads code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so Sidekiq was untested against the new release in this environment. Live pre-apply snapshot via `bin/rails runner`: enqueued=18 (cache=2, data_migrations=4, default=12) scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats reset by the 1.6.1 upgrade) Queue latencies ~50h — lines up with code-e9c (iOS client stopped POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1 was therefore a small, recoverable backlog, not the disaster the plan originally feared — no pre-apply triage needed. ## What changed Second container `dawarich-sidekiq` added to the existing Deployment (same pod, same lifecycle as `dawarich` web). Key differences vs the 2026-02-23 commented block: - `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory = 768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq job gets OOM-killed and container-restarted in place without evicting the whole pod (web stays Ready). - Hosts parametrised via `var.redis_host` / `var.postgresql_host` instead of hardcoded FQDNs; matches the web container's pattern. - DB / secret / Geoapify creds via `value_from.secret_key_ref` against the existing `dawarich-secrets` K8s Secret (populated by the existing ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2` reference the 2026-02-23 block relied on — that data source no longer exists in this stack. - `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred to separate commits (plan: 2 → 5 → 10 with 15-30min observation between bumps). - Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes — container-scoped restart on stall, verified `pgrep` is at /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image. - Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT, RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so Sidekiq's Rails initialisation matches web. Pod-level additions: - `termination_grace_period_seconds = 60` — gives Sidekiq time to drain in-flight jobs on SIGTERM during rolls (default 30s not enough for reverse-geocoding batches). ## What is NOT in this change - Prometheus exporter for Sidekiq metrics. The first apply turned on `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes metrics over TCP to a separate exporter server process — and the freikin/dawarich image does not start one. Client logged ~2/sec "Connection refused" errors until we flipped ENABLED back to "false" in this commit. `pod.annotations["prometheus.io/scrape"]` reverted for the same reason (nothing listening on :9394). Filed code-1q5 (blocks code-459) to add a third sidecar container running `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore the 4 drafted alerts (DawarichSidekiqDown / QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are actually being emitted. - The 4 drafted Sidekiq alerts — reverted from monitoring/prometheus_chart_values.tpl; they reference metrics that don't exist yet. Restoration is part of code-1q5. - Concurrency ramp past 2 and the 24h burn-in gate that closes code-459 — separate future commits. - Liveness/readiness probes on the web container — pre-existing gap, out of scope per plan. ## Other changes bundled in Kyverno `dns_config` drift suppression added with the `# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich` AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only called it out for the Deployment, but the CronJob shows identical drift (Kyverno injects ndots=2 on every pod template, Terraform wipes it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every pod-owning resource MUST carry the lifecycle block — this commit brings this stack into convention. ## Topology trade-off recorded Sidekiq lives in the same pod as the web container, not a separate Deployment. This means: - Every env bump during ramp bounces both containers (Recreate strategy) — brief UI blip accepted. - `kubectl scale` alone can't pause Sidekiq — pausing requires `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting the container block + apply. - Shared pod network namespace — only one process can bind any given port. This is why the plan explicitly avoided declaring a new `port { name = "prometheus" }` on the sidekiq container (the web container already reserves 9394 by name). Accepted because the alternative (split Deployment) is significantly more config for a single-instance service and a follow-up bead (tracked in code-1q5 description area / Viktor's notes) already captures "revisit if future crashes warrant blast-radius isolation". ## Rollback Three levels, in order of increasing impact: 1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up, no jobs processed, backlog preserved in Redis. 2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining. 3. Re-comment the second container block (this diff in reverse) + apply — full disable, backlog stays in Redis DB 1, recoverable. Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives, and the jobs are recoverable state. ## Refs - code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes after 24h burn-in at concurrency=10 with restartCount=0, DeadSet delta < 100. - code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts. Depends on code-459. - code-e9c (P2) — Viktor client-side POST bug 2026-04-16. Untouched; processing the backlog does not fix this but ensures future POSTs drain cleanly. - code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched; same reasoning. ## Test Plan ### Automated ``` $ cd stacks/dawarich && ../../scripts/tg plan ... Plan: 0 to add, 3 to change, 0 to destroy. # kubernetes_deployment.dawarich (sidekiq container + probes + lifecycle) # kubernetes_namespace.dawarich (drops stale goldilocks label, pre-existing drift) # module.tls_secret.kubernetes_secret.tls_secret (Kyverno clone-label drift, pre-existing) $ ../../scripts/tg apply --non-interactive ... Apply complete! Resources: 0 added, 3 changed, 0 destroyed. (Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation removal — same 0/3/0 shape.) ``` ### Manual Verification Setup: kubectl context against the k8s cluster (10.0.20.100). 1. Pod has both containers Ready with zero restarts: ``` $ kubectl -n dawarich get pods -o wide NAME READY STATUS RESTARTS AGE dawarich-75b4ff9fbf-qh56v 2/2 Running 0 <fresh> ``` 2. Sidekiq container is actively processing jobs: ``` $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20 Sidekiq 8.0.10 connecting to Redis ... db: 1 queues: [data_migrations, points, default, mailers, families, imports, exports, stats, trips, tracks, reverse_geocoding, visit_suggesting, places, app_version_checking, cache, archival, digests, low_priority] Performing DataMigrations::BackfillMotionDataJob ... Backfilled motion_data for N000 points (N climbing) ``` 3. Rails Sidekiq::API snapshot — procs registered, counters moving: ``` $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner ' require "sidekiq/api" s = Sidekiq::Stats.new puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}" ' processed=7 failed=2 procs=1 retry=0 dead=0 ``` (The 2 "failures" are cumulative across two pod lifecycles during the Prometheus env flip — retried successfully, neither retry nor dead set holds any jobs.) 4. Per-container memory well under the 1Gi limit: ``` $ kubectl -n dawarich top pod --containers POD NAME CPU MEMORY dawarich-75b4ff9fbf-qh56v dawarich 1m 272Mi (of 896Mi) dawarich-75b4ff9fbf-qh56v dawarich-sidekiq 79m 333Mi (of 1Gi) ``` 5. No "Prometheus Exporter, failed to send" log lines since the second apply: ``` $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \ \| grep -c "Prometheus Exporter" 0 ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:13:05 +00:00
Viktor Barzin	2b8bb849c0	[infra] Bump claude-agent-service + beadboard image tags ## Context Two rolling updates tied to the BeadBoard dispatch-button work (code-kel): 1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent (files in /usr/share/agent-seed/), the beads-task-runner agent, and hmac.compare_digest bearer verification. The tag moves from 382d6b14 to 0c24c9b6 (monorepo HEAD). 2. The beadboard Deployment in beads-server now consumes CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image needs the Dispatch button + /api/agent-dispatch + /api/agent-status routes. Tag moves from :latest to :17a38e43 (fork HEAD on github.com/ViktorBarzin/beadboard). ## What this change does - Flips `local.image_tag` in claude-agent-service main.tf. - Drops the "temporary" comment on `beadboard_image_tag` and sets the default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md "Use 8-char git SHA tags — `:latest` causes stale pull-through cache"). ## Test Plan ## Automated - Both images already pushed to registry.viktorbarzin.me{:5050}/ : - claude-agent-service:0c24c9b6 verified via `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/ contains both seed files. - beadboard:17a38e43 pushed, digest cd0d3c47. - terraform fmt/validate clean on both stacks from the earlier commits. ## Manual Verification 1. Push triggers Woodpecker default.yml. 2. Expected: both stacks apply; claude-agent-service pod rolls (new seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch + copies beads-task-runner.md), beadboard pod rolls with new env vars sourced from beadboard-agent-service ExternalSecret. 3. Cross-check: `kubectl -n claude-agent get pod -o yaml \| grep image:` should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard -o yaml \| grep image:` should show :17a38e43. Closes: code-kel Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:24:37 +00:00
Viktor Barzin	f79e3c563e	[infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip] ## Context On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone` on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name (selector switched to the standalone pod), all 20 databases/688 tables/14 users were dump-restored, and Vault rotated credentials against the new instance. The InnoDB Cluster has been dark since — Phase 4 was to remove the dead code and decommission its cluster-side Helm state. Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster + mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0 (30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles." ## This change Phase 4 scope executed in this session (beads code-qai): 1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack: - `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR + MySQL Router Deployment + 8 Services (mysql-cluster, -instances, ports 6446/6448/6447/6449/6450/8443, etc.) - `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles - `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns - `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover permissions patch that existed to work around the sidecar's kopf permissions bug; unused without the operator - `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra` - `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override `innodb_doublewrite=OFF` via subPath mount; standalone does not need it 2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not garbage-collect PVCs; 30Gi reclaimed. 3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf` covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section and all six resources above. The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context deadline exceeded"). Uninstallation had in fact completed cluster-side by that point but TF rolled back the state delta. A second `terragrunt destroy -target` call with the same args resolved cleanly — destroyed the remaining 2 tracked resources (the first pass cleared 4) and encrypted+committed the Tier 0 state. ## What is NOT in this change - CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these on uninstall. Verified clean: `kubectl get crd \| grep mysql.oracle.com` returns nothing. - Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not a TF-managed resource. - Stale Vault DB roles (health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium) for services migrated MySQL→PG — sandbox denies `vault list database/roles` as credential scouting, so the user handles this manually. - 2 state-commits preceding this one (``30fa411b``, ``6cf3575e``) are automatic SOPS-encrypted-state commits produced by `scripts/tg` after each `terragrunt destroy` pass. Standard Tier 0 workflow. ## Verification ``` $ helm list -A \| grep -E 'mysql-cluster\|mysql-operator' (no output) $ kubectl get ns mysql-operator Error from server (NotFound): namespaces "mysql-operator" not found $ kubectl get pvc -n dbaas datadir-mysql-cluster-0 Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found $ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone NAME READY STATUS RESTARTS AGE mysql-standalone-0 1/1 Running 1 (118m ago) 2d $ ../../scripts/tg state list \| grep -i 'mysql_operator\\|mysql_cluster\\|mysql_sidecar\\|mysql_extra_cnf' (no output) $ ../../scripts/tg plan \| grep -E 'mysql_cluster\|mysql_operator\|mysql_sidecar\|mysql_extra_cnf' (no output — Wave 2 drift is gone; remaining plan items are pre-existing drift unrelated to this change, see Wave 3 + in-flight payslip work) ``` ## Reproduce locally 1. `git pull` 2. `cd stacks/dbaas && ../../scripts/tg state list \| grep mysql_cluster` → no output 3. `helm list -A \| grep mysql-cluster` → no output Closes: code-qai Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:19:48 +00:00
Viktor Barzin	c75beaac6c	wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request) ## Context Pod was OOMKilled after today's broker-sync Phase 3 import grew the activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow matches across 6 accounts). `/api/v1/net-worth` and `/valuations/history` materialise the full history in memory to render the dashboard chart. `kubectl describe pod` showed Back-off restarting failed container; `kubectl top pod` reported 14Mi steady-state but spikes crossed the 64Mi cap. ## This change Bump container resources to: - requests.memory: 64Mi → 256Mi - limits.memory: 64Mi → 1Gi CPU unchanged. 1Gi is generous for the current 700-activity DB + chart rendering, with headroom for another year of growth before we need to revisit (VPA will flag if actual use exceeds upperBound). ## Verification ### Automated `scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0 added, 4 changed, 0 destroyed. ### Manual $ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}' → {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}} $ kubectl -n wealthfolio get pods -l app=wealthfolio NAME READY STATUS RESTARTS AGE wealthfolio-86c8696b9c-nzwkf 1/1 Running 0 51s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:13:05 +00:00
Viktor Barzin	43b4e1d372	[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role ## Context New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`) needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana datasource, a dashboard, and a Claude agent definition for PDF extraction. Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace. No ingress, no TLS cert, no DNS record. ## What ### New stack `stacks/payslip-ingest/` - `kubernetes_namespace` payslip-ingest, tier=aux. - ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN, WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`. - ExternalSecret (vault-database) reads rotating password from `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`. - Deployment: single replica, Recreate strategy (matches single-worker queue design), `wait-for postgresql.dbaas:5432` annotation, init container runs `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno dns_config lifecycle ignore. - ClusterIP Service :8080. - Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`, uid `payslips-pg`) reading password from the db-creds K8s Secret. ### Grafana dashboard `uk-payslip.json` (4 panels) - Monthly gross/net/tax/NI (timeseries, currencyGBP). - YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140. - Deductions breakdown (stacked bars). - Effective rate + take-home % (timeseries, percent). ### Vault DB role `pg-payslip-ingest` - Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`. - New `vault_database_secret_backend_static_role.pg_payslip_ingest` (username `payslip_ingest`, 7d rotation). ### DBaaS — DB + role creation - New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`: idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into `pg-cluster-1`. ### Claude agent `.claude/agents/payslip-extractor.md` - Haiku-backed agent invoked by `claude-agent-service`. - Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single JSON object matching the schema to stdout. No network, no file writes outside /tmp, no markdown fences. ## Trade-offs / decisions - Own DB per service (convention), NOT a schema in a shared `app` DB as the plan initially described. The Alembic migration still creates a `payslip_ingest` schema inside the `payslip_ingest` DB for table organisation. - Paperless URL uses port 80 (the Service port), not 8000 (the pod target port). - Grafana datasource uses the primary RW user — separate `_ro` role is aspirational and not yet a pattern in this repo. - No ingress — webhook is cluster-internal; external exposure is unnecessary attack surface. - No Uptime Kuma monitor yet: the internal-monitor list is a static block in `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor auto-creator). ## Test Plan ### Automated ``` terraform init -backend=false && terraform validate Success! The configuration is valid. terraform fmt -check -recursive (exit 0) python3 -c "import json; json.load(open('uk-payslip.json'))" (exit 0) ``` ### Manual Verification (post-merge) Prerequisites: 1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`. 2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`. Apply: 3. `scripts/tg apply vault` → creates pg-payslip-ingest static role. 4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role. 5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret` (first-apply ESO bootstrap). 6. `scripts/tg apply payslip-ingest` (full). 7. `kubectl -n payslip-ingest get pods` → Running 1/1. 8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200. End-to-end: 9. Configure Paperless workflow (README in code repo has steps). 10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s. 11. Grafana → Dashboards → UK Payslip → 4 panels render. Closes: code-do7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:07:05 +00:00
Viktor Barzin	bde713f8a4	broker-sync: add Fidelity PlanViewer CronJob (suspended) ## Context Viktor's UK workplace pension is at Fidelity PlanViewer. The broker-sync provider + CLI landed in the broker-sync repo (commits 804e6a8 + 7c9be54); this commit adds the infra bits so the monthly sync runs in-cluster like the other broker-sync jobs. One successful manual backfill on 2026-04-18 pulled 51 contributions + valuation into a new WF WORKPLACE_PENSION account; Net Worth moved from £865k → £1,003k. This commit productionises that flow. ## This change - New kubernetes_cron_job_v1.fidelity in stacks/broker-sync/main.tf: - Schedule: 05:00 UK on the 20th of each month (after mid-month payroll settles; finance data shows credits on the 13th-18th). - Suspended by default — unsuspend once broker-sync image is rebuilt with Chromium baked in (Dockerfile change shipped separately in the broker-sync repo). - Init container materialises the storage_state JSON (projected from the broker-sync-secrets K8s Secret, synced from Vault by ESO) to the encrypted PVC at /data/fidelity_storage_state.json. Chromium then loads it. - Container: broker-sync fidelity-ingest with WF + FIDELITY_* env vars. Memory request 512Mi, limit 1280Mi — Chromium is hungry. - Lifecycle ignore_changes on dns_config per the KYVERNO_LIFECYCLE_V1 convention documented in AGENTS.md. ## What is NOT in this change - The Vault keys fidelity_storage_state + fidelity_plan_id — already staged via `vault kv patch` on 2026-04-18. - Dockerfile Chromium install — in broker-sync repo (commit 7c9be54). - Prometheus BrokerSyncFidelityFailed alert — deferred until the CronJob has run successfully for a month and we have a baseline. Existing broker-sync CronJobs also don't have per-job alerts yet; filing as a follow-up. ## Verification ### Automated terraform fmt ran clean. `terragrunt plan` would show a single new kubernetes_cron_job_v1 (suspended, so no pods scheduled). ### Manual (after apply + image rebuild) 1. Build + push broker-sync:<sha> with Chromium. 2. `scripts/tg apply stacks/broker-sync` (updates image_tag + adds fidelity CronJob). 3. Unsuspend: `kubectl -n broker-sync patch cronjob broker-sync-fidelity \ -p '{"spec":{"suspend":false}}'` OR flip the tf flag. 4. Trigger a test run: `kubectl -n broker-sync create job \ fidelity-test --from=cronjob/broker-sync-fidelity`. 5. Expect logs: `fidelity-ingest: fetched=N new=N imported=N failed=0`. 6. On FidelitySessionError: run `broker-sync fidelity-seed` locally + `vault kv patch secret/broker-sync fidelity_storage_state=@...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 18:51:20 +00:00
Viktor Barzin	4f54c959d7	[infra] Remove iscsi-csi stack — TrueNAS decommissioned [ci skip] ## Context The iSCSI CSI driver was deployed against a TrueNAS appliance at 10.0.10.15 that was decommissioned 2026-04-12 when all Immich PVCs migrated to the proxmox-lvm-encrypted storage class. The stack has been dead code since — live survey (2026-04-18): - iscsi-csi namespace: empty (0 resources), 27h old (since last TF apply) - No iscsi CSI driver registered in the cluster - No PVs/PVCs reference iscsi - TF state held only the empty namespace - helm_release.democratic_csi was not in state (already gone pre-session) Leaving the stack around meant every `terragrunt run --all plan` would drift (TF wanted to create the helm release again) and every CI run would try to pull `truenas_api_key` + `truenas_ssh_private_key` from Vault against a TrueNAS that no longer exists. Beads tracking: code-gw0. ## This change - `scripts/tg destroy` in stacks/iscsi-csi (1 resource destroyed — the namespace). - `rm -rf stacks/iscsi-csi/` — removes modules/, main.tf, terragrunt.hcl, secrets symlink, and the 4 terragrunt-generated files (backend.tf, providers.tf, cloudflare_provider.tf, tiers.tf). - Dropped PG schema `iscsi-csi` on `10.0.20.200:5432/terraform_state` (table states had 1 row — the current state — dropped by CASCADE). - Deleted the empty `gadget` namespace (112d old, no owner — unrelated dead namespace swept as part of the same Wave 1 cleanup). ## What is NOT in this change - Vault database role cleanup for the 7 MySQL-migrated services (health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium). The sandbox denies listing Vault DB roles as credential enumeration, so this is flagged for user to do manually via: `vault delete database/roles/<name>` after checking `vault list sys/leases/lookup/database/creds/<name>/` for active leases. ## Reproduce locally 1. `git pull` 2. `ls stacks/ \| grep iscsi` → no output 3. `kubectl get ns iscsi-csi gadget` → both NotFound 4. psql to 10.0.20.200:5432/terraform_state → `\dn` shows no iscsi-csi schema ## Test Plan ### Automated ``` $ kubectl --kubeconfig config get ns iscsi-csi Error from server (NotFound): namespaces "iscsi-csi" not found $ kubectl --kubeconfig config get ns gadget Error from server (NotFound): namespaces "gadget" not found $ PGPASSWORD=... psql -h 10.0.20.200 -U ... -d terraform_state -c '\dn' \| grep iscsi (no output) $ ls stacks/iscsi-csi 2>&1 ls: cannot access 'stacks/iscsi-csi': No such file or directory ``` ### Manual Verification None required — destroy was a no-op for workloads (namespace was empty). Closes: code-b6l Closes: code-gw0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 18:49:40 +00:00

1 2 3 4 5 ...

629 commits