infra

Author	SHA1	Message	Date
Viktor Barzin	752f94ab8f	[monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730 The `external-monitor-sync` script is opt-IN by default for any *.viktorbarzin.me ingress, so a missing annotation means "monitored." Both ingress factories previously OMITTED the annotation when `external_monitor = false`, which silently left monitors in place. Fix: when the caller sets `external_monitor = false` explicitly, emit `uptime.viktorbarzin.me/external-monitor = "false"` so the sync script deletes the monitor. Keep the previous behavior (no annotation) for callers that leave external_monitor null — otherwise 19 publicly-reachable services with `dns_type="none"` would lose monitoring. Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy) to match the other two already-flagged services. Delete the r730 ingress module entirely — the Dell server has been decommissioned.	2026-04-19 15:18:27 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	5ea079181f	[dns] Technitium — raise memory limit to 2Gi (was 1Gi, originally 512Mi) Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi leaves enough headroom for normal operation but thin margin if blocklists or cache grow. User escalated: OOM cascades are the exact failure mode that causes user-visible DNS outages, so give a full 2x safety margin across all three instances. Replicas currently use 124-155Mi steady-state so they have enormous headroom at 2Gi — accepted for symmetry and future growth (OISD blocklists, in-memory cache). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:08:04 +00:00
Viktor Barzin	a86a97deb7	[reverse-proxy] Fix gw.viktorbarzin.me — point at 192.168.1.1 via EndpointSlice The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but Technitium has no record for that name (the router isn't a DHCP client and Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and the `[External] gw` Uptime Kuma monitor was permanently failing. Factory now accepts `backend_ip` as an alternative to `external_name`: it creates a selector-less ClusterIP Service + manual EndpointSlice pointing at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1); the old ExternalName path is retained for every other service. Also add a direct `port` monitor for the router in uptime-kuma's internal_monitors list so we can tell a Cloudflare/tunnel outage apart from the router itself being down. Extended the internal-monitor-sync script to handle non-DB monitor types (hostname + port fields).	2026-04-19 15:07:24 +00:00
Viktor Barzin	4b39fbb717	[dns] readiness gate — use dig-in-pod + retries, ephemeral curl pod for zone parity Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod health check from wget against /api to dig +short against 127.0.0.1. This probes the actual DNS serving path, which is what we care about anyway. Zone-count parity can't be done inside the Technitium pod (no HTTP client), so it spawns a short-lived curlimages/curl pod via kubectl run --rm that curls the three internal web services and exits. Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to load into memory on a cold start. Relaxed the A-record regex to match any IPv4 rather than 10.x — records may legitimately live outside that range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:57:29 +00:00
Viktor Barzin	9a21c0f065	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:53:41 +00:00
Viktor Barzin	a5e097088a	[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit `2eca011c` (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- \|` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before `2eca011c` because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- \|` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:30:39 +00:00
Viktor Barzin	2eca011cc3	[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. Fix: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. Fix: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" \| jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:25:52 +00:00
Viktor Barzin	2431c6d5fe	[reverse-proxy] ha-sofia per-service retry + ServersTransport Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms) and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS stalls that were surfacing as 18 x 502s/day without disturbing the global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new manifests to avoid the dangling-reference pattern from the 2026-04-17 Traefik P0. Closes: code-rd1	2026-04-19 14:07:07 +00:00
Viktor Barzin	947f1bd75d	[monitoring] UK Payslip v3.2 — stacked YTD panels, YTD-cumulative rate, Sankey Three changes: 1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting- clean stacked-area panels side-by-side: - "YTD sources": salary + bonus + rsu_vest + residual (= gross) - "YTD uses": net + income_tax + NI + pension_employee + student_loan + rsu_offset (= gross, per validate_totals identity) Green for take-home, red/orange for taxes, purple for pension, teal for RSU offset — visually encodes "what you earned vs what was taken". 2. Panel 3 effective rate switched from per-slip attribution to YTD cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike: the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top marginal`, not rsu_vest × average. Cumulative approach blends both proportionally, no attribution hack needed. Also adds a third series: all-deductions rate (income_tax + NI + student_loan / gross). 3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross → uses over the selected time range. Plugin added to grafana Helm values.	2026-04-19 13:42:27 +00:00
Service Upgrade Agent	55ade1f9b3	[servarr] Fix qbittorrent container_port 8787 -> 8080 (matches WEBUI_PORT) Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-19 13:37:44 +00:00
Viktor Barzin	3b4a059243	[uptime-kuma] Fix broken Redis monitor + move to TF-managed list The Redis monitor (id=53) was created manually with a connection string pointing at redis-master.redis-headless.redis.svc.cluster.local, which doesn't resolve — headless only exposes pod DNS (redis-node-N.redis-headless), not a synthetic "redis-master" name. Status had been DOWN with ENOTFOUND for weeks. Declare it in local.internal_monitors using redis-master.redis.svc.cluster.local (the HAProxy-fronted ClusterIP that already routes to the Sentinel-elected master). Verified RESP PING through HAProxy returns PONG. Tighten intervals to 60s / 30s retry / 3 retries — Redis is core (Paperless, Immich, Authentik, Dawarich all depend on it), a 5-minute detection window was way too loose given the blast radius. Also teach the sync CronJob to handle no-password monitors (auth disabled on the Bitnami chart), via an optional database_password_vault_key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:28:36 +00:00
Service Upgrade Agent	094bc727d4	upgrade: qbittorrent 5.0.4 -> 5.1.4 Changelog summary: Minor version bump; patch releases update external Alpine packages and restore qbittorrent-cli openssl3 support. Risk: SAFE Breaking changes: none DB backup: no (not DB-backed) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-19 13:26:15 +00:00
Viktor Barzin	26ef97d294	[claude-agent-service] Add WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL env vars ## Context Companion fix to 2026-04-19's service-upgrade spec refactor. The agent pod has no Vault CLI auth (no VAULT_TOKEN, port 8200 refused), so every `vault kv get` in the spec returned empty: - `WOODPECKER_TOKEN=""` → 401 on /api/repos/1/pipelines → agent can't find its pipeline → 15m poll timeout → rollback loop → >30m cap. - `SLACK_WEBHOOK=""` → webhook POST to empty URL → no Slack messages for 3+ days (the surface symptom that kicked off bd code-3o3). ## This change Extends the `claude-agent-secrets` ExternalSecret with two more keys, making them available to the agent via `envFrom`: - `WOODPECKER_API_TOKEN` ← `secret/ci/global.woodpecker_api_token` (already used by the vault-woodpecker-sync CronJob, same key) - `SLACK_WEBHOOK_URL` ← `secret/viktor.alertmanager_slack_api_url` (shared webhook also consumed by Alertmanager) Pairs with commit `a5963169` which refactored service-upgrade.md to read these env vars directly instead of shelling out to `vault kv get`. ## What is NOT in this change - REGISTRY_USER / REGISTRY_PASSWORD — not needed on the agent side. The separate `.woodpecker/build-cli.yml` fix (bd code-3o3 fix C) will add those to `secret/ci/global` for the vault-woodpecker-sync CronJob to publish as Woodpecker secrets, not here. ## Test Plan ### Automated `terraform plan` reported `Plan: 0 to add, 2 to change, 0 to destroy` (ExternalSecret + a cosmetic `tier` label drop on the Deployment). Applied cleanly. ### Manual Verification ``` $ kubectl -n claude-agent get externalsecret claude-agent-secrets \ -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}' secret synced $ kubectl -n claude-agent exec deploy/claude-agent-service -- sh -c \ 'echo "WP=${WOODPECKER_API_TOKEN:0:20}... SLACK=${SLACK_WEBHOOK_URL:0:40}..."' WP=eyJhbGciOiJIUzI1NiIs... SLACK=https://hooks.slack.com/services/T02SV75... $ kubectl -n claude-agent rollout status deploy/claude-agent-service deployment "claude-agent-service" successfully rolled out ``` Next step: fire one synthetic DIUN webhook to confirm the agent reaches Slack + lands a commit + exits cleanly, completing code-3o3. Refs: bd code-3o3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:23:12 +00:00
Viktor Barzin	83f4a72b6f	[redis] Raise master+replica memory 256Mi → 512Mi 256Mi was tight once the working set crossed ~200Mi: a BGSAVE fork during replica full PSYNC doubled master RSS via COW and pushed it past the limit, OOMing (exit 137) in a loop. HAProxy flapped, every client (Paperless, Immich, Authentik, Dawarich) saw session store failures → 500s on authenticated requests. 512Mi gives ~2x headroom on the current 204Mi RDB. Closes: code-n81 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:18:30 +00:00
Viktor Barzin	a5963169ec	[service-upgrade] Drop vault-CLI assumptions + check default workflow only ## Context Since the 2026-04-15 migration from SSH-on-DevVM to in-cluster claude-agent-service, the agent spec's four `vault kv get ...` calls have been dead code: the pod has no `VAULT_TOKEN`, no `~/.vault-token`, no Vault login method, and port 8200 is refused. Every token fetch returns empty, which silently breaks: - Slack: `SLACK_WEBHOOK=""` → POSTs 404 → no messages for 3+ days (the exact user-visible symptom that started this thread). - Woodpecker CI polling: `WOODPECKER_TOKEN=""` → 401 on `/api/repos/1/pipelines` → agent can't find its own pipeline → 15-min poll times out → jumps to rollback → same failure in the revert → hits n8n's 30-min ceiling → SIGKILL mid-saga → no commit, no Slack. - Changelog fetch: `GITHUB_TOKEN=""` overrides the env var supplied by `envFrom: claude-agent-secrets`, crippling changelog lookups too. Separately, Step 9 read the overall pipeline `status`, which is `failure` any time a single workflow fails — e.g. the unrelated `build-cli` workflow (docker image push to registry.viktorbarzin.me:5050 has been erroring since private-registry htpasswd was enabled on 2026-03-22). That made the agent spuriously rollback every otherwise- successful upgrade. ## This change - Replace the four `vault kv get ...` invocations with the matching env-var reads (`$GITHUB_TOKEN`, `$WOODPECKER_API_TOKEN`, `$SLACK_WEBHOOK_URL`) and document the env-var contract at the top of the "Environment" section. The env vars are expected to be pre-loaded via `envFrom: claude-agent-secrets` — that part is tracked as the companion ExternalSecret/Terraform change in bd code-3o3 (must land before this spec is effective). - Rewrite Step 9 to poll the `default` workflow's `state` instead of the overall pipeline `status`. Adds a jq example and explicitly documents the build-cli noise so future operators know why overall status is unreliable. ## What is NOT in this change - The matching ExternalSecret / Terraform changes that feed WOODPECKER_API_TOKEN / SLACK_WEBHOOK_URL / REGISTRY_USER / REGISTRY_PASSWORD into the pod. Until those land, this spec still produces empty env vars at runtime — but at least the shape of the contract is correct and grep-friendly. - The .woodpecker/build-cli.yml `logins:` entry for registry.viktorbarzin.me:5050. That's fix C in the same task. ## Test Plan ### Automated None — this is pure markdown guidance for the model. Syntax-checked by `grep -nE 'vault kv get\|WOODPECKER_TOKEN\|SLACK_WEBHOOK[^_]' .claude/agents/service-upgrade.md` showing only the explanatory warning on line 37 as a match. ### Manual Verification After the companion ExternalSecret change lands and the pod has WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL in env: 1. Trigger a DIUN-style webhook on a known slow service. 2. Watch `kubectl -n claude-agent logs -f deploy/claude-agent-service`. 3. Expect curl to `ci.viktorbarzin.me/api/...` return 200 and pipeline JSON (no 401), and Slack `$SLACK_WEBHOOK_URL` return 200. 4. Expect a Slack `[Upgrade Agent] Starting:` post inside the first minute, and a `SUCCESS` or `FAILED + ROLLED BACK` post on exit. Refs: bd code-3o3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:15:06 +00:00
Viktor Barzin	13cc5d956e	[monitoring] UK Payslip dashboard v3.1 — add YTD reconciliation panel Adds panel 6 that reconciles each payslip's reported YTD summary block (ytd_gross, ytd_taxable_pay, ytd_tax_paid) against the cumulative sum of extracted per-payslip values within the same tax year. Any Δ > £0.02 flags a parser regression, missing slip, or duplicate ingest — the algebraic companion to the existing missing-months panel. Variant A payslips (pre-mid-2022) carry no YTD block and are filtered out via WHERE ytd_gross IS NOT NULL.	2026-04-19 13:12:57 +00:00
Viktor Barzin	581aed5fcc	[openclaw,tor-proxy] Opt task-webhook + torrserver out of external monitoring Adds `external_monitor = false` to the ingress_factory calls for task-webhook and torrserver so the `external-monitor-sync` CronJob stops auto-creating `[External] <name>` monitors for them. Both services remain deployed and reachable; only the Uptime Kuma monitors are dropped.	2026-04-19 13:01:36 +00:00
Viktor Barzin	ac95973b38	[monitoring] UK Payslip dashboard v3 — consolidate to 5 panels + data-integrity check Collapse from 11 panels to 5. New hero "Tax-year YTD — gross / net / taxes / RSU / salary" merges the old YTD cumulative + total-comp + earnings-breakdown panels into a single line chart (tax-band thresholds still on ytd_cash_gross). New "Data integrity" table surfaces missing months and zero-salary anomalies at a glance — catches the 2024-02 gap (Paperless doc never uploaded) and any future parser regressions. Monthly cash flow, effective-rate, and full payslip table kept as-is. Total dashboard height: 39 rows (was ~67). No parser / schema changes. [ci skip]	2026-04-19 12:47:44 +00:00
Viktor Barzin	4ca793380b	[multi] Sweep Kyverno wait-for redis annotations to redis-master Replaces `redis.redis:6379` with `redis-master.redis:6379` in all 11 dependency.kyverno.io/wait-for annotations across 8 stacks, plus one docs comment in the Kyverno module. These annotations drive DNS-only `nc -z` init-container readiness checks — zero RW risk. Both hostnames resolve, so there is no wait-for failure window during the rolling re-apply. Closes: code-otr	2026-04-19 12:44:46 +00:00
Viktor Barzin	12a372bf92	[redis] Migrate live RW consumers off bare redis.redis hostname Completes the T0 hostname migration. The `redis.redis` service is a legacy alias that routes to HAProxy via a `null_resource` selector patch; `redis-master.redis` is the canonical name that has always routed to HAProxy directly and health-checks master-only. Changes: - redis-backup CronJob: redis-cli BGSAVE + --rdb now target redis-master.redis. BGSAVE runs on the master (what we want). - config.tfvars `resume_redis_url`: unused fallback updated for grep hygiene; nothing reads it today. - ytdlp REDIS_URL default: updated for dev-local runs; production already sets REDIS_URL via main.tf:283-285 → var.redis_host. - immich chart_values.tpl REDIS_HOSTNAME: dead Helm template (values block commented out in main.tf:524, Immich deploys as raw kubernetes_deployment using var.redis_host). Updated to keep the file consistent if someone ever revives it.	2026-04-19 12:42:36 +00:00
Viktor Barzin	e6e5fc5f17	[docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip] ## Context After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter component row, stale PVC names (`-proxmox` suffixes that were renamed `-encrypted` during the LUKS migration), a wrong probe schedule (claimed 10 min, actually 20 min), and a Mailgun-API claim for the probe (it's been on Brevo since code-n5l). The two-path architecture (external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the current design wasn't visualised at all. ## This change Rewrote the Architecture Diagram section to show both ingress paths in one Mermaid flowchart, colour-coded: - External (orange): Sender → pfSense NAT → HAProxy → NodePort → alt PROXY listeners (2525/4465/5587/10993). - Intra-cluster (blue): Roundcube / probe → ClusterIP Service → stock listeners (25/465/587/993), no PROXY. - The pod subgraph shows both listener sets feeding the same Postfix / Rspamd / Dovecot / Maildir pipeline. - Security dotted edges: Postfix log stream → CrowdSec agent → LAPI → pfSense bouncer decisions. - Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP → Pushgateway/Uptime Kuma. Added a sequenceDiagram for the external SMTP roundtrip — walks through the wire-level handshake from external MTA → pfSense NAT → HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod postscreen parse → smtpd banner. Makes the "how does the pod see the real IP despite SNAT?" question self-answering. Added a Port mapping table listing all 8 container listeners (4 stock + 4 alt) with their Service, NodePort, PROXY-required flag, and who uses each path. Replaces the ambiguous prose about "alt ports". Fixed stale bits: - Removed Dovecot Exporter row from Components (retired in code-1ik). - Added pfSense HAProxy row. - Probe schedule: every 10 min → every 20 min (`/20 * * `). - Probe API: Mailgun → Brevo HTTP. - PVC names: `-proxmox` → `-encrypted`* (all three); storage class `proxmox-lvm` → `proxmox-lvm-encrypted`. - Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS PVCs to the Storage table with backup flow pointer. - Expanded Troubleshooting → Inbound to include HAProxy health check + container-listener verification steps. - Secrets table: `brevo_api_key` now marked as used by both relay + probe; `mailgun_api_key` marked historical. Added a prominent UPDATE 2026-04-19 header to `docs/runbooks/mailserver-proxy-protocol.md` pointing future readers at the implemented state in `mailserver-pfsense-haproxy.md`. Research doc preserved as a decision record — it's the canonical "why not just pin the pod?" reference. ## What is NOT in this change - No Terraform changes; this is docs-only. - No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was already rewritten during Phase 6. ## Test Plan ### Automated ``` $ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md 2 $ grep -c '\-encrypted' docs/architecture/mailserver.md 5 # PVC references normalised $ grep -c '\-proxmox' docs/architecture/mailserver.md 0 # no stale names left ``` ### Manual Verification Render `docs/architecture/mailserver.md` on GitHub or any Mermaid- capable viewer: 1. Top Architecture Diagram should show two labelled paths into the pod, colour-coded (orange = external, blue = intra-cluster). 2. Sequence diagram should show 10 numbered steps ending at Rspamd + Dovecot delivery. 3. Port Mapping table should make it obvious that the 4 alt container ports are only reachable via `mailserver-proxy` NodePort and require PROXY v2.	2026-04-19 12:40:53 +00:00
Viktor Barzin	d5a47e35fc	[redis] Restore dynamic DNS in HAProxy to fix stale-IP outage HAProxy resolved `redis-node-{0,1}.redis-headless.redis.svc.cluster.local` once at pod startup and cached the IPs forever. When redis-node pods cycled (new pod IPs), HAProxy kept connecting to the dead IPs — backends flapped between "Connection refused" and "Layer4 timeout", and Immich's ioredis client hit EPIPE until max-retries exhausted and the pod entered CrashLoopBackOff. This caused an Immich outage on 2026-04-19. Fix: - Add `resolvers kubernetes` stanza pointing at kube-dns (10s hold on every category so we pick up pod IP changes within a DNS TTL window). - Add `resolvers kubernetes init-addr last,libc,none` to every backend server line so HAProxy resolves at startup AND uses the dynamic resolver for runtime refresh. - Add `checksum/config` pod annotation to the HAProxy Deployment so a haproxy.cfg change actually rolls the pods (including this one). Closes: code-fd6	2026-04-19 12:39:09 +00:00
Viktor Barzin	43fe11fffc	[mailserver] Phase 6 — decommission MetalLB LB path [ci skip] ## Context (bd code-yiu) With Phase 4+5 proven (external mail flows through pfSense HAProxy + PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are obsolete. Phase 6 decommissions them and documents the steady-state architecture. ## This change ### Terraform (stacks/mailserver/modules/mailserver/main.tf) - `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`. - Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation. - Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP). - Port set unchanged — the Service still exposes 25/465/587/993 for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor` CronJob) that hit the stock PROXY-free container listeners. - Inline comment documents the downgrade rationale + companion `mailserver-proxy` NodePort Service that now carries external traffic. ### pfSense (ops, not in git) - `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT rule references it post-Phase-4; keeping it would be misleading dead metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php` companion script (ad-hoc, not checked in — alias is just a Firewall → Aliases → Hosts entry). ### Uptime Kuma (ops) - Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202` → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` / `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy layer liveness). History retained (edit, not delete-recreate). ### Docs - `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten "Current state" section; now reflects steady-state architecture with two-path diagram (external via HAProxy / intra-cluster via ClusterIP). Phase history table marks Phase 6 ✅. Rollback section updated (no one-liner post-Phase-6; need Service-type re-upgrade + alias re-add). - `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound flow, CrowdSec section, Uptime Kuma monitors list, Decisions section (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY v2"), Troubleshooting all updated. - `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph updated with new external path description; references the new runbook. ## What is NOT in this change - Removal of `10.0.20.202` from `cloudflare_proxied_names` or any reserved-IP tracking — wasn't there to begin with. The `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of 19 available after this, confirming `.202` went back to the pool. - Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if someone re-introduces the MetalLB LB (see runbook "Rollback"). ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # Service is ClusterIP with no EXTERNAL-IP $ kubectl get svc -n mailserver mailserver mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP # 10.0.20.202 no longer answers ARP (ping from pfSense) $ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202' 2 packets transmitted, 0 packets received, 100.0% packet loss # MetalLB pool released the IP $ kubectl get ipaddresspool default -n metallb-system \ -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}' 2 of 19 available # E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver ... Round-trip SUCCESS in 20.3s ... $ kubectl delete job probe-phase6 -n mailserver # pfSense mailserver alias removed $ ssh admin@10.0.20.1 'php -r "..." \| grep mailserver' (no output) ``` ### Manual Verification 1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new hostname `10.0.20.1`. 2. Roundcube login works (`https://mail.viktorbarzin.me/`). 3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in mailserver logs within ~10s. 4. CrowdSec should still see real client IPs in postfix/dovecot parsers (verify with `cscli alerts list` on next auth-fail event). ## Phase history (bd code-yiu) \| Phase \| Status \| Description \| \|---\|---\|---\| \| 1a \| ✅ ``ef75c02f`` \| k8s alt :2525 listener + NodePort Service \| \| 2 \| ✅ 2026-04-19 \| pfSense HAProxy pkg installed \| \| 3 \| ✅ ``ba697b02`` \| HAProxy config persisted in pfSense XML \| \| 4+5 \| ✅ ``9806d515`` \| 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ✅ this commit \| MetalLB LB retired; 10.0.20.202 released; docs updated \| Closes: code-yiu	2026-04-19 12:36:11 +00:00
Viktor Barzin	9806d515dd	[mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip] ## Context (bd code-yiu) Cutover of external mail traffic from the MetalLB LB IP path (ETP:Local, pod-speaker colocation) to pfSense HAProxy + PROXY v2 (ETP:Cluster). Real client IP now preserved end-to-end on ports 25/465/587/993, both for postscreen anti-spam scoring and CrowdSec auth-failure bans. ## This change ### k8s (stacks/mailserver/modules/mailserver/main.tf) - `mailserver-user-patches` ConfigMap's `user-patches.sh` now appends 3 alt PROXY-speaking services to master.cf: - `:2525` postscreen (alt :25) - `:4465` smtpd (alt :465 SMTPS, wrappermode TLS) - `:5587` smtpd (alt :587 submission) All with `postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy`. Mirror stock submission/submissions options (SASL via Dovecot, TLS, client restrictions, mua_sender_restrictions). chroot=n so the SASL socket path `/dev/shm/sasl-auth.sock` resolves outside the chroot. - `dovecot.cf` ConfigMap adds: ``` haproxy_trusted_networks = 10.0.20.0/24 service imap-login { inet_listener imaps_proxy { port=10993; ssl=yes; haproxy=yes } } ``` Stock :993 stays PROXY-free for internal Roundcube/probe clients. - Container ports: 4 new (4465, 5587, 10993, 2525 already there). - `mailserver-proxy` NodePort Service now exposes all 4 ports: 25→2525→30125, 465→4465→30126, 587→5587→30127, 993→10993→30128 (ETP:Cluster). ### pfSense (scripts/pfsense-haproxy-bootstrap.php) Rebuilt to declare 4 backend pools (one per NodePort) and 4 production frontends on `10.0.20.1:{25,465,587,993}` TCP mode, plus the legacy `:2525` test frontend. All pools: `send-proxy-v2 check inter 120000`. Idempotent — re-runs converge on declared state. ### pfSense (scripts/pfsense-nat-mailserver-haproxy-{flip,unflip}.php) Flip script: updates `<nat><rule>` entries for mail ports from target `<mailserver>` alias (10.0.20.202 MetalLB) → `10.0.20.1` (pfSense HAProxy). Runs `filter_configure()` to rebuild pf rules. Unflip is the rollback. Both scripts are idempotent. ## What is NOT in this change - Phase 6 (decommission MetalLB LB path, downgrade mailserver Service from LoadBalancer to ClusterIP, free 10.0.20.202) — USER-GATED. Do NOT run until explicit approval. - Legacy MetalLB `mailserver` LB still live on 10.0.20.202 with stock ETP:Local ports — functional backup path + consumed by internal clients that hit `mailserver.mailserver.svc.cluster.local` (routes via ClusterIP layer of the LB Service, bypassing ETP). - Port :143 (plain IMAP) — no HAProxy frontend; stays on MetalLB via unchanged NAT rule. ## Test Plan ### Automated (verified pre-commit 2026-04-19) ``` # k8s container listens on all 8 ports $ kubectl exec -c docker-mailserver deployment/mailserver -n mailserver \ -- ss -ltn \| grep -E ':(25\|2525\|465\|4465\|587\|5587\|993\|10993)\b' ... all 8 listening ... # pfSense HAProxy listens on all 5 (production + legacy test) $ ssh admin@10.0.20.1 'sockstat -l \| grep haproxy' www haproxy 49418 5 tcp4 :25 www haproxy 49418 6 tcp4 :2525 www haproxy 49418 10 tcp4 :465 www haproxy 49418 11 tcp4 :587 www haproxy 49418 12 tcp4 :993 # Post-flip: pf rdr rules point at pfSense, not <mailserver> $ ssh admin@10.0.20.1 'pfctl -sn' \| grep 'smtp\\|sub\\|imap\\|:25' rdr on vtnet0 ... port = submission -> 10.0.20.1 rdr on vtnet0 ... port = imaps -> 10.0.20.1 rdr on vtnet0 ... port = smtps -> 10.0.20.1 rdr on vtnet0 ... port = 25 -> 10.0.20.1 # 4 HAProxy frontends reachable + SMTP/IMAP banners $ python3 <test script> → SMTP/SMTPS/Sub/IMAPS all respond correctly # Real client IP in maillog for external delivery via Brevo → MX postfix/smtpd-proxy25/postscreen: CONNECT from [77.32.148.26]:36334 to [10.0.20.1]:25 postfix/smtpd-proxy25/postscreen: PASS NEW [77.32.148.26]:36334 # E2E probe (Brevo HTTP → external SMTP delivery → IMAP fetch) succeeds $ kubectl create job --from=cronjob/email-roundtrip-monitor probe-yiu-flip -n mailserver ... Round-trip SUCCESS in 20.3s ... # Internal Roundcube path unchanged $ curl -sI https://mail.viktorbarzin.me/ → 302 (Authentik gate intact) # No mail alerts firing $ kubectl exec prometheus-server ... /api/v1/alerts \| grep Email → (empty) ``` ### Rollback ``` scp infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php admin@10.0.20.1:/tmp/ ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-unflip.php' ``` Immediate (<2s). Flips all 4 NAT rdrs back to `<mailserver>` alias. Pre-flip config snapshot also saved at `/tmp/config.xml.pre-yiu-flip.20260419-1222` on pfSense. ## Phase roadmap (bd code-yiu) \| Phase \| Status \| \|---\|---\| \| 1a \| ✅ commit `ef75c02f` — alt :2525 listener + NodePort \| \| 2 \| ✅ 2026-04-19 — HAProxy pkg installed on pfSense \| \| 3 \| ✅ commit `ba697b02` — HAProxy config persisted in pfSense XML \| \| 4+5\| ✅ this commit* — 4-port alt listeners + HAProxy frontends + NAT flip \| \| 6 \| ⏸ USER-GATED — MetalLB LB decommission after 48h observation \|	2026-04-19 12:24:50 +00:00
Viktor Barzin	702db75f84	[redis] Stabilise patch_redis_service trigger + document service naming ## Context `null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`, so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add` even when nothing has changed. That noise hides real drift in the signal and trains us to ignore redis-stack plans — which is exactly what you don't want on a load-bearing patch. The patch itself is still load-bearing (three consumers hard-code bare `redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`, `stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local` and call it during pod startup). Removing the null_resource is a follow-up (beads T0) once those consumers migrate to `redis-master.redis.svc`. For now the goal is just: stop being noisy. ## This change 1. Replace the `always = timestamp()` trigger with two inputs that only change when re-patching is genuinely required: - `chart_version = helm_release.redis.version` — changes only on a Bitnami chart version bump, which is the one code path that rewrites the `redis` Service selector back to `component=node`. - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])` — changes only when HAProxy config is edited; aligned with the existing `checksum/config` annotation that rolls the Deployment on config change. Both attributes are known at plan time (verified against `hashicorp/helm` v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision` (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))` (readability unverified on v3), and `kubernetes_deployment.haproxy.id` (static `namespace/name`, never changes) — don't meet the bar. 2. Add a Redis Service Naming section to `AGENTS.md` that explicitly states the write/sentinel/avoid endpoints, so new consumers start from `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor already learned that lesson the hard way (memory id=748). ## What is NOT in this change - Deleting `null_resource.patch_redis_service` — still load-bearing (T0). - Deleting `kubernetes_service.redis_master` — stays as the declared write API. - Migrating any consumer off bare `redis.redis.svc` — T0 epic. - Per-client sentinel migration — T1 epic. - Retiring HAProxy — T2 epic (blocked on T1 + T3). ## Before / after Before (steady state): ``` scripts/tg plan Plan: 1 to add, 2 to change, 1 to destroy. # null_resource.patch_redis_service must be replaced # triggers = { "always" = "<timestamp>" } -> (known after apply) ``` After (steady state, post-apply): ``` scripts/tg plan No changes. Your infrastructure matches the configuration. ``` After (chart version bump): ``` scripts/tg plan # null_resource.patch_redis_service must be replaced # triggers = { "chart_version" = "25.3.2" -> "25.4.0" } ``` — the trigger fires only when it actually needs to. ## Test Plan ### Automated `scripts/tg plan` pre-change (confirms baseline noise): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply) } } Plan: 1 to add, 2 to change, 1 to destroy. ``` `scripts/tg plan` post-edit (confirms the one-time structural replacement): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement - "always" = "2026-04-19T10:39:40Z" -> null + "chart_version" = "25.3.2" + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775" } } ``` Apply is deferred to the operator — the working tree on the same file also contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage) that needs its own review before rolling out together. No `scripts/tg apply` run from this session. ### Manual Verification Reproduce locally: 1. `cd infra/stacks/redis && ../../scripts/tg plan` 2. Before apply: expect `null_resource.patch_redis_service` to be replaced exactly once, with the trigger map transitioning from `{always = <ts>}` to `{chart_version, haproxy_config}`. 3. After apply: `../../scripts/tg plan` twice in a row must both report `No changes.` (excluding unrelated drift from other work-in-progress). 4. Cluster-side invariant (must hold pre- and post-apply): `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` 5. Regression test for the trigger doing its job: bump `helm_release.redis.version` in a branch, `tg plan`, expect the null_resource to replace. Revert.	2026-04-19 12:17:52 +00:00
Viktor Barzin	ba697b02a2	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip] ## Context (bd code-yiu) Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so it lives in the nightly backup) of the PROXY-v2 migration. Test path only — listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525 postscreen. Real client IP verified in maillog (`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a container plumbing is already live (commit `ef75c02f`). pfSense HAProxy config lives in `/cf/conf/config.xml` under `<installedpackages><haproxy>`. That file is captured daily by `scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`) and synced offsite to Synology. No new backup wiring needed — this commit documents the fact + adds the reproducer script. ## This change Two files, both additive: 1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that edits pfSense config.xml to add: - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125, `send-proxy-v2`, TCP health-check every 120000 ms (2 min). - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525 in TCP mode, forwarding to the pool. Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg` and reload HAProxy. Removes existing items with the same name before adding, so repeat runs converge on declared state. 2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering current state, validation, bootstrap/restore, health checks, phase roadmap, and known warts (health-check noise + bind-address templating). ## What is NOT in this change - Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred. - Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual- inet_listener) — deferred. - Terraform for pfSense HAProxy pkg install — not possible (no Terraform provider for pfSense pkg management). Runbook documents the manual `pkg install` command. ## Test Plan ### Automated ``` $ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l \| grep :2525' 64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D www haproxy 64009 5 tcp4 :2525 :* $ ssh admin@10.0.20.1 "echo 'show servers state' \| socat /tmp/haproxy.socket stdio" \ \| awk 'NR>1 {print $4, $6}' node1 2 node2 2 node3 2 node4 2 # all UP $ python3 -c " import socket; s=socket.socket(); s.settimeout(10) s.connect(('10.0.20.1', 2525)) print(s.recv(200).decode()) s.send(b'EHLO persist-test.example.com\r\n') print(s.recv(500).decode()) s.send(b'QUIT\r\n'); s.close()" 220-mail.viktorbarzin.me ESMTP ... 250-mail.viktorbarzin.me 250-SIZE 209715200 ... 221 2.0.0 Bye $ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \ \| grep smtpd-proxy.*CONNECT postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525 ``` Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy SNAT) → PROXY-v2 roundtrip confirmed. ### Manual Verification Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the now-persisted config (`<enable>yes</enable>` in XML). Connection test above should still work. ## Reproduce locally 1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/` 2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK 3. `python3 -c '...' ` SMTP roundtrip test above.	2026-04-19 12:07:47 +00:00
Viktor Barzin	602103ede1	[owntracks] Strip face avatar from hook payload + drop orphan PVC Bundles two small follow-ups to the live bridge + port-fix work: ## Face avatar fix (dawarich-hook.lua) After the Recorder ran in production for a while it began enriching publish payloads with a `face` field — the base64-encoded user avatar uploaded via the Recorder's web UI (~120 KB). Our Lua hook builds a curl command that embeds the JSON payload as `-d '<payload>'`, which hit `E2BIG` / `Argument list too long` (os.execute reason=code=7) on Linux's `execve` argv limit (~128 KB). Every live POST stopped making it to Dawarich, even though the HTTP POST from the phone to Owntracks still returned 200 and the .rec write still happened. Fix: `data.face = nil` before serializing. Dawarich doesn't use it anyway (not persisted into any column — `raw_data` stored without it). Also upgraded the debug log: on failure we now emit `dawarich-bridge: FAIL tst=... reason=... code=... cmd=...` so any future variant of this problem (next big field surfaced upstream, etc.) is one log tail away from a diagnosis. ``` $ kubectl -n owntracks logs deploy/owntracks --tail=5 \| grep dawarich-bridge + dawarich-bridge: init + dawarich-bridge: ok tst=1776600238 ``` ## Orphan PVC removal (main.tf) `owntracks-data-proxmox` (1 Gi, proxmox-lvm, unencrypted) was a leftover from the encrypted-migration attempt; the Deployment has been mounting `owntracks-data-encrypted` the whole time. Verified `Used By: <none>` on the live PVC before removal. Removing the resource from Terraform destroys the PVC — harmless, no data loss. ## Test Plan ### Automated ``` $ ../../scripts/tg plan Plan: 0 to add, 1 to change, 1 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 1 changed, 1 destroyed. $ kubectl -n owntracks get pvc NAME STATUS VOLUME ... owntracks-data-encrypted Bound ... (owntracks-data-proxmox gone) ``` ### Manual Verification ``` $ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) $ TST=$(date +%s) $ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \ curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \ -H 'Content-Type: application/json' \ -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \ -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \ https://owntracks.viktorbarzin.me/pub HTTP 200 $ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \ psql -U postgres -d dawarich -tAc \ "SELECT ST_AsText(lonlat::geometry) FROM points WHERE user_id=1 AND timestamp=$TST" POINT(-0.1278 51.5074) ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:05:18 +00:00
Viktor Barzin	ef75c02f0d	[mailserver] Phase 1a — alt :2525 postscreen listener + NodePort [ci skip] ## Context (bd code-yiu) Toward replacing MetalLB ETP:Local + pod-speaker colocation with pfSense HAProxy injecting PROXY v2 → mailserver. This commit lays the k8s-side groundwork for port 25 only. External SMTP flow post-cutover: Client → pfSense WAN:25 → pfSense HAProxy (injects PROXY v2) → k8s-node:30125 (NodePort for mailserver-proxy Service, ETP:Cluster) → kube-proxy → pod :2525 (postscreen with postscreen_upstream_proxy_protocol=haproxy) → real client IP recovered from PROXY header despite kube-proxy SNAT. Internal clients (Roundcube, email-roundtrip-monitor) keep using the stock :25 on mailserver.svc ClusterIP — no PROXY required, zero regression. ## This change - New `kubernetes_config_map.mailserver_user_patches` with a `user-patches.sh` script. docker-mailserver runs `/tmp/docker-mailserver/user-patches.sh` on startup; our script appends a `2525 postscreen` entry to `master.cf` with `-o postscreen_upstream_proxy_protocol=haproxy` and a 5s PROXY timeout. Sentinel-guarded for idempotency on in-place restart. - New volume + volume_mount (`mode = 0755` via defaultMode) wires the ConfigMap into the mailserver container. - New container port spec for 2525 (informational; kube-proxy resolves targetPort by number anyway). - New Service `mailserver-proxy` — NodePort type, ETP:Cluster, selector `app=mailserver`, port 25 → targetPort 2525 → fixed nodePort 30125. pfSense HAProxy's backend pool will be `<all k8s node IPs>:30125 check send-proxy-v2`. The existing `mailserver` LoadBalancer Service (ETP:Local, 10.0.20.202, ports 25/465/587/993) is untouched. Traffic still flows through it via the pfSense NAT `<mailserver>` alias; this commit does not change routing. ## What is NOT in this change - pfSense HAProxy install/config (Phase 2 — out-of-Terraform, runbook-managed) - pfSense NAT rdr flip from `<mailserver>` → HAProxy VIP (Phase 4) - 465/587/993 — scoped to port 25 first for proof of concept. Other ports get the same treatment (alt listeners 4465/5587/10993 + Service ports) once 25 is proven. - Dovecot per-listener `haproxy = yes` — irrelevant until IMAP is migrated. ## Test Plan ### Automated (verified pre-commit) ``` $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ postconf -M \| grep '^2525' 2525 inet n - y - 1 postscreen \ -o syslog_name=postfix/smtpd-proxy \ -o postscreen_upstream_proxy_protocol=haproxy \ -o postscreen_upstream_proxy_timeout=5s $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ ss -ltn \| grep -E ':25\b\|:2525' LISTEN 0 100 0.0.0.0:2525 0.0.0.0:* LISTEN 0 100 0.0.0.0:25 0.0.0.0:* $ kubectl get svc -n mailserver mailserver-proxy NAME TYPE CLUSTER-IP PORT(S) AGE mailserver-proxy NodePort 10.98.213.164 25:30125/TCP 93s # Expected-to-fail probe (no PROXY header) → postscreen rejects $ timeout 8 nc -v 10.0.20.101 30125 </dev/null Connection to 10.0.20.101 30125 port [tcp/*] succeeded! 421 4.3.2 No system resources ``` ### Manual Verification (after Phase 2 — pfSense HAProxy) Once HAProxy on pfSense is configured to listen on alt port :2525 (not the real :25 yet) and targets `k8s-nodes:30125` with `send-proxy-v2`: 1. From an external host: `swaks --to smoke-test@viktorbarzin.me --server <pfsense-ip>:2525 --body "phase 1 test"` 2. In mailserver logs: `kubectl logs -c docker-mailserver deployment/mailserver \| grep postfix/smtpd-proxy` — "connect from [<external-ip>]" with the real public IP, NOT the k8s node IP. 3. E2E probe CronJob keeps green (uses ClusterIP path, unaffected). ## Reproduce locally 1. `kubectl get svc mailserver-proxy -n mailserver` → NodePort 30125 exists 2. `kubectl get cm mailserver-user-patches -n mailserver` → exists 3. `timeout 8 nc -v <k8s-node>:30125 </dev/null` → "421 4.3.2 No system resources" (postscreen rejecting malformed PROXY)	2026-04-19 11:52:49 +00:00
Viktor Barzin	b60e34032c	[authentik] Phase 1 hardening — 3 replicas, PgBouncer PDB/probes, perf env ## Context Following the 2026-04-18 /dev/shm ENOSPC P0 and a 5-subagent research pass, this is Phase 1 of the authentik reliability + performance hardening epic (beads code-cwj). Scope: everything that is safe, additive, and does not require DB restart, architectural migration, or the 43-service auth path to go through a risky validation window. Five research findings drove the deltas: 1. Server/worker at 2 replicas conflicts with the documented convention "critical path services scaled to 3" in .claude/CLAUDE.md (Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared). PDB minAvailable was still 1 — a single-pod outage could take auth down. 2. PgBouncer had no resource requests/limits — silently capped at the Kyverno tier-defaults LimitRange (256Mi), no PDB, no probes. Pool failures undetected until connection timeouts. 3. Authentik 2026.2 has no Redis (the cache moved to Postgres in 2025.10). Persistent Django connections + longer flow/policy cache TTLs are the two knobs that move the needle most without DB tuning. Both are safe because PgBouncer runs in session mode. 4. Gunicorn defaults (2 workers × 4 threads on server, 1 process × 2 threads on worker) don't use the pod's 1.5 Gi headroom. Each worker preloads Django at ~500 MiB — bumping to 3 workers needs a memory bump to 2 Gi first. 5. AUTHENTIK_WORKER__CONCURRENCY was renamed AUTHENTIK_WORKER__THREADS in 2025.8 — the old name is aliased but the canonical config key changed. ## This change ### values.yaml - server.replicas 2 → 3 (PDB minAvailable 1 → 2) - worker.replicas 2 → 3 - server/worker limits.memory 1.5 Gi → 2 Gi (headroom for gunicorn workers) - authentik.postgresql.conn_max_age = 60 (persistent connections; safe with pgbouncer session mode, conn_max_age < server_idle_timeout=600s) - authentik.postgresql.conn_health_checks = true - authentik.cache.timeout_flows = 1800 (30 min; was 300) - authentik.cache.timeout_policies = 900 (15 min; was 300) - authentik.web.workers = 3, threads = 4 - authentik.worker.threads = 4 (was 2) ### pgbouncer.tf - container resources: requests cpu=50m/mem=128Mi, limits mem=512Mi (observed live usage is 1-3 m CPU, 2-4 MiB RSS — huge headroom, safely above Kyverno 256Mi tier-default cap) - readiness probe: TCP :6432, 10s period - liveness probe: TCP :6432, 30s period, 30s delay - kubernetes_pod_disruption_budget_v1.pgbouncer: minAvailable=2 (3 replicas; single drain rolls cleanly, two-node simultaneous outage correctly blocked) ## What is NOT in this change (deferred as Phase 2 follow-ups) - Codify outpost /dev/shm patch in Terraform (currently applied via Authentik API, not in code). Needs authentik_outpost resource. - Migrate embedded outpost → dedicated outpost Deployment with 2 replicas + sticky sessions. Only HA path per GH issue #18098; requires flow design because outpost sessions are in-process memory only. - PG max_connections 100 → 200 + shared_buffers 512MB → 768MB + CNPG pod memory 2Gi → 3Gi. Needs coordinated DB restart. - Enable pg_stat_statements on CNPG cluster for Authentik DB observability (currently shared_preload_libraries is empty). - PgBouncer pool_mode session → transaction + django_channels layer split. Needs atomic change + psycopg3 prepared-statement support. - authentik_tasks_tasklog 7-day retention (198k rows, unbounded). - Traefik forward-auth plugin caching via xabinapal/traefik-authentik-forward-plugin. - Grafana dashboard 14837 import + recording rule for authentik_flow_execution_duration (reported broken: values in ns while default buckets are seconds — upstream discussion #7156). ## Test plan ### Automated $ cd stacks/authentik && ../../scripts/tg plan Plan: 1 to add, 3 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive module.authentik.kubernetes_pod_disruption_budget_v1.pgbouncer: Creation complete after 0s module.authentik.kubernetes_deployment.pgbouncer: Modifications complete after 45s module.authentik.helm_release.authentik: Modifications complete after 2m47s Apply complete! Resources: 1 added, 3 changed, 0 destroyed. ### Manual Verification 1. Pod topology and PDBs: $ kubectl -n authentik get pods,pdb pod/goauthentik-server-5fc69b6cc6-ctvkp 1/1 Running 0 3m14s k8s-node2 pod/goauthentik-server-5fc69b6cc6-fkn8x 1/1 Running 0 3m45s k8s-node3 pod/goauthentik-server-5fc69b6cc6-jtjjd 1/1 Running 0 5m6s k8s-node1 pod/goauthentik-worker-5cfb7dc9bf-b2rlr 1/1 Running 0 3m44s k8s-node2 pod/goauthentik-worker-5cfb7dc9bf-fkfm4 1/1 Running 0 5m6s k8s-node1 pod/goauthentik-worker-5cfb7dc9bf-hxdg6 1/1 Running 0 3m3s k8s-node4 pod/pgbouncer-64746f955f-st567 1/1 Running 0 4m58s k8s-node4 pod/pgbouncer-64746f955f-xss9c 1/1 Running 0 5m11s k8s-node2 pod/pgbouncer-64746f955f-zvfkw 1/1 Running 0 4m45s k8s-node3 poddisruptionbudget/goauthentik-server 2 N/A 1 poddisruptionbudget/goauthentik-worker N/A 1 1 poddisruptionbudget/pgbouncer 2 N/A 1 All three workloads spread across 3+ nodes, PDBs allow 1 disruption. 2. Authentik server health: $ curl -sS -o /dev/null -w "%{http_code}\n" \ https://authentik.viktorbarzin.me/-/health/ready/ 200 3. Forward-auth redirect on protected service: $ curl -sS -o /dev/null -w "%{http_code}\n" -L \ https://wealthfolio.viktorbarzin.me/ 200 4. Outpost /dev/shm still within sizeLimit (patches from the 2026-04-18 post-mortem were not regressed): $ kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost \ -c proxy -- df -h /dev/shm tmpfs 2.0G 58M 2.0G 3% /dev/shm 5. PgBouncer port reachable from other pods: $ kubectl -n authentik exec deploy/pgbouncer -- nc -zv 127.0.0.1 6432 127.0.0.1 (127.0.0.1:6432) open ## Reproduce locally 1. `cd stacks/authentik && ../../scripts/tg plan` — expect 0/0/0 (No changes). 2. `kubectl -n authentik get pdb pgbouncer` — expect MIN AVAILABLE 2. 3. `kubectl -n authentik get deploy goauthentik-server -o jsonpath='{.spec.replicas}'` — expect 3. Closes: code-cwj	2026-04-19 11:52:41 +00:00
Viktor Barzin	789cb61310	[servarr] Rewrite MAM ratio farming — break Mouse death spiral, adopt in TF ## Context A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14 via kubectl apply (outside Terraform). Five days later the account was still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0. Tracker responses on 7 of 9 active torrents returned `status=4 \| msg="User currently mouse rank, you need to get your ratio up!"` — MAM was actively refusing to serve peer lists because the account was in Mouse class, and refusing to serve peer lists made the ratio impossible to recover. Meanwhile the grabber kept digging: 501 torrents sat in qBittorrent, 0 completed, 0 bytes uploaded. Root causes (ranked): 1. Death spiral — Mouse class blocks announces, nothing uploads. 2. BP-spender 30 000 BP threshold blocked the only exit even though the account already had 24 500 BP. 3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand torrents filtered to <100 MiB — ratio-hostile by design. 4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so torrents that never started never qualified. Combined with the 500- torrent cap this stalled the grabber indefinitely. 5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL. 6. Ratio-monitor labelled queued torrents `unknown` (empty tracker field), hiding the problem on the MAM Grafana panel. 7. qBittorrent memory limit (256 Mi LimitRange default) too low. 8. All of the above was Terraform drift with no reviewability. ## This change Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts the three kubectl-applied resources and replaces their scripts with demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana panel row. ### Architecture MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP) ├─── loadSearchJSONbasic.php (freeleech search) ├─── bonusBuy.php (50 GiB min tier for API) └─── download.php (torrent file) │ Pushgateway <──┬────────────┤ │ mam_ratio ┌────────────────────┐ │ mam_class_code │ freeleech-grabber │ /30 │ mam_bp_balance ◄───│ (ratio-guarded) │ │ mam_farming_ └──────────┬─────────┘ │ mam_janitor_* │ adds to │ ▼ │ Grafana panels qBittorrent (mam-farming) │ + 5 alerts ▲ │ │ deletes by rule │ ┌──────────┴─────────┐ │ ◄───│ farming-janitor │ /15 │ │ (H&R-aware) │ │ └──────────┬─────────┘ │ │ buys credit │ ┌──────────┴─────────┐ └───────────────────────│ bp-spender │ 0 /6 │ (tier-aware) │ └────────────────────┘ ### Key decisions - Ratio guard on grabber — refuse to grab if ratio < 1.2 OR class == Mouse. Prevents the death spiral from deepening. Emits `mam_grabber_skipped_reason{reason=...}` and exits clean. - Demand-first selection — new score formula `leechers3 - seeders0.5 + 200 if freeleech_wedge else 0`; size band 50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that will actually upload. - Janitor decoupled from grabber — runs every 15 min regardless of the ratio-guard state. Without this, stuck torrents accumulate fastest exactly when the grabber is skipping (Mouse class). H&R-aware: never deletes `progress==1.0 AND seeding_time < 72h`. Six delete reasons observable via `mam_janitor_deleted_per_run{reason=...}`. - BP-spender tier-aware — MAM imposes a hard 50 GiB minimum on API buyers ("Automated spenders are limited to buying at least 50 GB... due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB. The spender picks the smallest tier that satisfies the ratio deficit AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB tier is too expensive, it skips and retries on the next 6-hour cron. - Authoritative metrics use MAM profile fields — `downloaded_bytes` / `uploaded_bytes` (integers) rather than the pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB" that MAM also returns. - Ratio-monitor category-first labelling — `tracker` is empty for queued torrents that never announced. Now maps `category==mam-farming` to label `mam` first, only falls back to tracker-URL parsing when category is absent. Stops hundreds of MAM torrents collecting under `unknown`. - qBittorrent resources bumped to `requests=512Mi / limits=1Gi` so hundreds of active torrents don't OOM. ### Emergency recovery performed this session 1. Adopted 5 in-cluster resources via root-module `import {}` blocks (Terraform 1.5+ rejects imports inside child modules). 2. Ran the janitor in DRY_RUN=1 to verify rules against live state — 466 `never_started` candidates, 0 false positives in any other reason bucket. Flipped to enforce mode. 3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35 preserved as active/in-progress). 4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become eligible again. The ratio is still 0 because the API cannot buy below 50 GiB and the account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4 and unblock announces. Future automation cannot do this for us due to MAMs anti-spam rule. ### What is NOT in this change - qBittorrent prefs reconciliation (max_active_downloads=20, max_active_uploads=150, max_active_torrents=150). The plan wanted this; deferred to a follow-up because the janitor + ratio recovery handles the 500-torrent backlog first. A small reconciler CronJob posting to /api/v2/app/setPreferences is the intended follow-up. - VIP purchase (~100 k BP) — deferred until BP accumulates. - Cross-seed / autobrr — separate initiative. ## Alerts added - P1 MAMMouseClass — `mam_class_code == 0` for 1h - P1 MAMCookieExpired — `mam_farming_cookie_expired > 0` - P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old QBittorrentMAMRatioLow, now driven by authoritative profile metric) - P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy - P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 \| grep Plan Plan: 5 to import, 2 to add, 6 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed. # Re-plan after import block removal (idempotent) $ ../../scripts/tg plan 2>&1 \| grep Plan Plan: 0 to add, 1 to change, 0 to destroy. # The 1 change is a pre-existing MetalLB annotation drift on the # qbittorrent-torrenting Service — unrelated to this change. $ cd ../monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. # Python + JSON syntax $ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [ "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py", "infra/stacks/servarr/mam-farming/files/bp-spender.py", "infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]' $ python3 -c 'import json; json.load(open( "infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))' ### Manual Verification 1. Grabber ratio-guard path: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class 2. BP-spender tier path: $ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1 $ kubectl -n servarr logs job/s1 Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551 \| deficit=1.40 GiB needed=3 affordable=48 buy=0 Done: BP=24551, spent=0 GiB (needed=3, affordable=48) Correctly skips because affordable (48) < smallest API tier (50). 3. Janitor in enforce mode: $ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1 $ kubectl -n servarr logs job/j1 \| tail -3 Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False per reason: {'never_started': 466, ...} Second run immediately after: `deleted=0 skipped_active=35` — steady state with only active/seeding torrents left. 4. Alerts loaded: $ kubectl -n monitoring get cm prometheus-server \ -o jsonpath='{.data.alerting_rules\.yml}' \ \| grep -E "alert: MAM\|alert: QBittorrent" - alert: MAMMouseClass - alert: MAMCookieExpired - alert: MAMRatioBelowOne - alert: MAMFarmingStuck - alert: MAMJanitorStuckBacklog - alert: QBittorrentDisconnected - alert: QBittorrentMAMUnsatisfied 5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new "MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor deletion stacked chart, janitor state stat, grabber state stat. ## Reproduce locally 1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect 0 add / 1 change (unrelated MetalLB annotation drift). 2. `kubectl -n servarr get cronjobs` — expect three: mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor. 3. Trigger each via `kubectl create job --from=cronjob/<name> <job>` and read logs; outputs match the manual-verification snippets above. Closes: code-qfs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:45:38 +00:00
Viktor Barzin	5ea0aa70e3	[claude-agent-service] Bump image_tag to 2fd7670d (45m /execute timeout) ## Context Ships the monorepo commit (code@2fd7670d [claude-agent-service] Raise /execute default timeout from 15m to 45m) that raises ExecuteRequest.timeout_seconds from 900 to 2700. The auto-upgrade pipeline (DIUN → n8n → claude-agent-service → service-upgrade agent) had been silently timing out mid-run for 3 days: 139 × 202 Accepted + 6 × TimeoutError in the last 24h, zero commits to infra, zero Slack posts. Root cause was the 15-minute cap truncating CAUTION-class upgrades that need to summarise multi-release changelogs, poll Woodpecker CI, and wait on on-demand DB backup CronJobs. ## What changed `local.image_tag` 0c24c9b6 → 2fd7670d. Image built + pushed to registry.viktorbarzin.me/claude-agent-service:2fd7670d. Deployment is `Recreate`, so the single pod is dropped + recreated. ## Test Plan ### Automated `terraform plan` — `Plan: 0 to add, 1 to change, 0 to destroy` (3 container image refs flip from 0c24c9b6 → 2fd7670d). `terraform apply` — `Apply complete! Resources: 0 added, 1 changed, 0 destroyed.` ### Manual Verification ``` $ kubectl -n claude-agent rollout status deploy/claude-agent-service --timeout=120s deployment "claude-agent-service" successfully rolled out $ kubectl -n claude-agent get deploy claude-agent-service \ -o jsonpath='{.spec.template.spec.containers[0].image}' registry.viktorbarzin.me/claude-agent-service:2fd7670d $ kubectl -n claude-agent exec deploy/claude-agent-service -- \ sh -c 'cd /srv && python3 -c "from app.main import ExecuteRequest; \ print(ExecuteRequest(prompt=\"p\", agent=\"a\").timeout_seconds)"' 2700 ``` Next DIUN cycle (every 6h) should land ≥1 unattended upgrade as an infra commit + Slack message without TimeoutError in the agent logs. Closes: code-cfy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:29:08 +00:00
Viktor Barzin	a5df175a67	[mailserver] Retire Dovecot exporter + scrape + alerts [ci skip] ## Context code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot 2.3 deprecated in favour of `service stats` + `doveadm-server` with a different wire format. The scrape only ever returned `dovecot_up{scope="user"} 0`. code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or (b) retire the exporter + scrape + alerts. Picking (b) — carrying a no-op exporter + scrape + alert group taxes cluster resources, clutters Prometheus /targets, and tees up an alert that can never fire correctly. If a future session needs real Dovecot stats, reach for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and rebuild this scaffolding. ## This change ### mailserver stack - Removes the `dovecot-exporter` container from `kubernetes_deployment.mailserver` (was ~28 lines). Pod now runs a single `docker-mailserver` container. - Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service added in code-izl). The `mailserver` LoadBalancer (ports 25, 465, 587, 993) is unaffected. - Drops the dovecot.cf comment documenting the failed code-vnc attempt — the documentation survives here + in bd code-vnc / code-1ik. ### monitoring stack - Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`. - Removes the `Mailserver Dovecot` PrometheusRule group (`DovecotConnectionsNearLimit`, `DovecotExporterDown`). - Inline comments in both files point future work at code-1ik's decision record. Prometheus configmap-reload picked up the change; scrape target set now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to 1/1 Running. ## What is NOT in this change - No replacement exporter — deliberate. The alert that was removed was a false-signal alert; its removal returns cluster alerting to a correct, lower-noise state. - mailserver MetalLB Service + SMTP/IMAP ports — unchanged. - `auth_failure_delay`, `mail_max_userip_connections` — stay; those are unrelated to stats export. ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver NAME READY STATUS RESTARTS AGE mailserver-78589bfd95-swz6h 1/1 Running 0 49s $ kubectl get svc -n mailserver NAME TYPE PORT(S) mailserver LoadBalancer 25/TCP,465/TCP,587/TCP,993/TCP roundcubemail ClusterIP 80/TCP # mailserver-metrics gone $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' {"status":"success","data":{"activeTargets":[]}} ``` ### Manual Verification 1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence) 2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even without the exporter signal 3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit or DovecotExporterDown Closes: code-1ik Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:01:07 +00:00
Viktor Barzin	137404a6a2	[mailserver] Document Dovecot exporter incompatibility [ci skip] ## Context bd code-vnc investigated why `viktorbarzin/dovecot_exporter` only exposed `dovecot_up{scope="user"} 0`. Root cause: the exporter speaks the legacy pre-2.3 `old_stats` FIFO wire protocol. docker-mailserver 15.0.0 ships Dovecot 2.3.19, which moved to `service stats` with a different architecture — `doveadm stats dump` on the old-stats unix_listener returns "Failed to read VERSION line" and the exporter loops on "Input does not provide any columns". Attempted fix: enabled `old_stats` plugin via `mail_plugins` + declared `service old-stats { unix_listener stats-reader }`. Socket was created but protocol incompatibility made it useless. Reverted. ## This change - Reverts the attempted dovecot.cf additions - Adds a comment in the dovecot.cf heredoc explaining why we deliberately do NOT enable old_stats here - `auth_failure_delay = 5s` (code-9mi) and `mail_max_userip_connections = 50` stay — they're unrelated to stats ## What is NOT in this change - A replacement exporter — filed as follow-up bd code-1ik with two paths: switch to jtackaberry/dovecot_exporter, or retire the exporter+scrape+alert entirely - The `mailserver-metrics` ClusterIP Service (from code-izl) — kept; it will be useful for whichever path code-1ik chooses ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ supervisorctl status dovecot postfix dovecot RUNNING pid 1022, uptime 0:00:27 postfix RUNNING pid 1063, uptime 0:00:26 $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification Dovecot config returns to baseline + auth_failure_delay. Mail continues to flow (E2E probe continues to succeed via `email-roundtrip-monitor`). Closes: code-vnc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:55:48 +00:00
Viktor Barzin	973f549810	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser ## Context Companion change to payslip-ingest v2 (regex parser + accurate RSU tax attribution). The Grafana dashboard now has 4 more panels powered by the new earnings-decomposition and YTD-snapshot columns, and the Claude fallback agent's prompt is aligned with the new schema so non-Meta payslips still land with the full field set. ## This change ### `.claude/agents/payslip-extractor.md` Rewrites the RSU handling section to match Meta UK's actual template (rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead). Adds a new "Earnings decomposition (v2)" section telling the fallback agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_* and when to use pension_employee vs pension_sacrifice without double-counting. ### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json` - Panel 4 (Effective rate) — SQL switched from the naive `(income_tax + NIC) / cash_gross` to the YTD-effective-rate method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid / ytd_taxable_pay)`. Title updated to "YTD-corrected" so the change is discoverable. - Panel 5 (Table) — adds salary, bonus, pension_sacrifice, taxable_pay columns so row-level debugging against the parser output is trivial. - +Panel 8 (Earnings breakdown) — monthly stacked bars of salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice months show up as a massive negative pension_sacrifice spike paired with a near-zero bonus bar. - +Panel 9 (Accurate cash tax rate) — timeseries of cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU contribution the payslip hides in the single `Tax paid` line. - +Panel 10 (All-in compensation) — stacked bars of cash_gross + rsu_vest per payslip. - +Panel 11 (YTD cumulative cash gross vs total comp) — two lines partitioned by tax_year; the gap between them is the RSU contribution YTD. Total panels go from 7 → 11. ## Test Plan ### Automated Dashboard JSON validity: ``` $ python3 -m json.tool uk-payslip.json > /dev/null && echo ok ok ``` ### Manual Verification After applying `stacks/monitoring/`: 1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels 2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the negative pension_sacrifice bar in panel 8 3. Panel 9 "Accurate cash effective tax rate" shows the cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in RSU-vest months ## Reproduce locally 1. `cd infra/stacks/monitoring && terragrunt plan` 2. Expected: ConfigMap diff on the payslip dashboard with the new panel JSON 3. `terragrunt apply` — Grafana reloads the dashboard automatically (configmap-reload sidecar) Relates to: payslip-ingest commit 9741816 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:54:33 +00:00
Viktor Barzin	c6784f87b5	[docs] Add NFS prerequisite runbook for nfs_volume module [ci skip] ## Context `modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`). The first time a new consumer is added, the mount fails with `mount.nfs: … No such file or directory` and the pod hangs in ContainerCreating. This bit us twice during the Wave 1/2 rollout — once for the mailserver backup (code-z26) and again for the Roundcube backup (code-1f6). Both times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`. Rather than automate the SSH dependency into the module (which would break hermeticity and fail for operators without host SSH), this runbook documents the manual bootstrap step and the rationale. Addresses bd code-yo4. ## This change New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers, gives the copy-paste SSH command, and explains why auto-creation was rejected (two options, neither worth the churn). ## What is NOT in this change - Any automation of the bootstrap — runbook only - Migration to `nfs-subdir-external-provisioner` — explicitly out of scope ## Test Plan ### Automated ``` $ cat docs/runbooks/nfs-prerequisites.md \| head -5 # NFS Prerequisites for `modules/kubernetes/nfs_volume` The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a path on the Proxmox NFS server (`192.168.1.127`). It does not create the underlying directory on the server. ``` ### Manual Verification Before the next stack adds a new `nfs_volume` consumer, read the runbook and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod reaches Ready within a minute of the PV creation. Closes: code-yo4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:40:55 +00:00
Viktor Barzin	28009a0e85	[redis] Bump master/replica memory 64Mi→256Mi (OOMKilled on PSYNC) ## Context redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts. Cluster-health check flagged it as WARN; Prometheus was firing `StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and `PodCrashLooping` alerts continuously. ## Root cause Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but the replica needs transient headroom during PSYNC full resync: - RDB snapshot transfer buffer - Copy-on-write during AOF rewrite (`fork()` + writes during snapshot) - Replication backlog tracking The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137), looping forever. This also broke Sentinel quorum when master would fail — no healthy replica to promote. ## Fix Master + replica: 64Mi → 256Mi (both requests and limits, per `CLAUDE.md` resource management rule: `requests=limits` based on VPA upperBound). Sentinels stay at 64Mi — they don't store data. ## Deployment note Helm upgrade initially deadlocked because StatefulSet uses `OrderedReady` podManagementPolicy: the update rollout refuses to start until all pods Ready, but redis-node-1 could not be Ready without the update. Recovered via: helm rollback redis 43 -n redis kubectl -n redis patch sts redis-node --type=strategic \ -p '{...memory: 256Mi...}' kubectl -n redis delete pod redis-node-1 --force Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery runbook to be written under `code-cnf`. ## Verification kubectl -n redis get pods redis-node-0 2/2 Running 0 <bounce> redis-node-1 2/2 Running 0 <bounce> kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}' 256Mi ## Follow-ups filed - code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically (separate root cause; surfaced via same cluster-health run) - code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait deadlock recovery Closes: code-pqt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:40:51 +00:00
Viktor Barzin	468a7a266b	[mailserver] Drop unneeded NET_ADMIN capability [ci skip] ## Context The mailserver container had `capabilities.add = ["NET_ADMIN"]`. Upstream docker-mailserver docs say the capability is only needed by Fail2ban to run iptables ban actions. Fail2ban is DISABLED in this stack (`ENABLE_FAIL2BAN=0`, see line ~68) — CrowdSec owns the brute-force policy at the LB layer. The capability was therefore unused ballast and a minor attack-surface reduction opportunity. Addresses code-4mu. ## This change Replaces the explicit `capabilities { add = ["NET_ADMIN"] }` block with an empty `security_context {}`. Post-rollout verification (`supervisorctl status`) confirms every service we actually run is healthy — dovecot, postfix, rspamd, rsyslog, postsrsd, changedetector, cron, mailserver. Every STOPPED entry was already disabled. The inline comment documents the revert trigger: check `kubectl logs -c docker-mailserver` for permission-denied patterns and restore the capability if observed. ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver -o jsonpath='{.items[0].spec.containers[?(@.name=="docker-mailserver")].securityContext}' {"allowPrivilegeEscalation":true,"privileged":false,"readOnlyRootFilesystem":false,"runAsNonRoot":false} $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ supervisorctl status \| grep RUNNING changedetector RUNNING ... cron RUNNING ... dovecot RUNNING ... mailserver RUNNING ... postfix RUNNING ... postsrsd RUNNING ... rspamd RUNNING ... rsyslog RUNNING ... ``` ### Observation window EmailRoundtripFailing + EmailRoundtripStale alerts continue to run every 20 min. If no alert fires in the 24h post-rollout window (through ~2026-04-20 10:40 UTC), the change is considered safe and this commit stands. Otherwise revert this commit. ## What is NOT in this change - readOnlyRootFilesystem (separate hardening, out of scope) - runAsNonRoot (docker-mailserver needs root for Postfix) - Removing privilege-escalation defaults (container needs those for chowning mail spool at startup) Closes: code-4mu Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:39:43 +00:00
Viktor Barzin	c941199f8d	[mailserver] Split Dovecot metrics port onto ClusterIP service [ci skip] ## Context Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable, shipping an internal metric on the same listening IP as external mail conflated two concerns and over-exposed the port. Prometheus was scraping via the same LB Service. Addresses code-izl (follow-up to code-61v which added the scrape job). ## This change ### mailserver stack - Drops `dovecot-metrics` port from `kubernetes_service.mailserver` (LoadBalancer stays: 25, 465, 587, 993). - Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only, selecting the same `app=mailserver` pod, exposing 9166. ### monitoring stack - Updates `extraScrapeConfigs` in the Prometheus chart values to target the new `mailserver-metrics.mailserver.svc.cluster.local:9166` instead of `mailserver.mailserver.svc.cluster.local:9166`. - helm_release.prometheus updated in-place; configmap-reload sidecar picked up the new target within 10s. ``` mailserver LB mailserver-metrics ClusterIP ┌──────────────────┐ ┌──────────────────┐ │ 25 smtp │ │ 9166 dovecot- │ │ 465 smtp-secure │ │ metrics │ ← Prometheus only │ 587 smtp-auth │ └──────────────────┘ │ 993 imap-secure │ └──────────────────┘ ↑ 10.0.20.202 ``` ## What is NOT in this change - Per-Service RBAC/NetworkPolicy tightening (separate task) - Moving the metrics port to a dedicated sidecar-only Service Monitor (ServiceMonitor CRDs not installed; extraScrapeConfigs is correct for the prometheus-community chart in use) ## Test Plan ### Automated ``` $ kubectl get svc -n mailserver mailserver LoadBalancer 10.0.20.202 25/TCP,465/TCP,587/TCP,993/TCP mailserver-metrics ClusterIP 10.100.102.174 9166/TCP $ kubectl get endpoints -n mailserver mailserver-metrics mailserver-metrics 10.10.169.163:9166 $ # Prometheus target (after 10s configmap-reload) $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics health: up ``` ### Manual Verification 1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused 2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name 3. PromQL: `up{job="mailserver-dovecot"}` returns `1` Closes: code-izl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:37:30 +00:00
Viktor Barzin	7502e0db21	[mailserver] Document postfix-accounts.cf hash-drift invariant [ci skip] ## Context The `postfix-accounts.cf` ConfigMap renders `bcrypt(pass, 6)` for each user in `var.mailserver_accounts`. bcrypt generates a fresh salt on every evaluation → the ConfigMap `data` hash line differs every plan run. `ignore_changes = [data["postfix-accounts.cf"]]` was the pragmatic workaround, but the side-effect wasn't documented: a Vault rotation of a mailserver password would be MASKED by ignore_changes — TF would never push the new hash and the pod would keep accepting the old password until manual taint/replace. Addresses bd code-7ns. ## This change Inline comment on the lifecycle block spelling out: - Why ignore_changes exists (non-deterministic bcrypt) - What the invariant costs (masks automatic rotation) - Why it's acceptable TODAY (no automatic rotation for mailserver_accounts — verified in Vault; manual password change is a manual TF run anyway) - Two concrete alternatives if rotation is ever added: (a) deterministic bcrypt with stable per-user salt (b) render from an ESO-synced K8s Secret No code change, no apply needed — this is a comment-only commit. The decision (live-with + document) is one of the three options in the plan. ## What is NOT in this change - Deterministic hashing (not needed until automatic rotation exists) - ESO-driven Secret (same reason) - Removal of ignore_changes (would cause the original drift flap) ## Test Plan ### Automated ``` $ cd stacks/mailserver && /home/wizard/code/infra/scripts/tg plan # no diff expected on this comment-only change; other drift remains # but is pre-existing and out of scope. ``` ### Manual Verification Read the new comment block at `stacks/mailserver/modules/mailserver/ main.tf` around the postfix-accounts-cf lifecycle — comprehensible without session context. Closes: code-7ns Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:33:57 +00:00
Viktor Barzin	23173131f4	[mailserver] Add Dovecot auth_failure_delay 5s [ci skip] ## Context Dovecot's `dovecot.cf` block previously set only `mail_max_userip_connections = 50`. No equivalent of the SMTP rate limit existed for IMAP auth — brute-force against IMAP/POP auth was throttled only by CrowdSec at the LB level. Adding an in-process auth delay is cheap defense in depth. Addresses code-9mi. ## This change Adds `auth_failure_delay = 5s` to the dovecot.cf ConfigMap key. Each failed auth attempt pauses 5s before responding; a sequential 1000-entry dictionary attack stretches from <1s to ~85min, bought out CrowdSec's ban window. ## What is NOT in this change - `login_processes_count` tuning (workload doesn't warrant it yet) - Equivalent SMTP AUTH delay (CrowdSec already covers, and SMTP AUTH is rate-limited via `smtpd_client_connection_rate_limit`) ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ doveconf -n \| grep -E 'auth_failure\|mail_max_userip' auth_failure_delay = 5 secs mail_max_userip_connections = 50 $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. `openssl s_client -connect mail.viktorbarzin.me:993` 2. `a1 LOGIN bogus@viktorbarzin.me wrongpass` — expect ~5s delay before `NO [AUTHENTICATIONFAILED]` 3. Fire 5 failed attempts rapidly: total ≥25s ## Reproduce locally 1. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- doveconf -n \| grep auth_failure` 2. Expected: `auth_failure_delay = 5 secs` Closes: code-9mi Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:33:05 +00:00
Viktor Barzin	a32bfbf07e	[mailserver] Require STARTTLS before AUTH on submission [ci skip] ## Context docker-mailserver 15.0.0's default Postfix config does NOT set `smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587 (or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec and rate limiting don't catch this — it's an auth-path leak, not a bruteforce. Addresses bd code-vnw. ## This change Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the `postfix-main.cf` ConfigMap key consumed by docker-mailserver). Rolled the pod to pick up the new ConfigMap. ### Deviation from task spec code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT a real Postfix parameter — attempting it gets `postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The acceptance test (reject PLAIN auth before STARTTLS) is satisfied by `smtpd_tls_auth_only`, which is the correct knob. Added an inline comment noting the common confusion. ## What is NOT in this change - Per-service override in master.cf (smtpd_tls_auth_only applied globally, which is safe because port 25 doesn't accept AUTH here) - Other Postfix hardening (sender_restrictions, etc.) ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ postconf smtpd_tls_auth_only smtpd_tls_auth_only = yes $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp` 2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS` 3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled` 4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds ## Reproduce locally 1. From a shell with `kubectl` access to the cluster: 2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only` 3. Expected: `smtpd_tls_auth_only = yes` Closes: code-vnw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:31:15 +00:00
Viktor Barzin	e12c7b43e4	[mailserver] Pin dovecot_exporter to SHA + add Diun [ci skip] ## Context `viktorbarzin/dovecot_exporter:latest` was consumed with `IfNotPresent` pull, which means whichever node landed the pod kept whatever digest was cached from an earlier pull. A SHA-level pin is the reproducibility baseline this repo uses for every other home-built image (`headscale`, `excalidraw`, `linkwarden`). ## This change - Pins `dovecot-exporter` container image to `viktorbarzin/dovecot_exporter@sha256:1114224c...` — the digest the pod is actually running today (captured from live `imageID`). - Enables Diun tag watching on the mailserver Deployment (`diun.enable=true`, `diun.include_tags=^latest$`) so new `:latest` digests trigger a notification rather than silently landing on the next `IfNotPresent` miss. Deviation from task spec (code-cno): the task asked for an 8-char SHA tag, but Docker Hub only publishes `:latest` for this image — a SHA tag doesn't exist. Used the digest-pin pattern already established at `stacks/headscale/modules/headscale/main.tf:204` instead; Diun watches the `:latest` tag for drift, which is the equivalent notification. ## What is NOT in this change - Volume-mount ordering drift on `kubernetes_deployment.mailserver` (pre-existing; tolerated by Waves 1+2). - Splitting the metrics port into its own Service (code-izl). ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver \ -o jsonpath='{.items[0].spec.containers[*].image}' docker.io/mailserver/docker-mailserver:15.0.0 \ viktorbarzin/dovecot_exporter@sha256:1114224c9bf0261ca8e9949a6b42d3c5a2c923d34ca4593f6b62f034daf14fc5 $ kubectl get deployment -n mailserver mailserver \ -o jsonpath='{.spec.template.metadata.annotations}' {"diun.enable":"true","diun.include_tags":"^latest$"} $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. Push a new `:latest` digest to the exporter image (or wait for one). 2. Check Diun notifier output: a tag event for `^latest$` should fire. 3. `kubectl describe deployment/mailserver -n mailserver` shows the digest pin unchanged until someone rebumps it. ## Reproduce locally 1. `kubectl -n mailserver get pod -l app=mailserver -o yaml \| \ grep -A1 dovecot_exporter` 2. Expected: `image: viktorbarzin/dovecot_exporter@sha256:1114224c...`. Closes: code-cno Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:26:31 +00:00
Viktor Barzin	c36b41eabc	[monitoring] Scrape mailserver Dovecot exporter + near-limit alerts Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but nothing was scraping it. Added a static `mailserver-dovecot` scrape job to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not `kube-prometheus-stack`, so no ServiceMonitor CRDs are available). Two alerts in a new `Mailserver Dovecot` rule group: - `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for 5m (85% of `mail_max_userip_connections = 50`). - `DovecotExporterDown` fires if the scrape target is unreachable for 10m (catches pod restarts + network issues). Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule on `mailserver-beta1` branch; that commit is abandoned because the CRDs aren't installed. This path is functionally equivalent and plans cleanly. Closes: code-61v	2026-04-19 00:24:12 +00:00
Viktor Barzin	6a75ed4809	[mailserver] Add targeted retention for spam@ mailbox ## Context The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. The mailbox had no retention policy. On 2026-04-18 it held 519 messages consuming 43 MiB. Without a policy, the only brake on growth was manual deletion, which has not been happening - hence the bd task. Viktor's explicit constraint when filing code-oy4: DO NOT blind age-expunge. We need targeted retention that keeps genuine forwarded human mail for a long time while shedding the recurring-newsletter cruft that dominates the byte count. ## Profile findings (2026-04-18, verified on the live pod) Total: 519 messages, 43 MiB, 0 in new/, 0 in tmp/. Top senders by volume: 138 dan@tldrnewsletter.com 51 hi@ratepunk.com 40 uber@uber.com 35 truenas@viktorbarzin.me 19 ubereats@uber.com 15 hello@travel.jacksflightclub.com 12 chris@chriswillx.com 10 me@viktorbarzin.me Top senders by storage bytes: 8,176,481 dan@tldrnewsletter.com (19 % of 43 MiB alone) 2,866,104 uber@uber.com 2,207,458 noreply@mail.selfh.st 2,066,094 hi@ratepunk.com 1,675,435 ubereats@uber.com Age distribution: 97 % older than 14 days (502 / 519) 23 % older than 90 days (121 / 519) Automated-sender markers: 66 % carry List-Unsubscribe: (342 / 519) 4 % carry Precedence: bulk\|list\|junk ( 21 / 519) 34 % carry neither marker (= human-ish tail) (177 / 519) Combined "automated AND >14d": 328 messages -> target of rule 1. ## Retention strategy Signed off by Viktor 2026-04-18. Two rules, both delete-leaf: 1. Older than 14 days AND header matches one of: - `^List-Unsubscribe:` - `^Precedence:\s(bulk\|list\|junk)` - `^Auto-Submitted:\sauto-` -> DELETE. Rationale: these markers are the RFC-agreed indicators of bulk / robotic senders. A 14-day window still lets genuine subscription alerts (delivery, flight, calendar invite) come to attention. 2. Older than 90 days AND no automated marker at all -> DELETE. Rationale: these are long-tail forwards from real people to the catch-all. 90 days is deliberately generous - I would rather leak bytes than lose Viktor's personal correspondence. 3. Everything else -> KEEP (recent traffic, or aged human tail younger than 90d). ## Implementation A `kubernetes_cron_job_v1.spam_retention` running every 4h (at :17 past) that `kubectl exec`s a Python retention script into the mailserver pod. Why kubectl exec and not a sibling CronJob with the Maildir mounted: mailserver-data-encrypted is a RWO volume held by the mailserver pod. A sibling would fail to attach. The nextcloud-watchdog pattern in stacks/nextcloud/main.tf already solves this for a similar "interact with the live pod on a schedule" shape. Mirrored here with its own SA + Role + RoleBinding scoped to list/get pods and create pods/exec in the mailserver namespace only. Why Python and not pure shell: POSIX `find + stat + awk` struggles with the header-scan-up-to-blank-line rule, and `stat -c` is Linux- GNU-specific anyway. The script reads each message's first 64 KiB, stops at the first blank line, scans headers only, then checks mtime. The CronJob streams the Python source via `kubectl exec -i ... -- python3 - <<PYEOF`. After the retention pass, `doveadm force-resync -u spam@viktorbarzin.me INBOX/spam` refreshes Dovecot's cached index so the deletions appear in IMAP immediately instead of after the next pod restart. Includes the standard KYVERNO_LIFECYCLE_V1 marker on the CronJob so Kyverno ndots mutation does not cause perpetual drift. ## What is NOT in this change - Dovecot sieve rules (no sieve infrastructure exists in the module; the plan file's fallback option was precisely this CronJob path). - Push of retention metrics to Pushgateway - the script prints them to the job log for now; plumbing Pushgateway is a follow-up if Viktor wants alerts. - Any touch of other mailboxes - only `/var/mail/viktorbarzin.me/spam/cur` is walked. - Any mailserver pod restart or config reload. ## Test plan ### Automated `terraform fmt` + `terragrunt hclfmt` pass. `scripts/tg plan` on the mailserver stack shows: Plan: 7 to add, 3 to change, 0 to destroy. Of the 7 adds, 4 are mine (SA + Role + RoleBinding + CronJob). The other 3 adds belong to the concurrent roundcube-backup CronJob + nfs_roundcube_backup_host PV + PVC already on master in parallel. The 3 in-place updates are pre-existing drift on the mailserver Deployment, Service and email_roundtrip_monitor CronJob, not introduced by this change. ### Manual Verification After `scripts/tg apply` lands the CronJob: 1. Trigger an immediate run: `kubectl -n mailserver create job --from=cronjob/spam-retention manual-1` 2. Wait for completion, read the log: `kubectl -n mailserver logs job/manual-1` -> expected tail: spam_retention_scanned_total <N> spam_retention_auto_deleted_total <M> spam_retention_human_deleted_total <H> spam_retention_kept_total <K> spam_retention_errors_total 0 Retention pass complete 3. Confirm mailbox shrunk: `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \ -- du -sh /var/mail/viktorbarzin.me/spam/` -> expected: well below 43 MiB within one run (bulk rule alone purges ~328 messages per the profile numbers above). 4. Confirm IMAP reflects the deletions: `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \ -- doveadm mailbox status -u spam@viktorbarzin.me messages INBOX/spam` -> expected: message count dropped accordingly. 5. 4 hours later, confirm the next scheduled run logs a much smaller scan count and 0 deletions (nothing new crossed the threshold). Closes: code-oy4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:22:55 +00:00
Viktor Barzin	6cfc4b7836	[mailserver] Add backup CronJob for Roundcube html + enigma PVCs ## Context Roundcube webmail runs with two encrypted RWO PVCs (see roundcubemail.tf: `roundcubemail-html-encrypted`, `roundcubemail-enigma-encrypted`). These carry user-visible state that is NOT regenerable without user action: - `html` PVC → Apache docroot, plugin installs, skin overrides, session artefacts (two_factor_webauthn keys, persistent_login tokens, rcguard throttle state) - `enigma` PVC → user-uploaded PGP private keyrings Per the subdir CLAUDE.md "Storage & Backup Architecture" rule every proxmox-lvm* PVC MUST have a backup CronJob writing to NFS `/mnt/main/<app>-backup/`. Mailserver already complies via code-z26's `mailserver-backup` CronJob; Roundcube does not. Losing either Roundcube PVC means users must re-add 2FA devices, re-install plugins, and re-import PGP keys — none of it recoverable from a database dump. Target task: `code-1f6`. ## This change - Adds `module.nfs_roundcube_backup_host` sourcing `modules/kubernetes/nfs_volume` pointed at `/srv/nfs/roundcube-backup` on the Proxmox host (NFSv4, inotify change-tracker picks it up for Synology offsite). - Adds `kubernetes_cron_job_v1.roundcube-backup`: - Schedule `10 3 * * ` — 10 minutes after `mailserver-backup` (`0 3 * `) to avoid NFS write-window contention. Roundcube PVCs are tiny (<200 MiB combined on current cluster) so the window is well under 10 min. - `pod_affinity` on `app=roundcubemail` (Roundcube runs 1 replica with `Recreate` strategy on a fresh node per pod; the backup pod must co-locate because both PVCs are RWO). - `rsync -aH --delete --link-dest=/backup/<prev-week>` into `/backup/<YYYY-WW>/{html,enigma}/` — hardlinks unchanged files vs the previous weekly snapshot, keeping storage cost ~= delta only. - Weekly rotation retains 8 snapshots (~2 months), matching `mailserver-backup`. - Pushgateway metrics under `job=roundcube-backup` so existing `BackupDurationHigh` / `BackupStale` alert patterns detect regressions without extra wiring. - `KYVERNO_LIFECYCLE_V1` `ignore_changes` for mutated `dns_config`. ## Layout ``` NFS server 192.168.1.127:/srv/nfs/ ├── mailserver-backup/ (0 3 * * — code-z26) │ └── <YYYY-WW>/{data,state,log}/ └── roundcube-backup/ (10 3 * * * — this change) └── <YYYY-WW>/{html,enigma}/ ``` ## What is NOT in this change - Changing the mailserver-backup CronJob to also cover Roundcube. Two separate CronJobs keep the concerns (and pod anti-affinity/affinity) clean; the 10-min stagger eliminates the contention justification for merging them. - Retention alerting tuning — existing Pushgateway/Prometheus rule ecosystem suffices for now. - Restore tooling — follows the standard pattern in `docs/runbooks/` (rsync back, fix perms). ## Reproduce locally 1. Plan: `cd stacks/mailserver && scripts/tg plan -lock=false` → 2 new resources (nfs_volume module + CronJob). 2. Apply, then trigger a one-shot run: `kubectl -n mailserver create job --from=cronjob/roundcube-backup roundcube-backup-manual-1` 3. Expected on success: - `kubectl -n mailserver logs job/roundcube-backup-manual-1` → "=== Backup IO Stats ===". - On Proxmox host: `ls /srv/nfs/roundcube-backup/$(date +%Y-%W)/` → `html`, `enigma`. - `/mnt/backup/.nfs-changes.log` (Proxmox) lists fresh paths under `roundcube-backup/` within ~1s of the rsync finishing. - Pushgateway: `curl -s prometheus-prometheus-pushgateway.monitoring:9091/metrics \| grep roundcube` shows `backup_duration_seconds`, `backup_last_success_timestamp`. ## Automated - `terraform fmt -check -recursive stacks/mailserver/modules/mailserver/` → clean. - `scripts/tg plan -lock=false` in stacks/mailserver expected to show `+ module.nfs_roundcube_backup_host.*`, `+ kubernetes_cron_job_v1.roundcube-backup`. Closes: code-1f6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:14:47 +00:00
Viktor Barzin	f707968091	[mailserver] Retry probe Pushgateway + Uptime Kuma pushes with backoff ## Context The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)` in bare `try/except` that only prints "Failed to push ..." on error. If Pushgateway is transiently unreachable (e.g., during a Prometheus Helm upgrade / HPA scale-down / brief network blip) metrics silently drop and downstream detection relies entirely on `EmailRoundtripStale` firing after 60 min of staleness. Single transient failures masquerade as data-plane breakage for up to an hour. Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes. ## This change - Extracts a `push_with_retry(label, func, url)` helper that performs 3 attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as success, everything else as failure. On final failure, logs an explicit `ERROR:` line to stderr with the URL and either the last HTTP status or the exception repr — matches the existing `print(...)` logging style used throughout the heredoc (no stdlib `logging` dependency added). - Replaces the two inline `try/requests.put/except print` blocks with calls to the helper. Pushgateway runs unconditionally; Uptime Kuma still only runs on round-trip success (same as before). - Makes exit code responsive to push outcome: probe exits non-zero when the round-trip itself failed (unchanged), OR when BOTH pushes failed all retries on the success path. Single-endpoint push failure with the other succeeding keeps exit 0 — partial observability is preferred over noisy pod restarts from Kubernetes' Job controller. ## Behavior matrix ``` roundtrip \| pushgw \| kuma \| exit \| rationale ----------+--------+------+------+------------------------------- success \| ok \| ok \| 0 \| happy path (unchanged) success \| fail \| ok \| 0 \| one endpoint still has telemetry success \| ok \| fail \| 0 \| one endpoint still has telemetry success \| fail \| fail \| 1 \| NEW — total observability loss fail \| ok \| - \| 1 \| roundtrip failed (unchanged, Kuma skipped) fail \| fail \| - \| 1 \| roundtrip failed (unchanged, Kuma skipped) ``` ## What is NOT in this change - Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of scope per the task description. - `logging` stdlib adoption — rest of heredoc uses `print`, staying consistent. - Moving the heredoc out of `main.tf` into a sidecar Python file — separate refactor. ## Reproduce locally 1. Point PUSHGATEWAY at a black hole: `kubectl -n mailserver set env cronjob/email-roundtrip-monitor \` `PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test` 2. Trigger a one-shot job: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test` 3. Expected in logs: - 3 attempts, each ~1s/2s/4s apart - `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...` - Uptime Kuma push still succeeds (round-trip ok) → exit 0 4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect exit 1 + two ERROR lines. ## Automated - `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK (heredoc extracts cleanly). - `terraform fmt -check -recursive modules/mailserver/` → no diff. Closes: code-n5l Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:14:46 +00:00
Viktor Barzin	f568e7d2bf	[mailserver] Delete unused postfix_cf_reference_DO_NOT_USE variable [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix main.cf layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but wastes reviewer attention on every `git blame` and drives up `variables.tf` read time. Note on history: the prior commit `09c11056` landed with an identical title ("Delete postfix_cf_reference_DO_NOT_USE dead code") but actually committed `docs/runbooks/mailserver-proxy-protocol.md` — fallout from a race between two concurrent mailserver sessions that staged files in parallel. That commit accidentally closed this beads task via the `Closes:` trailer without performing the deletion. This commit does the actual deletion that was originally intended for code-o3q. The runbook from `09c11056` is legitimate work for code-rtb and is left in place. ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the live `postfix_cf` variable that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` -> `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift. No change is attributable to this commit — the dead variable was never referenced. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` -> expected `0` 3. `wc -l variables.tf` -> expected `39` (was `175`; 136 lines removed) 4. `cd ../..` -> `terragrunt validate` -> expected `Success!` 5. `terragrunt plan` -> expected `Plan: 0 to add, 2 to change, 0 to destroy.` (pre-existing drift only). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:07:43 +00:00
Viktor Barzin	09c1105648	[mailserver] Delete postfix_cf_reference_DO_NOT_USE dead code [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix `main.cf` layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but actively wastes reviewer attention on every `git blame`, drives up `variables.tf` read time, and lets drift calcify. Trade-offs considered: - Keep it "just in case" → rejected; the file it mirrored (`/usr/share/postfix/main.cf.dist`) is already canonical upstream and reproducible inside any docker-mailserver container. - Move it to a comment block → rejected; same noise cost, no value over deletion (authoritative source is in the image). ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the single live variable `postfix_cf` that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately; this commit explicitly avoids conflating cleanup with drift resolution. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift (volume_mount ordering + stale `metallb.io/ip-allocated-from-pool` annotation). No change is attributable to this commit — the dead variable was never referenced, so removing it leaves state untouched. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` → expected `0` 3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed including the trailing blank after the EOT) 4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains 5. `cd ../..` (stack root) → `terragrunt validate` → expected: `Success! The configuration is valid` 6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to destroy.` (the 2 are the pre-existing drift, not from this commit). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:05:44 +00:00
root	1990ee7f8d	Woodpecker CI Update TLS Certificates Commit	2026-04-19 00:02:53 +00:00

1 2 3 4 5 ...

2937 commits