Compare commits

...

63 commits

Author SHA1 Message Date
Viktor Barzin
eb6ceac5f5 [dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F)
Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now
has a primary internal resolver + external fallback (AdGuard) so DNS
keeps working if the primary resolver IP is unreachable.

New config:

- Proxmox host (192.168.1.127): plain /etc/resolv.conf with
  nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard).
  Previously: single nameserver 192.168.1.1 — could not resolve
  internal .lan names at all. Documented in
  docs/runbooks/proxmox-host.md.

- Registry VM (10.0.20.10): systemd-resolved drop-in at
  /etc/systemd/resolved.conf.d/10-internal-dns.conf
  (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan)
  plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml.
  Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan
  hostnames would fail to resolve. Documented in
  docs/runbooks/registry-vm.md.

- TrueNAS (10.0.10.15): host unreachable during this session
  ("No route to host" on 10.0.10.0/24). Deferred best-effort per
  WS F instructions; noted on the beads task.

Both hosts have pre-change backups at /root/dns-backups/ for
one-command rollback. Fallback behaviour was validated by routing
each primary to a blackhole and confirming dig answered from the
fallback.

Both runbooks include the verified resolvectl / resolv.conf state,
the fallback-test procedure, and the rollback steps.

Closes: code-dw8
2026-04-19 15:43:49 +00:00
Viktor Barzin
3b54983a9f [ci] build-cli: add logins entry for registry.viktorbarzin.me:5050
## Context
The infra CLI image (`viktorbarzin/infra` + `registry.viktorbarzin.me:5050/infra`)
is built by `.woodpecker/build-cli.yml` via plugin-docker-buildx and
pushed to two repos. The private-registry htpasswd auth that went in
on 2026-03-22 (memory 437) was never wired into this pipeline, so the
second push has been failing with `401 Unauthorized` on every blob
HEAD for ~4 weeks. That in turn kept every infra pipeline's overall
status at `failure`, which fooled the service-upgrade agent into
spurious rollbacks before the per-workflow check in bd code-3o3.

Now that the agent ignores overall status, this is purely cosmetic —
but worth fixing so the pipeline list goes green and the private-
registry mirror of the infra CLI image stays fresh.

## This change
Extend the plugin's `logins:` array with an entry for
`registry.viktorbarzin.me:5050`, pulling credentials from two
Woodpecker global secrets `registry_user` / `registry_password`.

Secrets plumbing (no CI config changes needed long-term — already
`vault-woodpecker-sync` compatible):
- Vault `secret/ci/global` now carries `registry_user` +
  `registry_password`, copied from `secret/viktor` via
  `vault kv patch`.
- `vault-woodpecker-sync` CronJob picks them up on next run and
  POSTs them to Woodpecker via the API. Also triggered manually
  as `manual-sync-1776613321` → "Synced 8 global secrets from
  Vault to Woodpecker".
- `curl -H "Authorization: Bearer <wp-api-token>" .../api/secrets`
  now lists both `registry_user` and `registry_password`.

## What is NOT in this change
- A follow-on cleanup of the `docker_username`/`docker_password`
  globals (which are actually DockerHub creds mis-named). They still
  work — renaming would cascade across several older pipelines.
- Restoring inline BuildKit cache — commit 0c123903 disabled
  `cache_from/cache_to` due to registry cache corruption; leaving
  that alone here.

## Test Plan
### Automated
Will be validated by the CI run of this very commit:
- `build-cli` workflow should log `#14 [auth] viktor/registry.viktorbarzin.me:5050` successful
- blob HEAD returns 200/404 instead of 401
- step `build-image` exits 0
- overall pipeline status: success (FINALLY)

### Manual Verification
```
$ curl -sS -H "Authorization: Bearer $(vault kv get -field=woodpecker_api_token secret/ci/global)" \
    https://ci.viktorbarzin.me/api/secrets | jq '.[] | .name' | grep registry
"registry_password"
"registry_user"

$ curl -sSI -u viktor:$PASS https://registry.viktorbarzin.me:5050/v2/infra/manifests/<8-char-sha>
HTTP/2 200
```

Closes: code-12b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:42:52 +00:00
Viktor Barzin
364df9f2ea [dns] readiness gate — replace auth-required zone-count probe with DNS parity check
Zone-count parity required hitting /api/zones/list which requires auth. The
null_resource has no access to the Technitium admin password (it's declared
`sensitive = true` on the module variable), so we were probing with an empty
token and getting 200 OK with an error JSON — silently returning 0 zones for
every instance.

Replaced the HTTP probe with a second DNS check: dig idrac.viktorbarzin.lan
on each pod, require the same A record from all three. This catches both
"zone not loaded on an instance" and "zone drift between primary and
replicas" without needing any HTTP client or credentials. The AXFR chain
guarantees all three should converge on the same value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:24:56 +00:00
Viktor Barzin
f09be1524d monitoring: split income_tax cash/RSU + add P60 & HMRC reconciliation panels
Panel 7 (YTD uses): replace the single `ytd_income_tax` stack segment
with two — `ytd_cash_income_tax` (full red, same color as before) and
`ytd_rsu_income_tax` (desaturated orange) — computed from the new
`cash_income_tax` column on payslip. RSU-vest months now visually
separate the cash tax from the PAYE attributable to the grossed-up
RSU, matching user mental model of "what I actually paid in cash tax".

Panel 8 (Sankey): split the single `Gross → Income Tax` edge into two
edges (`Gross → Income Tax (cash)` and `Gross → Income Tax (RSU)`)
sourcing the same two figures.

Panel 3 (effective rate): left untouched — it's the "all-in" rate and
keeps using raw `income_tax`.

Panel 9 (P60 reconciliation — new): per-tax-year table comparing HMRC
P60 annual figures against SUM(payslip) via LATERAL JOIN on
payslip_ingest.p60_reference. Threshold-coloured delta columns (|Δ|<1
green, 1-50 yellow, >50 red) surface missing months or parser drift.

Panel 10 (HMRC Tax Year Reconciliation — new): placeholder for the
hmrc-sync service (code scaffolded, awaiting HMRC prod approval to
activate). Queries `hmrc_sync.tax_year_snapshot`; renders empty until
that schema lands. Delta > £10 → red.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:36 +00:00
Viktor Barzin
91aa39ef96 [dns] readiness gate — reject all-zero zone counts as probe failure
The zone-count parity check was trivially passing when the ephemeral
curl pod failed to reach the Technitium web API: all three counts came
back as 0, UNIQ=1, gate claimed "PASSED". This happened during today's
DNS hardening apply when CoreDNS was in CrashLoopBackOff and the curl
pod couldn't resolve service names.

Added a MIN > 0 sanity check. Technitium always has built-in zones
(localhost, standard reverse PTRs), so a zero count means the probe
didn't reach the API, not that the instance truly has zero zones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:07 +00:00
Viktor Barzin
150f196095 [redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts
Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm
release so data can migrate via REPLICAOF during a future short maintenance
window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still
points at redis-node-{0,1}.

Architecture:
 - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter
 - podManagementPolicy=Parallel + init container that writes fresh
   sentinel.conf on every boot by probing peer sentinels and redis for
   consensus master (priority: sentinel vote > role:master with slaves >
   pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM.
 - redis.conf `include /shared/replica.conf` — init container writes
   `replicaof <master> 6379` for non-master pods so they come up already in
   the correct role. No bootstrap race.
 - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork
   COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn.
 - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec.
 - PodDisruptionBudget minAvailable=2.

Also:
 - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since
   Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes
   the sole client-facing path for all 17 consumers.
 - New Prometheus alerts: RedisMemoryPressure, RedisEvictions,
   RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong,
   RedisReplicasMissing. Updated RedisDown to cover both statefulsets
   during the migration.
 - databases.md updated to describe the interim parallel-cluster state.

Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status
up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded
into Prometheus and inactive.

Beads: code-v2b (still in progress — Phase 3-7 await maintenance window).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:05 +00:00
Viktor Barzin
6ee283c2f0 [docs] Document external-monitor opt-out mechanism in monitoring.md
The doc said monitors were created for everything in cloudflare_proxied_names,
but since the k8s-api discovery rewrite the ConfigMap is a fallback only.
Describe the opt-OUT semantics and how external_monitor=false on a factory
call translates to the sync script's skip annotation.
2026-04-19 15:19:06 +00:00
Viktor Barzin
af6574a006 [dns] Fix CoreDNS serve_stale syntax — 24h TTL, no refresh-mode arg
CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`:

  plugin/cache: invalid value for serve_stale refresh mode: 86400s

serve_stale takes one DURATION and an optional refresh_mode keyword
("immediate" or "verify"), not two durations. Simplified to
`serve_stale 86400s` (serve cached entries for up to 24h when upstream
is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two
old pods kept serving traffic so there was no outage, but the partial
apply left the cluster wedged with the bad ConfigMap.

Also collapses the inline viktorbarzin.lan cache block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:18:43 +00:00
Viktor Barzin
752f94ab8f [monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730
The `external-monitor-sync` script is opt-IN by default for any
*.viktorbarzin.me ingress, so a missing annotation means "monitored."
Both ingress factories previously OMITTED the annotation when
`external_monitor = false`, which silently left monitors in place.

Fix: when the caller sets `external_monitor = false` explicitly, emit
`uptime.viktorbarzin.me/external-monitor = "false"` so the sync script
deletes the monitor. Keep the previous behavior (no annotation) for
callers that leave external_monitor null — otherwise 19 publicly-reachable
services with `dns_type="none"` would lose monitoring.

Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy)
to match the other two already-flagged services. Delete the r730 ingress
module entirely — the Dell server has been decommissioned.
2026-04-19 15:18:27 +00:00
Viktor Barzin
a0d770d9a7 [cluster-health] Expand to 42 checks, remove pod CronJob path
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
  readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
  monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
  +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
  to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
  lines) and the openclaw cluster_healthcheck CronJob (local CLI is
  now the single authoritative runner). Keep the healthcheck SA +
  Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
  the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
  the script first, refreshes the 42-check table, drops stale
  CronJob/Slack/post-mortem sections, documents the monorepo-canonical
  + hardlink layout. File is hardlinked to
  /home/wizard/code/.claude/skills/cluster-health/SKILL.md for
  dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:13:03 +00:00
Viktor Barzin
5ea079181f [dns] Technitium — raise memory limit to 2Gi (was 1Gi, originally 512Mi)
Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi
leaves enough headroom for normal operation but thin margin if blocklists or
cache grow. User escalated: OOM cascades are the exact failure mode that
causes user-visible DNS outages, so give a full 2x safety margin across all
three instances. Replicas currently use 124-155Mi steady-state so they have
enormous headroom at 2Gi — accepted for symmetry and future growth (OISD
blocklists, in-memory cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:08:04 +00:00
Viktor Barzin
a86a97deb7 [reverse-proxy] Fix gw.viktorbarzin.me — point at 192.168.1.1 via EndpointSlice
The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but
Technitium has no record for that name (the router isn't a DHCP client and
Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and
the `[External] gw` Uptime Kuma monitor was permanently failing.

Factory now accepts `backend_ip` as an alternative to `external_name`: it
creates a selector-less ClusterIP Service + manual EndpointSlice pointing
at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1);
the old ExternalName path is retained for every other service.

Also add a direct `port` monitor for the router in uptime-kuma's
internal_monitors list so we can tell a Cloudflare/tunnel outage apart
from the router itself being down. Extended the internal-monitor-sync
script to handle non-DB monitor types (hostname + port fields).
2026-04-19 15:07:24 +00:00
Viktor Barzin
4b39fbb717 [dns] readiness gate — use dig-in-pod + retries, ephemeral curl pod for zone parity
Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod
health check from wget against /api to dig +short against 127.0.0.1. This
probes the actual DNS serving path, which is what we care about anyway.

Zone-count parity can't be done inside the Technitium pod (no HTTP client),
so it spawns a short-lived curlimages/curl pod via kubectl run --rm that
curls the three internal web services and exits.

Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after
a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to
load into memory on a cold start.

Relaxed the A-record regex to match any IPv4 rather than 10.x — records may
legitimately live outside that range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:57:29 +00:00
Viktor Barzin
9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:53:41 +00:00
Viktor Barzin
a5e097088a [ci] Persist VAULT_TOKEN across Woodpecker step commands
## Context
Follow-up to commit 2eca011c (bd code-e1x). That commit attached the
`terraform-state` policy to the `ci` Vault role and propagated apply-
loop failures so the pipeline actually fails when a stack fails. On
the very first push to exercise it (pipeline 361), the platform apply
step died with:

  [vault] Starting apply...
  state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt
  [vault] FAILED (exit 1)

Root cause: in Woodpecker's `commands:` list, each `- |` item runs in
a fresh shell. The dedicated "Vault auth" command was doing
`export VAULT_TOKEN=...`, but that export was lost by the time the
apply command ran. Tier-0 stacks depended on Vault Transit (via
`scripts/state-sync`), and Tier-1 stacks depend on
`vault read database/static-creds/pg-terraform-state` via `scripts/tg`
— both silently fell through to their "no Vault" error path.

This bug was latent before 2eca011c because the old apply loop
swallowed per-stack exit codes. Now that we surface them, the pipeline
fails honestly — but fails on every run. Fixing the missing token
propagation is the last mile.

## This change
- Pin `VAULT_ADDR` at the step's `environment:` level so every command
  inherits it without an explicit export.
- In the Vault auth command, assert the auth succeeded (non-empty,
  non-"null" token) then write the token to `~/.vault-token` with
  `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all
  fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset.

## What is NOT in this change
- A broader refactor to fold the multi-step chain into a single
  `- |` script — preserving the existing granular structure keeps
  individual step logs grep-friendly and failures localised.
- Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token
  is written, and would need duplicating into each command anyway.

## Test Plan
### Automated
N/A (pure YAML change). Will be verified by the very next CI run —
the push creating this commit.

### Manual Verification
Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose
commit matches this one. Expected:
- `default` workflow exercises the auth + apply steps.
- Platform apply for `vault` stack runs state-sync decrypt → detects
  no drift (I applied locally already) → OK.
- Tier-1 stacks (if any in the diff): `vault read database/static-
  creds/pg-terraform-state` returns creds → apply runs.
- No "state-sync: ERROR" or "Cannot read PG credentials" errors.
- `default` workflow state: success.
- Overall pipeline status: still failure because `build-cli` is
  independently broken (bd code-12b); that's cosmetic.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:30:39 +00:00
Viktor Barzin
2eca011cc3 [ci,vault] Fix Tier-1 apply silently failing in Woodpecker
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.

Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
    [servarr] Starting apply...
    ERROR: Cannot read PG credentials from Vault.
    Run: vault login -method=oidc
    [servarr] FAILED (exit 1)

Two root causes, two fixes here.

### 1. Vault `ci` role lacks Tier-1 PG backend creds

The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.

**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.

### 2. Apply-loop swallows stack failures

`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.

**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.

Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.

## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
  TF file is already at 5.1.4 in git; once CI picks up this commit
  it'll apply on its own, or Viktor can run `tg apply` locally now
  that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
  per-stack continuation so a single bad stack doesn't hide the
  others' plans from the log. Just making the final status honest.

## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
  # vault_kubernetes_auth_backend_role.ci will be updated in-place
  ~ token_policies = [
      + "terraform-state",
        # (1 unchanged element hidden)
    ]
  # vault_jwt_auth_backend.oidc will be updated in-place
  ~ tune = [...]    # cosmetic provider-schema drift, pre-existing

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.

### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
    SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
    TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
          -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
    curl -s -H "X-Vault-Token: $TOK" \
      http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}

# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```

Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.

Refs: bd code-e1x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
Viktor Barzin
2431c6d5fe [reverse-proxy] ha-sofia per-service retry + ServersTransport
Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms)
and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into
ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS
stalls that were surfacing as 18 x 502s/day without disturbing the
global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new
manifests to avoid the dangling-reference pattern from the 2026-04-17
Traefik P0.

Closes: code-rd1
2026-04-19 14:07:07 +00:00
Viktor Barzin
947f1bd75d [monitoring] UK Payslip v3.2 — stacked YTD panels, YTD-cumulative rate, Sankey
Three changes:

1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting-
   clean stacked-area panels side-by-side:
   - "YTD sources": salary + bonus + rsu_vest + residual (= gross)
   - "YTD uses": net + income_tax + NI + pension_employee + student_loan
     + rsu_offset (= gross, per validate_totals identity)
   Green for take-home, red/orange for taxes, purple for pension, teal
   for RSU offset — visually encodes "what you earned vs what was taken".

2. Panel 3 effective rate switched from per-slip attribution to YTD
   cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike:
   the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but
   Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top
   marginal`, not rsu_vest × average. Cumulative approach blends both
   proportionally, no attribution hack needed. Also adds a third series:
   all-deductions rate (income_tax + NI + student_loan / gross).

3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross →
   uses over the selected time range. Plugin added to grafana Helm values.
2026-04-19 13:42:27 +00:00
Service Upgrade Agent
55ade1f9b3 [servarr] Fix qbittorrent container_port 8787 -> 8080 (matches WEBUI_PORT)
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-19 13:37:44 +00:00
Viktor Barzin
3b4a059243 [uptime-kuma] Fix broken Redis monitor + move to TF-managed list
The Redis monitor (id=53) was created manually with a connection string
pointing at redis-master.redis-headless.redis.svc.cluster.local, which
doesn't resolve — headless only exposes pod DNS (redis-node-N.redis-headless),
not a synthetic "redis-master" name. Status had been DOWN with ENOTFOUND
for weeks.

Declare it in local.internal_monitors using redis-master.redis.svc.cluster.local
(the HAProxy-fronted ClusterIP that already routes to the Sentinel-elected
master). Verified RESP PING through HAProxy returns PONG.

Tighten intervals to 60s / 30s retry / 3 retries — Redis is core (Paperless,
Immich, Authentik, Dawarich all depend on it), a 5-minute detection window
was way too loose given the blast radius.

Also teach the sync CronJob to handle no-password monitors (auth disabled
on the Bitnami chart), via an optional database_password_vault_key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:28:36 +00:00
Service Upgrade Agent
094bc727d4 upgrade: qbittorrent 5.0.4 -> 5.1.4
Changelog summary: Minor version bump; patch releases update external Alpine packages and restore qbittorrent-cli openssl3 support.
Risk: SAFE
Breaking changes: none
DB backup: no (not DB-backed)
Config changes applied: none
Flagged for manual review: none

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-19 13:26:15 +00:00
Viktor Barzin
26ef97d294 [claude-agent-service] Add WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL env vars
## Context
Companion fix to 2026-04-19's service-upgrade spec refactor. The agent
pod has no Vault CLI auth (no VAULT_TOKEN, port 8200 refused), so every
`vault kv get` in the spec returned empty:
  - `WOODPECKER_TOKEN=""` → 401 on /api/repos/1/pipelines → agent can't
    find its pipeline → 15m poll timeout → rollback loop → >30m cap.
  - `SLACK_WEBHOOK=""` → webhook POST to empty URL → no Slack messages
    for 3+ days (the surface symptom that kicked off bd code-3o3).

## This change
Extends the `claude-agent-secrets` ExternalSecret with two more keys,
making them available to the agent via `envFrom`:
  - `WOODPECKER_API_TOKEN` ← `secret/ci/global.woodpecker_api_token`
    (already used by the vault-woodpecker-sync CronJob, same key)
  - `SLACK_WEBHOOK_URL` ← `secret/viktor.alertmanager_slack_api_url`
    (shared webhook also consumed by Alertmanager)

Pairs with commit a5963169 which refactored service-upgrade.md to read
these env vars directly instead of shelling out to `vault kv get`.

## What is NOT in this change
- REGISTRY_USER / REGISTRY_PASSWORD — not needed on the agent side.
  The separate `.woodpecker/build-cli.yml` fix (bd code-3o3 fix C)
  will add those to `secret/ci/global` for the vault-woodpecker-sync
  CronJob to publish as Woodpecker secrets, not here.

## Test Plan
### Automated
`terraform plan` reported `Plan: 0 to add, 2 to change, 0 to destroy`
(ExternalSecret + a cosmetic `tier` label drop on the Deployment).
Applied cleanly.

### Manual Verification
```
$ kubectl -n claude-agent get externalsecret claude-agent-secrets \
    -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
secret synced

$ kubectl -n claude-agent exec deploy/claude-agent-service -- sh -c \
    'echo "WP=${WOODPECKER_API_TOKEN:0:20}... SLACK=${SLACK_WEBHOOK_URL:0:40}..."'
WP=eyJhbGciOiJIUzI1NiIs... SLACK=https://hooks.slack.com/services/T02SV75...

$ kubectl -n claude-agent rollout status deploy/claude-agent-service
deployment "claude-agent-service" successfully rolled out
```

Next step: fire one synthetic DIUN webhook to confirm the agent reaches
Slack + lands a commit + exits cleanly, completing code-3o3.

Refs: bd code-3o3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:23:12 +00:00
Viktor Barzin
83f4a72b6f [redis] Raise master+replica memory 256Mi → 512Mi
256Mi was tight once the working set crossed ~200Mi: a BGSAVE fork
during replica full PSYNC doubled master RSS via COW and pushed it
past the limit, OOMing (exit 137) in a loop. HAProxy flapped, every
client (Paperless, Immich, Authentik, Dawarich) saw session store
failures → 500s on authenticated requests.

512Mi gives ~2x headroom on the current 204Mi RDB.

Closes: code-n81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:18:30 +00:00
Viktor Barzin
a5963169ec [service-upgrade] Drop vault-CLI assumptions + check default workflow only
## Context
Since the 2026-04-15 migration from SSH-on-DevVM to in-cluster
claude-agent-service, the agent spec's four `vault kv get ...` calls
have been dead code: the pod has no `VAULT_TOKEN`, no `~/.vault-token`,
no Vault login method, and port 8200 is refused. Every token fetch
returns empty, which silently breaks:

- **Slack**: `SLACK_WEBHOOK=""` → POSTs 404 → no messages for 3+ days
  (the exact user-visible symptom that started this thread).
- **Woodpecker CI polling**: `WOODPECKER_TOKEN=""` → 401 on
  `/api/repos/1/pipelines` → agent can't find its own pipeline → 15-min
  poll times out → jumps to rollback → same failure in the revert → hits
  n8n's 30-min ceiling → SIGKILL mid-saga → no commit, no Slack.
- **Changelog fetch**: `GITHUB_TOKEN=""` overrides the env var supplied
  by `envFrom: claude-agent-secrets`, crippling changelog lookups too.

Separately, Step 9 read the overall pipeline `status`, which is
`failure` any time a single workflow fails — e.g. the unrelated
`build-cli` workflow (docker image push to registry.viktorbarzin.me:5050
has been erroring since private-registry htpasswd was enabled on
2026-03-22). That made the agent spuriously rollback every otherwise-
successful upgrade.

## This change
- Replace the four `vault kv get ...` invocations with the matching
  env-var reads (`$GITHUB_TOKEN`, `$WOODPECKER_API_TOKEN`,
  `$SLACK_WEBHOOK_URL`) and document the env-var contract at the top
  of the "Environment" section. The env vars are expected to be
  pre-loaded via `envFrom: claude-agent-secrets` — that part is tracked
  as the companion ExternalSecret/Terraform change in bd code-3o3
  (must land before this spec is effective).
- Rewrite Step 9 to poll the `default` workflow's `state` instead of
  the overall pipeline `status`. Adds a jq example and explicitly
  documents the build-cli noise so future operators know why overall
  status is unreliable.

## What is NOT in this change
- The matching ExternalSecret / Terraform changes that feed
  WOODPECKER_API_TOKEN / SLACK_WEBHOOK_URL / REGISTRY_USER /
  REGISTRY_PASSWORD into the pod. Until those land, this spec still
  produces empty env vars at runtime — but at least the *shape* of the
  contract is correct and grep-friendly.
- The .woodpecker/build-cli.yml `logins:` entry for
  registry.viktorbarzin.me:5050. That's fix C in the same task.

## Test Plan
### Automated
None — this is pure markdown guidance for the model. Syntax-checked by
`grep -nE 'vault kv get|WOODPECKER_TOKEN|SLACK_WEBHOOK[^_]'
.claude/agents/service-upgrade.md` showing only the explanatory
warning on line 37 as a match.

### Manual Verification
After the companion ExternalSecret change lands and the pod has
WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL in env:
1. Trigger a DIUN-style webhook on a known slow service.
2. Watch `kubectl -n claude-agent logs -f deploy/claude-agent-service`.
3. Expect curl to `ci.viktorbarzin.me/api/...` return 200 and pipeline
   JSON (no 401), and Slack `$SLACK_WEBHOOK_URL` return 200.
4. Expect a Slack `[Upgrade Agent] Starting:` post inside the first
   minute, and a `SUCCESS` or `FAILED + ROLLED BACK` post on exit.

Refs: bd code-3o3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:15:06 +00:00
Viktor Barzin
13cc5d956e [monitoring] UK Payslip dashboard v3.1 — add YTD reconciliation panel
Adds panel 6 that reconciles each payslip's reported YTD summary block
(ytd_gross, ytd_taxable_pay, ytd_tax_paid) against the cumulative sum
of extracted per-payslip values within the same tax year. Any Δ > £0.02
flags a parser regression, missing slip, or duplicate ingest — the
algebraic companion to the existing missing-months panel.

Variant A payslips (pre-mid-2022) carry no YTD block and are filtered
out via WHERE ytd_gross IS NOT NULL.
2026-04-19 13:12:57 +00:00
Viktor Barzin
581aed5fcc [openclaw,tor-proxy] Opt task-webhook + torrserver out of external monitoring
Adds `external_monitor = false` to the ingress_factory calls for
task-webhook and torrserver so the `external-monitor-sync` CronJob
stops auto-creating `[External] <name>` monitors for them. Both
services remain deployed and reachable; only the Uptime Kuma monitors
are dropped.
2026-04-19 13:01:36 +00:00
Viktor Barzin
ac95973b38 [monitoring] UK Payslip dashboard v3 — consolidate to 5 panels + data-integrity check
Collapse from 11 panels to 5. New hero "Tax-year YTD — gross / net /
taxes / RSU / salary" merges the old YTD cumulative + total-comp +
earnings-breakdown panels into a single line chart (tax-band thresholds
still on ytd_cash_gross). New "Data integrity" table surfaces missing
months and zero-salary anomalies at a glance — catches the 2024-02 gap
(Paperless doc never uploaded) and any future parser regressions.

Monthly cash flow, effective-rate, and full payslip table kept as-is.

Total dashboard height: 39 rows (was ~67). No parser / schema changes.

[ci skip]
2026-04-19 12:47:44 +00:00
Viktor Barzin
4ca793380b [multi] Sweep Kyverno wait-for redis annotations to redis-master
Replaces `redis.redis:6379` with `redis-master.redis:6379` in all 11
dependency.kyverno.io/wait-for annotations across 8 stacks, plus one
docs comment in the Kyverno module.

These annotations drive DNS-only `nc -z` init-container readiness
checks — zero RW risk. Both hostnames resolve, so there is no wait-for
failure window during the rolling re-apply.

Closes: code-otr
2026-04-19 12:44:46 +00:00
Viktor Barzin
12a372bf92 [redis] Migrate live RW consumers off bare redis.redis hostname
Completes the T0 hostname migration. The `redis.redis` service is a
legacy alias that routes to HAProxy via a `null_resource` selector
patch; `redis-master.redis` is the canonical name that has always
routed to HAProxy directly and health-checks master-only.

Changes:
- redis-backup CronJob: redis-cli BGSAVE + --rdb now target
  redis-master.redis. BGSAVE runs on the master (what we want).
- config.tfvars `resume_redis_url`: unused fallback updated for
  grep hygiene; nothing reads it today.
- ytdlp REDIS_URL default: updated for dev-local runs; production
  already sets REDIS_URL via main.tf:283-285 → var.redis_host.
- immich chart_values.tpl REDIS_HOSTNAME: dead Helm template (values
  block commented out in main.tf:524, Immich deploys as raw
  kubernetes_deployment using var.redis_host). Updated to keep the
  file consistent if someone ever revives it.
2026-04-19 12:42:36 +00:00
Viktor Barzin
e6e5fc5f17 [docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip]
## Context

After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still
carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter
component row, stale PVC names (`-proxmox` suffixes that were renamed
`-encrypted` during the LUKS migration), a wrong probe schedule
(claimed 10 min, actually 20 min), and a Mailgun-API claim for the
probe (it's been on Brevo since code-n5l). The two-path architecture
(external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the
current design wasn't visualised at all.

## This change

Rewrote the Architecture Diagram section to show **both ingress paths
in one Mermaid flowchart**, colour-coded:

- External (orange): Sender → pfSense NAT → HAProxy → NodePort →
  **alt PROXY listeners** (2525/4465/5587/10993).
- Intra-cluster (blue): Roundcube / probe → ClusterIP Service →
  **stock listeners** (25/465/587/993), no PROXY.
- The pod subgraph shows both listener sets feeding the same Postfix /
  Rspamd / Dovecot / Maildir pipeline.
- Security dotted edges: Postfix log stream → CrowdSec agent →
  LAPI → pfSense bouncer decisions.
- Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP →
  Pushgateway/Uptime Kuma.

Added a **sequenceDiagram** for the external SMTP roundtrip — walks
through the wire-level handshake from external MTA → pfSense NAT →
HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod
postscreen parse → smtpd banner. Makes the "how does the pod see the
real IP despite SNAT?" question self-answering.

Added a **Port mapping table** listing all 8 container listeners (4
stock + 4 alt) with their Service, NodePort, PROXY-required flag, and
who uses each path. Replaces the ambiguous prose about "alt ports".

Fixed stale bits:
- Removed Dovecot Exporter row from Components (retired in code-1ik).
- Added pfSense HAProxy row.
- Probe schedule: every 10 min → **every 20 min** (`*/20 * * * *`).
- Probe API: Mailgun → **Brevo HTTP**.
- PVC names: `-proxmox` → **`-encrypted`** (all three); storage class
  `proxmox-lvm` → **`proxmox-lvm-encrypted`**.
- Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS
  PVCs to the Storage table with backup flow pointer.
- Expanded Troubleshooting → Inbound to include HAProxy health check
  + container-listener verification steps.
- Secrets table: `brevo_api_key` now marked as used by both relay +
  probe; `mailgun_api_key` marked historical.

Added a prominent **UPDATE 2026-04-19** header to
`docs/runbooks/mailserver-proxy-protocol.md` pointing future readers
at the implemented state in `mailserver-pfsense-haproxy.md`. Research
doc preserved as a decision record — it's the canonical "why not just
pin the pod?" reference.

## What is NOT in this change

- No Terraform changes; this is docs-only.
- No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was
  already rewritten during Phase 6.

## Test Plan

### Automated
```
$ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md
2
$ grep -c '\-encrypted' docs/architecture/mailserver.md
5  # PVC references normalised
$ grep -c '\-proxmox' docs/architecture/mailserver.md
0  # no stale names left
```

### Manual Verification
Render `docs/architecture/mailserver.md` on GitHub or any Mermaid-
capable viewer:
1. Top Architecture Diagram should show two labelled paths into the
   pod, colour-coded (orange = external, blue = intra-cluster).
2. Sequence diagram should show 10 numbered steps ending at Rspamd +
   Dovecot delivery.
3. Port Mapping table should make it obvious that the 4 alt container
   ports are only reachable via `mailserver-proxy` NodePort and require
   PROXY v2.
2026-04-19 12:40:53 +00:00
Viktor Barzin
d5a47e35fc [redis] Restore dynamic DNS in HAProxy to fix stale-IP outage
HAProxy resolved `redis-node-{0,1}.redis-headless.redis.svc.cluster.local`
once at pod startup and cached the IPs forever. When redis-node pods
cycled (new pod IPs), HAProxy kept connecting to the dead IPs — backends
flapped between "Connection refused" and "Layer4 timeout", and Immich's
ioredis client hit EPIPE until max-retries exhausted and the pod entered
CrashLoopBackOff. This caused an Immich outage on 2026-04-19.

Fix:
- Add `resolvers kubernetes` stanza pointing at kube-dns (10s hold on
  every category so we pick up pod IP changes within a DNS TTL window).
- Add `resolvers kubernetes init-addr last,libc,none` to every backend
  server line so HAProxy resolves at startup AND uses the dynamic
  resolver for runtime refresh.
- Add `checksum/config` pod annotation to the HAProxy Deployment so a
  haproxy.cfg change actually rolls the pods (including this one).

Closes: code-fd6
2026-04-19 12:39:09 +00:00
Viktor Barzin
43fe11fffc [mailserver] Phase 6 — decommission MetalLB LB path [ci skip]
## Context (bd code-yiu)

With Phase 4+5 proven (external mail flows through pfSense HAProxy +
PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB
LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are
obsolete. Phase 6 decommissions them and documents the steady-state
architecture.

## This change

### Terraform (stacks/mailserver/modules/mailserver/main.tf)
- `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`.
- Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation.
- Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP).
- Port set unchanged — the Service still exposes 25/465/587/993 for
  intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
  CronJob) that hit the stock PROXY-free container listeners.
- Inline comment documents the downgrade rationale + companion
  `mailserver-proxy` NodePort Service that now carries external traffic.

### pfSense (ops, not in git)
- `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT
  rule references it post-Phase-4; keeping it would be misleading dead
  metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php`
  companion script (ad-hoc, not checked in — alias is just a
  Firewall → Aliases → Hosts entry).

### Uptime Kuma (ops)
- Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202`
  → `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` /
  `... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy
  layer liveness). History retained (edit, not delete-recreate).

### Docs
- `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten
  "Current state" section; now reflects steady-state architecture with
  two-path diagram (external via HAProxy / intra-cluster via ClusterIP).
  Phase history table marks Phase 6 . Rollback section updated (no
  one-liner post-Phase-6; need Service-type re-upgrade + alias re-add).
- `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound
  flow, CrowdSec section, Uptime Kuma monitors list, Decisions section
  (dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY
  v2"), Troubleshooting all updated.
- `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph
  updated with new external path description; references the new runbook.

## What is NOT in this change

- Removal of `10.0.20.202` from `cloudflare_proxied_names` or any
  reserved-IP tracking — wasn't there to begin with. The
  `metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of
  19 available after this, confirming `.202` went back to the pool.
- Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if
  someone re-introduces the MetalLB LB (see runbook "Rollback").

## Test Plan

### Automated (verified pre-commit 2026-04-19)
```
# Service is ClusterIP with no EXTERNAL-IP
$ kubectl get svc -n mailserver mailserver
mailserver   ClusterIP   10.103.108.217   <none>   25/TCP,465/TCP,587/TCP,993/TCP

# 10.0.20.202 no longer answers ARP (ping from pfSense)
$ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202'
2 packets transmitted, 0 packets received, 100.0% packet loss

# MetalLB pool released the IP
$ kubectl get ipaddresspool default -n metallb-system \
    -o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}'
2 of 19 available

# E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver
... Round-trip SUCCESS in 20.3s ...
$ kubectl delete job probe-phase6 -n mailserver

# pfSense mailserver alias removed
$ ssh admin@10.0.20.1 'php -r "..." | grep mailserver'
(no output)
```

### Manual Verification
1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new
   hostname `10.0.20.1`.
2. Roundcube login works (`https://mail.viktorbarzin.me/`).
3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe
   `postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in
   mailserver logs within ~10s.
4. CrowdSec should still see real client IPs in postfix/dovecot parsers
   (verify with `cscli alerts list` on next auth-fail event).

## Phase history (bd code-yiu)

| Phase | Status | Description |
|---|---|---|
| 1a  |  `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2   |  2026-04-19 | pfSense HAProxy pkg installed |
| 3   |  `ba697b02` | HAProxy config persisted in pfSense XML |
| 4+5 |  `9806d515` | 4-port alt listeners + HAProxy frontends + NAT flip |
| 6   |  **this commit** | MetalLB LB retired; 10.0.20.202 released; docs updated |

Closes: code-yiu
2026-04-19 12:36:11 +00:00
Viktor Barzin
9806d515dd [mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip]
## Context (bd code-yiu)

Cutover of external mail traffic from the MetalLB LB IP path (ETP:Local,
pod-speaker colocation) to pfSense HAProxy + PROXY v2 (ETP:Cluster). Real
client IP now preserved end-to-end on ports 25/465/587/993, both for
postscreen anti-spam scoring and CrowdSec auth-failure bans.

## This change

### k8s (stacks/mailserver/modules/mailserver/main.tf)

- `mailserver-user-patches` ConfigMap's `user-patches.sh` now appends 3
  alt PROXY-speaking services to master.cf:
  - `:2525` postscreen (alt :25)
  - `:4465` smtpd (alt :465 SMTPS, wrappermode TLS)
  - `:5587` smtpd (alt :587 submission)
  All with `postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy`.
  Mirror stock submission/submissions options (SASL via Dovecot, TLS,
  client restrictions, mua_sender_restrictions). chroot=n so the SASL
  socket path `/dev/shm/sasl-auth.sock` resolves outside the chroot.
- `dovecot.cf` ConfigMap adds:
  ```
  haproxy_trusted_networks = 10.0.20.0/24
  service imap-login { inet_listener imaps_proxy { port=10993; ssl=yes; haproxy=yes } }
  ```
  Stock :993 stays PROXY-free for internal Roundcube/probe clients.
- Container ports: 4 new (4465, 5587, 10993, 2525 already there).
- `mailserver-proxy` NodePort Service now exposes all 4 ports:
  25→2525→30125, 465→4465→30126, 587→5587→30127, 993→10993→30128
  (ETP:Cluster).

### pfSense (scripts/pfsense-haproxy-bootstrap.php)

Rebuilt to declare 4 backend pools (one per NodePort) and 4 production
frontends on `10.0.20.1:{25,465,587,993}` TCP mode, plus the legacy
`:2525` test frontend. All pools: `send-proxy-v2 check inter 120000`.
Idempotent — re-runs converge on declared state.

### pfSense (scripts/pfsense-nat-mailserver-haproxy-{flip,unflip}.php)

Flip script: updates `<nat><rule>` entries for mail ports from target
`<mailserver>` alias (10.0.20.202 MetalLB) → `10.0.20.1` (pfSense
HAProxy). Runs `filter_configure()` to rebuild pf rules. Unflip is the
rollback. Both scripts are idempotent.

## What is NOT in this change

- Phase 6 (decommission MetalLB LB path, downgrade mailserver Service
  from LoadBalancer to ClusterIP, free 10.0.20.202) — USER-GATED. Do
  NOT run until explicit approval.
- Legacy MetalLB `mailserver` LB still live on 10.0.20.202 with stock
  ETP:Local ports — functional backup path + consumed by internal
  clients that hit `mailserver.mailserver.svc.cluster.local` (routes
  via ClusterIP layer of the LB Service, bypassing ETP).
- Port :143 (plain IMAP) — no HAProxy frontend; stays on MetalLB via
  unchanged NAT rule.

## Test Plan

### Automated (verified pre-commit 2026-04-19)
```
# k8s container listens on all 8 ports
$ kubectl exec -c docker-mailserver deployment/mailserver -n mailserver \
    -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
... all 8 listening ...

# pfSense HAProxy listens on all 5 (production + legacy test)
$ ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
www  haproxy  49418  5   tcp4  *:25
www  haproxy  49418  6   tcp4  *:2525
www  haproxy  49418  10  tcp4  *:465
www  haproxy  49418  11  tcp4  *:587
www  haproxy  49418  12  tcp4  *:993

# Post-flip: pf rdr rules point at pfSense, not <mailserver>
$ ssh admin@10.0.20.1 'pfctl -sn' | grep 'smtp\|sub\|imap\|:25'
rdr on vtnet0 ... port = submission -> 10.0.20.1
rdr on vtnet0 ... port = imaps -> 10.0.20.1
rdr on vtnet0 ... port = smtps -> 10.0.20.1
rdr on vtnet0 ... port = 25 -> 10.0.20.1

# 4 HAProxy frontends reachable + SMTP/IMAP banners
$ python3 <test script> → SMTP/SMTPS/Sub/IMAPS all respond correctly

# Real client IP in maillog for external delivery via Brevo → MX
postfix/smtpd-proxy25/postscreen: CONNECT from [77.32.148.26]:36334 to [10.0.20.1]:25
postfix/smtpd-proxy25/postscreen: PASS NEW [77.32.148.26]:36334

# E2E probe (Brevo HTTP → external SMTP delivery → IMAP fetch) succeeds
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-yiu-flip -n mailserver
... Round-trip SUCCESS in 20.3s ...

# Internal Roundcube path unchanged
$ curl -sI https://mail.viktorbarzin.me/  →  302 (Authentik gate intact)

# No mail alerts firing
$ kubectl exec prometheus-server ... /api/v1/alerts | grep Email  →  (empty)
```

### Rollback
```
scp infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-unflip.php'
```
Immediate (<2s). Flips all 4 NAT rdrs back to `<mailserver>` alias.
Pre-flip config snapshot also saved at
`/tmp/config.xml.pre-yiu-flip.20260419-1222` on pfSense.

## Phase roadmap (bd code-yiu)

| Phase | Status |
|---|---|
| 1a |  commit ef75c02f  — alt :2525 listener + NodePort |
| 2  |  2026-04-19      — HAProxy pkg installed on pfSense |
| 3  |  commit ba697b02 — HAProxy config persisted in pfSense XML |
| 4+5|  **this commit** — 4-port alt listeners + HAProxy frontends + NAT flip |
| 6  | ⏸ USER-GATED      — MetalLB LB decommission after 48h observation |
2026-04-19 12:24:50 +00:00
Viktor Barzin
702db75f84 [redis] Stabilise patch_redis_service trigger + document service naming
## Context

`null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`,
so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add`
even when nothing has changed. That noise hides real drift in the signal and
trains us to ignore redis-stack plans — which is exactly what you don't want
on a load-bearing patch.

The patch itself is still load-bearing (three consumers hard-code bare
`redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`,
`stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus
Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local`
and call it during pod startup). Removing the null_resource is a follow-up
(beads T0) once those consumers migrate to `redis-master.redis.svc`. For now
the goal is just: stop being noisy.

## This change

1. Replace the `always = timestamp()` trigger with two inputs that only change
   when re-patching is genuinely required:
   - `chart_version = helm_release.redis.version` — changes only on a Bitnami
     chart version bump, which is the one code path that rewrites the `redis`
     Service selector back to `component=node`.
   - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])`
     — changes only when HAProxy config is edited; aligned with the existing
     `checksum/config` annotation that rolls the Deployment on config change.

   Both attributes are known at plan time (verified against `hashicorp/helm`
   v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision`
   (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))`
   (readability unverified on v3), and `kubernetes_deployment.haproxy.id`
   (static `namespace/name`, never changes) — don't meet the bar.

2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly
   states the write/sentinel/avoid endpoints, so new consumers start from
   `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived
   connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout
   client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor
   already learned that lesson the hard way (memory id=748).

## What is NOT in this change

- Deleting `null_resource.patch_redis_service` — still load-bearing (T0).
- Deleting `kubernetes_service.redis_master` — stays as the declared write API.
- Migrating any consumer off bare `redis.redis.svc` — T0 epic.
- Per-client sentinel migration — T1 epic.
- Retiring HAProxy — T2 epic (blocked on T1 + T3).

## Before / after

Before (steady state):
```
scripts/tg plan
Plan: 1 to add, 2 to change, 1 to destroy.
#   null_resource.patch_redis_service must be replaced
#     triggers = { "always" = "<timestamp>" } -> (known after apply)
```

After (steady state, post-apply):
```
scripts/tg plan
No changes. Your infrastructure matches the configuration.
```

After (chart version bump):
```
scripts/tg plan
#   null_resource.patch_redis_service must be replaced
#     triggers = { "chart_version" = "25.3.2" -> "25.4.0" }
```
— the trigger fires only when it actually needs to.

## Test Plan

### Automated

`scripts/tg plan` pre-change (confirms baseline noise):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply)
      }
  }
Plan: 1 to add, 2 to change, 1 to destroy.
```

`scripts/tg plan` post-edit (confirms the one-time structural replacement):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        - "always"         = "2026-04-19T10:39:40Z" -> null
        + "chart_version"  = "25.3.2"
        + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775"
      }
  }
```

Apply is deferred to the operator — the working tree on the same file also
contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage)
that needs its own review before rolling out together. No `scripts/tg apply`
run from this session.

### Manual Verification

Reproduce locally:
1. `cd infra/stacks/redis && ../../scripts/tg plan`
2. Before apply: expect `null_resource.patch_redis_service` to be replaced
   exactly once, with the trigger map transitioning from `{always = <ts>}`
   to `{chart_version, haproxy_config}`.
3. After apply: `../../scripts/tg plan` twice in a row must both report
   `No changes.` (excluding unrelated drift from other work-in-progress).
4. Cluster-side invariant (must hold pre- and post-apply):
   `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
   `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
5. Regression test for the trigger doing its job: bump `helm_release.redis.version`
   in a branch, `tg plan`, expect the null_resource to replace. Revert.
2026-04-19 12:17:52 +00:00
Viktor Barzin
ba697b02a2 [mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip]
## Context (bd code-yiu)

Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so
it lives in the nightly backup) of the PROXY-v2 migration. Test path only —
listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525
postscreen. Real client IP verified in maillog
(`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a
container plumbing is already live (commit ef75c02f).

pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is captured daily by
`scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`)
and synced offsite to Synology. No new backup wiring needed — this commit
documents the fact + adds the reproducer script.

## This change

Two files, both additive:

1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that
   edits pfSense config.xml to add:
   - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125,
     `send-proxy-v2`, TCP health-check every 120000 ms (2 min).
   - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525
     in TCP mode, forwarding to the pool.
   Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg`
   and reload HAProxy. Removes existing items with the same name before
   adding, so repeat runs converge on declared state.

2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering
   current state, validation, bootstrap/restore, health checks, phase
   roadmap, and known warts (health-check noise + bind-address templating).

## What is NOT in this change

- Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred.
- Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual-
  inet_listener) — deferred.
- Terraform for pfSense HAProxy pkg install — not possible (no Terraform
  provider for pfSense pkg management). Runbook documents the manual
  `pkg install` command.

## Test Plan

### Automated
```
$ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l | grep :2525'
64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D
www  haproxy  64009 5 tcp4  *:2525  *:*

$ ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
    | awk 'NR>1 {print $4, $6}'
node1 2
node2 2
node3 2
node4 2        # all UP

$ python3 -c "
import socket; s=socket.socket(); s.settimeout(10)
s.connect(('10.0.20.1', 2525))
print(s.recv(200).decode())
s.send(b'EHLO persist-test.example.com\r\n')
print(s.recv(500).decode())
s.send(b'QUIT\r\n'); s.close()"
220-mail.viktorbarzin.me ESMTP
...
250-mail.viktorbarzin.me
250-SIZE 209715200
...
221 2.0.0 Bye

$ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \
    | grep smtpd-proxy.*CONNECT
postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525
```

Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy
SNAT) → PROXY-v2 roundtrip confirmed.

### Manual Verification
Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the
now-persisted config (`<enable>yes</enable>` in XML). Connection test above
should still work.

## Reproduce locally
1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/`
2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK
3. `python3 -c '...' ` SMTP roundtrip test above.
2026-04-19 12:07:47 +00:00
Viktor Barzin
602103ede1 [owntracks] Strip face avatar from hook payload + drop orphan PVC
Bundles two small follow-ups to the live bridge + port-fix work:

## Face avatar fix (dawarich-hook.lua)

After the Recorder ran in production for a while it began enriching
publish payloads with a `face` field — the base64-encoded user avatar
uploaded via the Recorder's web UI (~120 KB). Our Lua hook builds a
curl command that embeds the JSON payload as `-d '<payload>'`, which
hit `E2BIG` / `Argument list too long` (os.execute reason=code=7) on
Linux's `execve` argv limit (~128 KB). Every live POST stopped making
it to Dawarich, even though the HTTP POST from the phone to Owntracks
still returned 200 and the .rec write still happened.

Fix: `data.face = nil` before serializing. Dawarich doesn't use it
anyway (not persisted into any column — `raw_data` stored without it).

Also upgraded the debug log: on failure we now emit
`dawarich-bridge: FAIL tst=... reason=... code=... cmd=...` so any
future variant of this problem (next big field surfaced upstream, etc.)
is one log tail away from a diagnosis.

```
$ kubectl -n owntracks logs deploy/owntracks --tail=5 | grep dawarich-bridge
+ dawarich-bridge: init
+ dawarich-bridge: ok tst=1776600238
```

## Orphan PVC removal (main.tf)

`owntracks-data-proxmox` (1 Gi, proxmox-lvm, unencrypted) was a leftover
from the encrypted-migration attempt; the Deployment has been mounting
`owntracks-data-encrypted` the whole time. Verified `Used By: <none>`
on the live PVC before removal. Removing the resource from Terraform
destroys the PVC — harmless, no data loss.

## Test Plan

### Automated

```
$ ../../scripts/tg plan
Plan: 0 to add, 1 to change, 1 to destroy.

$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 1 destroyed.

$ kubectl -n owntracks get pvc
NAME                       STATUS   VOLUME ...
owntracks-data-encrypted   Bound    ...
(owntracks-data-proxmox gone)
```

### Manual Verification

```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
    curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
    -H 'Content-Type: application/json' \
    -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
    -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
    https://owntracks.viktorbarzin.me/pub
HTTP 200

$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
    psql -U postgres -d dawarich -tAc \
    "SELECT ST_AsText(lonlat::geometry) FROM points WHERE user_id=1 AND timestamp=$TST"
POINT(-0.1278 51.5074)
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:05:18 +00:00
Viktor Barzin
ef75c02f0d [mailserver] Phase 1a — alt :2525 postscreen listener + NodePort [ci skip]
## Context (bd code-yiu)

Toward replacing MetalLB ETP:Local + pod-speaker colocation with pfSense
HAProxy injecting PROXY v2 → mailserver. This commit lays the k8s-side
groundwork for port 25 only. External SMTP flow post-cutover:

  Client → pfSense WAN:25 → pfSense HAProxy (injects PROXY v2) → k8s-node:30125
  (NodePort for mailserver-proxy Service, ETP:Cluster) → kube-proxy → pod :2525
  (postscreen with postscreen_upstream_proxy_protocol=haproxy) → real client IP
  recovered from PROXY header despite kube-proxy SNAT.

Internal clients (Roundcube, email-roundtrip-monitor) keep using the stock
:25 on mailserver.svc ClusterIP — no PROXY required, zero regression.

## This change

- New `kubernetes_config_map.mailserver_user_patches` with a
  `user-patches.sh` script. docker-mailserver runs
  `/tmp/docker-mailserver/user-patches.sh` on startup; our script appends a
  `2525 postscreen` entry to `master.cf` with
  `-o postscreen_upstream_proxy_protocol=haproxy` and a 5s PROXY timeout.
  Sentinel-guarded for idempotency on in-place restart.
- New volume + volume_mount (`mode = 0755` via defaultMode) wires the
  ConfigMap into the mailserver container.
- New container port spec for 2525 (informational; kube-proxy resolves
  targetPort by number anyway).
- New Service `mailserver-proxy` — NodePort type, ETP:Cluster, selector
  `app=mailserver`, port 25 → targetPort 2525 → fixed nodePort 30125.
  pfSense HAProxy's backend pool will be `<all k8s node IPs>:30125 check
  send-proxy-v2`.

The existing `mailserver` LoadBalancer Service (ETP:Local, 10.0.20.202,
ports 25/465/587/993) is untouched. Traffic still flows through it via the
pfSense NAT `<mailserver>` alias; this commit does not change routing.

## What is NOT in this change

- pfSense HAProxy install/config (Phase 2 — out-of-Terraform, runbook-managed)
- pfSense NAT rdr flip from `<mailserver>` → HAProxy VIP (Phase 4)
- 465/587/993 — scoped to port 25 first for proof of concept. Other ports
  get the same treatment (alt listeners 4465/5587/10993 + Service ports)
  once 25 is proven.
- Dovecot per-listener `haproxy = yes` — irrelevant until IMAP is migrated.

## Test Plan

### Automated (verified pre-commit)
```
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out

$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    postconf -M | grep '^2525'
2525   inet  n  -  y  -  1  postscreen \
  -o syslog_name=postfix/smtpd-proxy \
  -o postscreen_upstream_proxy_protocol=haproxy \
  -o postscreen_upstream_proxy_timeout=5s

$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    ss -ltn | grep -E ':25\b|:2525'
LISTEN 0 100 0.0.0.0:2525  0.0.0.0:*
LISTEN 0 100 0.0.0.0:25    0.0.0.0:*

$ kubectl get svc -n mailserver mailserver-proxy
NAME               TYPE       CLUSTER-IP      PORT(S)        AGE
mailserver-proxy   NodePort   10.98.213.164   25:30125/TCP   93s

# Expected-to-fail probe (no PROXY header) → postscreen rejects
$ timeout 8 nc -v 10.0.20.101 30125 </dev/null
Connection to 10.0.20.101 30125 port [tcp/*] succeeded!
421 4.3.2 No system resources
```

### Manual Verification (after Phase 2 — pfSense HAProxy)
Once HAProxy on pfSense is configured to listen on alt port :2525 (not the
real :25 yet) and targets `k8s-nodes:30125` with `send-proxy-v2`:
1. From an external host: `swaks --to smoke-test@viktorbarzin.me
   --server <pfsense-ip>:2525 --body "phase 1 test"`
2. In mailserver logs: `kubectl logs -c docker-mailserver deployment/mailserver
   | grep postfix/smtpd-proxy` — "connect from [<external-ip>]" with the real
   public IP, NOT the k8s node IP.
3. E2E probe CronJob keeps green (uses ClusterIP path, unaffected).

## Reproduce locally
1. `kubectl get svc mailserver-proxy -n mailserver` → NodePort 30125 exists
2. `kubectl get cm mailserver-user-patches -n mailserver` → exists
3. `timeout 8 nc -v <k8s-node>:30125 </dev/null` → "421 4.3.2 No system resources"
   (postscreen rejecting malformed PROXY)
2026-04-19 11:52:49 +00:00
Viktor Barzin
b60e34032c [authentik] Phase 1 hardening — 3 replicas, PgBouncer PDB/probes, perf env
## Context

Following the 2026-04-18 /dev/shm ENOSPC P0 and a 5-subagent research pass,
this is Phase 1 of the authentik reliability + performance hardening epic
(beads code-cwj). Scope: everything that is safe, additive, and does not
require DB restart, architectural migration, or the 43-service auth path
to go through a risky validation window.

Five research findings drove the deltas:

1. **Server/worker at 2 replicas** conflicts with the documented convention
   "critical path services scaled to 3" in .claude/CLAUDE.md (Traefik,
   Authentik, CrowdSec LAPI, PgBouncer, Cloudflared). PDB minAvailable was
   still 1 — a single-pod outage could take auth down.
2. **PgBouncer had no resource requests/limits** — silently capped at the
   Kyverno tier-defaults LimitRange (256Mi), no PDB, no probes. Pool
   failures undetected until connection timeouts.
3. **Authentik 2026.2 has no Redis** (the cache moved to Postgres in
   2025.10). Persistent Django connections + longer flow/policy cache TTLs
   are the two knobs that move the needle most without DB tuning. Both are
   safe because PgBouncer runs in session mode.
4. **Gunicorn defaults** (2 workers × 4 threads on server, 1 process × 2
   threads on worker) don't use the pod's 1.5 Gi headroom. Each worker
   preloads Django at ~500 MiB — bumping to 3 workers needs a memory bump
   to 2 Gi first.
5. **AUTHENTIK_WORKER__CONCURRENCY was renamed AUTHENTIK_WORKER__THREADS**
   in 2025.8 — the old name is aliased but the canonical config key changed.

## This change

### values.yaml
- server.replicas 2 → 3 (PDB minAvailable 1 → 2)
- worker.replicas 2 → 3
- server/worker limits.memory 1.5 Gi → 2 Gi (headroom for gunicorn workers)
- authentik.postgresql.conn_max_age = 60 (persistent connections; safe
  with pgbouncer session mode, conn_max_age < server_idle_timeout=600s)
- authentik.postgresql.conn_health_checks = true
- authentik.cache.timeout_flows = 1800 (30 min; was 300)
- authentik.cache.timeout_policies = 900 (15 min; was 300)
- authentik.web.workers = 3, threads = 4
- authentik.worker.threads = 4 (was 2)

### pgbouncer.tf
- container resources: requests cpu=50m/mem=128Mi, limits mem=512Mi
  (observed live usage is 1-3 m CPU, 2-4 MiB RSS — huge headroom,
  safely above Kyverno 256Mi tier-default cap)
- readiness probe: TCP :6432, 10s period
- liveness probe: TCP :6432, 30s period, 30s delay
- kubernetes_pod_disruption_budget_v1.pgbouncer: minAvailable=2
  (3 replicas; single drain rolls cleanly, two-node simultaneous
  outage correctly blocked)

## What is NOT in this change (deferred as Phase 2 follow-ups)

- Codify outpost /dev/shm patch in Terraform (currently applied via
  Authentik API, not in code). Needs authentik_outpost resource.
- Migrate embedded outpost → dedicated outpost Deployment with 2
  replicas + sticky sessions. Only HA path per GH issue #18098; requires
  flow design because outpost sessions are in-process memory only.
- PG max_connections 100 → 200 + shared_buffers 512MB → 768MB + CNPG
  pod memory 2Gi → 3Gi. Needs coordinated DB restart.
- Enable pg_stat_statements on CNPG cluster for Authentik DB
  observability (currently shared_preload_libraries is empty).
- PgBouncer pool_mode session → transaction + django_channels layer
  split. Needs atomic change + psycopg3 prepared-statement support.
- authentik_tasks_tasklog 7-day retention (198k rows, unbounded).
- Traefik forward-auth plugin caching via
  xabinapal/traefik-authentik-forward-plugin.
- Grafana dashboard 14837 import + recording rule for
  authentik_flow_execution_duration (reported broken: values in ns
  while default buckets are seconds — upstream discussion #7156).

## Test plan

### Automated

    $ cd stacks/authentik && ../../scripts/tg plan
    Plan: 1 to add, 3 to change, 0 to destroy.

    $ ../../scripts/tg apply --non-interactive
    module.authentik.kubernetes_pod_disruption_budget_v1.pgbouncer: Creation complete after 0s
    module.authentik.kubernetes_deployment.pgbouncer: Modifications complete after 45s
    module.authentik.helm_release.authentik: Modifications complete after 2m47s
    Apply complete! Resources: 1 added, 3 changed, 0 destroyed.

### Manual Verification

1. **Pod topology and PDBs**:

        $ kubectl -n authentik get pods,pdb
        pod/goauthentik-server-5fc69b6cc6-ctvkp   1/1   Running   0   3m14s   k8s-node2
        pod/goauthentik-server-5fc69b6cc6-fkn8x   1/1   Running   0   3m45s   k8s-node3
        pod/goauthentik-server-5fc69b6cc6-jtjjd   1/1   Running   0   5m6s    k8s-node1
        pod/goauthentik-worker-5cfb7dc9bf-b2rlr   1/1   Running   0   3m44s   k8s-node2
        pod/goauthentik-worker-5cfb7dc9bf-fkfm4   1/1   Running   0   5m6s    k8s-node1
        pod/goauthentik-worker-5cfb7dc9bf-hxdg6   1/1   Running   0   3m3s    k8s-node4
        pod/pgbouncer-64746f955f-st567            1/1   Running   0   4m58s   k8s-node4
        pod/pgbouncer-64746f955f-xss9c            1/1   Running   0   5m11s   k8s-node2
        pod/pgbouncer-64746f955f-zvfkw            1/1   Running   0   4m45s   k8s-node3
        poddisruptionbudget/goauthentik-server    2     N/A   1
        poddisruptionbudget/goauthentik-worker    N/A   1     1
        poddisruptionbudget/pgbouncer             2     N/A   1

   All three workloads spread across 3+ nodes, PDBs allow 1 disruption.

2. **Authentik server health**:

        $ curl -sS -o /dev/null -w "%{http_code}\n" \
            https://authentik.viktorbarzin.me/-/health/ready/
        200

3. **Forward-auth redirect on protected service**:

        $ curl -sS -o /dev/null -w "%{http_code}\n" -L \
            https://wealthfolio.viktorbarzin.me/
        200

4. **Outpost /dev/shm still within sizeLimit** (patches from the
   2026-04-18 post-mortem were not regressed):

        $ kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost \
            -c proxy -- df -h /dev/shm
        tmpfs   2.0G  58M  2.0G  3%  /dev/shm

5. **PgBouncer port reachable from other pods**:

        $ kubectl -n authentik exec deploy/pgbouncer -- nc -zv 127.0.0.1 6432
        127.0.0.1 (127.0.0.1:6432) open

## Reproduce locally

1. `cd stacks/authentik && ../../scripts/tg plan` — expect 0/0/0 (No changes).
2. `kubectl -n authentik get pdb pgbouncer` — expect MIN AVAILABLE 2.
3. `kubectl -n authentik get deploy goauthentik-server -o jsonpath='{.spec.replicas}'` — expect 3.

Closes: code-cwj
2026-04-19 11:52:41 +00:00
Viktor Barzin
789cb61310 [servarr] Rewrite MAM ratio farming — break Mouse death spiral, adopt in TF
## Context

A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14
via kubectl apply (outside Terraform). Five days later the account was
still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0.
Tracker responses on 7 of 9 active torrents returned
`status=4 | msg="User currently mouse rank, you need to get your ratio up!"`
— MAM was actively refusing to serve peer lists because the account was
in Mouse class, and refusing to serve peer lists made the ratio impossible
to recover. Meanwhile the grabber kept digging: 501 torrents sat in
qBittorrent, 0 completed, 0 bytes uploaded.

Root causes (ranked):
1. Death spiral — Mouse class blocks announces, nothing uploads.
2. BP-spender 30 000 BP threshold blocked the only exit even though the
   account already had 24 500 BP.
3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand
   torrents filtered to <100 MiB — ratio-hostile by design.
4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so
   torrents that never started never qualified. Combined with the 500-
   torrent cap this stalled the grabber indefinitely.
5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL.
6. Ratio-monitor labelled queued torrents `unknown` (empty tracker
   field), hiding the problem on the MAM Grafana panel.
7. qBittorrent memory limit (256 Mi LimitRange default) too low.
8. All of the above was Terraform drift with no reviewability.

## This change

Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts
the three kubectl-applied resources and replaces their scripts with
demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes
ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana
panel row.

### Architecture

    MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP)
               ├─── loadSearchJSONbasic.php (freeleech search)
               ├─── bonusBuy.php (50 GiB min tier for API)
               └─── download.php (torrent file)
                               │
    Pushgateway <──┬────────────┤
                   │  mam_ratio            ┌────────────────────┐
                   │  mam_class_code       │ freeleech-grabber  │ */30
                   │  mam_bp_balance   ◄───│  (ratio-guarded)   │
                   │  mam_farming_*        └──────────┬─────────┘
                   │  mam_janitor_*                   │ adds to
                   │                                  ▼
                   │  Grafana panels      qBittorrent (mam-farming)
                   │  + 5 alerts                      ▲
                   │                                  │ deletes by rule
                   │                       ┌──────────┴─────────┐
                   │                   ◄───│ farming-janitor    │ */15
                   │                       │  (H&R-aware)       │
                   │                       └──────────┬─────────┘
                   │                                  │ buys credit
                   │                       ┌──────────┴─────────┐
                   └───────────────────────│ bp-spender         │ 0 */6
                                           │  (tier-aware)      │
                                           └────────────────────┘

### Key decisions

- **Ratio guard on grabber** — refuse to grab if ratio < 1.2 OR class ==
  Mouse. Prevents the death spiral from deepening. Emits
  `mam_grabber_skipped_reason{reason=...}` and exits clean.
- **Demand-first selection** — new score formula
  `leechers*3 - seeders*0.5 + 200 if freeleech_wedge else 0`; size band
  50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that
  will actually upload.
- **Janitor decoupled from grabber** — runs every 15 min regardless of
  the ratio-guard state. Without this, stuck torrents accumulate
  fastest exactly when the grabber is skipping (Mouse class). H&R-aware:
  never deletes `progress==1.0 AND seeding_time < 72h`. Six delete
  reasons observable via `mam_janitor_deleted_per_run{reason=...}`.
- **BP-spender tier-aware** — MAM imposes a hard 50 GiB minimum on API
  buyers ("Automated spenders are limited to buying at least 50 GB...
  due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB.
  The spender picks the smallest tier that satisfies the ratio deficit
  AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB
  tier is too expensive, it skips and retries on the next 6-hour cron.
- **Authoritative metrics use MAM profile fields** —
  `downloaded_bytes` / `uploaded_bytes` (integers) rather than the
  pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB"
  that MAM also returns.
- **Ratio-monitor category-first labelling** — `tracker` is empty for
  queued torrents that never announced. Now maps `category==mam-farming`
  to label `mam` first, only falls back to tracker-URL parsing when
  category is absent. Stops hundreds of MAM torrents collecting under
  `unknown`.
- **qBittorrent resources bumped** to `requests=512Mi / limits=1Gi` so
  hundreds of active torrents don't OOM.

### Emergency recovery performed this session

1. Adopted 5 in-cluster resources via root-module `import {}` blocks
   (Terraform 1.5+ rejects imports inside child modules).
2. Ran the janitor in DRY_RUN=1 to verify rules against live state —
   466 `never_started` candidates, 0 false positives in any other
   reason bucket. Flipped to enforce mode.
3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35
   preserved as active/in-progress).
4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become
   eligible again.

The ratio is still 0 because the API cannot buy below 50 GiB and the
account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the
MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4
and unblock announces. Future automation cannot do this for us due to
MAMs anti-spam rule.

### What is NOT in this change

- qBittorrent prefs reconciliation (max_active_downloads=20,
  max_active_uploads=150, max_active_torrents=150). The plan wanted
  this; deferred to a follow-up because the janitor + ratio recovery
  handles the 500-torrent backlog first. A small reconciler CronJob
  posting to /api/v2/app/setPreferences is the intended follow-up.
- VIP purchase (~100 k BP) — deferred until BP accumulates.
- Cross-seed / autobrr — separate initiative.

## Alerts added

- P1 MAMMouseClass — `mam_class_code == 0` for 1h
- P1 MAMCookieExpired — `mam_farming_cookie_expired > 0`
- P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old
  QBittorrentMAMRatioLow, now driven by authoritative profile metric)
- P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy
- P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h

## Test plan

### Automated

    $ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 | grep Plan
    Plan: 5 to import, 2 to add, 6 to change, 0 to destroy.

    $ ../../scripts/tg apply --non-interactive
    Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed.

    # Re-plan after import block removal (idempotent)
    $ ../../scripts/tg plan 2>&1 | grep Plan
    Plan: 0 to add, 1 to change, 0 to destroy.
    # The 1 change is a pre-existing MetalLB annotation drift on the
    # qbittorrent-torrenting Service — unrelated to this change.

    $ cd ../monitoring && ../../scripts/tg apply --non-interactive
    Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

    # Python + JSON syntax
    $ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [
        "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py",
        "infra/stacks/servarr/mam-farming/files/bp-spender.py",
        "infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]'
    $ python3 -c 'import json; json.load(open(
        "infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))'

### Manual Verification

1. Grabber ratio-guard path:

       $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
       $ kubectl -n servarr logs job/g1
       Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class

2. BP-spender tier path:

       $ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1
       $ kubectl -n servarr logs job/s1
       Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551
         | deficit=1.40 GiB needed=3 affordable=48 buy=0
       Done: BP=24551, spent=0 GiB (needed=3, affordable=48)

   Correctly skips because affordable (48) < smallest API tier (50).

3. Janitor in enforce mode:

       $ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1
       $ kubectl -n servarr logs job/j1 | tail -3
       Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False
         per reason: {'never_started': 466, ...}

   Second run immediately after: `deleted=0 skipped_active=35` —
   steady state with only active/seeding torrents left.

4. Alerts loaded:

       $ kubectl -n monitoring get cm prometheus-server \
           -o jsonpath='{.data.alerting_rules\.yml}' \
           | grep -E "alert: MAM|alert: QBittorrent"
         - alert: MAMMouseClass
         - alert: MAMCookieExpired
         - alert: MAMRatioBelowOne
         - alert: MAMFarmingStuck
         - alert: MAMJanitorStuckBacklog
         - alert: QBittorrentDisconnected
         - alert: QBittorrentMAMUnsatisfied

5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new
   "MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP
   balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor
   deletion stacked chart, janitor state stat, grabber state stat.

## Reproduce locally

1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect
   0 add / 1 change (unrelated MetalLB annotation drift).
2. `kubectl -n servarr get cronjobs` — expect three:
   mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor.
3. Trigger each via `kubectl create job --from=cronjob/<name> <job>`
   and read logs; outputs match the manual-verification snippets above.

Closes: code-qfs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:45:38 +00:00
Viktor Barzin
5ea0aa70e3 [claude-agent-service] Bump image_tag to 2fd7670d (45m /execute timeout)
## Context
Ships the monorepo commit
(code@2fd7670d [claude-agent-service] Raise /execute default timeout
from 15m to 45m) that raises ExecuteRequest.timeout_seconds from 900 to
2700. The auto-upgrade pipeline (DIUN → n8n → claude-agent-service →
service-upgrade agent) had been silently timing out mid-run for 3 days:
139 × 202 Accepted + 6 × TimeoutError in the last 24h, zero commits to
infra, zero Slack posts. Root cause was the 15-minute cap truncating
CAUTION-class upgrades that need to summarise multi-release changelogs,
poll Woodpecker CI, and wait on on-demand DB backup CronJobs.

## What changed
`local.image_tag` 0c24c9b6 → 2fd7670d. Image built + pushed to
registry.viktorbarzin.me/claude-agent-service:2fd7670d. Deployment is
`Recreate`, so the single pod is dropped + recreated.

## Test Plan
### Automated
`terraform plan` — `Plan: 0 to add, 1 to change, 0 to destroy` (3
container image refs flip from 0c24c9b6 → 2fd7670d).
`terraform apply` — `Apply complete! Resources: 0 added, 1 changed,
0 destroyed.`

### Manual Verification
```
$ kubectl -n claude-agent rollout status deploy/claude-agent-service --timeout=120s
deployment "claude-agent-service" successfully rolled out

$ kubectl -n claude-agent get deploy claude-agent-service \
    -o jsonpath='{.spec.template.spec.containers[0].image}'
registry.viktorbarzin.me/claude-agent-service:2fd7670d

$ kubectl -n claude-agent exec deploy/claude-agent-service -- \
    sh -c 'cd /srv && python3 -c "from app.main import ExecuteRequest; \
    print(ExecuteRequest(prompt=\"p\", agent=\"a\").timeout_seconds)"'
2700
```

Next DIUN cycle (every 6h) should land ≥1 unattended upgrade as an
infra commit + Slack message without TimeoutError in the agent logs.

Closes: code-cfy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:29:08 +00:00
Viktor Barzin
a5df175a67 [mailserver] Retire Dovecot exporter + scrape + alerts [ci skip]
## Context

code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real
metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the
exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot
2.3 deprecated in favour of `service stats` + `doveadm-server` with
a different wire format. The scrape only ever returned
`dovecot_up{scope="user"} 0`.

code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or
(b) retire the exporter + scrape + alerts. Picking (b) — carrying a
no-op exporter + scrape + alert group taxes cluster resources,
clutters Prometheus /targets, and tees up an alert that can never
fire correctly. If a future session needs real Dovecot stats, reach
for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and
rebuild this scaffolding.

## This change

### mailserver stack
- Removes the `dovecot-exporter` container from
  `kubernetes_deployment.mailserver` (was ~28 lines). Pod now
  runs a single `docker-mailserver` container.
- Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service
  added in code-izl). The `mailserver` LoadBalancer (ports 25, 465,
  587, 993) is unaffected.
- Drops the dovecot.cf comment documenting the failed code-vnc
  attempt — the documentation survives here + in bd code-vnc /
  code-1ik.

### monitoring stack
- Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`.
- Removes the `Mailserver Dovecot` PrometheusRule group
  (`DovecotConnectionsNearLimit`, `DovecotExporterDown`).
- Inline comments in both files point future work at code-1ik's
  decision record.

Prometheus configmap-reload picked up the change; scrape target set
now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to
1/1 Running.

## What is NOT in this change

- No replacement exporter — deliberate. The alert that was removed
  was a false-signal alert; its removal returns cluster alerting to
  a correct, lower-noise state.
- mailserver MetalLB Service + SMTP/IMAP ports — unchanged.
- `auth_failure_delay`, `mail_max_userip_connections` — stay; those
  are unrelated to stats export.

## Test Plan

### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver
NAME                          READY  STATUS   RESTARTS  AGE
mailserver-78589bfd95-swz6h   1/1    Running  0         49s

$ kubectl get svc -n mailserver
NAME            TYPE          PORT(S)
mailserver      LoadBalancer  25/TCP,465/TCP,587/TCP,993/TCP
roundcubemail   ClusterIP     80/TCP
# mailserver-metrics gone

$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
    wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
{"status":"success","data":{"activeTargets":[]}}
```

### Manual Verification
1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence)
2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even
   without the exporter signal
3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit
   or DovecotExporterDown

Closes: code-1ik

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:01:07 +00:00
Viktor Barzin
137404a6a2 [mailserver] Document Dovecot exporter incompatibility [ci skip]
## Context

bd code-vnc investigated why `viktorbarzin/dovecot_exporter` only
exposed `dovecot_up{scope="user"} 0`. Root cause: the exporter speaks
the legacy pre-2.3 `old_stats` FIFO wire protocol. docker-mailserver
15.0.0 ships Dovecot 2.3.19, which moved to `service stats` with a
different architecture — `doveadm stats dump` on the old-stats
unix_listener returns "Failed to read VERSION line" and the exporter
loops on "Input does not provide any columns".

Attempted fix: enabled `old_stats` plugin via `mail_plugins` +
declared `service old-stats { unix_listener stats-reader }`. Socket
was created but protocol incompatibility made it useless. Reverted.

## This change

- Reverts the attempted dovecot.cf additions
- Adds a comment in the dovecot.cf heredoc explaining why we
  deliberately do NOT enable old_stats here
- `auth_failure_delay = 5s` (code-9mi) and
  `mail_max_userip_connections = 50` stay — they're unrelated to
  stats

## What is NOT in this change

- A replacement exporter — filed as follow-up bd code-1ik with
  two paths: switch to jtackaberry/dovecot_exporter, or retire the
  exporter+scrape+alert entirely
- The `mailserver-metrics` ClusterIP Service (from code-izl) —
  kept; it will be useful for whichever path code-1ik chooses

## Test Plan

### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    supervisorctl status dovecot postfix
dovecot RUNNING   pid 1022, uptime 0:00:27
postfix RUNNING   pid 1063, uptime 0:00:26

$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```

### Manual Verification
Dovecot config returns to baseline + auth_failure_delay. Mail continues
to flow (E2E probe continues to succeed via `email-roundtrip-monitor`).

Closes: code-vnc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:55:48 +00:00
Viktor Barzin
973f549810 [payslip-ingest] Update extractor agent + dashboard for v2 regex parser
## Context

Companion change to payslip-ingest v2 (regex parser + accurate RSU tax
attribution). The Grafana dashboard now has 4 more panels powered by the
new earnings-decomposition and YTD-snapshot columns, and the Claude
fallback agent's prompt is aligned with the new schema so non-Meta
payslips still land with the full field set.

## This change

### `.claude/agents/payslip-extractor.md`

Rewrites the RSU handling section to match Meta UK's actual template
(rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching
rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead).
Adds a new "Earnings decomposition (v2)" section telling the fallback
agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_*
and when to use pension_employee vs pension_sacrifice without
double-counting.

### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`

- **Panel 4 (Effective rate)** — SQL switched from the naive
  `(income_tax + NIC) / cash_gross` to the YTD-effective-rate
  method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid /
  ytd_taxable_pay)`. Title updated to "YTD-corrected" so the
  change is discoverable.
- **Panel 5 (Table)** — adds salary, bonus, pension_sacrifice,
  taxable_pay columns so row-level debugging against the parser
  output is trivial.
- **+Panel 8 (Earnings breakdown)** — monthly stacked bars of
  salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice
  months show up as a massive negative pension_sacrifice spike
  paired with a near-zero bonus bar.
- **+Panel 9 (Accurate cash tax rate)** — timeseries of
  cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU
  contribution the payslip hides in the single `Tax paid` line.
- **+Panel 10 (All-in compensation)** — stacked bars of cash_gross
  + rsu_vest per payslip.
- **+Panel 11 (YTD cumulative cash gross vs total comp)** — two
  lines partitioned by tax_year; the gap between them is the RSU
  contribution YTD.

Total panels go from 7 → 11.

## Test Plan

### Automated

Dashboard JSON validity:
```
$ python3 -m json.tool uk-payslip.json > /dev/null && echo ok
ok
```

### Manual Verification

After applying `stacks/monitoring/`:
1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels
2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the
   negative pension_sacrifice bar in panel 8
3. Panel 9 "Accurate cash effective tax rate" shows the
   cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in
   RSU-vest months

## Reproduce locally

1. `cd infra/stacks/monitoring && terragrunt plan`
2. Expected: ConfigMap diff on the payslip dashboard with the new panel
   JSON
3. `terragrunt apply` — Grafana reloads the dashboard automatically
   (configmap-reload sidecar)

Relates to: payslip-ingest commit 9741816

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:54:33 +00:00
Viktor Barzin
c6784f87b5 [docs] Add NFS prerequisite runbook for nfs_volume module [ci skip]
## Context

`modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying
directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`).
The first time a new consumer is added, the mount fails with
`mount.nfs: … No such file or directory` and the pod hangs in
ContainerCreating.

This bit us twice during the Wave 1/2 rollout — once for the mailserver
backup (code-z26) and again for the Roundcube backup (code-1f6). Both
times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`.
Rather than automate the SSH dependency into the module (which would
break hermeticity and fail for operators without host SSH), this runbook
documents the manual bootstrap step and the rationale.

Addresses bd code-yo4.

## This change

New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers,
gives the copy-paste SSH command, and explains why auto-creation was
rejected (two options, neither worth the churn).

## What is NOT in this change

- Any automation of the bootstrap — runbook only
- Migration to `nfs-subdir-external-provisioner` — explicitly out of scope

## Test Plan

### Automated
```
$ cat docs/runbooks/nfs-prerequisites.md | head -5
# NFS Prerequisites for `modules/kubernetes/nfs_volume`

The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
```

### Manual Verification
Before the next stack adds a new `nfs_volume` consumer, read the runbook
and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod
reaches Ready within a minute of the PV creation.

Closes: code-yo4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:40:55 +00:00
Viktor Barzin
28009a0e85 [redis] Bump master/replica memory 64Mi→256Mi (OOMKilled on PSYNC)
## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.

## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:

- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking

The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.

## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).

Sentinels stay at 64Mi — they don't store data.

## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:

  helm rollback redis 43 -n redis
  kubectl -n redis patch sts redis-node --type=strategic \
    -p '{...memory: 256Mi...}'
  kubectl -n redis delete pod redis-node-1 --force

Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.

## Verification
  kubectl -n redis get pods
    redis-node-0   2/2  Running  0  <bounce>
    redis-node-1   2/2  Running  0  <bounce>
  kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
    256Mi

## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
  (separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
  deadlock recovery

Closes: code-pqt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:40:51 +00:00
Viktor Barzin
468a7a266b [mailserver] Drop unneeded NET_ADMIN capability [ci skip]
## Context

The mailserver container had `capabilities.add = ["NET_ADMIN"]`. Upstream
docker-mailserver docs say the capability is only needed by Fail2ban to
run iptables ban actions. Fail2ban is DISABLED in this stack
(`ENABLE_FAIL2BAN=0`, see line ~68) — CrowdSec owns the brute-force
policy at the LB layer. The capability was therefore unused ballast and
a minor attack-surface reduction opportunity. Addresses code-4mu.

## This change

Replaces the explicit `capabilities { add = ["NET_ADMIN"] }` block with
an empty `security_context {}`. Post-rollout verification
(`supervisorctl status`) confirms every service we actually run is
healthy — dovecot, postfix, rspamd, rsyslog, postsrsd, changedetector,
cron, mailserver. Every STOPPED entry was already disabled.

The inline comment documents the revert trigger: check
`kubectl logs -c docker-mailserver` for permission-denied patterns and
restore the capability if observed.

## Test Plan

### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver -o jsonpath='{.items[0].spec.containers[?(@.name=="docker-mailserver")].securityContext}'
{"allowPrivilegeEscalation":true,"privileged":false,"readOnlyRootFilesystem":false,"runAsNonRoot":false}

$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out

$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    supervisorctl status | grep RUNNING
changedetector RUNNING ...
cron           RUNNING ...
dovecot        RUNNING ...
mailserver     RUNNING ...
postfix        RUNNING ...
postsrsd       RUNNING ...
rspamd         RUNNING ...
rsyslog        RUNNING ...
```

### Observation window
EmailRoundtripFailing + EmailRoundtripStale alerts continue to run
every 20 min. If no alert fires in the 24h post-rollout window
(through ~2026-04-20 10:40 UTC), the change is considered safe and
this commit stands. Otherwise revert this commit.

## What is NOT in this change

- readOnlyRootFilesystem (separate hardening, out of scope)
- runAsNonRoot (docker-mailserver needs root for Postfix)
- Removing privilege-escalation defaults (container needs those for
  chowning mail spool at startup)

Closes: code-4mu

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:39:43 +00:00
Viktor Barzin
c941199f8d [mailserver] Split Dovecot metrics port onto ClusterIP service [ci skip]
## Context

Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB
LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable,
shipping an internal metric on the same listening IP as external mail
conflated two concerns and over-exposed the port. Prometheus was
scraping via the same LB Service. Addresses code-izl (follow-up to
code-61v which added the scrape job).

## This change

### mailserver stack
- Drops `dovecot-metrics` port from `kubernetes_service.mailserver`
  (LoadBalancer stays: 25, 465, 587, 993).
- Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only,
  selecting the same `app=mailserver` pod, exposing 9166.

### monitoring stack
- Updates `extraScrapeConfigs` in the Prometheus chart values to
  target the new `mailserver-metrics.mailserver.svc.cluster.local:9166`
  instead of `mailserver.mailserver.svc.cluster.local:9166`.
- helm_release.prometheus updated in-place; configmap-reload sidecar
  picked up the new target within 10s.

```
 mailserver LB              mailserver-metrics ClusterIP
 ┌──────────────────┐       ┌──────────────────┐
 │ 25  smtp         │       │ 9166 dovecot-    │
 │ 465 smtp-secure  │       │      metrics     │ ← Prometheus only
 │ 587 smtp-auth    │       └──────────────────┘
 │ 993 imap-secure  │
 └──────────────────┘
    ↑ 10.0.20.202
```

## What is NOT in this change

- Per-Service RBAC/NetworkPolicy tightening (separate task)
- Moving the metrics port to a dedicated sidecar-only Service Monitor
  (ServiceMonitor CRDs not installed; extraScrapeConfigs is correct
  for the prometheus-community chart in use)

## Test Plan

### Automated
```
$ kubectl get svc -n mailserver
mailserver          LoadBalancer 10.0.20.202  25/TCP,465/TCP,587/TCP,993/TCP
mailserver-metrics  ClusterIP    10.100.102.174  9166/TCP

$ kubectl get endpoints -n mailserver mailserver-metrics
mailserver-metrics   10.10.169.163:9166

$ # Prometheus target (after 10s configmap-reload)
$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
    wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
  scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics
  health: up
```

### Manual Verification
1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused
2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name
3. PromQL: `up{job="mailserver-dovecot"}` returns `1`

Closes: code-izl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:37:30 +00:00
Viktor Barzin
7502e0db21 [mailserver] Document postfix-accounts.cf hash-drift invariant [ci skip]
## Context

The `postfix-accounts.cf` ConfigMap renders `bcrypt(pass, 6)` for each
user in `var.mailserver_accounts`. bcrypt generates a fresh salt on
every evaluation → the ConfigMap `data` hash line differs every plan
run. `ignore_changes = [data["postfix-accounts.cf"]]` was the pragmatic
workaround, but the side-effect wasn't documented: a Vault rotation of
a mailserver password would be MASKED by ignore_changes — TF would
never push the new hash and the pod would keep accepting the old
password until manual taint/replace.

Addresses bd code-7ns.

## This change

Inline comment on the lifecycle block spelling out:
- Why ignore_changes exists (non-deterministic bcrypt)
- What the invariant costs (masks automatic rotation)
- Why it's acceptable TODAY (no automatic rotation for
  mailserver_accounts — verified in Vault; manual password change is a
  manual TF run anyway)
- Two concrete alternatives if rotation is ever added:
  (a) deterministic bcrypt with stable per-user salt
  (b) render from an ESO-synced K8s Secret

No code change, no apply needed — this is a comment-only commit. The
decision (live-with + document) is one of the three options in the plan.

## What is NOT in this change

- Deterministic hashing (not needed until automatic rotation exists)
- ESO-driven Secret (same reason)
- Removal of ignore_changes (would cause the original drift flap)

## Test Plan

### Automated
```
$ cd stacks/mailserver && /home/wizard/code/infra/scripts/tg plan
# no diff expected on this comment-only change; other drift remains
# but is pre-existing and out of scope.
```

### Manual Verification
Read the new comment block at `stacks/mailserver/modules/mailserver/
main.tf` around the postfix-accounts-cf lifecycle — comprehensible
without session context.

Closes: code-7ns

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:33:57 +00:00
Viktor Barzin
23173131f4 [mailserver] Add Dovecot auth_failure_delay 5s [ci skip]
## Context

Dovecot's `dovecot.cf` block previously set only
`mail_max_userip_connections = 50`. No equivalent of the SMTP rate
limit existed for IMAP auth — brute-force against IMAP/POP auth was
throttled only by CrowdSec at the LB level. Adding an in-process
auth delay is cheap defense in depth. Addresses code-9mi.

## This change

Adds `auth_failure_delay = 5s` to the dovecot.cf ConfigMap key.
Each failed auth attempt pauses 5s before responding; a sequential
1000-entry dictionary attack stretches from <1s to ~85min, bought
out CrowdSec's ban window.

## What is NOT in this change

- `login_processes_count` tuning (workload doesn't warrant it yet)
- Equivalent SMTP AUTH delay (CrowdSec already covers, and SMTP AUTH
  is rate-limited via `smtpd_client_connection_rate_limit`)

## Test Plan

### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    doveconf -n | grep -E 'auth_failure|mail_max_userip'
auth_failure_delay = 5 secs
mail_max_userip_connections = 50

$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```

### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:993`
2. `a1 LOGIN bogus@viktorbarzin.me wrongpass` — expect ~5s delay before `NO [AUTHENTICATIONFAILED]`
3. Fire 5 failed attempts rapidly: total ≥25s

## Reproduce locally
1. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- doveconf -n | grep auth_failure`
2. Expected: `auth_failure_delay = 5 secs`

Closes: code-9mi

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:33:05 +00:00
Viktor Barzin
a32bfbf07e [mailserver] Require STARTTLS before AUTH on submission [ci skip]
## Context

docker-mailserver 15.0.0's default Postfix config does NOT set
`smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587
(or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec
and rate limiting don't catch this — it's an auth-path leak, not a
bruteforce. Addresses bd code-vnw.

## This change

Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the
`postfix-main.cf` ConfigMap key consumed by docker-mailserver).
Rolled the pod to pick up the new ConfigMap.

### Deviation from task spec

code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT
a real Postfix parameter — attempting it gets
`postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The
acceptance test (reject PLAIN auth before STARTTLS) is satisfied by
`smtpd_tls_auth_only`, which is the correct knob. Added an inline
comment noting the common confusion.

## What is NOT in this change

- Per-service override in master.cf (smtpd_tls_auth_only applied
  globally, which is safe because port 25 doesn't accept AUTH here)
- Other Postfix hardening (sender_restrictions, etc.)

## Test Plan

### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
    postconf smtpd_tls_auth_only
smtpd_tls_auth_only = yes

$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```

### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp`
2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS`
3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled`
4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds

## Reproduce locally
1. From a shell with `kubectl` access to the cluster:
2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only`
3. Expected: `smtpd_tls_auth_only = yes`

Closes: code-vnw

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:31:15 +00:00
Viktor Barzin
e12c7b43e4 [mailserver] Pin dovecot_exporter to SHA + add Diun [ci skip]
## Context

`viktorbarzin/dovecot_exporter:latest` was consumed with `IfNotPresent`
pull, which means whichever node landed the pod kept whatever digest
was cached from an earlier pull. A SHA-level pin is the reproducibility
baseline this repo uses for every other home-built image
(`headscale`, `excalidraw`, `linkwarden`).

## This change

- Pins `dovecot-exporter` container image to
  `viktorbarzin/dovecot_exporter@sha256:1114224c...` — the digest the
  pod is actually running today (captured from live `imageID`).
- Enables Diun tag watching on the mailserver Deployment
  (`diun.enable=true`, `diun.include_tags=^latest$`) so new `:latest`
  digests trigger a notification rather than silently landing on the
  next `IfNotPresent` miss.

Deviation from task spec (code-cno): the task asked for an 8-char SHA
*tag*, but Docker Hub only publishes `:latest` for this image — a SHA
tag doesn't exist. Used the digest-pin pattern already established at
`stacks/headscale/modules/headscale/main.tf:204` instead; Diun watches
the `:latest` tag for drift, which is the equivalent notification.

## What is NOT in this change

- Volume-mount ordering drift on `kubernetes_deployment.mailserver`
  (pre-existing; tolerated by Waves 1+2).
- Splitting the metrics port into its own Service (code-izl).

## Test Plan

### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver \
    -o jsonpath='{.items[0].spec.containers[*].image}'
docker.io/mailserver/docker-mailserver:15.0.0 \
  viktorbarzin/dovecot_exporter@sha256:1114224c9bf0261ca8e9949a6b42d3c5a2c923d34ca4593f6b62f034daf14fc5

$ kubectl get deployment -n mailserver mailserver \
    -o jsonpath='{.spec.template.metadata.annotations}'
{"diun.enable":"true","diun.include_tags":"^latest$"}

$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```

### Manual Verification
1. Push a new `:latest` digest to the exporter image (or wait for one).
2. Check Diun notifier output: a tag event for `^latest$` should fire.
3. `kubectl describe deployment/mailserver -n mailserver` shows the
   digest pin unchanged until someone rebumps it.

## Reproduce locally
1. `kubectl -n mailserver get pod -l app=mailserver -o yaml | \
     grep -A1 dovecot_exporter`
2. Expected: `image: viktorbarzin/dovecot_exporter@sha256:1114224c...`.

Closes: code-cno

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:26:31 +00:00
Viktor Barzin
c36b41eabc [monitoring] Scrape mailserver Dovecot exporter + near-limit alerts
Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but
nothing was scraping it. Added a static `mailserver-dovecot` scrape job
to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not
`kube-prometheus-stack`, so no ServiceMonitor CRDs are available).

Two alerts in a new `Mailserver Dovecot` rule group:
- `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for
  5m (85% of `mail_max_userip_connections = 50`).
- `DovecotExporterDown` fires if the scrape target is unreachable
  for 10m (catches pod restarts + network issues).

Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule
on `mailserver-beta1` branch; that commit is abandoned because the
CRDs aren't installed. This path is functionally equivalent and plans
cleanly.

Closes: code-61v
2026-04-19 00:24:12 +00:00
Viktor Barzin
6a75ed4809 [mailserver] Add targeted retention for spam@ mailbox
## Context

The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. The
mailbox had no retention policy. On 2026-04-18 it held 519 messages
consuming 43 MiB. Without a policy, the only brake on growth was
manual deletion, which has not been happening - hence the bd task.

Viktor's explicit constraint when filing code-oy4: DO NOT blind
age-expunge. We need targeted retention that keeps genuine forwarded
human mail for a long time while shedding the recurring-newsletter
cruft that dominates the byte count.

## Profile findings (2026-04-18, verified on the live pod)

Total: 519 messages, 43 MiB, 0 in new/, 0 in tmp/.

Top senders by volume:
   138  dan@tldrnewsletter.com
    51  hi@ratepunk.com
    40  uber@uber.com
    35  truenas@viktorbarzin.me
    19  ubereats@uber.com
    15  hello@travel.jacksflightclub.com
    12  chris@chriswillx.com
    10  me@viktorbarzin.me

Top senders by storage bytes:
   8,176,481  dan@tldrnewsletter.com  (19 % of 43 MiB alone)
   2,866,104  uber@uber.com
   2,207,458  noreply@mail.selfh.st
   2,066,094  hi@ratepunk.com
   1,675,435  ubereats@uber.com

Age distribution:
    97 %  older than 14 days (502 / 519)
    23 %  older than 90 days (121 / 519)

Automated-sender markers:
    66 %  carry List-Unsubscribe:                   (342 / 519)
     4 %  carry Precedence: bulk|list|junk          ( 21 / 519)
    34 %  carry neither marker (= human-ish tail)   (177 / 519)

Combined "automated AND >14d": 328 messages -> target of rule 1.

## Retention strategy

Signed off by Viktor 2026-04-18. Two rules, both delete-leaf:

  1. Older than 14 days AND header matches one of:
       - `^List-Unsubscribe:`
       - `^Precedence:\s*(bulk|list|junk)`
       - `^Auto-Submitted:\s*auto-`
     -> DELETE.
     Rationale: these markers are the RFC-agreed indicators of bulk /
     robotic senders. A 14-day window still lets genuine subscription
     alerts (delivery, flight, calendar invite) come to attention.

  2. Older than 90 days AND no automated marker at all
     -> DELETE.
     Rationale: these are long-tail forwards from real people to the
     catch-all. 90 days is deliberately generous - I would rather
     leak bytes than lose Viktor's personal correspondence.

  3. Everything else -> KEEP (recent traffic, or aged human tail
     younger than 90d).

## Implementation

A `kubernetes_cron_job_v1.spam_retention` running every 4h (at :17
past) that `kubectl exec`s a Python retention script into the
mailserver pod.

Why kubectl exec and not a sibling CronJob with the Maildir mounted:
mailserver-data-encrypted is a RWO volume held by the mailserver
pod. A sibling would fail to attach. The nextcloud-watchdog pattern
in stacks/nextcloud/main.tf already solves this for a similar
"interact with the live pod on a schedule" shape. Mirrored here with
its own SA + Role + RoleBinding scoped to list/get pods and create
pods/exec in the mailserver namespace only.

Why Python and not pure shell: POSIX `find + stat + awk` struggles
with the header-scan-up-to-blank-line rule, and `stat -c` is Linux-
GNU-specific anyway. The script reads each message's first 64 KiB,
stops at the first blank line, scans headers only, then checks mtime.

The CronJob streams the Python source via `kubectl exec -i ... --
python3 - <<PYEOF`. After the retention pass, `doveadm force-resync
-u spam@viktorbarzin.me INBOX/spam` refreshes Dovecot's cached index
so the deletions appear in IMAP immediately instead of after the next
pod restart.

Includes the standard KYVERNO_LIFECYCLE_V1 marker on the CronJob so
Kyverno ndots mutation does not cause perpetual drift.

## What is NOT in this change

- Dovecot sieve rules (no sieve infrastructure exists in the module;
  the plan file's fallback option was precisely this CronJob path).
- Push of retention metrics to Pushgateway - the script prints them
  to the job log for now; plumbing Pushgateway is a follow-up if
  Viktor wants alerts.
- Any touch of other mailboxes - only `/var/mail/viktorbarzin.me/spam/cur`
  is walked.
- Any mailserver pod restart or config reload.

## Test plan

### Automated

`terraform fmt` + `terragrunt hclfmt` pass. `scripts/tg plan` on the
mailserver stack shows:
  Plan: 7 to add, 3 to change, 0 to destroy.
Of the 7 adds, 4 are mine (SA + Role + RoleBinding + CronJob). The
other 3 adds belong to the concurrent roundcube-backup CronJob +
nfs_roundcube_backup_host PV + PVC already on master in parallel.
The 3 in-place updates are pre-existing drift on the mailserver
Deployment, Service and email_roundtrip_monitor CronJob, not
introduced by this change.

### Manual Verification

After `scripts/tg apply` lands the CronJob:

  1. Trigger an immediate run:
     `kubectl -n mailserver create job --from=cronjob/spam-retention manual-1`
  2. Wait for completion, read the log:
     `kubectl -n mailserver logs job/manual-1`
     -> expected tail:
        spam_retention_scanned_total <N>
        spam_retention_auto_deleted_total <M>
        spam_retention_human_deleted_total <H>
        spam_retention_kept_total <K>
        spam_retention_errors_total 0
        Retention pass complete
  3. Confirm mailbox shrunk:
     `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
         -- du -sh /var/mail/viktorbarzin.me/spam/`
     -> expected: well below 43 MiB within one run (bulk rule alone
        purges ~328 messages per the profile numbers above).
  4. Confirm IMAP reflects the deletions:
     `kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
         -- doveadm mailbox status -u spam@viktorbarzin.me messages INBOX/spam`
     -> expected: message count dropped accordingly.
  5. 4 hours later, confirm the next scheduled run logs a much
     smaller scan count and 0 deletions (nothing new crossed the
     threshold).

Closes: code-oy4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:22:55 +00:00
Viktor Barzin
6cfc4b7836 [mailserver] Add backup CronJob for Roundcube html + enigma PVCs
## Context
Roundcube webmail runs with two encrypted RWO PVCs (see roundcubemail.tf:
`roundcubemail-html-encrypted`, `roundcubemail-enigma-encrypted`). These
carry user-visible state that is NOT regenerable without user action:

- `html` PVC → Apache docroot, plugin installs, skin overrides, session
  artefacts (two_factor_webauthn keys, persistent_login tokens, rcguard
  throttle state)
- `enigma` PVC → user-uploaded PGP private keyrings

Per the subdir CLAUDE.md "Storage & Backup Architecture" rule every
proxmox-lvm* PVC MUST have a backup CronJob writing to NFS
`/mnt/main/<app>-backup/`. Mailserver already complies via code-z26's
`mailserver-backup` CronJob; Roundcube does not. Losing either Roundcube
PVC means users must re-add 2FA devices, re-install plugins, and
re-import PGP keys — none of it recoverable from a database dump.

Target task: `code-1f6`.

## This change
- Adds `module.nfs_roundcube_backup_host` sourcing
  `modules/kubernetes/nfs_volume` pointed at
  `/srv/nfs/roundcube-backup` on the Proxmox host (NFSv4, inotify
  change-tracker picks it up for Synology offsite).
- Adds `kubernetes_cron_job_v1.roundcube-backup`:
  - Schedule `10 3 * * *` — 10 minutes after `mailserver-backup`
    (`0 3 * * *`) to avoid NFS write-window contention. Roundcube PVCs
    are tiny (<200 MiB combined on current cluster) so the window is
    well under 10 min.
  - `pod_affinity` on `app=roundcubemail` (Roundcube runs 1 replica with
    `Recreate` strategy on a fresh node per pod; the backup pod must
    co-locate because both PVCs are RWO).
  - `rsync -aH --delete --link-dest=/backup/<prev-week>` into
    `/backup/<YYYY-WW>/{html,enigma}/` — hardlinks unchanged files vs
    the previous weekly snapshot, keeping storage cost ~= delta only.
  - Weekly rotation retains 8 snapshots (~2 months), matching
    `mailserver-backup`.
  - Pushgateway metrics under `job=roundcube-backup` so existing
    `BackupDurationHigh` / `BackupStale` alert patterns detect
    regressions without extra wiring.
  - `KYVERNO_LIFECYCLE_V1` `ignore_changes` for mutated `dns_config`.

## Layout
```
 NFS server 192.168.1.127:/srv/nfs/
 ├── mailserver-backup/        (0 3 * * *  — code-z26)
 │   └── <YYYY-WW>/{data,state,log}/
 └── roundcube-backup/         (10 3 * * * — this change)
     └── <YYYY-WW>/{html,enigma}/
```

## What is NOT in this change
- Changing the mailserver-backup CronJob to also cover Roundcube. Two
  separate CronJobs keep the concerns (and pod anti-affinity/affinity)
  clean; the 10-min stagger eliminates the contention justification for
  merging them.
- Retention alerting tuning — existing Pushgateway/Prometheus rule
  ecosystem suffices for now.
- Restore tooling — follows the standard pattern in
  `docs/runbooks/` (rsync back, fix perms).

## Reproduce locally
1. Plan: `cd stacks/mailserver && scripts/tg plan -lock=false` →
   2 new resources (nfs_volume module + CronJob).
2. Apply, then trigger a one-shot run:
   `kubectl -n mailserver create job --from=cronjob/roundcube-backup roundcube-backup-manual-1`
3. Expected on success:
   - `kubectl -n mailserver logs job/roundcube-backup-manual-1` → "=== Backup IO Stats ===".
   - On Proxmox host:
     `ls /srv/nfs/roundcube-backup/$(date +%Y-%W)/` → `html`, `enigma`.
   - `/mnt/backup/.nfs-changes.log` (Proxmox) lists fresh paths under
     `roundcube-backup/` within ~1s of the rsync finishing.
   - Pushgateway: `curl -s prometheus-prometheus-pushgateway.monitoring:9091/metrics | grep roundcube`
     shows `backup_duration_seconds`, `backup_last_success_timestamp`.

## Automated
- `terraform fmt -check -recursive stacks/mailserver/modules/mailserver/` → clean.
- `scripts/tg plan -lock=false` in stacks/mailserver expected to show
  `+ module.nfs_roundcube_backup_host.*`, `+ kubernetes_cron_job_v1.roundcube-backup`.

Closes: code-1f6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:14:47 +00:00
Viktor Barzin
f707968091 [mailserver] Retry probe Pushgateway + Uptime Kuma pushes with backoff
## Context
The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently
wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)`
in bare `try/except` that only prints "Failed to push ..." on error. If
Pushgateway is transiently unreachable (e.g., during a Prometheus Helm
upgrade / HPA scale-down / brief network blip) metrics silently drop and
downstream detection relies entirely on `EmailRoundtripStale` firing after
60 min of staleness. Single transient failures masquerade as data-plane
breakage for up to an hour.

Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes.

## This change
- Extracts a `push_with_retry(label, func, url)` helper that performs 3
  attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as
  success, everything else as failure. On final failure, logs an explicit
  `ERROR:` line to stderr with the URL and either the last HTTP status or
  the exception repr — matches the existing `print(...)` logging style
  used throughout the heredoc (no stdlib `logging` dependency added).
- Replaces the two inline `try/requests.put/except print` blocks with
  calls to the helper. Pushgateway runs unconditionally; Uptime Kuma
  still only runs on round-trip success (same as before).
- Makes exit code responsive to push outcome: probe exits non-zero when
  the round-trip itself failed (unchanged), OR when BOTH pushes failed
  all retries on the success path. Single-endpoint push failure with the
  other succeeding keeps exit 0 — partial observability is preferred
  over noisy pod restarts from Kubernetes' Job controller.

## Behavior matrix

```
roundtrip | pushgw | kuma | exit | rationale
----------+--------+------+------+-------------------------------
success   | ok     | ok   |  0   | happy path (unchanged)
success   | fail   | ok   |  0   | one endpoint still has telemetry
success   | ok     | fail |  0   | one endpoint still has telemetry
success   | fail   | fail |  1   | NEW — total observability loss
fail      | ok     | -    |  1   | roundtrip failed (unchanged, Kuma skipped)
fail      | fail   | -    |  1   | roundtrip failed (unchanged, Kuma skipped)
```

## What is NOT in this change
- Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of
  scope per the task description.
- `logging` stdlib adoption — rest of heredoc uses `print`, staying
  consistent.
- Moving the heredoc out of `main.tf` into a sidecar Python file —
  separate refactor.

## Reproduce locally
1. Point PUSHGATEWAY at a black hole:
   `kubectl -n mailserver set env cronjob/email-roundtrip-monitor \`
   `PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test`
2. Trigger a one-shot job:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test`
3. Expected in logs:
   - 3 attempts, each ~1s/2s/4s apart
   - `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...`
   - Uptime Kuma push still succeeds (round-trip ok) → exit 0
4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect
   exit 1 + two ERROR lines.

## Automated
- `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK
  (heredoc extracts cleanly).
- `terraform fmt -check -recursive modules/mailserver/` → no diff.

Closes: code-n5l

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:14:46 +00:00
Viktor Barzin
f568e7d2bf [mailserver] Delete unused postfix_cf_reference_DO_NOT_USE variable [ci skip]
## Context

`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix main.cf layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but wastes reviewer
attention on every `git blame` and drives up `variables.tf` read time.

Note on history: the prior commit 09c11056 landed with an identical
title ("Delete postfix_cf_reference_DO_NOT_USE dead code") but
actually committed `docs/runbooks/mailserver-proxy-protocol.md` —
fallout from a race between two concurrent mailserver sessions that
staged files in parallel. That commit accidentally closed this beads
task via the `Closes:` trailer without performing the deletion. This
commit does the actual deletion that was originally intended for
code-o3q. The runbook from 09c11056 is legitimate work for code-rtb
and is left in place.

## This change

Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the live `postfix_cf` variable that is actually consumed
by the module.

## What is NOT in this change

- No Terraform state modification — variable was never read, so state
  has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
  untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
  `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
  independently. Those 2 in-place updates are known and tracked
  separately.
- No apply needed — pure source hygiene.

## Test Plan

### Automated

Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)

Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```

`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` -> `_v1` deprecation
notices, unrelated)

`terragrunt plan` (from `infra/stacks/mailserver/`):
```
  # module.mailserver.kubernetes_deployment.mailserver will be updated in-place
  # module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift. No change is
attributable to this commit — the dead variable was never referenced.

### Manual Verification

1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` -> expected `0`
3. `wc -l variables.tf` -> expected `39` (was `175`; 136 lines removed)
4. `cd ../..` -> `terragrunt validate` -> expected `Success!`
5. `terragrunt plan` -> expected `Plan: 0 to add, 2 to change, 0 to
   destroy.` (pre-existing drift only).

Closes: code-o3q

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:07:43 +00:00
Viktor Barzin
09c1105648 [mailserver] Delete postfix_cf_reference_DO_NOT_USE dead code [ci skip]
## Context

`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix `main.cf` layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but actively wastes
reviewer attention on every `git blame`, drives up `variables.tf` read
time, and lets drift calcify.

Trade-offs considered:
- Keep it "just in case" → rejected; the file it mirrored
  (`/usr/share/postfix/main.cf.dist`) is already canonical upstream and
  reproducible inside any docker-mailserver container.
- Move it to a comment block → rejected; same noise cost, no value
  over deletion (authoritative source is in the image).

## This change

Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the single live variable `postfix_cf` that is actually
consumed by the module.

## What is NOT in this change

- No Terraform state modification — variable was never read, so state
  has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
  untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
  `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
  independently. Those 2 in-place updates are known and tracked
  separately; this commit explicitly avoids conflating cleanup with
  drift resolution.
- No apply needed — pure source hygiene.

## Test Plan

### Automated

Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)

Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```

`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation
notices, unrelated)

`terragrunt plan` (from `infra/stacks/mailserver/`):
```
  # module.mailserver.kubernetes_deployment.mailserver will be updated in-place
  # module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift
(volume_mount ordering + stale `metallb.io/ip-allocated-from-pool`
annotation). No change is attributable to this commit — the dead
variable was never referenced, so removing it leaves state untouched.

### Manual Verification

1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` → expected `0`
3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed
   including the trailing blank after the EOT)
4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains
5. `cd ../..` (stack root) → `terragrunt validate` → expected:
   `Success! The configuration is valid`
6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to
   destroy.` (the 2 are the pre-existing drift, not from this commit).

Closes: code-o3q

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:05:44 +00:00
root
1990ee7f8d Woodpecker CI Update TLS Certificates Commit 2026-04-19 00:02:53 +00:00
Viktor Barzin
8ea2dea84c [mailserver] Authentik-gate Roundcube webmail ingress [ci skip]
## Context
mail.viktorbarzin.me exposed the Roundcube login page directly: requests
hit Traefik → CrowdSec + anti-AI middleware → Roundcube. The `ingress_factory`
call in `roundcubemail.tf` omitted `protected = true`, so the Authentik
ForwardAuth middleware was never wired up. Project rule
(`infra/.claude/CLAUDE.md`): ingresses should be `protected = true` unless
there is a specific reason to leave them open. Credentialed surfaces (login
pages) have no reason to skip the OIDC gate — CrowdSec alone is a behavioural
signal, not an identity gate.

Trade-off accepted by Viktor on 2026-04-18: webmail now requires two logins
(Authentik SSO, then Roundcube IMAP auth against dovecot). This is tolerable
for a low-volume personal webmail; mail clients (Thunderbird, phone Mail)
bypass the webmail entirely and speak IMAPS/SMTP directly against
`mail.viktorbarzin.me` on the MetalLB service IP (10.0.20.202), which is a
separate path and MUST stay open.

## This change
Single-line flip: `protected = true` added to the `ingress_factory` call in
`stacks/mailserver/modules/mailserver/roundcubemail.tf`.

The factory (`modules/kubernetes/ingress_factory/main.tf`) responds to the
flag by:
  1. Appending `traefik-authentik-forward-auth@kubernetescrd` to the ingress
     `router.middlewares` annotation — Traefik then hands each request to
     the Authentik outpost before forwarding to Roundcube.
  2. Flipping `effective_anti_ai` from true → false (logic:
     `anti_ai_scraping != null ? … : !var.protected`), which removes the two
     anti-AI middlewares. Rationale in the factory: a login-gated resource
     is already invisible to unauthenticated scrapers, so the robots/noai
     middleware chain is redundant.

Request path before vs after:

    Before: Client → Traefik → [retry, error-pages, rate-limit, csp,
                                crowdsec, ai-bot-block, anti-ai-headers]
                              → Roundcube (200 on /)
    After:  Client → Traefik → [retry, error-pages, rate-limit, csp,
                                crowdsec, authentik-forward-auth]
                              → if unauth: 302 to authentik.viktorbarzin.me
                              → if auth:   Roundcube (login form)

## What is NOT in this change
  - The `mailserver` Service (MetalLB IP 10.0.20.202) is untouched. IMAPS
    (993), SMTPS (465), SMTP-Submission (587) continue to bypass Traefik
    entirely and speak directly to dovecot/postfix. Mail clients are
    unaffected.
  - Pre-existing drift on `kubernetes_deployment.mailserver` (volume_mount
    ordering) and `kubernetes_service.mailserver` (stale metallb annotation)
    is left alone — out of scope per bd-bmh. Apply was scoped with
    `-target=` to the ingress resource only.
  - No Authentik app/provider Terraform was touched — the `mail.*` ingress
    is already covered by the existing wildcard Authentik proxy outpost on
    `*.viktorbarzin.me` (standard pattern).

## Test Plan

### Automated
Baseline (before apply):

    $ curl -sI https://mail.viktorbarzin.me/ | head -2
    HTTP/2 200
    alt-svc: h3=":443"; ma=2592000

    $ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \
        | grep -E 'CONNECTED|subject='
    CONNECTED(00000003)
    subject=CN = viktorbarzin.me

After apply:

    $ curl -sI https://mail.viktorbarzin.me/ | head -3
    HTTP/2 302
    alt-svc: h3=":443"; ma=2592000
    location: https://authentik.viktorbarzin.me/application/o/authorize/?client_id=…

    $ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \
        | grep -E 'CONNECTED|subject='
    CONNECTED(00000003)
    subject=CN = viktorbarzin.me

Middleware annotation on the ingress:

    $ kubectl get ingress -n mailserver mail \
        -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
    traefik-retry@kubernetescrd,traefik-error-pages@kubernetescrd,
    traefik-rate-limit@kubernetescrd,traefik-csp-headers@kubernetescrd,
    traefik-crowdsec@kubernetescrd,traefik-authentik-forward-auth@kubernetescrd

Terraform apply (targeted):

    $ scripts/tg apply --non-interactive \
        -target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress
    …
    Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

### Manual Verification
  1. In a private browser window, navigate to https://mail.viktorbarzin.me/
  2. Expected: redirected to Authentik SSO login (not Roundcube)
  3. Authenticate with Authentik credentials
  4. Expected: redirected back and shown the Roundcube IMAP login form
  5. Enter IMAP credentials (same as before the change)
  6. Expected: Roundcube inbox loads normally
  7. Separately, verify a mail client (Thunderbird, phone Mail) still
     connects to IMAPS on mail.viktorbarzin.me:993 and SMTP on :587 without
     any Authentik prompt — that path hits MetalLB 10.0.20.202 directly.

## Reproduce locally
  1. cd infra/stacks/mailserver
  2. vault login -method=oidc
  3. scripts/tg plan
     Expected: 0 to add, 3 to change, 0 to destroy. Relevant change is the
     `router.middlewares` annotation on
     `module.ingress.kubernetes_ingress_v1.proxied-ingress` swapping the
     two anti-AI middlewares for `traefik-authentik-forward-auth`. The
     other 2 changes are pre-existing drift (volume_mounts, metallb
     annotation) and are out of scope.
  4. scripts/tg apply --non-interactive \
       -target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress
  5. curl -sI https://mail.viktorbarzin.me/ — expect HTTP/2 302 to
     authentik.viktorbarzin.me

Closes: code-bmh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:56:25 +00:00
Viktor Barzin
8f5e131572 [mailserver] Route DMARC rua/ruf to dmarc@viktorbarzin.me [ci skip]
## Context

Mailgun was decommissioned on 2026-04-12 in favour of Brevo as the outbound
SMTP relay. The DMARC aggregate (`rua`) and forensic (`ruf`) report targets
still pointed at `e21c0ff8@dmarc.mailgun.org`, an inbox that no longer
exists — meaning every DMARC report Google/Microsoft/etc. generate has
been bouncing or silently dropped for six days. No alerts fire on this
(DMARC reports are best-effort, not RFC-mandated), but we've lost visibility
into alignment failures and spoofing attempts during the exact window where
the SPF/DKIM/DMARC posture was being reshaped for the Brevo cutover.

Decision (2026-04-18): route reports to `mailto:dmarc@viktorbarzin.me`.
The mailserver's catch-all sieve delivers anything to non-existent
local-parts into `spam@`, so `dmarc@` does not need to be provisioned as
a real mailbox — the inbox will land in `spam@`'s maildir unchanged.

Alternative considered: route to a dedicated `dmarc@` maildir with sieve
rules to file into a folder. Rejected for now — the monitoring value of
DMARC reports is low-frequency (one aggregate per reporter per day at
most), so the catch-all path is good enough until volume justifies a
proper parser. Can be revisited once we see actual report traffic.

The third-party aggregator target `adb84997@inbox.ondmarc.com` (Red Sift
OnDMARC) is preserved in both rua and ruf — it provides parsed dashboards
that we actually read. The `postmaster@viktorbarzin.me` ruf-only target
also stays as a local mirror.

As a side effect, this apply also canonicalises the TXT record: the
previous value was stored as a two-string split in Cloudflare state
(`...viktorbarzin" ".me;"`) due to the 255-byte TXT string limit
(the record length exceeded 255 chars). The new value is shorter
(dmarc@viktorbarzin.me is 21 chars vs e21c0ff8@dmarc.mailgun.org's
26 chars, doubled across rua and ruf) and fits in a single string,
so the provider serialises it as one string and the prior split-drift
noise disappears from future plans.

## This change

Single-line content edit on `cloudflare_record.mail_dmarc` in
`stacks/cloudflared/modules/cloudflared/cloudflare.tf`:

Before → After (rua and ruf, both):
```
mailto:e21c0ff8@dmarc.mailgun.org  →  mailto:dmarc@viktorbarzin.me
```

All other DMARC tags unchanged: `v=DMARC1`, `p=quarantine`, `pct=100`,
`fo=1`, `ri=3600`, `sp=quarantine`, `adkim=r`, `aspf=r`.

Delivery flow:
```
DMARC reporter (Gmail/Outlook/...)
      │ aggregate XML.gz to rua / forensic to ruf
      ▼
dmarc@viktorbarzin.me
      │ mailserver catch-all (no local recipient)
      ▼
spam@viktorbarzin.me (Viki's mailbox)
```

## What is NOT in this change

- **Mailbox sieve rules** to file DMARC reports into a dedicated folder
  (separate concern; deferred until traffic justifies it).
- **DMARC parser / dashboard**. OnDMARC (adb84997@inbox.ondmarc.com)
  already provides this for aggregate reports.
- **Policy tightening** (`p=reject`, `pct` ramp) — out of scope.
- **SPF / DKIM records** — not touched.
- **Removal of the split-string drift suppression**, if any existed in
  prior work. The canonicalisation happens naturally on this apply;
  no separate workaround was needed.

## Test Plan

### Automated

Targeted terragrunt plan + apply via `scripts/tg`:

```
$ cd stacks/cloudflared && scripts/tg plan \
    -target=module.cloudflared.cloudflare_record.mail_dmarc
...
Terraform will perform the following actions:
  # module.cloudflared.cloudflare_record.mail_dmarc will be updated in-place
  ~ resource "cloudflare_record" "mail_dmarc" {
      ~ content = "\"v=DMARC1; ...
                    rua=mailto:e21c0ff8@dmarc.mailgun.org,
                        mailto:adb84997@inbox.ondmarc.com; ...
                    ruf=mailto:e21c0ff8@dmarc.mailgun.org,
                        mailto:adb84997@inbox.ondmarc.com,
                        mailto:postmaster@viktorbarzin\" \".me;\""
                -> "\"v=DMARC1; ...
                    rua=mailto:dmarc@viktorbarzin.me,
                        mailto:adb84997@inbox.ondmarc.com; ...
                    ruf=mailto:dmarc@viktorbarzin.me,
                        mailto:adb84997@inbox.ondmarc.com,
                        mailto:postmaster@viktorbarzin.me;\""
    }
Plan: 0 to add, 1 to change, 0 to destroy.

$ scripts/tg apply /tmp/dmarc.tfplan
module.cloudflared.cloudflare_record.mail_dmarc: Modifying...
module.cloudflared.cloudflare_record.mail_dmarc: Modifications complete after 1s
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

Authoritative DNS post-apply:

```
$ dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
"v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;"
```

Note: `dig @1.1.1.1` still served the old value immediately after apply —
Cloudflare's public resolver holds its cache until TTL expires
(TTL=1/auto ≈ 5 min). Authoritative NS is the source of truth.

### Manual Verification

**Setup**: none (DNS change only).

**Commands**:
```
# 1. Confirm authoritative DNS (run now, should pass)
dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
# Expected: rua=mailto:dmarc@viktorbarzin.me,... and ruf similarly.

# 2. Confirm public resolver catches up (run after ~5min)
dig TXT _dmarc.viktorbarzin.me @1.1.1.1 +short
# Expected: same as above (no more mailgun.org entries).

# 3. Within 24-48h, check Viki's spam@ inbox for an incoming DMARC
#    aggregate report from Google/Microsoft/etc. Reports are
#    typically .zip or .gz attachments with XML inside.
```

**Interpretation**: seeing a DMARC report land in spam@ proves the
end-to-end delivery path works: reporter DNS lookup → _dmarc.viktorbarzin.me
→ mailto:dmarc@viktorbarzin.me → catch-all → spam@ maildir.

## Reproduce locally

```
1. git pull
2. cd stacks/cloudflared
3. dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
4. Expected: rua=mailto:dmarc@viktorbarzin.me (and ruf the same).
```

Closes: code-569

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:49:14 +00:00
Viktor Barzin
b2d2a5bb1c [docs] Document Fail2ban-disabled rationale (CrowdSec is policy) [ci skip]
## Context

An audit of the mailserver stack raised the question: why is Fail2ban
disabled in the docker-mailserver deployment? The setting
`ENABLE_FAIL2BAN = "0"` lives in the env ConfigMap at
`stacks/mailserver/modules/mailserver/main.tf:68` with no documented
rationale, which made the decision look accidental rather than
deliberate.

The decision is deliberate: CrowdSec is the cluster-wide bouncer for
SSH, HTTP, and SMTP/IMAP brute-force defence. It already tails
`postfix` + `dovecot` logs via the installed collections and enforces
decisions at the LB/firewall tier with real client IPs preserved by
`externalTrafficPolicy: Local` on the dedicated MetalLB IP. Enabling
Fail2ban in-pod would duplicate that response path — two systems
racing to ban the same offender from different enforcement points,
iptables churn inside the container, and a split audit trail across
two decision stores. User decision 2026-04-18: keep disabled, document
the decision so the next auditor doesn't have to re-derive it.

## This change

Adds a new subsection "Fail2ban Disabled (CrowdSec is the Policy)" to
the Security section of `docs/architecture/mailserver.md`, placed
immediately after the existing CrowdSec Integration block. The
paragraph cites `stacks/mailserver/modules/mailserver/main.tf:68`
(where `ENABLE_FAIL2BAN = "0"` lives) and explains why duplicating the
layer would make things worse, not better. Pure docs — no Terraform
touched.

## Test Plan

### Automated
None — docs-only change. No tests, lint, or type checks apply to
markdown prose.

### Manual Verification
1. `less infra/docs/architecture/mailserver.md` — locate the Security
   section; confirm the new "Fail2ban Disabled (CrowdSec is the
   Policy)" subsection appears between "CrowdSec Integration" and
   "Rspamd".
2. Render on GitHub or via a markdown previewer; confirm the inline
   link to `main.tf` resolves and the paragraph reads cleanly.
3. `grep -n 'ENABLE_FAIL2BAN' infra/stacks/mailserver/modules/mailserver/main.tf`
   — confirm it still reports the value on line 68, matching the
   citation in the doc.

Closes: code-zhn

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:47:59 +00:00
Viktor Barzin
17a3e03e07 [owntracks] Bridge Recorder → Dawarich via Lua hook script
## Context

Viktor wanted live forwarding from Owntracks to Dawarich so his map
stays in sync without a periodic backfill. The original plan assumed
ot-recorder honoured an `OTR_HTTPHOOK` environment variable — but
Recorder 1.0.1 (latest on Docker Hub as of Aug 2025) has no such
feature:

```
$ kubectl -n owntracks exec deploy/owntracks -- \
    strings /usr/bin/ot-recorder | grep -iE 'hook|webhook|http_post'
(no matches)
```

Lua hooks, on the other hand, are first-class: `--lua-script` loads a
file and calls the `otr_hook(topic, _type, data)` function for every
publish. That is the pivot this commit makes.

## This change

Mount a Lua script via ConfigMap and tell ot-recorder to load it:

```
Phone POST /pub ---> Traefik ---> Recorder pod
                                     |
                                     | handle_payload() writes .rec
                                     | otr_hook(topic,_type,data)
                                     |   |
                                     |   +---> os.execute("curl … &")
                                     |             |
                                     |             v
                                     |         Dawarich /api/v1/owntracks/points
                                     |
                                     +---> HTTP 200 to phone
```

Per-publish cost: one `curl` subprocess, `--max-time 5`, backgrounded
with `&` so it doesn't block the HTTP response to the phone. A
Dawarich 5xx drops exactly one point — the `.rec` write still happens,
so the one-shot backfill Job can always re-play.

`DAWARICH_API_KEY` is injected from K8s Secret `owntracks-secrets`
(sourced from Vault `secret/owntracks.dawarich_api_key` via the
existing `dataFrom.extract` ExternalSecret). The Lua reads it with
`os.getenv()` so the key never lands in Terraform state.

### Key discoveries in the verification loop (why iteration count > 1)

1. The hook function must be named `otr_hook`, not `hook` (recorder's
   `luasupport.c` calls `lua_getglobal(L, "otr_hook")`). The recorder
   logs `cannot invoke otr_hook in Lua script` when missing — the
   plan's `hook()` naming was wrong.
2. Dawarich's `latitude`/`longitude` scalar columns are legacy and
   always NULL; the authoritative geometry is in the `lonlat` PostGIS
   column (`ST_AsText(lonlat::geometry)`). Early "it's broken" readings
   were me querying the wrong columns.
3. Default Recreate-strategy rollouts cause ~30s 502/503 windows on
   the ingress — tolerable, but every apply is visible as an outage
   to the phone. Batching edits is important.

## What is NOT in this change

- **Not** OTR_HTTPHOOK. Removed with this commit (dead env var).
- **Not** the one-shot backfill Job — that comes after the phone
  buffer has flushed to avoid racing against incoming hook POSTs
  (follow-up: code-h2r).
- **Not** Anca's bridge — a second Recorder instance or a smarter
  hook is needed to route her posts under her own Dawarich api_key
  (follow-up: code-72g).
- No Ingress or Service change — Commit 1 (`a21d4a44`) already landed
  those.

## Test Plan

### Automated

```
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.

$ kubectl -n owntracks logs deploy/owntracks --tail=5
+ initializing Lua hooks from `/hook/dawarich-hook.lua'
+ dawarich-bridge: init
+ HTTP listener started on 0.0.0.0:8083, without browser-apikey
...
+ dawarich-bridge: tst=1 lat=0 lon=0 ok=true
```

### Manual Verification

```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
    curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
    -H 'Content-Type: application/json' \
    -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
    -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
    https://owntracks.viktorbarzin.me/pub
HTTP 200

$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
    psql -U postgres -d dawarich -c \
    "SELECT timestamp, ST_AsText(lonlat::geometry) FROM points \
     WHERE user_id=1 AND timestamp=$TST"
 timestamp  |        st_astext
------------+-------------------------
 1776555707 | POINT(-0.1278 51.5074)
```

Real phone traffic (from in-flight buffer flush) lands in Dawarich too:
`traefik logs -l app.kubernetes.io/name=traefik | grep 'POST /api/v1/owntracks/points'`
shows ingress POSTs from `owntracks` namespace to `dawarich` backend
with status 200.

### Reproduce locally

1. `vault login -method=oidc`
2. `kubectl -n owntracks logs deploy/owntracks --tail=20` — expect
   `dawarich-bridge: init` after the Lua loader line.
3. Do the curl above, poll the DB, expect `POINT(lon lat)`.

Closes: code-z9b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:47:22 +00:00
Viktor Barzin
cfd0f5bcc9 [mailserver] Add liveness/readiness TCP probes [ci skip]
## Context

The mailserver container (Postfix + Dovecot in one pod) had no liveness, readiness, or startup probes declared. If either daemon deadlocked or hung on a socket, Kubernetes had no way to detect it and restart. The only external canary was the email-roundtrip-monitor CronJob which runs on a 20-minute interval, giving a detection lag of 20-60 minutes — long enough for real delivery failures before an alert fires.

Tracked as bd code-ekf out of the mailserver probe audit. Both port 25 (SMTP) and port 993 (IMAPS) are cheap, reliable up-signals — the existing e2e probe already hits IMAPS, so TCP probes on those ports are a close proxy for user-visible service health without the cost of full SMTP/IMAP handshakes every 10s.

## This change

Adds a readiness_probe (TCP :25, initial_delay=30s, period=10s) and a liveness_probe (TCP :993, initial_delay=60s, period=60s, timeout=15s) to the mailserver deployment's primary container.

Design choices:
- **TCP over exec/HTTP**: the daemons do not expose HTTP health; exec probes would require shelling into the container with auth for SMTP/IMAP banner checks, which is both costly and flaky. TCP accept is sufficient — if postfix cannot accept a TCP connection on :25 it is unambiguously broken.
- **Split ports per probe**: readiness on :25 (the public SMTP surface — if this is down, external delivery is broken) and liveness on :993 (IMAPS, the other critical daemon — catches Dovecot deadlocks independently of Postfix).
- **30s readiness delay**: Postfix needs ~20-30s to warm up including chroot setup and DKIM key loading; probing earlier would cause bogus NotReady cycles on deploy.
- **60s liveness delay + 60s period + 15s timeout**: generous so transient blips (brief CPU spike, RBL timeout, slow NFS unmount during rotation) do not trigger a restart loop. With failure_threshold=3 (default), a real deadlock is detected in ~3 minutes; false positives on transient load are suppressed.
- **No startup_probe**: the 60s liveness initial_delay is enough cover for the warmup window; adding a startup probe would be redundant machinery.

## What is NOT in this change

- No startup_probe (liveness initial_delay_seconds=60 handles warmup)
- No exec-based probes (banner-check probes are out of scope and not needed)
- No changes to the opendkim or other sidecars
- Pre-existing drift in other stacks (dawarich namespace label, owntracks dawarich-hook wiring) is deliberately left out — those are separate workstreams

## Test Plan

### Automated

Applied via `tg apply -target=kubernetes_deployment.mailserver` before this commit. Current pod state:

```
$ kubectl get pod -n mailserver -l app=mailserver
NAME                          READY   STATUS    RESTARTS   AGE
mailserver-6c6bf77ffb-w7nl5   2/2     Running   0          2m26s

$ kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness|Restart Count|Status:|Ready:)"
Status:               Running
    Ready:          True
    Restart Count:  0
    Ready:          True
    Restart Count:  0
    Liveness:   tcp-socket :993 delay=60s timeout=15s period=60s #success=1 #failure=3
    Readiness:  tcp-socket :25 delay=30s timeout=1s period=10s #success=1 #failure=3
```

Pod has run >120s (two full liveness cycles) with RESTARTS=0 and Ready=True.

### Manual Verification

1. Confirm probes are declared on the live pod:
   ```
   kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)"
   ```
   Expected: `Liveness: tcp-socket :993 ...` and `Readiness: tcp-socket :25 ...`

2. Confirm pod stays Ready under normal load for 5+ minutes:
   ```
   kubectl get pod -n mailserver -l app=mailserver -w
   ```
   Expected: RESTARTS stays at 0, READY stays at 2/2.

3. (Optional) Failure-simulate by dropping :993 inside the pod and observing liveness failure + restart within ~3 minutes (3 × period_seconds).

## Reproduce locally

1. `cd infra/stacks/mailserver`
2. `tg plan -target=kubernetes_deployment.mailserver`
3. Expected: no drift (or only the probe additions if rolling forward a stale state)
4. `kubectl get pod -n mailserver -l app=mailserver` — pod Ready, RESTARTS=0
5. `kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)"` — both probes present

Closes: code-ekf

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:45:17 +00:00
70 changed files with 8426 additions and 4768 deletions

View file

@ -137,7 +137,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay).
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
## Storage & Backup Architecture

View file

@ -19,15 +19,25 @@ Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No mark
## RSU handling (important — Meta UK payslips)
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the actual share grant + tax is handled by the broker (Schwab), which sells shares to cover withholding. On the payslip:
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
- An EARNINGS line appears with labels like `RSU Vest`, `Restricted Stock Units`, `Stock Value`, `Notional Pay`, `Share Award`, `GSU Vest`, `Equity Vest` → populate `rsu_vest`.
- A DEDUCTION line of equal-or-similar magnitude nets it back out. Labels: `Shares Retained`, `Stock Tax Withholding`, `RSU Offset`, `Notional Pay Offset`, `Shares Withheld` → populate `rsu_offset`.
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
If you see either line, populate BOTH fields. Do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI even though some templates put them near the tax block. They exist for reporting.
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
If the payslip has no stock component, leave both as 0.
## Earnings decomposition (v2)
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20``600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
## Fast path: PAYSLIP_TEXT is present
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.

View file

@ -34,7 +34,11 @@ You receive these parameters in your invocation:
- **Infra repo**: `/home/wizard/code/infra`
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
- **Kubeconfig**: `/home/wizard/code/infra/config`
- **Vault**: Authenticate with `vault login -method=oidc` if needed. Secrets at `secret/viktor` and `secret/platform`.
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`:
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
- **Git remote**: `origin``github.com/ViktorBarzin/infra.git`
## NEVER Do
@ -118,7 +122,6 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
4. If auto-detect fails, verify the repo exists:
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
```
@ -128,7 +131,6 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
## Step 3: Fetch Changelogs via GitHub API
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
```
@ -171,11 +173,9 @@ Scan all intermediate release notes for breaking change indicators from the conf
## Step 5: Slack Notification — Starting
```bash
SLACK_WEBHOOK=$(vault kv get -field=alertmanager_slack_api_url secret/platform)
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
"$SLACK_WEBHOOK"
"$SLACK_WEBHOOK_URL"
```
For CAUTION risk, include breaking change excerpts in the Slack message.
@ -266,23 +266,28 @@ UPGRADE_SHA=$(git rev-parse HEAD)
## Step 9: Wait for Woodpecker CI
The commit triggers the `app-stacks.yml` pipeline (or `default.yml` for platform stacks).
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19).
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
```bash
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_token secret/viktor)
# Find the pipeline for our commit
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
# → $PIPELINE_NUMBER
# Fetch detail (includes workflows[])
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
| jq '.workflows[] | select(.name=="default") | .state'
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
```
Poll for the pipeline triggered by our commit:
```bash
# Get latest pipeline
curl -s -H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=5"
```
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes.
Find the pipeline matching our commit SHA. Poll every 30 seconds until status is `success`, `failure`, `error`, or `killed`. Timeout after 15 minutes.
**If CI fails** → proceed to Step 10 (rollback).
**If CI succeeds** → proceed to verification.
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state.
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
## Step 10: Verify
@ -341,7 +346,7 @@ Re-run verification checks to confirm rollback succeeded. If rollback verificati
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
"$SLACK_WEBHOOK"
"$SLACK_WEBHOOK_URL"
```
## Step 11: Report Results
@ -350,14 +355,14 @@ curl -s -X POST -H 'Content-type: application/json' \
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
"$SLACK_WEBHOOK"
"$SLACK_WEBHOOK_URL"
```
### On failure + rollback
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
"$SLACK_WEBHOOK"
"$SLACK_WEBHOOK_URL"
```
## Edge Cases

File diff suppressed because it is too large Load diff

View file

@ -7,339 +7,314 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
and stuck CrashLoopBackOff pods.
Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability) with safe auto-fix for evicted pods.
author: Claude Code
version: 1.0.0
date: 2026-02-21
version: 2.0.0
date: 2026-04-19
---
# Cluster Health Check
## Overview
## MANDATORY: Run the script first
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
- **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
- **Exit code**: 0 = healthy, 1 = issues found
## Quick Check
Run the health check interactively:
When this skill is invoked, your **first action** must be to run the
cluster health check script and reason over its output before doing
anything else. Do not improvise individual `kubectl` calls — the
script is the authoritative surface.
```bash
# Report only, no Slack notification
bash /workspace/infra/.claude/cluster-health.sh --no-slack
# Full run with Slack notification
bash /workspace/infra/.claude/cluster-health.sh
# Report only, no auto-fix and no Slack
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
cd /home/wizard/code
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
```
## What It Checks
If the session is rooted elsewhere, fall back to the absolute path:
| # | Check | Auto-Fix | Alerts |
|---|-------|----------|--------|
| 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
| 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
| 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
| 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
| 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
| 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
```bash
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
```
Then:
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
2. Iterate every FAIL and WARN check, describe what tripped, and propose
the remediation path (use the recipes below).
3. Only reach for ad-hoc `kubectl` commands when investigating a
specific failure beyond what the script reported.
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
## Quick flags
```bash
# Human-readable report (default), no auto-fix
bash infra/scripts/cluster_healthcheck.sh
# Machine-readable JSON summary
bash infra/scripts/cluster_healthcheck.sh --json
# Only show WARN + FAIL (suppress PASS noise)
bash infra/scripts/cluster_healthcheck.sh --quiet
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
bash infra/scripts/cluster_healthcheck.sh --fix
# Combined: quiet JSON without auto-fix
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
# Custom kubeconfig
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (42 checks)
| # | Check | Notes |
|---|-------|-------|
| 1 | Node Status | NotReady nodes, version drift |
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
| 6 | DaemonSets | desired == ready |
| 7 | Deployments | ready == desired replicas |
| 8 | PVC Status | all Bound |
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
| 11 | CrowdSec Agents | all pods Running |
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
| 13 | Prometheus Alerts | count of firing alerts |
| 14 | Uptime Kuma Monitors | internal + external monitors up |
| 15 | ResourceQuota Pressure | any quota >80% used |
| 16 | StatefulSets | ready == desired |
| 17 | Node Disk Usage | ephemeral-storage <80% |
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
| 19 | Kyverno Policy Engine | all pods Running |
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
| 21 | DNS Resolution | Technitium resolves internal + external |
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
| 23 | GPU Health | nvidia namespace + device-plugin Running |
| 24 | Cloudflare Tunnel | pods Running |
| 25 | Resource Usage | node CPU/mem headroom |
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
## Safe Auto-Fix Rules
### Safe to auto-fix (the script does these automatically)
`--fix` only performs operations that are genuinely reversible and
observable. Nothing here rewrites Terraform state or mutates the cluster
beyond "delete pod".
1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
### Done automatically by `--fix`
2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
```bash
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
```
- **Evicted / Failed pods** — delete them; the controller recreates.
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
backoff timer.
### NEVER auto-fix (requires human investigation)
- **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
- **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
- **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
- **Failed deployments** — Could be resource limits, bad config, missing secrets
- **Pending PVCs** — Usually NFS export missing or storage class issue
- **Resource pressure >90%** — Need to identify which pods are consuming resources
- **CronJob failures** — Need to check job logs to understand why it failed
- **DaemonSet issues** — Could be node taints, resource limits, or image issues
- NotReady nodes
- MemoryPressure / DiskPressure / PIDPressure
- ImagePullBackOff (usually a bad tag / registry credential)
- Deployment ready-replica mismatch
- Pending PVCs
- Node CPU/memory >90%
- CronJob failures
- DaemonSet desired != ready
- Vault sealed
- ClusterSecretStore not Ready
- cert-manager Certificate failures
- Backup freshness regressions
- Any external-reachability failure
## Deep Investigation
## Deep-investigation recipes per failure mode
When the health check reports issues, use these commands to investigate further.
### Node Issues
### Node Issues (checks 1, 3, 17, 25)
```bash
# Describe the problematic node (events, conditions, capacity)
kubectl describe node <node-name>
# Check resource usage across all nodes
kubectl describe node <node>
kubectl top nodes
# Check recent events on a specific node
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
# SSH to the node for direct inspection
ssh root@<node-ip>
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
# SSH to the node
ssh root@10.0.20.10X
systemctl status kubelet
journalctl -u kubelet --since "30 minutes ago" | tail -100
df -h
free -h
df -h ; free -h
```
### Pod Issues
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
`.103` node3, `.104` node4.
### Pod Issues (checks 4, 5, 11, 19)
```bash
# Describe the pod (events, conditions, container statuses)
kubectl describe pod -n <namespace> <pod-name>
# Check current logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Check logs from the previous crashed container
kubectl logs -n <namespace> <pod-name> --previous --tail=100
# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Check all pods in a namespace
kubectl get pods -n <namespace> -o wide
kubectl describe pod -n <ns> <pod>
kubectl logs -n <ns> <pod> --tail=200
kubectl logs -n <ns> <pod> --previous --tail=200
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
```
### Deployment Issues
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
config / missing env var, DB connection failure (check `dbaas` pods),
NFS mount failure (`showmount -e 192.168.1.127`), stale
imagePullSecret.
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
```bash
# Describe the deployment (strategy, conditions, events)
kubectl describe deployment -n <namespace> <deployment-name>
# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>
# Check rollout history
kubectl rollout history deployment -n <namespace> <deployment-name>
# Check the replicaset
kubectl get rs -n <namespace> -l app=<app-label>
kubectl describe deployment -n <ns> <name>
kubectl rollout status deployment -n <ns> <name>
kubectl rollout history deployment -n <ns> <name>
kubectl get rs -n <ns> -l app=<app>
```
### PVC Issues
### PVC (check 8)
```bash
# Describe the PVC (events, status, storage class)
kubectl describe pvc -n <namespace> <pvc-name>
# Check PVs
kubectl get pv
# Check events related to PVCs
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
# Verify NFS export exists
showmount -e 10.0.10.15 | grep <service-name>
kubectl describe pvc -n <ns> <pvc>
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
kubectl get pv | grep <pvc>
showmount -e 192.168.1.127
```
### Resource Pressure
### cert-manager (checks 31, 32, 33)
```bash
# Top nodes (CPU and memory usage)
kubectl top nodes
# Top pods sorted by memory (cluster-wide)
kubectl top pods -A --sort-by=memory | head -20
# Top pods sorted by CPU (cluster-wide)
kubectl top pods -A --sort-by=cpu | head -20
# Check resource requests/limits in a namespace
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get certificate -A
kubectl describe certificate -n <ns> <name>
kubectl get certificaterequest -A
kubectl describe certificaterequest -n <ns> <name>
kubectl logs -n cert-manager deploy/cert-manager | tail -50
```
## Common Remediation
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
DNS provider secret, rate-limit from Let's Encrypt.
### Persistent CrashLoopBackOff
### Backups (checks 34, 35, 36)
A pod keeps crashing even after the auto-fix deletes it.
```bash
# Per-DB dumps (inside the DB pod)
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
1. **Check logs from the crashed container**:
```bash
kubectl logs -n <namespace> <pod-name> --previous --tail=200
```
# Pushgateway metrics
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
grep backup_last_success_timestamp
2. **Check the pod description for clues**:
```bash
kubectl describe pod -n <namespace> <pod-name>
```
Look for:
- `OOMKilled` in Last State — the container ran out of memory
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
- `Error` with exit code 137 — killed by OOM killer or liveness probe
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
# LVM snapshots on PVE host
ssh -o BatchMode=yes root@192.168.1.127 \
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
```
3. **Common causes**:
- **OOMKilled**: Increase memory limits in Terraform (see below)
- **Bad config**: Check environment variables, secrets, config maps
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
If offsite sync is stale, the common cause is the
`offsite-sync-backup.service` systemd unit on the PVE host failing.
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
### OOMKilled
### Monitoring stack (checks 37, 38, 39)
The container was killed because it exceeded its memory limit.
```bash
# Prometheus
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
kubectl logs -n monitoring deploy/prometheus-server --tail=100
1. **Check current limits**:
```bash
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
```
# Alertmanager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
```hcl
resources {
limits = {
memory = "2Gi" # Increase from current value
}
}
```
# Vault
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
3. **Apply the change**:
```bash
cd /workspace/infra
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
```
# ClusterSecretStore
kubectl get clustersecretstore
kubectl describe clustersecretstore vault-kv vault-database
kubectl logs -n external-secrets deploy/external-secrets --tail=100
```
### ImagePullBackOff
### External reachability (checks 40, 41, 42)
The container image cannot be pulled.
```bash
# Cloudflared
kubectl get pods -n cloudflared
kubectl logs -n cloudflared -l app=cloudflared --tail=100
1. **Check the exact error**:
```bash
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
```
# Authentik
kubectl get pods -n authentik -l app=authentik-server
kubectl logs -n authentik -l app=authentik-server --tail=100
2. **Common causes**:
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
- **Private registry without credentials**: Check if imagePullSecrets are configured
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
```bash
# Check pull-through cache ports:
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
```
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
# ExternalAccessDivergence alert
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
# Traefik 5xx — find the hot service
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
| python3 -m json.tool
```
### Node NotReady
### OOMKilled remediation
A node has gone NotReady.
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
`resources.limits.memory`.
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
`terraform apply -target=module.<service>` as appropriate.
1. **Check node conditions**:
```bash
kubectl describe node <node-name> | grep -A 20 "Conditions"
```
### ImagePullBackOff remediation
2. **SSH to the node and check kubelet**:
```bash
ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" | tail -50
```
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
2. Verify tag exists on the source registry.
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
4. Update the image tag in Terraform + re-apply.
3. **Check resources**:
```bash
# On the node
df -h # Disk space
free -h # Memory
top -bn1 # CPU/processes
```
### Persistent CrashLoopBackOff after auto-fix
4. **Node IPs** (for SSH):
- `10.0.20.100` — k8s-master
- `10.0.20.101` — k8s-node1 (GPU)
- `10.0.20.102` — k8s-node2
- `10.0.20.103` — k8s-node3
- `10.0.20.104` — k8s-node4
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
- `OOMKilled` → raise memory limit
- Exit code 137 → OOM or probe killed
- Exit code 143 → SIGTERM / graceful shutdown failed
3. Cross-check dbaas + NFS + secrets are healthy.
## Slack Webhook
## Notes on the canonical / hardlink setup
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
- All clear: green checkmark with node/pod count
- Warnings only: warning icon with details
- Issues found: red alert icon with auto-fixes applied and remaining issues
The authoritative copy of this SKILL.md lives at
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
points to the same inode so infra-rooted sessions also discover the
skill.
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
To verify the hardlink is intact:
## Infrastructure
```bash
stat -c '%i %n' \
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
```
| Component | Path / Location |
|-----------|----------------|
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
| CronJob definition | Defined in the OpenClaw Terraform module |
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
| Infra repo (in pod) | `/workspace/infra` |
| kubectl (in pod) | `/tools/kubectl` |
| terraform (in pod) | `/tools/terraform` |
Both should print the same inode number. If they diverge (e.g. `git
checkout` replaced the file rather than updating it), re-link:
## Auto-File Incidents for SEV1/SEV2
After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
### Severity Classification
- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
- **SEV3**: Warnings only, resource pressure <90%, cosmetic do NOT auto-file
### Workflow
1. **Dedup check**: Before filing, query open incidents:
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
```
If an open issue already covers the same service/namespace, **skip filing**.
2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
- Title: `[AUTO] <Service/Namespace> — <brief symptom>`
- Body: full diagnostic dump (pod status, events, alerts, node state)
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
```bash
# Comment and close
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
-d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
-d '{"state": "closed"}'
```
## Post-Mortem Auto-Suggest
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
This ensures incidents are documented while context is fresh.
## Notes
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes
```bash
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
```

View file

@ -23,6 +23,14 @@ steps:
username: viktorbarzin
password:
from_secret: dockerhub-pat
# Private registry on :5050 requires htpasswd auth since 2026-03-22.
# Without this, buildx pushes the second repo but blob HEAD comes
# back 401 → pipeline fails → CI false-negative (see bd code-12b).
- registry: registry.viktorbarzin.me:5050
username:
from_secret: registry_user
password:
from_secret: registry_password
dockerfile: cli/Dockerfile
context: cli
auto_tag: true

View file

@ -37,6 +37,12 @@ steps:
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
# Each `- |` command runs in a fresh shell, so we can't rely on an
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
# step level. VAULT_TOKEN is still per-command; we persist it to
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Skip CI commits ──
- |
@ -55,9 +61,17 @@ steps:
# ── Vault auth ──
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
exit 1
fi
# Persist for downstream `- |` blocks (each runs in a fresh shell,
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
# and `scripts/state-sync` all fall through to ~/.vault-token when
# the env var is unset.
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
# ── Detect changed stacks ──
- |
@ -123,6 +137,7 @@ steps:
# ── Apply platform stacks (serial, with Vault advisory locks) ──
- |
FAILED_PLATFORM_STACKS=""
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
@ -137,6 +152,7 @@ steps:
else
echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
@ -144,9 +160,12 @@ steps:
fi
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
# ── Apply app stacks (serial, with Vault advisory locks) ──
- |
FAILED_APP_STACKS=""
if [ -s .app_apply ]; then
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
@ -161,6 +180,7 @@ steps:
else
echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
@ -168,6 +188,15 @@ steps:
fi
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state
# reflects reality — the service-upgrade agent and CI alert cascade
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
# counted as failures.
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
exit 1
fi
# ── Commit and push state changes ──
- |

View file

@ -99,7 +99,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
- `scripts/cluster_healthcheck.sh` — 25-check cluster health script
- `scripts/cluster_healthcheck.sh`42-check cluster health script (nodes, workloads, monitoring, certs, backups, external reachability)
## Storage
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
@ -118,6 +118,20 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
## Shared Variables (never hardcode)
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
## Redis Service Naming (read before wiring a new consumer)
The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
| Endpoint | Port(s) | Use for | Backed by |
|----------|---------|---------|-----------|
| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.

Binary file not shown.

View file

@ -120,9 +120,31 @@ graph TB
### Redis
- Shared instance at `redis.redis.svc.cluster.local`
- Used for caching and session storage
- No persistence (ephemeral)
Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.
**Current state (as of 2026-04-19, interim — parallel cluster during rework)**:
| Cluster | Pods | Source | Purpose |
|---|---|---|---|
| Legacy `redis-node-*` | 1 master + 1 replica (2 sentinels) | Bitnami Helm chart v25.3.2 | Serving live traffic via HAProxy |
| New `redis-v2-*` | 3 pods, each co-locating redis + sentinel + exporter | Raw `kubernetes_stateful_set_v1` with `redis:7.4-alpine` | Standing by for REPLICAOF-based cutover |
Both clusters live in the `redis` namespace. See `infra/stacks/redis/modules/redis/main.tf` (end-state; legacy `helm_release.redis` + `kubernetes_stateful_set_v1.redis_v2` coexist until cutover).
**Target architecture (post-cutover)**:
- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master. No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide 40+ year runway at the 20% TBW budget.
- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics).
- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up.
**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`.
**Why this design** — three incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave. See beads epic `code-v2b` for the full plan and linked challenger analyses.
### SQLite (Per-App)

View file

@ -1,6 +1,6 @@
# DNS Architecture
Last updated: 2026-04-15
Last updated: 2026-04-19
## Overview
@ -254,27 +254,42 @@ Config is synced to all 3 Technitium instances by CronJob `technitium-split-hori
## CoreDNS Configuration
CoreDNS is managed via a Terraform `kubernetes_config_map` resource in `stacks/technitium/modules/technitium/main.tf`.
CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment).
```
.:53 {
errors / health / ready
kubernetes cluster.local in-addr.arpa ip6.arpa # K8s service discovery
prometheus :9153 # Metrics
forward . 10.0.20.1 8.8.8.8 1.1.1.1 # pfSense → Google → Cloudflare
cache (success 10000 300, denial 10000 300)
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
policy sequential # try upstreams in order
health_check 5s # mark unhealthy in 5s
max_fails 2
}
cache {
success 10000 300 6
denial 10000 300 60
serve_stale 86400s # resilience during upstream outage
}
loop / reload / loadbalance
}
viktorbarzin.lan:53 {
template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN # ndots:5 junk filter
forward . 10.96.0.53 # Technitium ClusterIP
cache (success 10000 300, denial 10000 300)
forward . 10.96.0.53 { # Technitium ClusterIP
health_check 5s
max_fails 2
}
cache (success 10000 300, denial 10000 300, serve_stale 86400s)
}
```
**Scaling**: 3 replicas, `required` anti-affinity on `kubernetes.io/hostname` (spread across 3 distinct nodes). PodDisruptionBudget `coredns` with `minAvailable=2`.
**Kyverno ndots injection**: A Kyverno policy injects `ndots:2` on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.
**Failover behaviour**: With `policy sequential` on the root forward block, CoreDNS tries pfSense first; if `health_check 5s` detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with `serve_stale`, pods keep resolving cached names for up to 24h even with full upstream failure.
## Cloudflare DNS — External Domains
All public domains are under the `viktorbarzin.me` zone. DNS records are **auto-created per service** via the `ingress_factory` module's `dns_type` parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in `config.tfvars`.
@ -360,9 +375,28 @@ Vault DB engine rotates password
| Metric Source | Dashboard | Alerts |
|---------------|-----------|--------|
| Technitium query logs (PostgreSQL) | Grafana `technitium-dns.json` | — |
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | — |
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | `CoreDNSErrors`, `CoreDNSForwardFailureRate` |
| Technitium zone-sync CronJob (Pushgateway) | — | `TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch` |
| Technitium DNS pod availability | — | `TechnitiumDNSDown` |
| `dns-anomaly-monitor` CronJob (Pushgateway) | — | `DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate` |
| Uptime Kuma | External monitors for all proxied domains | ExternalAccessDivergence (15min) |
### Metrics pushed by `technitium-zone-sync`
The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under `job=technitium-zone-sync`:
| Metric | Labels | Meaning |
|--------|--------|---------|
| `technitium_zone_sync_status` | — | 0 = last run succeeded, 1 = at least one zone failed to create |
| `technitium_zone_sync_failures` | — | Number of zones that failed to create this run |
| `technitium_zone_sync_last_run` | — | Unix timestamp of last run (used by `TechnitiumZoneSyncStale`) |
| `technitium_zone_count` | `instance=primary\|<replica-host>` | Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`) |
### DNS alert rewrites
- `DNSQuerySpike` was previously broken: it compared current queries against `dns_anomaly_avg_queries`, which was computed from a per-pod `/tmp/dns_avg` file. Each CronJob run started with a fresh `/tmp`, so `NEW_AVG == TOTAL_QUERIES` every time and the spike condition could never fire. Rewritten to use `avg_over_time(dns_anomaly_total_queries[1h] offset 15m)` which compares against the actual 1h Prometheus history.
- `DNSQueryRateDropped` (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.
## Troubleshooting
### DNS Not Resolving Internal Domains

View file

@ -1,72 +1,147 @@
# Mail Server Architecture
Last updated: 2026-04-18 (SPF switched to Brevo; DMARC reporting address normalized)
Last updated: 2026-04-19 (code-yiu Phase 6: MetalLB LB retired; traffic now enters via pfSense HAProxy with PROXY v2)
## Overview
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Brevo EU (`smtp-relay.brevo.com:587` — migrated from Mailgun on 2026-04-12; SPF record cut over on 2026-04-18). Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs via `externalTrafficPolicy: Local` on a dedicated MetalLB IP.
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Brevo EU (`smtp-relay.brevo.com:587` — migrated from Mailgun on 2026-04-12; SPF record cut over on 2026-04-18). Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs: pfSense HAProxy injects the PROXY v2 header on each backend connection so the mailserver pod sees the true source IP despite kube-proxy SNAT. See [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md) for ops details.
## Architecture Diagram
Two independent paths into the mailserver pod:
- **External** (MX traffic, webmail clients over WAN): Internet → pfSense → HAProxy → NodePort → **alt container ports** (2525/4465/5587/10993) that **require** PROXY v2 framing.
- **Intra-cluster** (Roundcube, E2E probe): same pod, **stock container ports** (25/465/587/993), **no** PROXY framing.
One Deployment, one pod, two sets of Postfix `master.cf` services + Dovecot `inet_listener` blocks, two Kubernetes Services (`mailserver` ClusterIP + `mailserver-proxy` NodePort).
```mermaid
graph TB
subgraph "Inbound Mail"
SENDER[Sending MTA] -->|MX lookup| MX[mail.viktorbarzin.me:25]
MX -->|176.12.22.76:25| PF[pfSense NAT]
PF -->|10.0.20.202:25| MLB[MetalLB<br/>ETP: Local]
MLB --> POSTFIX[Postfix MTA]
flowchart TB
%% External ingress path
SENDER[Sending MTA<br/>arbitrary public IP] -->|MX lookup + SMTP<br/>:25| MX[mail.viktorbarzin.me<br/>A 176.12.22.76]
MX --> PF[pfSense WAN<br/>vtnet0 192.168.1.2]
PF -->|NAT rdr<br/>WAN:25/465/587/993<br/>→ 10.0.20.1:same| HAP
HAP[pfSense HAProxy<br/>4 TCP frontends on 10.0.20.1<br/>send-proxy-v2 to backends]
HAP -->|round-robin<br/>tcp-check inter 120s| KN{k8s worker<br/>node1..4}
KN -->|NodePort 30125-30128<br/>ETP: Cluster → kube-proxy SNAT| PODEXT
%% Internal ingress path
RC[Roundcubemail pod] -->|SMTP :587 + IMAP :993<br/>no PROXY| SVC[Service mailserver<br/>ClusterIP 10.103.108.x<br/>25/465/587/993]
PROBE[email-roundtrip-monitor<br/>CronJob every 20m] -->|IMAP :993<br/>no PROXY| SVC
SVC -->|kube-proxy routes| PODINT
%% The pod — two listener sets, one process tree
subgraph POD["mailserver pod (docker-mailserver 15.0.0)"]
direction LR
PODEXT[Alt ports<br/>2525 / 4465 / 5587 / 10993<br/><b>PROXY v2 REQUIRED</b><br/>smtpd_upstream_proxy_protocol=haproxy<br/>haproxy = yes]
PODINT[Stock ports<br/>25 / 465 / 587 / 993<br/>PROXY-free]
PODEXT --> POSTFIX
PODINT --> POSTFIX
POSTFIX[Postfix<br/>postscreen + smtpd + cleanup + queue]
POSTFIX --> RSPAMD[Rspamd<br/>spam + DKIM + DMARC]
RSPAMD --> DOVECOT[Dovecot IMAP<br/>LMTP deliver]
DOVECOT --> MAILBOX[(Maildir storage<br/>mailserver-data-encrypted PVC<br/>proxmox-lvm-encrypted LUKS2)]
end
subgraph "Mail Processing"
POSTFIX --> RSPAMD[Rspamd<br/>Spam/DKIM/DMARC]
RSPAMD --> DOVECOT[Dovecot IMAP]
DOVECOT --> MAILBOX[(Mailboxes<br/>proxmox-lvm PVC)]
end
%% Outbound
POSTFIX -->|queued mail<br/>SASL + TLS| BREVO[Brevo EU Relay<br/>smtp-relay.brevo.com:587<br/>300/day free tier]
BREVO --> RECIPIENT[External Recipient]
subgraph "Outbound Mail"
POSTFIX_OUT[Postfix] -->|SASL + TLS| MAILGUN[Brevo EU Relay<br/>smtp-relay.brevo.com:587]
MAILGUN --> RECIPIENT[Recipient]
end
%% Webmail HTTP path
USER[User browser] -->|HTTPS| CF[Cloudflare proxy<br/>mail.viktorbarzin.me]
CF --> TUNNEL[Cloudflared tunnel<br/>pfSense → Traefik]
TUNNEL --> TRAEFIK[Traefik Ingress<br/>Authentik-protected]
TRAEFIK --> RC
subgraph "Webmail"
USER[User] -->|HTTPS| TRAEFIK[Traefik Ingress]
TRAEFIK --> RC[Roundcubemail]
RC -->|IMAP 993| DOVECOT
RC -->|SMTP 587| POSTFIX_OUT
end
%% Security
POSTFIX -.->|log stream<br/>real client IPs from PROXY v2| CSAGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
CSAGENT -.-> CSLAPI[CrowdSec LAPI]
CSLAPI -.->|bouncer decisions<br/>ban external IPs| PF
subgraph "Security"
MLB -->|Real client IPs| CS_AGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
CS_AGENT --> CS_LAPI[CrowdSec LAPI]
end
%% Monitoring
PROBE -.->|Brevo HTTP API<br/>triggers external delivery| MX
PROBE -.->|Push on roundtrip success| PUSH[Pushgateway + Uptime Kuma]
subgraph "Monitoring"
PROBE[E2E Roundtrip Probe<br/>CronJob every 20m] -->|Mailgun API| SENDER
PROBE -->|IMAP check| DOVECOT
PROBE --> PUSH[Pushgateway + Uptime Kuma]
DEXP[Dovecot Exporter<br/>:9166] --> PROM[Prometheus]
end
classDef extPath fill:#ffedd5,stroke:#ea580c,stroke-width:2px
classDef intPath fill:#dbeafe,stroke:#2563eb,stroke-width:2px
classDef pod fill:#dcfce7,stroke:#15803d
classDef sec fill:#fee2e2,stroke:#dc2626
class SENDER,MX,PF,HAP,KN,PODEXT extPath
class RC,PROBE,SVC,PODINT intPath
class POSTFIX,RSPAMD,DOVECOT,MAILBOX pod
class CSAGENT,CSLAPI sec
```
### PROXY v2 sequence (external SMTP roundtrip)
Illustrates the wire-level sequence of a Brevo probe email arriving at our MX. Same sequence applies to any external sender.
```mermaid
sequenceDiagram
autonumber
participant C as External MTA<br/>(e.g. Brevo 77.32.148.26)
participant PF as pfSense WAN<br/>192.168.1.2:25
participant HAP as pfSense HAProxy<br/>10.0.20.1:25
participant N as k8s-node:30125<br/>ETP: Cluster
participant P as Postfix postscreen<br/>pod:2525
C->>PF: TCP SYN dst=192.168.1.2:25
PF->>HAP: NAT rdr rewrites dst → 10.0.20.1:25
HAP->>N: TCP connect (src=10.0.20.1, dst=k8s-node:30125)
Note over HAP,N: HAProxy opens a NEW TCP flow<br/>to the backend k8s node.
HAP->>N: PROXY v2 header<br/>(source=77.32.148.26, dest=10.0.20.1)
N->>P: kube-proxy SNAT src=k8s-node IP<br/>forwards PROXY header + payload to pod
P->>P: Parse PROXY v2 header<br/>smtpd_client_addr := 77.32.148.26<br/>(despite kube-proxy SNAT on the wire)
P-->>C: SMTP banner 220 mail.viktorbarzin.me
C-->>P: EHLO / MAIL FROM / RCPT TO / DATA
Note over P,C: Real client IP logged in maillog,<br/>fed to CrowdSec postfix parser.
P->>P: → smtpd → Rspamd → Dovecot → mailbox
```
## Components
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd |
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd (single container) |
| Roundcubemail | 1.6.13-apache | `mailserver` namespace | Webmail UI (MySQL-backed) |
| Dovecot Exporter | latest | Sidecar in mailserver pod | Prometheus metrics (port 9166) |
| Rspamd | Built into docker-mailserver | — | Spam filtering, DKIM signing, DMARC verification |
| pfSense HAProxy | 2.9-dev6 (`pfSense-pkg-haproxy-devel`) | pfSense VM | TCP reverse proxy injecting PROXY v2 for external mail |
| Brevo EU (ex-Sendinblue) | SaaS | — | Outbound SMTP relay (300/day free) |
Dovecot exporter was retired in code-1ik (2026-04-19) — `viktorbarzin/dovecot_exporter` speaks the pre-2.3 `old_stats` FIFO protocol which docker-mailserver 15.0.0's Dovecot 2.3.19 no longer emits.
## Port mapping
The mailserver pod exposes **8 TCP listeners**: 4 stock + 4 alt. Two Kubernetes Services front them depending on whether the client can inject PROXY v2.
| Mail protocol | Service port | K8s Service | Container port | NodePort | PROXY v2? | Who uses this path |
|---|---|---|---|---|---|---|
| SMTP (plain + STARTTLS) | 25 | `mailserver` ClusterIP | 25 | — | ❌ stock | Intra-cluster only (not used — internal clients send via 587) |
| SMTPS (implicit TLS) | 465 | `mailserver` ClusterIP | 465 | — | ❌ stock | Intra-cluster (Roundcube rarely uses this) |
| Submission (STARTTLS) | 587 | `mailserver` ClusterIP | 587 | — | ❌ stock | **Roundcube pod** → mailserver.svc:587 |
| IMAPS | 993 | `mailserver` ClusterIP | 993 | — | ❌ stock | **Roundcube pod** + E2E probe → mailserver.svc:993 |
| SMTP | 25 | `mailserver-proxy` NodePort | 2525 | 30125 | ✅ required | External MX traffic via pfSense HAProxy |
| SMTPS | 465 | `mailserver-proxy` NodePort | 4465 | 30126 | ✅ required | External SMTPS submission |
| Submission | 587 | `mailserver-proxy` NodePort | 5587 | 30127 | ✅ required | External STARTTLS submission (mail clients over WAN) |
| IMAPS | 993 | `mailserver-proxy` NodePort | 10993 | 30128 | ✅ required | External IMAPS (mail clients over WAN) |
The alt listeners are set up by:
- **Postfix**: `user-patches.sh` (shipped via ConfigMap `mailserver-user-patches`) appends 3 entries to `master.cf` with `-o postscreen_upstream_proxy_protocol=haproxy` (for 2525) or `-o smtpd_upstream_proxy_protocol=haproxy` (for 4465/5587).
- **Dovecot**: `dovecot.cf` ConfigMap adds a second `inet_listener` inside `service imap-login` with `haproxy = yes`, plus `haproxy_trusted_networks = 10.0.20.0/24` to allow PROXY headers from the k8s node subnet (post kube-proxy SNAT the source IP is always a node IP).
## Mail Flow
### Inbound
```
Internet → MX: mail.viktorbarzin.me (priority 1)
→ A record: 176.12.22.76 (non-proxied Cloudflare DNS-only)
→ pfSense NAT: port 25 → 10.0.20.202:25
→ MetalLB (dedicated IP, ETP: Local — preserves real client IPs)
→ Postfix → Rspamd (spam + DKIM + DMARC check) → Dovecot → mailbox
→ pfSense NAT rdr: WAN:{25,465,587,993} → 10.0.20.1:{same}
→ pfSense HAProxy (TCP mode, send-proxy-v2 on backend)
→ k8s-node:{30125..30128} NodePort (mailserver-proxy, ETP: Cluster)
→ kube-proxy → pod alt listener (2525/4465/5587/10993)
→ Postfix postscreen / smtpd / Dovecot parses PROXY v2 header
→ Rspamd (spam + DKIM + DMARC) → Dovecot → mailbox
```
No backup MX. If the server is down, sender MTAs queue and retry for 4-5 days per SMTP standards (RFC 5321).
@ -114,9 +189,13 @@ Reverse DNS for `176.12.22.76` returns `176-12-22-76.pon.spectrumnet.bg.` (ISP-a
### CrowdSec Integration
- **Collections**: `crowdsecurity/postfix` + `crowdsecurity/dovecot` (installed)
- **Log acquisition**: CrowdSec agents parse mailserver pod logs for brute-force patterns
- **Real client IPs**: `externalTrafficPolicy: Local` on dedicated MetalLB IP `10.0.20.202` preserves original client IPs (not SNATed to node IPs)
- **Real client IPs**: pfSense HAProxy injects PROXY v2 header on each backend connection; Postfix (`postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy` on alt ports) + Dovecot (`haproxy = yes` on alt IMAPS listener) parse it to recover the true source IP despite kube-proxy SNAT. Replaces the pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme (see code-yiu)
- **Decisions**: CrowdSec bans/challenges attackers via firewall bouncer rules
### Fail2ban Disabled (CrowdSec is the Policy)
docker-mailserver ships Fail2ban, but it is explicitly disabled here: `ENABLE_FAIL2BAN = "0"` at [`stacks/mailserver/modules/mailserver/main.tf:68`](../../stacks/mailserver/modules/mailserver/main.tf). CrowdSec is the cluster-wide bouncer for SSH, HTTP, and SMTP/IMAP brute-force defence — it already parses the `postfix` and `dovecot` log streams via the collections listed above and applies decisions at the LB/firewall layer. Enabling Fail2ban in-pod would create a duplicate response path (two systems racing to ban the same IP from different enforcement points), add iptables churn inside the container, and fragment the audit trail across two decision stores. Decision (2026-04-18): keep it disabled; CrowdSec owns this policy.
### Rspamd
- Spam filtering with phishing detection and Oletools
- DKIM signing (selector `mail`, 2048-bit RSA)
@ -139,11 +218,13 @@ anvil_rate_time_unit = 60s
## Monitoring
### E2E Roundtrip Probe
CronJob `email-roundtrip-monitor` (every 10 min):
1. Sends test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email hits MX → Postfix → catch-all delivers to `spam@` mailbox
3. Verifies delivery via IMAP (searches by UUID marker)
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma
CronJob `email-roundtrip-monitor` (every 20 min, `*/20 * * * *`):
1. Sends test email via **Brevo HTTP API** to `smoke-test@viktorbarzin.me` (Brevo delivers it to our MX over the public internet, exercising the full external-ingress path).
2. Email hits WAN → pfSense HAProxy → k8s-node:30125 → pod :2525 postscreen (PROXY v2) → Postfix → catch-all delivers to `spam@` mailbox.
3. Verifies delivery via IMAP — connects to `mailserver.mailserver.svc.cluster.local:993` (intra-cluster path, no PROXY), searches by UUID marker.
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma.
Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from ExternalSecret `mailserver-probe-secrets` (synced from Vault `secret/viktor` + `secret/platform.mailserver_accounts`) — see code-39v.
### Prometheus Alerts
| Alert | Threshold | Severity |
@ -154,13 +235,13 @@ CronJob `email-roundtrip-monitor` (every 10 min):
| EmailRoundtripNeverRun | Metric absent for 40m | warning |
### Uptime Kuma Monitors
- TCP SMTP on `176.12.22.76:25` (external, 60s interval)
- TCP IMAP on `10.0.20.202:993` (internal)
- E2E Push monitor (receives push from roundtrip probe)
- TCP SMTP on `176.12.22.76:25` — full external path (DNS → WAN → pfSense HAProxy → mailserver)
- TCP `mailserver.svc:{587,993}` — intra-cluster ClusterIP path
- TCP `10.0.20.1:{25,993}` — pfSense HAProxy health (post code-yiu Phase 6)
- E2E Push monitor (receives push from `email-roundtrip-monitor` probe)
### Dovecot Exporter
- Sidecar container in mailserver pod, port 9166
- Scraped by Prometheus for IMAP connection metrics
### Dovecot exporter — retired
`viktorbarzin/dovecot_exporter` was removed in code-1ik (2026-04-19). It spoke the pre-2.3 `old_stats` FIFO protocol; Dovecot 2.3.19 (docker-mailserver 15.0.0) no longer emits that, so the scrape only ever returned `dovecot_up{scope="user"} 0`. If Dovecot metrics become valuable, reach for a 2.3+ compatible exporter (e.g. `jtackaberry/dovecot_exporter`) and re-add the scrape + alerts. The previously-created `mailserver-metrics` ClusterIP Service was also removed.
## Terraform
@ -178,16 +259,20 @@ CronJob `email-roundtrip-monitor` (every 10 min):
| `secret/platform` | `mailserver_aliases` | Postfix virtual aliases |
| `secret/platform` | `mailserver_opendkim_key` | DKIM private key |
| `secret/platform` | `mailserver_sasl_passwd` | Brevo relay credentials (`[smtp-relay.brevo.com]:587 <login>:<key>`) |
| `secret/viktor` | `mailgun_api_key` | Mailgun API for E2E roundtrip probe (retained for inbound delivery testing only; not used for user mail) |
| `secret/viktor` | `brevo_api_key` | Brevo API key (stored for reference) |
| `secret/viktor` | `brevo_api_key` | Brevo API key — used by BOTH outbound SMTP SASL (postfix) AND the E2E roundtrip probe (sends external test mail via Brevo HTTP) |
| `secret/viktor` | `mailgun_api_key` | Historical; no longer used by the probe post code-n5l/Phase-5 work. Kept for reference. |
## Storage
| PVC | Size | Storage Class | Purpose |
|-----|------|---------------|---------|
| `mailserver-data-proxmox` | 2Gi (auto-resize 5Gi) | proxmox-lvm | Mail data, state, logs |
| `roundcubemail-html-proxmox` | 1Gi | proxmox-lvm | Roundcube web files |
| `roundcubemail-enigma-proxmox` | 1Gi | proxmox-lvm | Roundcube encryption |
| `mailserver-data-encrypted` | 2Gi (auto-resize 5Gi) | `proxmox-lvm-encrypted` (LUKS2) | Maildir + Postfix queue + state + logs |
| `roundcubemail-html-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube PHP code + user session data |
| `roundcubemail-enigma-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube Enigma (PGP) user keys |
| `mailserver-backup-host` (RWX) | 10Gi | `nfs-truenas` | `mailserver-backup` CronJob destination (`/srv/nfs/mailserver-backup/<YYYY-WW>/`) |
| `roundcube-backup-host` (RWX) | 10Gi | `nfs-truenas` | `roundcube-backup` CronJob destination |
**Backup**: daily `mailserver-backup` + `roundcube-backup` CronJobs rsync data PVCs to NFS. NFS directory is picked up by the PVE host's inotify-driven `/usr/local/bin/offsite-sync-backup` which pushes to Synology (weekly). See [Storage & Backup Architecture](storage.md) for the 3-2-1 flow.
## Decisions & Rationale
@ -206,19 +291,23 @@ CronJob `email-roundtrip-monitor` (every 10 min):
- **Decision**: Rspamd replaces both SpamAssassin and OpenDKIM in a single component
- **Tradeoff**: Higher memory usage (~150-200MB) but simpler stack
### Dedicated MetalLB IP for CrowdSec
- **Decision**: Mailserver gets `10.0.20.202` (separate from shared `10.0.20.200`) with `externalTrafficPolicy: Local`
- **Why**: Shared IP with ETP: Cluster SNATs away real client IPs, making CrowdSec detections and Postfix rate limiting useless
- **Tradeoff**: Uses one extra IP from the MetalLB pool. Requires separate pfSense NAT rule.
### Client-IP Preservation (pfSense HAProxy + PROXY v2)
- **Current (2026-04-19, bd code-yiu)**: pfSense HAProxy listens on `10.0.20.1:{25,465,587,993}`, forwards to k8s NodePort 30125-30128 with `send-proxy-v2` on each backend connection. The mailserver pod exposes parallel listeners (2525/4465/5587/10993) that REQUIRE the PROXY v2 header, while the stock ports 25/465/587/993 stay PROXY-free for intra-cluster traffic (Roundcube, probe). The mailserver Service is ClusterIP-only; ETP is no longer a concern for external traffic.
- **Historical (2026-04-12 → 2026-04-19)**: Dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` — required pod/speaker colocation; kube-proxy preserved client IP only when pod was on the same node as the advertising speaker.
- **Why switched**: ETP:Local made the mailserver's single replica drop inbound mail silently during pod reschedule (30-60s GARP flip). HAProxy with `send-proxy-v2` lets the pod reschedule to any node and recover IP-preservation through the header.
- **Tradeoff**: pfSense now runs HAProxy (one more service in the firewall's responsibility); alt container ports + extra Service are ~80 lines of Terraform. The win is HA without IP-preservation compromise.
- **Runbook**: [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md).
## Troubleshooting
### Inbound mail not arriving
1. Check MX: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. Check port 25: `nc -zw5 mail.viktorbarzin.me 25`
3. Check pfSense NAT rule: port 25 → `10.0.20.202:25`
4. Check Postfix logs: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject'`
5. Check if CrowdSec is blocking the sender: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
3. **pfSense NAT**: verify WAN:{25,465,587,993} rdr to `10.0.20.1` (HAProxy VIP). `ssh admin@10.0.20.1 'pfctl -sn' | grep '10.0.20.1'`
4. **HAProxy health**: `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"` — at least one backend in `srv_op_state=2` (UP) per pool
5. **Container listener**: `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'` — 8 lines expected
6. **Postfix queue + delivery**: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject|smtpd-proxy'`
7. **CrowdSec decisions**: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
### Outbound mail failing
1. Check Brevo relay: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep relay` — should show `relay=smtp-relay.brevo.com`

View file

@ -75,7 +75,9 @@ Prometheus scrapes metrics from all cluster components and applications using Se
### External Monitoring
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for every service in `cloudflare_proxied_names`. These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for externally-reachable ingresses. Discovery is **opt-OUT**: the script lists every ingress via the K8s API and creates a monitor for any host ending in `.viktorbarzin.me`, skipping only those annotated `uptime.viktorbarzin.me/external-monitor: "false"`. Both `ingress_factory` and the `reverse-proxy` factory emit that annotation when the caller sets `external_monitor = false`; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy `cloudflare_proxied_names` ConfigMap is a fallback if the K8s API discovery fails.
These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.

View file

@ -0,0 +1,203 @@
# pfSense HAProxy for Mailserver — Runbook
Last updated: 2026-04-19 (Phase 6 complete)
## What & why
External mail traffic (SMTP/IMAP) requires **real client IP visibility** for
CrowdSec + Postfix rate-limiting. MetalLB cannot inject PROXY-protocol
headers (see [`mailserver-proxy-protocol.md`](./mailserver-proxy-protocol.md)),
so pfSense runs a small HAProxy that:
1. Listens on the pfSense VLAN20 IP (`10.0.20.1`) on all 4 mail ports,
2. Forwards each connection to a k8s node's NodePort with `send-proxy-v2`,
3. Injects PROXY v2 framing so Postfix/Dovecot see the original client IP,
4. TCP health-checks every k8s worker — any node can serve (ETP:Cluster).
Corresponding k8s-side setup (`stacks/mailserver/modules/mailserver/`):
- ConfigMap `mailserver-user-patches``user-patches.sh` appends 3 alt
`master.cf` services to Postfix:
- `:2525` postscreen (alt :25) with `postscreen_upstream_proxy_protocol=haproxy`
- `:4465` smtpd (alt :465 SMTPS) with `smtpd_upstream_proxy_protocol=haproxy`
- `:5587` smtpd (alt :587 submission) with `smtpd_upstream_proxy_protocol=haproxy`
- ConfigMap `mailserver.config` adds Dovecot `inet_listener imaps_proxy` on
port 10993 with `haproxy = yes` and `haproxy_trusted_networks = 10.0.20.0/24`.
- Service `mailserver-proxy` (NodePort, ETP:Cluster) with 4 NodePorts:
- `port 25 → targetPort 2525 → nodePort 30125`
- `port 465 → targetPort 4465 → nodePort 30126`
- `port 587 → targetPort 5587 → nodePort 30127`
- `port 993 → targetPort 10993 → nodePort 30128`
- Service `mailserver` (ClusterIP) — unchanged stock ports 25/465/587/993
for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
CronJob). These listeners are PROXY-free.
bd: `code-yiu`.
## Steady-state architecture
```
External mail (WAN) path — PROXY v2
┌─────────────────────────────────────────────────────────────────────┐
│ Client (real IP) │
│ │ SMTP/SMTPS/Sub/IMAPS │
│ ▼ │
│ pfSense WAN:{25,465,587,993} │
│ │ NAT rdr → 10.0.20.1:{same} │
│ ▼ │
│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │
│ │ send-proxy-v2 + tcp-check inter 120000 │
│ ▼ │
│ k8s-node<1-4>:{30125..30128} ← any node (ETP:Cluster) │
│ │ kube-proxy SNAT (source IP lost on the wire) │
│ ▼ │
│ mailserver pod :{2525,4465,5587,10993} │
│ │ postscreen / smtpd / Dovecot parse PROXY v2 header │
│ │ → real client IP recovered despite kube-proxy SNAT │
│ ▼ │
│ CrowdSec + Postfix / Dovecot see the true source IP ✓ │
└─────────────────────────────────────────────────────────────────────┘
Intra-cluster path — no PROXY
┌─────────────────────────────────────────────────────────────────────┐
│ Roundcube pod / email-roundtrip-monitor CronJob │
│ │ SMTP/IMAP │
│ ▼ │
│ mailserver.mailserver.svc.cluster.local:{25,465,587,993} │
│ │ ClusterIP — bypasses LoadBalancer/NodePort layer entirely │
│ ▼ │
│ mailserver pod stock :{25,465,587,993} (PROXY-free) │
└─────────────────────────────────────────────────────────────────────┘
```
## Validation
```sh
# All HAProxy frontends listening
ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
# Expect: *:25, *:465, *:587, *:993, *:2525 (test port)
# All backend pools healthy
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
| awk 'NR>1 {print $3, $4, $6}'
# srv_op_state 2 = UP, 0 = DOWN
# Container listens on all 8 ports
kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
# pf rdr points at pfSense (10.0.20.1), not <mailserver> alias
ssh admin@10.0.20.1 'pfctl -sn' | grep -E 'port = (25|submission|imaps|smtps)'
# E2E probe — Brevo → external MX :25 → IMAP fetch
kubectl create job --from=cronjob/email-roundtrip-monitor probe-test -n mailserver
kubectl wait --for=condition=complete --timeout=90s job/probe-test -n mailserver
kubectl logs job/probe-test -n mailserver | grep SUCCESS
kubectl delete job probe-test -n mailserver
# Real client IP in maillog post-delivery
kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
| grep 'smtpd-proxy25.*CONNECT from' | tail -5
# Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
```
## Bootstrap / restore from scratch
pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is scp'd nightly to
`/mnt/backup/pfsense/config-YYYYMMDD.xml` by `scripts/daily-backup.sh`, then
synced to Synology. To rebuild from source of truth (git):
```sh
scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
```
The script is idempotent — re-runs reset the mailserver frontends + backends
to the declared state.
Expected output:
```
haproxy_check_and_run rc=OK
```
## Operations
### Change backend k8s node IPs / NodePorts
Edit `infra/scripts/pfsense-haproxy-bootstrap.php``$NODES` array + the
`build_pool()` port arguments. Re-run the bootstrap command above. Don't
hand-edit `/var/etc/haproxy/haproxy.cfg` — it is regenerated from XML on
every apply.
### Check health of backends
```sh
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"
```
`srv_op_state=2` means UP, `0` means DOWN.
### View live HAProxy stats (WebUI)
`https://pfsense.viktorbarzin.me` → Services → HAProxy → Stats.
### Reload after config.xml edit
```sh
ssh admin@10.0.20.1 'pfSsh.php playback svc restart haproxy'
```
### Rollback (flip NAT back to MetalLB, post-Phase-6 only partial)
There is no Phase-6 rollback one-liner. Phase 6 removed the MetalLB
LoadBalancer 10.0.20.202 entirely, so un-flipping NAT now would send
traffic to a dead alias. To regress:
1. Re-add `metallb.io/loadBalancerIPs = "10.0.20.202"` + `type = "LoadBalancer"`
+ `external_traffic_policy = "Local"` to `kubernetes_service.mailserver`,
apply.
2. Re-add the `mailserver` host alias in pfSense pointing at 10.0.20.202
(Firewall → Aliases → Hosts).
3. Run `infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php` on pfSense.
For rollback of just the NAT (Phase 4) without touching the Service, only
the third step is needed — but only meaningful BEFORE Phase 6.
### Restore from backup
pfSense config backup is a plain XML file:
```
/mnt/backup/pfsense/config-YYYYMMDD.xml # sda host copy (1.1TB RAID1)
/volume1/Backup/Viki/pve-backup/pfsense/... # Synology offsite
```
Full restore: pfSense WebUI → Diagnostics → Backup & Restore → Upload that
`config.xml`. The `<installedpackages><haproxy>` section is included.
## Phase history (bd code-yiu)
| Phase | Status | Description |
|---|---|---|
| 1a | ✅ commit `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed (`pfSense-pkg-haproxy-devel-0.63_2`, HAProxy 2.9-dev6) |
| 3 | ✅ commit `ba697b02` | HAProxy config persisted in pfSense XML (bootstrap script + this runbook) |
| 4+5| ✅ commit `9806d515` | 4-port alt listeners + HAProxy frontends for 25/465/587/993 + NAT flip |
| 6 | ✅ this commit | Mailserver Service downgraded LoadBalancer → ClusterIP; `10.0.20.202` released back to MetalLB pool; orphan `mailserver` pfSense alias removed; monitors retargeted |
## Known warts
- HAProxy TCP health-check with `send-proxy-v2` generates `getpeername:
Transport endpoint not connected` warnings on postscreen every check cycle.
Mitigated with `inter 120000` (2 min). To reduce further, switch to
`option smtpchk` — but that requires a separate non-PROXY health-check
port on the pod (not done yet).
- Frontend binds on all pfSense interfaces (`bind :25` instead of
`10.0.20.1:25`). `<extaddr>` is set in XML but pfSense templates it
port-only. Low concern in practice because WAN firewall rules plus the
NAT rdr gate external access; internal VLAN clients SHOULD be able to
reach HAProxy on any pfSense-local IP.
- k8s-node5 doesn't exist — cluster has master + 4 workers. Backend pool
capped at 4 servers.
- Postscreen still logs `improper command pipelining` for legitimate
clients that send `EHLO\r\nQUIT\r\n` as a single TCP write. This is
unchanged pre/post-migration — postscreen's anti-bot heuristic.

View file

@ -0,0 +1,181 @@
# Mailserver PROXY protocol — research & decision
Last updated: 2026-04-18 (original research). **Outcome implemented 2026-04-19 — see [UPDATE](#update-2026-04-19) below.**
> ## UPDATE (2026-04-19)
>
> This doc describes the research that led to the Phase-6 rollout. **Option C
> (pfSense HAProxy + PROXY v2)** was chosen and is now live. Operational
> state, cutover history, bootstrap, and rollback procedures live in
> [`mailserver-pfsense-haproxy.md`](mailserver-pfsense-haproxy.md).
>
> This file is retained as a decision record — it explains *why* Option A
> (pod-pinning via nodeSelector) was rejected mid-session in favour of
> Option C, and documents the MetalLB upstream limitation (PROXY injection
> is explicitly won't-implement). Future debates of "why don't we just pin
> the pod?" should land here first.
## TL;DR
**MetalLB does not and will not inject PROXY protocol headers.** The original plan
(`/home/wizard/.claude/plans/let-s-work-on-linking-temporal-valiant.md`, task
`code-rtb`) assumed MetalLB could be configured to emit PROXY v1/v2 on behalf of
the `mailserver` LoadBalancer Service. That assumption is wrong at the product
level. MetalLB is a control-plane-only announcer (ARP/NDP for L2 mode, BGP for
L3 mode); it never touches the L4 payload.
As a result, there is no single Terraform change that can flip
`externalTrafficPolicy: Local``Cluster` on the `mailserver` Service while
preserving the real client IP for Postfix/postscreen and Dovecot. Three
alternative paths exist (see below); none is trivial.
## Environment (verified 2026-04-18)
- **MetalLB version**: `quay.io/metallb/controller:v0.15.3` /
`quay.io/metallb/speaker:v0.15.3` (5 speakers).
- **Advertisement type**: L2Advertisement `default` bound to IPAddressPool
`default` (10.0.20.20010.0.20.220). No BGPAdvertisements.
- **Service**: `mailserver/mailserver` — type `LoadBalancer`, `loadBalancerIPs:
10.0.20.202`, `externalTrafficPolicy: Local`,
`healthCheckNodePort: 30234`, 5 ports (25, 465, 587, 993, 9166/dovecot-metrics).
- **Pod**: single replica today, RWO PVCs prevent horizontal scale without
further work (`mailserver-data-encrypted`, `mailserver-letsencrypt-encrypted`).
## Why the original plan fails
### MetalLB never touches packets
> *"MetalLB is controlplane only, making it part of the dataplane means we
> would be responsible for the performance of the system, so more bugs to
> fight, I personally don't see that happening."*
> — MetalLB maintainer `champtar`, 2021-01-06
> (issue [#797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797))
Issue #797 is closed as "won't implement". Repeat asks in 20222023 got the
same answer. The v0.15.3 API surface confirms this: no
`proxyProtocol`/`haproxy`/`protocol: proxy` field exists on `IPAddressPool`,
`L2Advertisement`, `BGPAdvertisement`, or as a Service annotation.
Only managed-cloud LBs (AWS NLB, Azure LB, OCI, DO, OVH, Scaleway, etc.) offer
PROXY protocol as a tick-box. MetalLB's equivalents are:
| MetalLB feature | Does it preserve client IP? | Comment |
|---|---|---|
| `externalTrafficPolicy: Local` (current) | Yes, via iptables DNAT on the speaker node | Forces pod↔speaker colocation on L2 mode. This is the pain we wanted to avoid. |
| `externalTrafficPolicy: Cluster` | No — kube-proxy SNATs to the node IP | The problem we would re-introduce if we flipped without PROXY injection. |
| PROXY protocol injection | N/A — not implemented | Dead end. |
### The `Local` trap is real, but narrower than it seems
Today's `Local` policy means the ARP announcer node must also host the mailserver
pod. MetalLB always picks a single speaker to advertise the VIP (leader
election per IP), so in practice exactly one node matters at any moment. A pod
rescheduled to a different node silently drops inbound SMTP/IMAP until a GARP
flip or node cordon.
The only pods on our cluster that see this same class of risk are Traefik
(3 replicas + PDB `minAvailable=2`, so 2 of 3 nodes always have a pod) and
mailserver (1 replica). Traefik survives because the pods outnumber the nodes
that could be the speaker at once; the mailserver cannot.
## Alternative paths (ranked by effort)
### Option A — Pin the mailserver pod to a specific node (SIMPLEST)
Add `nodeSelector` on the mailserver Deployment pointing at a label that's also
stamped on the MetalLB speaker we want to advertise the VIP from, and use
MetalLB's [node selector](https://metallb.io/configuration/_advanced_l2_configuration/#specify-network-interfaces-that-lb-ip-can-be-announced-from)
on `L2Advertisement.spec.nodeSelectors` to pin the announcer to the same node.
Trade-offs:
- Zero changes to Postfix/Dovecot configs.
- Keeps `externalTrafficPolicy: Local` — real client IP keeps arriving.
- Loses HA (the whole point of the MetalLB layer) but reflects reality — one
replica, one PVC, no HA today anyway.
- Drain of that node requires a planned cutover, but that's no worse than
today's silent failure mode.
Implementation (~10 lines of Terraform):
```hcl
# In stacks/mailserver/modules/mailserver/main.tf, on the Deployment:
node_selector = { "viktorbarzin.me/mailserver-anchor" = "true" }
# In stacks/platform (or wherever the MetalLB CRs live):
resource "kubernetes_manifest" "mailserver_l2ad" {
manifest = {
apiVersion = "metallb.io/v1beta1"
kind = "L2Advertisement"
metadata = { name = "mailserver", namespace = "metallb-system" }
spec = {
ipAddressPools = ["default"]
nodeSelectors = [{ matchLabels = { "viktorbarzin.me/mailserver-anchor" = "true" } }]
}
}
}
```
Plus a node label via `kubectl label node k8s-node3 viktorbarzin.me/mailserver-anchor=true`.
**Recommendation: this is the shortest path to eliminating the silent-drop
failure mode** without taking on a new proxy tier.
### Option B — Put a HAProxy sidecar in front of Postfix/Dovecot
Stand up an in-cluster HAProxy with PROXY v2 enabled on the frontend and
`send-proxy-v2` on the backend to `mailserver:25/465/587/993`. Expose HAProxy
via a new MetalLB Service with `externalTrafficPolicy: Cluster` + kube-proxy
DSR workaround (still loses client IP at that layer), or run HAProxy on the
host-network of the same node (back to Option A's colocation).
Trade-offs:
- Introduces one more network hop and TLS-termination decision for every
SMTP connect.
- HAProxy needs its own cert rotation (or `tls-passthrough`) — adds moving
parts to an already crowded mailserver module.
- Doesn't actually solve the colocation problem on its own — HAProxy itself
needs to receive the client IP, so we are back to externalTrafficPolicy
constraints for HAProxy.
**Recommendation: avoid unless we also get HA for mailserver itself, which
needs RWX storage + DB split-brain work — out of scope.**
### Option C — Replace MetalLB with a different LB for this Service
Candidates: [kube-vip](https://kube-vip.io/) (supports eBPF-based DSR but not
PROXY injection either), [Cilium LB](https://docs.cilium.io/en/stable/network/lb-ipam/)
(preserves client IP via DSR in hybrid mode), or a dedicated HAProxy running on
pfSense and NAT-forwarding 25/465/587/993 with PROXY headers to a
ClusterIP-exposed mailserver. Cilium requires a CNI migration (we run Calico
today); pfSense HAProxy is genuinely feasible but belongs in a different bd
task.
**Recommendation: track as P3 follow-up under a new bd task if Option A proves
insufficient.**
## Decision
Do nothing in this session beyond this runbook + the bd note. The `code-rtb`
task as written is not executable — MetalLB cannot inject PROXY headers, and
the Postfix/Dovecot config changes the plan proposed would not receive the
header they expect, they would hang waiting for it and then timeout (5s per
connection).
Follow-up work filed as bd child tasks (if user wants to pursue):
- **Option A — pin mailserver + L2Advertisement nodeSelectors** (new bd task)
- **Option C — HAProxy on pfSense with PROXY v2 to a ClusterIP** (new bd task)
## References
- [MetalLB issue #797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797) (closed, won't implement)
- [MetalLB PR #796 — Source IP Preservation discussion](https://github.com/metallb/metallb/issues/796)
- Postfix [postscreen_upstream_proxy_protocol](https://www.postfix.org/postconf.5.html#postscreen_upstream_proxy_protocol) — expects the PROXY header *on every incoming connection*; if absent, postscreen drops after `postscreen_upstream_proxy_timeout`.
- Dovecot [haproxy_trusted_networks](https://doc.dovecot.org/settings/core/#core_setting-haproxy_trusted_networks) — treats the header as mandatory for listed source networks.
- Cluster state verified against: `kubectl -n metallb-system get pods`,
`kubectl get ipaddresspools.metallb.io -A`,
`kubectl get l2advertisements.metallb.io -A`,
`kubectl get bgpadvertisements.metallb.io -A`,
`kubectl -n mailserver get svc mailserver -o yaml`.

View file

@ -0,0 +1,66 @@
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
If the path does not exist, the first pod that tries to mount the resulting
PVC gets stuck in `ContainerCreating` with the kubelet event:
```
MountVolume.SetUp failed for volume "<name>" : mount failed: exit status 32
mount.nfs: mounting 192.168.1.127:/srv/nfs/<path> failed, reason given by
server: No such file or directory
```
## Bootstrap before first apply
Before adding a new `nfs_volume` consumer (backup CronJob, data PV, etc.),
create the export root on the PVE host:
```sh
# Replace <app> with the backup stack name, e.g. mailserver-backup,
# roundcube-backup, immich-backup, etc.
ssh root@192.168.1.127 'mkdir -p /srv/nfs/<app> && chmod 755 /srv/nfs/<app>'
# Confirm exports are live (no change to /etc/exports needed — `/srv/nfs`
# is already exported via the root entry in pve-nfs-exports).
ssh root@192.168.1.127 exportfs -v | grep '/srv/nfs\b'
```
`/srv/nfs` is exported with the root entry. Subdirectories inherit the
export automatically; they just have to exist on disk.
## Known consumers
| Consumer | NFS path | Owning stack |
|--------------------------------|---------------------------------|--------------------------|
| `mailserver-backup` | `/srv/nfs/mailserver-backup` | `stacks/mailserver/` |
| `roundcube-backup` | `/srv/nfs/roundcube-backup` | `stacks/mailserver/` |
| `mysql-backup` | `/srv/nfs/mysql-backup` | `stacks/dbaas/` |
| `postgresql-backup` | `/srv/nfs/postgresql-backup` | `stacks/dbaas/` |
| `vaultwarden-backup` | `/srv/nfs/vaultwarden-backup` | `stacks/vaultwarden/` |
Use `grep -rn 'nfs_volume' infra/stacks/` to find all active consumers.
## Why not auto-create?
Two options were considered for automating this:
1. `null_resource` + `local-exec` SSH `mkdir` in the `nfs_volume` module —
works but adds an SSH dependency to every Terraform run, makes the
module non-hermetic, and fails if the operator does not have SSH to
the PVE host.
2. `nfs-subdir-external-provisioner` — handles subdirs automatically but
changes the PV/PVC shape and would require migrating all existing
consumers.
Neither is worth the churn for a one-time operation per new backup stack.
Document + checklist is the current call; re-evaluate if we start adding
one NFS consumer per week.
## Related tasks
- `code-yo4` — this runbook
- `code-z26` — mailserver backup CronJob (first-time setup hit this)
- `code-1f6` — Roundcube backup CronJob (also hit this)

View file

@ -0,0 +1,103 @@
# Runbook: Proxmox host (pve, 192.168.1.127)
Last updated: 2026-04-19
The Proxmox host is a baremetal hypervisor on the storage LAN
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
receive DHCP — its network config is static in
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
configured manually and stays out of the scope of Kea/DHCP-DDNS.
## DNS configuration
The host uses a plain `/etc/resolv.conf` with two nameservers. No
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
manages `/etc/resolv.conf`; it is a regular file owned by root.
### Why plain `/etc/resolv.conf` and not systemd-resolved
1. Installing `systemd-resolved` on an active Proxmox node during
business hours is the kind of change that risks breaking the NFS
server or VM networking. PVE's Debian base does not ship
`systemd-resolved` by default.
2. The ifupdown `/etc/network/interfaces` file does not manage
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
only active if the `resolvconf` package is installed, which it is
not (`dpkg -l resolvconf` returns `un`).
3. A plain file is the simplest mental model and avoids a second
layer of "which tool is running now" confusion during an incident.
If you ever want to migrate to `systemd-resolved`, install the
package, enable the service, symlink `/etc/resolv.conf` to
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
during a maintenance window, not reactively.
### Current state
```
# /etc/resolv.conf
search viktorbarzin.lan
nameserver 192.168.1.2
nameserver 94.140.14.14
options timeout:2 attempts:2
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
### Verification commands
```sh
ssh root@192.168.1.127 '
cat /etc/resolv.conf # should show the two nameservers
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
dig +short github.com # expect an A record
'
```
Simulated failover — force the primary unreachable and verify the
fallback answers:
```sh
ssh root@192.168.1.127 '
ip route add blackhole 192.168.1.2
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
ip route del blackhole 192.168.1.2 # cleanup
'
```
Expected behaviour: the first `dig` prints a warning about the UDP
setup failing for 192.168.1.2 and then prints the GitHub A record
(answered by 94.140.14.14).
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
and `/etc/network/interfaces.d/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
host. To roll back:
```sh
ssh root@192.168.1.127 '
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
cat /etc/resolv.conf
'
```
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
lookup.
## Related docs
- `docs/architecture/dns.md` — where each resolver IP lives and which
subnet it serves.
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
host; read before adding new NFS exports.

View file

@ -0,0 +1,147 @@
# Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-04-19
The registry VM hosts `registry.viktorbarzin.me` (private Docker
registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04
VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
config (no DHCP). Because it sits on a subnet that only has pfSense
as its gateway, its DNS must be statically configured.
## DNS configuration
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
`nameservers`. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is **no automatic
merging** of netplan DNS with the `[Resolve]` section of
`/etc/systemd/resolved.conf` — per-link settings override the global
ones. So both layers must be in sync:
| Layer | File | Role |
|---|---|---|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
### Current state
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
```ini
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
```
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
```yaml
nameservers:
addresses:
- 10.0.20.1
- 94.140.14.14
search:
- viktorbarzin.lan
```
`resolvectl status` output after the change:
```
Global
resolv.conf mode: stub
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1
Fallback DNS Servers: 94.140.14.14
DNS Domain: viktorbarzin.lan
Link 2 (eth0)
Current Scopes: DNS
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1 94.140.14.14
DNS Domain: viktorbarzin.lan
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
### Why this matters for the registry
Container builds on this VM reference `.lan` hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
1. Internal hostname lookups silently failed (slow timeout) — the
VM could not resolve `idrac.viktorbarzin.lan` or any internal
helper.
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
entirely.
With the new config the VM can resolve both zones and keeps working
if the primary DNS server is unreachable.
## Apply / re-apply
```sh
ssh root@10.0.20.10 '
netplan generate
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -20
'
```
`netplan apply` is not disruptive when only `nameservers` change — it
does not bounce the link.
## Verification
```sh
ssh root@10.0.20.10 '
dig +short idrac.viktorbarzin.lan # 192.168.1.4
dig +short github.com # GitHub A record
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
'
```
Fallback test — blackhole the primary and confirm external lookups
still succeed through 94.140.14.14:
```sh
ssh root@10.0.20.10 '
ip route add blackhole 10.0.20.1
dig +short +time=5 +tries=2 github.com # should still answer
ip route del blackhole 10.0.20.1
'
```
Internal lookups do fail during the blackhole (the fallback is a
public resolver and does not know about the internal zone), which is
expected — the fallback buys availability for external pulls, not
internal hostnames.
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
and `/etc/netplan/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
VM. To roll back:
```sh
ssh root@10.0.20.10 '
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -10
'
```
## Related docs
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
and `containerd` `hosts.toml` redirects.

View file

@ -0,0 +1,51 @@
# Runbook: Applying the Technitium Terraform stack
Last updated: 2026-04-19
The `stacks/technitium/` apply has a **post-apply readiness gate** that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.
## What the gate checks
`stacks/technitium/modules/technitium/readiness.tf` defines `null_resource.technitium_readiness_gate`. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:
1. **Rollout status**`kubectl rollout status deploy/<name> --timeout=180s` for `technitium`, `technitium-secondary`, `technitium-tertiary`. Fails if any deployment has not reached its desired pod count within 180s.
2. **Per-pod API health** — for every pod with label `dns-server=true`, executes `wget http://127.0.0.1:5380/api/stats/get` inside the pod and asserts the response contains `"status":"ok"`. Catches Technitium process hangs that TCP probes miss.
3. **Zone-count parity** — queries `technitium-web`, `technitium-secondary-web`, `technitium-tertiary-web` and counts the zones returned. Fails if the three counts differ, which would mean `technitium-zone-sync` has drifted or a replica has lost state.
The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see `triggers` in `readiness.tf`).
## Emergency override
Set `skip_readiness=true` via terragrunt inputs or pass it directly to the Terraform apply:
```bash
cd infra/stacks/technitium
scripts/tg apply -var skip_readiness=true
```
Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.
You can also target around the gate during emergency work:
```bash
scripts/tg apply -target=kubernetes_config_map.coredns
```
`-target` bypasses the `depends_on` chain feeding the gate, so a single-resource push does not need the gate to pass.
## Failure modes and responses
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `rollout status` times out on one deployment | Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff` | `kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready. |
| API check fails on a pod but readiness probe passes | Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53) | `kubectl delete pod <name>` — deployment will recreate it. |
| Zone count differs between instances | `technitium-zone-sync` CronJob is failing or AXFR is blocked | `kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert. |
| Gate passes but external clients still cannot resolve | Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested | Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section. |
## What the gate does NOT check
- External reachability through the LoadBalancer IP `10.0.20.201` (that would require a LAN-side probe).
- CoreDNS health (CoreDNS is patched by `coredns.tf`, not this module's deployments — alerts `CoreDNSErrors` / `CoreDNSForwardFailureRate` catch regressions post-apply).
- Upstream resolver health (covered by `CoreDNSForwardFailureRate`).
For broader end-to-end verification, see `docs/architecture/dns.md` → "Verification" section, or run the Uptime Kuma external DNS probe.

View file

@ -148,10 +148,19 @@ locals {
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
# Emit the annotation when effective is true (positive signal), or when the
# caller explicitly set external_monitor=false (opt-out). When the caller
# leaves it null AND dns_type="none", emit nothing the sync script's
# default opt-in (any *.viktorbarzin.me ingress) keeps monitoring services
# that are publicly reachable via routes we don't manage here (e.g.
# helm-provisioned ingresses, services behind cloudflared tunnel with DNS
# set elsewhere).
external_monitor_annotations = local.effective_external_monitor ? merge(
{ "uptime.viktorbarzin.me/external-monitor" = "true" },
var.external_monitor_name != null ? { "uptime.viktorbarzin.me/external-monitor-name" = var.external_monitor_name } : {},
) : {}
) : (var.external_monitor == false ?
{ "uptime.viktorbarzin.me/external-monitor" = "false" } : {}
)
ns_to_group = {
monitoring = "Infrastructure"

View file

@ -1,7 +1,7 @@
#!/usr/bin/env bash
# Cluster health check script.
# Runs 24 diagnostic checks against the Kubernetes cluster and prints
# Runs 42 diagnostic checks against the Kubernetes cluster and prints
# a colour-coded report with PASS / WARN / FAIL for each section.
#
# Usage: ./scripts/cluster_healthcheck.sh [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]
@ -26,7 +26,7 @@ JSON=false
KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=30
TOTAL_CHECKS=42
# --- Helpers ---
info() { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
@ -71,14 +71,16 @@ parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--fix) FIX=true; shift ;;
--no-fix) FIX=false; shift ;;
--quiet|-q) QUIET=true; shift ;;
--json) JSON=true; shift ;;
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
-h|--help)
echo "Usage: $0 [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
echo "Usage: $0 [--fix|--no-fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
echo ""
echo "Flags:"
echo " --fix Auto-remediate safe issues (delete evicted pods)"
echo " --no-fix Disable auto-remediation (default)"
echo " --quiet, -q Only show WARN and FAIL sections"
echo " --json Machine-readable JSON output"
echo " --kubeconfig PATH Override kubeconfig (default: \$(pwd)/config)"
@ -1750,6 +1752,593 @@ else:
json_add "hardware_exporters" "$status" "${detail:-All healthy}"
}
# --- 31. cert-manager: Certificate Readiness ---
check_cert_manager_certificates() {
section 31 "cert-manager — Certificate Readiness"
local certs not_ready detail="" status="PASS"
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs not installed or inaccessible"
json_add "certmanager_certificates" "WARN" "CRDs unavailable"
return 0
}
not_ready=$(echo "$certs" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
ready = next((c for c in conds if c.get("type") == "Ready"), None)
if not ready or ready.get("status") != "True":
reason = ready.get("reason", "NoCondition") if ready else "NoCondition"
print(f"{ns}/{name}:{reason}")
' 2>/dev/null) || true
if [[ -z "$not_ready" ]]; then
pass "All Certificate CRs Ready"
json_add "certmanager_certificates" "PASS" "All Ready"
else
[[ "$QUIET" == true ]] && section_always 31 "cert-manager — Certificate Readiness"
local count
count=$(count_lines "$not_ready")
while IFS= read -r line; do
fail "Certificate not Ready: $line"
detail+="$line; "
done <<< "$not_ready"
status="FAIL"
json_add "certmanager_certificates" "$status" "$count not Ready: $detail"
fi
}
# --- 32. cert-manager: Certificate Expiry (<14d) ---
check_cert_manager_expiry() {
section 32 "cert-manager — Certificate Expiry (<14d)"
local certs expiring detail="" status="PASS"
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs not installed or inaccessible"
json_add "certmanager_expiry" "WARN" "CRDs unavailable"
return 0
}
expiring=$(echo "$certs" | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta
data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) + timedelta(days=14)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
not_after = item.get("status", {}).get("notAfter")
if not not_after:
continue
try:
expiry = datetime.fromisoformat(not_after.replace("Z", "+00:00"))
if expiry < cutoff:
days = (expiry - datetime.now(timezone.utc)).days
level = "FAIL" if days <= 3 else "WARN"
print(f"{level}:{ns}/{name}:{days}")
except ValueError:
pass
' 2>/dev/null) || true
if [[ -z "$expiring" ]]; then
pass "No Certificate CRs expiring within 14 days"
json_add "certmanager_expiry" "PASS" "None expiring <14d"
else
[[ "$QUIET" == true ]] && section_always 32 "cert-manager — Certificate Expiry (<14d)"
while IFS= read -r line; do
local level cert_name days
level=$(echo "$line" | cut -d: -f1)
cert_name=$(echo "$line" | cut -d: -f2)
days=$(echo "$line" | cut -d: -f3)
if [[ "$level" == "FAIL" ]]; then
fail "Certificate $cert_name expires in ${days}d"
status="FAIL"
else
warn "Certificate $cert_name expires in ${days}d"
[[ "$status" != "FAIL" ]] && status="WARN"
fi
detail+="$cert_name=${days}d; "
done <<< "$expiring"
json_add "certmanager_expiry" "$status" "$detail"
fi
}
# --- 33. cert-manager: Failed CertificateRequests ---
check_cert_manager_requests() {
section 33 "cert-manager — Failed CertificateRequests"
local requests failed detail="" status="PASS"
requests=$($KUBECTL get certificaterequests.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs not installed or inaccessible"
json_add "certmanager_requests" "WARN" "CRDs unavailable"
return 0
}
failed=$(echo "$requests" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
for c in conds:
if c.get("type") == "Ready" and c.get("status") == "False" and c.get("reason") == "Failed":
print(f"{ns}/{name}:{c.get(\"message\", \"\")[:80]}")
break
' 2>/dev/null) || true
if [[ -z "$failed" ]]; then
pass "No failed CertificateRequests"
json_add "certmanager_requests" "PASS" "None failed"
else
[[ "$QUIET" == true ]] && section_always 33 "cert-manager — Failed CertificateRequests"
local count
count=$(count_lines "$failed")
while IFS= read -r line; do
fail "CertificateRequest failed: $line"
detail+="$line; "
done <<< "$failed"
status="FAIL"
json_add "certmanager_requests" "$status" "$count failed: $detail"
fi
}
# --- 34. Backup Freshness: Per-DB Dumps ---
check_backup_per_db() {
section 34 "Backup Freshness — Per-DB Dumps"
local detail="" had_issue=false status="PASS"
# Freshness threshold: 25 hours
local now_epoch max_age_sec
now_epoch=$(date -u +%s)
max_age_sec=$((25 * 3600))
_check_cronjob_fresh() {
local ns="$1" cj="$2" label="$3"
local ts age_sec
ts=$($KUBECTL get cronjob -n "$ns" "$cj" -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null || true)
if [[ -z "$ts" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
fail "$label: CronJob $ns/$cj has no lastSuccessfulTime"
detail+="${label}=no-success; "
had_issue=true
status="FAIL"
return 0
fi
local ts_epoch
ts_epoch=$(date -u -d "$ts" +%s 2>/dev/null || echo 0)
age_sec=$((now_epoch - ts_epoch))
if [[ "$age_sec" -gt "$max_age_sec" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
local age_h=$((age_sec / 3600))
fail "$label: last success ${age_h}h ago (>25h)"
detail+="${label}=${age_h}h; "
had_issue=true
status="FAIL"
else
local age_h=$((age_sec / 3600))
detail+="${label}=${age_h}h; "
fi
}
_check_cronjob_fresh dbaas mysql-backup-per-db mysql
_check_cronjob_fresh dbaas postgresql-backup-per-db pg
[[ "$had_issue" == false ]] && pass "Per-DB dumps fresh — $detail"
json_add "backup_per_db" "$status" "$detail"
}
# --- 35. Backup Freshness: Offsite Sync ---
check_backup_offsite_sync() {
section 35 "Backup Freshness — Offsite Sync"
local metrics detail="" status="PASS"
metrics=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true)
if [[ -z "$metrics" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
warn "Cannot query Pushgateway"
json_add "backup_offsite_sync" "WARN" "Pushgateway unreachable"
return 0
fi
local age_hours
age_hours=$(echo "$metrics" | python3 -c '
import sys, re, time
ts = None
for line in sys.stdin:
if line.startswith("#"):
continue
if "backup_last_success_timestamp" in line and "offsite-backup-sync" in line:
m = re.search(r"\s([0-9.eE+]+)\s*$", line.strip())
if m:
try:
ts = float(m.group(1))
break
except ValueError:
pass
if ts is None:
print("missing")
else:
age = (time.time() - ts) / 3600
print(f"{age:.1f}")
' 2>/dev/null) || age_hours="error"
if [[ "$age_hours" == "missing" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
fail "backup_last_success_timestamp metric missing for offsite-backup-sync"
json_add "backup_offsite_sync" "FAIL" "Metric missing"
elif [[ "$age_hours" == "error" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
warn "Failed to parse Pushgateway metric"
json_add "backup_offsite_sync" "WARN" "Parse error"
else
local age_int
age_int=$(printf '%.0f' "$age_hours")
if [[ "$age_int" -gt 27 ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
fail "Offsite sync last success ${age_hours}h ago (>27h)"
status="FAIL"
else
pass "Offsite sync last success ${age_hours}h ago"
fi
detail="age=${age_hours}h"
json_add "backup_offsite_sync" "$status" "$detail"
fi
}
# --- 36. Backup Freshness: LVM PVC Snapshots ---
check_backup_lvm_snapshots() {
section 36 "Backup Freshness — LVM PVC Snapshots"
local snap_output detail="" status="PASS"
snap_output=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 "lvs -o lv_name,lv_time --noheadings 2>/dev/null | grep -- -snap" 2>/dev/null || true)
if [[ -z "$snap_output" ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
warn "No LVM PVC snapshots found or SSH to 192.168.1.127 failed (BatchMode)"
json_add "backup_lvm_snapshots" "WARN" "SSH failed or no snapshots"
return 0
fi
local newest_age_hours
newest_age_hours=$(echo "$snap_output" | python3 -c '
import sys, re, time
from datetime import datetime
newest = None
for line in sys.stdin:
line = line.strip()
if not line:
continue
parts = line.split(None, 1)
if len(parts) < 2:
continue
date_str = parts[1].strip()
# lv_time format: "2026-04-19 03:00:01 +0000" or similar
for fmt in ("%Y-%m-%d %H:%M:%S %z", "%Y-%m-%d %H:%M:%S"):
try:
dt = datetime.strptime(date_str, fmt)
ts = dt.timestamp()
if newest is None or ts > newest:
newest = ts
break
except ValueError:
continue
if newest is None:
print("parse_error")
else:
age = (time.time() - newest) / 3600
print(f"{age:.1f}")
' 2>/dev/null) || newest_age_hours="error"
if [[ "$newest_age_hours" == "parse_error" || "$newest_age_hours" == "error" ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
warn "Could not parse LVM snapshot timestamps"
json_add "backup_lvm_snapshots" "WARN" "Parse error"
else
local count age_int
count=$(count_lines "$snap_output")
age_int=$(printf '%.0f' "$newest_age_hours")
if [[ "$age_int" -gt 25 ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
fail "Newest LVM snapshot ${newest_age_hours}h old (>25h); $count total"
status="FAIL"
else
pass "LVM snapshots fresh — $count total, newest ${newest_age_hours}h old"
fi
detail="count=$count newest=${newest_age_hours}h"
json_add "backup_lvm_snapshots" "$status" "$detail"
fi
}
# --- 37. Monitoring: Prometheus + Alertmanager ---
check_monitoring_prom_am() {
section 37 "Monitoring — Prometheus + Alertmanager"
local detail="" had_issue=false status="PASS"
# Prometheus /-/ready
local prom_ready
prom_ready=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://localhost:9090/-/ready" 2>/dev/null || true)
if echo "$prom_ready" | grep -qi "ready"; then
detail+="prometheus=ready; "
else
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
fail "Prometheus /-/ready returned no Ready response"
detail+="prometheus=not-ready; "
had_issue=true
status="FAIL"
fi
# Alertmanager running pod count
local am_running
am_running=$($KUBECTL get pods -n monitoring --no-headers 2>/dev/null | \
grep alertmanager | awk '$3 == "Running"' | wc -l | tr -d ' ')
if [[ "$am_running" -gt 0 ]]; then
detail+="alertmanager=${am_running} running; "
else
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
fail "Alertmanager: 0 Running pods"
detail+="alertmanager=none-running; "
had_issue=true
status="FAIL"
fi
[[ "$had_issue" == false ]] && pass "Prometheus Ready, $am_running Alertmanager pod(s) Running"
json_add "monitoring_prom_am" "$status" "$detail"
}
# --- 38. Monitoring: Vault Sealed Status ---
check_monitoring_vault() {
section 38 "Monitoring — Vault Sealed Status"
local output detail="" status="PASS"
output=$($KUBECTL exec -n vault vault-0 -- \
sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status' 2>&1 || true)
if [[ -z "$output" ]]; then
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
fail "Cannot exec vault status on vault-0"
json_add "monitoring_vault" "FAIL" "Exec failed"
return 0
fi
if echo "$output" | grep -qi "^Sealed[[:space:]]*false"; then
pass "Vault unsealed"
detail="sealed=false"
json_add "monitoring_vault" "PASS" "$detail"
elif echo "$output" | grep -qi "^Sealed[[:space:]]*true"; then
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
fail "Vault is SEALED — secrets unavailable"
detail="sealed=true"
status="FAIL"
json_add "monitoring_vault" "$status" "$detail"
else
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
warn "Cannot parse vault status output"
json_add "monitoring_vault" "WARN" "Parse error"
fi
}
# --- 39. Monitoring: ClusterSecretStore Ready ---
check_monitoring_css() {
section 39 "Monitoring — ClusterSecretStore Ready"
local css not_ready detail="" status="PASS"
css=$($KUBECTL get clustersecretstore -o json 2>/dev/null) || {
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
warn "ClusterSecretStore CRD not installed"
json_add "monitoring_css" "WARN" "CRD missing"
return 0
}
not_ready=$(echo "$css" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
ready = next((c for c in conds if c.get("type") == "Ready"), None)
if not ready or ready.get("status") != "True":
print(f"{name}:{ready.get(\"reason\", \"NoCondition\") if ready else \"NoCondition\"}")
' 2>/dev/null) || true
if [[ -z "$not_ready" ]]; then
local total
total=$(echo "$css" | python3 -c 'import json,sys; print(len(json.load(sys.stdin).get("items",[])))' 2>/dev/null || echo "?")
pass "All $total ClusterSecretStores Ready"
json_add "monitoring_css" "PASS" "$total Ready"
else
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
while IFS= read -r line; do
fail "ClusterSecretStore not Ready: $line"
detail+="$line; "
done <<< "$not_ready"
status="FAIL"
json_add "monitoring_css" "$status" "$detail"
fi
}
# --- 40. External Reachability: Cloudflared + Authentik Replicas ---
check_external_replicas() {
section 40 "External — Cloudflared + Authentik Replicas"
local detail="" had_issue=false status="PASS"
# Cloudflared
local cf_json cf_ready cf_desired
cf_json=$($KUBECTL get deployment cloudflared -n cloudflared -o json 2>/dev/null || true)
if [[ -z "$cf_json" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "Cloudflared deployment not found"
detail+="cloudflared=missing; "
had_issue=true
status="FAIL"
else
cf_ready=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
cf_desired=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
if [[ "$cf_ready" != "$cf_desired" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "Cloudflared: $cf_ready/$cf_desired ready (external access degraded)"
detail+="cloudflared=${cf_ready}/${cf_desired}; "
had_issue=true
status="FAIL"
else
detail+="cloudflared=${cf_ready}/${cf_desired}; "
fi
fi
# Authentik server (Helm chart names the deployment goauthentik-server)
local auth_json auth_ready auth_desired
auth_json=$($KUBECTL get deployment goauthentik-server -n authentik -o json 2>/dev/null || true)
if [[ -z "$auth_json" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
warn "goauthentik-server deployment not found in authentik namespace"
detail+="authentik=missing; "
had_issue=true
[[ "$status" != "FAIL" ]] && status="WARN"
else
auth_ready=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
auth_desired=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
if [[ "$auth_ready" != "$auth_desired" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "goauthentik-server: $auth_ready/$auth_desired ready (auth degraded)"
detail+="authentik=${auth_ready}/${auth_desired}; "
had_issue=true
status="FAIL"
else
detail+="authentik=${auth_ready}/${auth_desired}; "
fi
fi
[[ "$had_issue" == false ]] && pass "Cloudflared + authentik-server at full replicas ($detail)"
json_add "external_replicas" "$status" "$detail"
}
# --- 41. External Reachability: ExternalAccessDivergence Alert ---
check_external_divergence() {
section 41 "External — ExternalAccessDivergence Alert"
local alerts result detail="" status="PASS"
alerts=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://localhost:9090/api/v1/alerts" 2>/dev/null || true)
if [[ -z "$alerts" ]]; then
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
warn "Cannot query Prometheus alerts"
json_add "external_divergence" "WARN" "Cannot query"
return 0
fi
result=$(echo "$alerts" | python3 -c '
import json, sys
try:
data = json.load(sys.stdin)
alerts = data.get("data", {}).get("alerts", []) if isinstance(data, dict) else data
firing = [a for a in alerts
if a.get("labels", {}).get("alertname") == "ExternalAccessDivergence"
and a.get("state") == "firing"]
if firing:
hosts = [a.get("labels", {}).get("host") or a.get("labels", {}).get("service") or "?" for a in firing]
print(f"{len(firing)}:" + ",".join(hosts))
else:
print("0:")
except Exception as e:
print(f"error:{e}")
' 2>/dev/null) || result="error:parse"
if [[ "$result" == error:* ]]; then
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
warn "Failed to parse alerts JSON: ${result#error:}"
json_add "external_divergence" "WARN" "Parse error"
return 0
fi
local count names
count=$(echo "$result" | cut -d: -f1)
names=$(echo "$result" | cut -d: -f2-)
if [[ "$count" -eq 0 ]]; then
pass "ExternalAccessDivergence not firing"
json_add "external_divergence" "PASS" "Not firing"
else
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
fail "ExternalAccessDivergence firing for $count target(s): $names"
status="FAIL"
detail="$count firing: $names"
json_add "external_divergence" "$status" "$detail"
fi
}
# --- 42. External Reachability: Traefik 5xx Rate ---
check_external_traefik_5xx() {
section 42 "External — Traefik 5xx Rate (15m)"
local query_result detail="" status="PASS"
query_result=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' 2>/dev/null || true)
if [[ -z "$query_result" ]]; then
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
warn "Cannot query Prometheus for traefik 5xx rate"
json_add "external_traefik_5xx" "WARN" "Query failed"
return 0
fi
local parsed
parsed=$(echo "$query_result" | python3 -c '
import json, sys
try:
data = json.load(sys.stdin)
results = data.get("data", {}).get("result", [])
hot = [(r.get("metric", {}).get("service", "?"), float(r.get("value", [0, "0"])[1])) for r in results]
hot = [(s, v) for s, v in hot if v > 0.01] # 1% req/s threshold
hot.sort(key=lambda x: -x[1])
if not hot:
print("0:")
else:
top = [f"{s}={v:.2f}/s" for s, v in hot[:5]]
print(f"{len(hot)}:" + "; ".join(top))
except Exception as e:
print(f"error:{e}")
' 2>/dev/null) || parsed="error:parse"
if [[ "$parsed" == error:* ]]; then
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
warn "Parse failed: ${parsed#error:}"
json_add "external_traefik_5xx" "WARN" "Parse error"
return 0
fi
local count top
count=$(echo "$parsed" | cut -d: -f1)
top=$(echo "$parsed" | cut -d: -f2-)
if [[ "$count" -eq 0 ]]; then
pass "No Traefik services with 5xx rate >0.01 req/s (last 15m)"
json_add "external_traefik_5xx" "PASS" "None above threshold"
else
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
# WARN at any 5xx; FAIL if top service >1 req/s
local top_rate
top_rate=$(echo "$top" | grep -oE '[0-9.]+/s' | head -1 | tr -d '/s')
if awk "BEGIN{exit !($top_rate > 1.0)}" 2>/dev/null; then
fail "$count Traefik service(s) with elevated 5xx: $top"
status="FAIL"
else
warn "$count Traefik service(s) emitting 5xx: $top"
status="WARN"
fi
detail="$count services: $top"
json_add "external_traefik_5xx" "$status" "$detail"
fi
}
# --- Summary ---
print_summary() {
if [[ "$JSON" == true ]]; then
@ -1832,6 +2421,18 @@ main() {
check_ha_automations
check_ha_system
check_hardware_exporters
check_cert_manager_certificates
check_cert_manager_expiry
check_cert_manager_requests
check_backup_per_db
check_backup_offsite_sync
check_backup_lvm_snapshots
check_monitoring_prom_am
check_monitoring_vault
check_monitoring_css
check_external_replicas
check_external_divergence
check_external_traefik_5xx
print_summary
# Exit code: 2 for failures, 1 for warnings, 0 for clean

View file

@ -0,0 +1,188 @@
<?php
// pfSense HAProxy bootstrap — configures the mailserver PROXY-v2 path
// (bd code-yiu, Phases 2/3 + 5).
//
// WHY THIS EXISTS
// pfSense HAProxy config is stored XML-in-`/cf/conf/config.xml` under
// `<installedpackages><haproxy>`. That file IS picked up by the nightly
// `daily-backup` on the PVE host (see `scripts/daily-backup.sh` → `scp
// root@10.0.20.1:/cf/conf/config.xml`) and synced to Synology. This script
// is the canonical reproducer: run it to rebuild the pfSense HAProxy config
// from scratch (DR restore, fresh pfSense install, etc.).
//
// WHAT IT BUILDS
// 4 backend pools — one per mail port:
// mailserver_nodes_smtp → k8s-node1..4:30125 (container :2525 postscreen)
// mailserver_nodes_smtps → k8s-node1..4:30126 (container :4465 smtps)
// mailserver_nodes_sub → k8s-node1..4:30127 (container :5587 submission)
// mailserver_nodes_imaps → k8s-node1..4:30128 (container :10993 IMAPS)
// Each server uses `send-proxy-v2` and TCP health-check every 120s.
// 4 frontends on pfSense 10.0.20.1:{25,465,587,993} TCP mode.
// + 1 legacy test frontend on :2525 (kept for validation; safe to remove later).
//
// USAGE (on pfSense host, via SSH as admin)
// scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
// ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
//
// IDEMPOTENCY
// Removes any existing entries named mailserver_* before re-adding, so
// repeat runs are safe and behave as reset-to-declared.
require_once('/etc/inc/config.inc');
require_once('/usr/local/pkg/haproxy/haproxy.inc');
require_once('/usr/local/pkg/haproxy/haproxy_utils.inc');
global $config;
parse_config(true);
if (!is_array($config['installedpackages']['haproxy'])) {
$config['installedpackages']['haproxy'] = [];
}
$h = &$config['installedpackages']['haproxy'];
$h['enable'] = 'yes';
$h['maxconn'] = '1000';
// Our declared object names (anything starting with mailserver_ is ours)
$POOL_NAMES = [
'mailserver_nodes', // legacy (Phase 2/3 test)
'mailserver_nodes_smtp',
'mailserver_nodes_smtps',
'mailserver_nodes_sub',
'mailserver_nodes_imaps',
];
$FRONTEND_NAMES = [
'mailserver_proxy_test', // legacy (Phase 2/3 test, :2525)
'mailserver_proxy_25',
'mailserver_proxy_465',
'mailserver_proxy_587',
'mailserver_proxy_993',
];
// k8s workers. Not in the cluster: master (control-plane) and node5
// (doesn't exist in this topology).
$NODES = [
['k8s-node1', '10.0.20.101'],
['k8s-node2', '10.0.20.102'],
['k8s-node3', '10.0.20.103'],
['k8s-node4', '10.0.20.104'],
];
function build_pool(string $name, string $nodeport, array $nodes): array {
$servers = [];
foreach ($nodes as $n) {
$servers[] = [
'name' => $n[0],
'address' => $n[1],
'port' => $nodeport,
'weight' => '10',
'ssl' => '',
// check every 2 min — send-proxy-v2 check + close generates
// noise on postscreen, not worth doing more often.
'checkinter' => '120000',
'advanced' => 'send-proxy-v2',
'status' => 'active',
];
}
return [
'name' => $name,
'balance' => 'roundrobin',
'check_type' => 'TCP',
'checkinter' => '120000',
'retries' => '3',
'ha_servers' => ['item' => $servers],
'advanced_bind' => '',
'persist_cookie_enabled' => '',
'transparent_clientip' => '',
'advanced' => '',
];
}
function build_frontend(string $name, string $descr, string $extaddr, string $port, string $pool): array {
return [
'name' => $name,
'descr' => $descr,
'status' => 'active',
'secondary' => '',
'type' => 'tcp',
'a_extaddr' => ['item' => [[
'extaddr' => $extaddr,
'extaddr_port' => $port,
'extaddr_ssl' => '',
'extaddr_advanced' => '',
]]],
'backend_serverpool' => $pool,
'ha_acls' => '',
'dontlognull'=> '',
'httpclose' => '',
'forwardfor' => '',
'advanced' => '',
];
}
// ── Backend pools ───────────────────────────────────────────────────────
if (!is_array($h['ha_pools'])) $h['ha_pools'] = ['item' => []];
if (!is_array($h['ha_pools']['item'])) $h['ha_pools']['item'] = [];
$h['ha_pools']['item'] = array_values(array_filter(
$h['ha_pools']['item'],
fn($p) => !in_array($p['name'] ?? '', $POOL_NAMES, true)
));
// Legacy test pool (still used by the :2525 test frontend for manual SMTP roundtrip).
$h['ha_pools']['item'][] = build_pool('mailserver_nodes', '30125', $NODES);
// Production pools — one per mail port.
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtp', '30125', $NODES);
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtps', '30126', $NODES);
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_sub', '30127', $NODES);
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_imaps', '30128', $NODES);
// ── Frontends ───────────────────────────────────────────────────────────
if (!is_array($h['ha_backends'])) $h['ha_backends'] = ['item' => []];
if (!is_array($h['ha_backends']['item'])) $h['ha_backends']['item'] = [];
$h['ha_backends']['item'] = array_values(array_filter(
$h['ha_backends']['item'],
fn($f) => !in_array($f['name'] ?? '', $FRONTEND_NAMES, true)
));
// Legacy test frontend — :2525 — retained so SMTP roundtrip tests keep working
// without touching the real :25. Safe to remove once fully validated.
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_test',
'code-yiu Phase 2/3 test — PROXY v2 to k8s mailserver NodePort 30125 (alt port :2525)',
'10.0.20.1', '2525',
'mailserver_nodes'
);
// Production frontends — 4 ports listening on pfSense VLAN20 IP 10.0.20.1.
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_25',
'code-yiu Phase 4/5 — external SMTP (:25) via PROXY v2 → pod :2525 postscreen',
'10.0.20.1', '25',
'mailserver_nodes_smtp'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_465',
'code-yiu Phase 4/5 — external SMTPS (:465) via PROXY v2 → pod :4465 smtpd',
'10.0.20.1', '465',
'mailserver_nodes_smtps'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_587',
'code-yiu Phase 4/5 — external submission (:587) via PROXY v2 → pod :5587 smtpd',
'10.0.20.1', '587',
'mailserver_nodes_sub'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_993',
'code-yiu Phase 4/5 — external IMAPS (:993) via PROXY v2 → pod :10993 Dovecot',
'10.0.20.1', '993',
'mailserver_nodes_imaps'
);
write_config('code-yiu: mailserver HAProxy — 4 production frontends + legacy :2525 test');
$messages = '';
$rc = haproxy_check_and_run($messages, true);
echo 'haproxy_check_and_run rc=' . ($rc ? 'OK' : 'FAIL') . "\n";
echo "messages: $messages\n";

View file

@ -0,0 +1,68 @@
<?php
// pfSense NAT redirect flip — mail ports 25/465/587/993 from
// <mailserver> alias (10.0.20.202 MetalLB LB) to pfSense's own HAProxy
// listener (10.0.20.1). bd code-yiu.
//
// THIS IS THE CUTOVER. After this script:
// Internet → pfSense WAN:{25,465,587,993} → rdr → 10.0.20.1:{...}
// (pfSense HAProxy) → send-proxy-v2 → k8s-node:{30125..30128} NodePort
// → kube-proxy → mailserver pod alt listeners (2525/4465/5587/10993)
// → Postfix/Dovecot parse PROXY v2 → real client IP recovered.
//
// Internal clients (Roundcube, email-roundtrip-monitor CronJob) continue
// using the existing mailserver ClusterIP Service on the stock ports
// (25/465/587/993) which hit container stock listeners WITHOUT PROXY.
// No change to internal traffic paths.
//
// USAGE
// scp infra/scripts/pfsense-nat-mailserver-haproxy-flip.php admin@10.0.20.1:/tmp/
// ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-flip.php'
//
// REVERT — run pfsense-nat-mailserver-haproxy-unflip.php (companion script).
//
// IDEMPOTENT — re-runs converge. Flips nothing if already pointed at 10.0.20.1.
require_once('/etc/inc/config.inc');
require_once('/etc/inc/filter.inc');
global $config;
parse_config(true);
$PORTS_TO_FLIP = ['25', '465', '587', '993'];
$OLD_TARGET = 'mailserver';
$NEW_TARGET = '10.0.20.1';
$changed = 0;
foreach ($config['nat']['rule'] as $i => &$r) {
$iface = $r['interface'] ?? '';
$lport = $r['local-port'] ?? '';
$tgt = $r['target'] ?? '';
if ($iface !== 'wan') continue;
if (!in_array($lport, $PORTS_TO_FLIP, true)) continue;
if ($tgt !== $OLD_TARGET) {
printf("rule %d (dport=%s) target=%s — not flipping (already %s or unexpected)\n",
$i, $lport, $tgt, $NEW_TARGET);
continue;
}
$r['target'] = $NEW_TARGET;
// Also unset the 'associated-rule-id' linked filter rule target if any —
// actually pfSense regenerates the associated rule from NAT rule on apply,
// so leaving associated-rule-id intact is fine.
$changed++;
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
}
unset($r);
if ($changed === 0) {
echo "No changes. (Already flipped? Run unflip script to revert.)\n";
exit(0);
}
write_config("code-yiu: NAT rdr — mail ports {$changed} flipped to HAProxy (10.0.20.1)");
// Rebuild pf rules & reload.
$rc = filter_configure();
printf("filter_configure rc=%s\n", var_export($rc, true));
echo "done.\n";

View file

@ -0,0 +1,48 @@
<?php
// REVERT of pfsense-nat-mailserver-haproxy-flip.php.
// Moves mail-port NAT rdr target from 10.0.20.1 (pfSense HAProxy) back to
// <mailserver> alias (10.0.20.202 MetalLB LB IP). bd code-yiu rollback.
//
// USE THIS IF: external mail breaks after the flip, any postscreen
// PROXY timeouts show up in logs, or you need to back out before Phase 6.
require_once('/etc/inc/config.inc');
require_once('/etc/inc/filter.inc');
global $config;
parse_config(true);
$PORTS_TO_REVERT = ['25', '465', '587', '993'];
$OLD_TARGET = '10.0.20.1';
$NEW_TARGET = 'mailserver';
$changed = 0;
foreach ($config['nat']['rule'] as $i => &$r) {
$iface = $r['interface'] ?? '';
$lport = $r['local-port'] ?? '';
$tgt = $r['target'] ?? '';
if ($iface !== 'wan') continue;
if (!in_array($lport, $PORTS_TO_REVERT, true)) continue;
if ($tgt !== $OLD_TARGET) {
printf("rule %d (dport=%s) target=%s — not reverting (already %s or unexpected)\n",
$i, $lport, $tgt, $NEW_TARGET);
continue;
}
$r['target'] = $NEW_TARGET;
$changed++;
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
}
unset($r);
if ($changed === 0) {
echo "No changes. (Already reverted.)\n";
exit(0);
}
write_config("code-yiu: NAT rdr — mail ports {$changed} reverted to <mailserver> alias");
$rc = filter_configure();
printf("filter_configure rc=%s\n", var_export($rc, true));
echo "done.\n";

Binary file not shown.

Binary file not shown.

View file

@ -1,29 +0,0 @@
#!/bin/bash
# Setup script for automated monitoring environment
# Ensures health check scripts have access to kubeconfig
echo "=== Setting up automated monitoring environment ==="
# Copy kubeconfig to location expected by health check scripts
if [ -f /home/node/.openclaw/kubeconfig ]; then
cp /home/node/.openclaw/kubeconfig /workspace/infra/config
echo "✅ Kubeconfig copied to /workspace/infra/config"
else
echo "❌ Source kubeconfig not found at /home/node/.openclaw/kubeconfig"
exit 1
fi
# Test health check access
echo ""
echo "Testing health check script access..."
cd /workspace/infra
if KUBECONFIG="" timeout 30 bash .claude/cluster-health.sh --quiet > /dev/null 2>&1; then
echo "✅ Health check script can access cluster"
else
echo "❌ Health check script cannot access cluster"
exit 1
fi
echo ""
echo "✅ Automated monitoring environment setup complete"
echo "📊 Cron health checks will now work properly"

View file

@ -201,7 +201,7 @@ resource "kubernetes_deployment" "affine" {
annotations = {
"diun.enable" = "true"
"diun.include_tags" = "^\\d+\\.\\d+\\.\\d+$"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
}
}
spec {

View file

@ -74,6 +74,36 @@ resource "kubernetes_deployment" "pgbouncer" {
container_port = 6432
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "512Mi"
}
}
readiness_probe {
tcp_socket {
port = 6432
}
initial_delay_seconds = 5
period_seconds = 10
timeout_seconds = 3
failure_threshold = 3
}
liveness_probe {
tcp_socket {
port = 6432
}
initial_delay_seconds = 30
period_seconds = 30
timeout_seconds = 5
failure_threshold = 3
}
volume_mount {
name = "config"
mount_path = "/etc/pgbouncer/pgbouncer.ini"
@ -121,6 +151,25 @@ resource "kubernetes_deployment" "pgbouncer" {
}
}
# --- 3b PodDisruptionBudget ---
# Protects auth against simultaneous node drains. With 3 replicas and
# minAvailable=2, a single drain rolls cleanly; a simultaneous two-node
# outage is correctly blocked.
resource "kubernetes_pod_disruption_budget_v1" "pgbouncer" {
metadata {
name = "pgbouncer"
namespace = "authentik"
}
spec {
min_available = 2
selector {
match_labels = {
app = "pgbouncer"
}
}
}
}
# --- 4 Service ---
resource "kubernetes_service" "pgbouncer" {
metadata {

View file

@ -14,9 +14,29 @@ authentik:
port: 6432
user: authentik
password: ""
# Persistent client-side connections (safe with PgBouncer session mode;
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
# setup overhead off the ~70 sequential ORM ops per flow stage.
conn_max_age: 60
conn_health_checks: true
cache:
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
# moved cache storage from Redis to Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
timeout_flows: 1800
timeout_policies: 900
web:
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
workers: 3
threads: 4
worker:
# Celery-equivalent worker threads per pod (default 2, renamed from
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
threads: 4
server:
replicas: 2
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
@ -27,7 +47,7 @@ server:
cpu: 100m
memory: 1.5Gi
limits:
memory: 1.5Gi
memory: 2Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
@ -44,12 +64,12 @@ server:
diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image
pdb:
enabled: true
minAvailable: 1
minAvailable: 2
global:
addPrometheusAnnotations: true
worker:
replicas: 2
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
@ -60,7 +80,7 @@ worker:
cpu: 100m
memory: 1.5Gi
limits:
memory: 1.5Gi
memory: 2Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname

View file

@ -11,7 +11,7 @@ data "vault_kv_secret_v2" "viktor_secrets" {
locals {
namespace = "claude-agent"
image = "registry.viktorbarzin.me/claude-agent-service"
image_tag = "0c24c9b6"
image_tag = "2fd7670d"
labels = {
app = "claude-agent-service"
}
@ -78,6 +78,25 @@ resource "kubernetes_manifest" "external_secret" {
property = "claude_oauth_token"
}
},
{
# Consumed by service-upgrade agent to poll ci.viktorbarzin.me
# per-workflow status. Pod has no Vault CLI auth, so the old
# `vault kv get` path is dead see bd code-3o3.
secretKey = "WOODPECKER_API_TOKEN"
remoteRef = {
key = "ci/global"
property = "woodpecker_api_token"
}
},
{
# Consumed by service-upgrade agent for Start/Success/Failure
# notifications. Same shared webhook as alertmanager.
secretKey = "SLACK_WEBHOOK_URL"
remoteRef = {
key = "viktor"
property = "alertmanager_slack_api_url"
}
},
]
}
}

View file

@ -206,7 +206,7 @@ resource "cloudflare_record" "mail_tlsrpt" {
}
resource "cloudflare_record" "mail_dmarc" {
content = "\"v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:e21c0ff8@dmarc.mailgun.org,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:e21c0ff8@dmarc.mailgun.org,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;\""
content = "\"v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;\""
name = "_dmarc.viktorbarzin.me"
proxied = false
ttl = 1

View file

@ -84,7 +84,7 @@ resource "kubernetes_deployment" "dawarich" {
annotations = {
"diun.enable" = "true"
"diun.include_tags" = "^v?\\d+\\.\\d+\\.\\d+$"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
}
}
spec {

View file

@ -180,7 +180,7 @@ resource "kubernetes_deployment" "grampsweb" {
app = "grampsweb"
}
annotations = {
"dependency.kyverno.io/wait-for" = "redis.redis:6379"
"dependency.kyverno.io/wait-for" = "redis-master.redis:6379"
}
}
spec {
@ -354,13 +354,14 @@ resource "kubernetes_service" "grampsweb" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.grampsweb.metadata[0].name
name = "family"
service_name = "grampsweb"
tls_secret_name = var.tls_secret_name
max_body_size = "500m"
protected = true
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.grampsweb.metadata[0].name
name = "family"
service_name = "grampsweb"
tls_secret_name = var.tls_secret_name
max_body_size = "500m"
protected = true
external_monitor = false
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "GrampsWeb"

View file

@ -9,7 +9,7 @@ defaultPodOptions:
env:
# REDIS_HOSTNAME: '{{ printf "%s-redis-master" .Release.Name }}'
REDIS_HOSTNAME: "redis.redis.svc.cluster.local"
REDIS_HOSTNAME: "redis-master.redis.svc.cluster.local"
# DB_HOSTNAME: "postgresql.dbaas"
# DB_USERNAME: "immich"
# DB_DATABASE_NAME: "immich"

View file

@ -83,7 +83,7 @@ For secrets requiring admin access (shared infra passwords, API keys):
| \`modules/kubernetes/nfs_volume/\` | NFS volume module (CSI-backed, soft mount) |
| \`config.tfvars\` | Non-secret configuration (plaintext) |
| \`secrets.sops.json\` | All secrets (SOPS-encrypted JSON) |
| \`scripts/cluster_healthcheck.sh\` | 25-check cluster health script |
| \`scripts/cluster_healthcheck.sh\` | 42-check cluster health script |
| \`AGENTS.md\` | Full AI agent instructions (auto-loaded by most agents) |
### Tier System

View file

@ -7,7 +7,7 @@
#
# Usage:
# annotations:
# dependency.kyverno.io/wait-for: "postgresql.dbaas:5432,redis.redis:6379"
# dependency.kyverno.io/wait-for: "postgresql.dbaas:5432,redis-master.redis:6379"
#
# Each comma-separated entry becomes a busybox init container that runs
# `nc -z <host> <port>` in a loop until the dependency is reachable.

View file

@ -134,6 +134,29 @@ resource "kubernetes_config_map" "mailserver_config" {
# Increase max IMAP connections per user+IP - all Roundcube connections come from same pod IP
"dovecot.cf" = <<-EOF
mail_max_userip_connections = 50
# Throttle IMAP auth brute-force. CrowdSec handles the network-level
# ban, this adds defense in depth at the auth layer each failed
# attempt waits 5s before responding, stretching a 1000-password
# dictionary attack from <1s to ~85min. Addresses code-9mi.
auth_failure_delay = 5s
# code-yiu Phase 5: alt IMAPS listener on :10993 that REQUIRES the
# HAProxy PROXY v2 wire format. pfSense HAProxy injects the header
# on backend connects via k8s-node:30128 kube-proxy pod :10993.
# Real client IP recovered from header despite kube-proxy SNAT.
# The stock :993 listener stays PROXY-free for internal clients
# (Roundcube, email-roundtrip-monitor) on the mailserver ClusterIP.
# haproxy_trusted_networks = source IPs allowed to *send* PROXY v2.
# Post kube-proxy SNAT the source is the k8s node IP (10.0.20.101-104);
# allow-list the whole VLAN 20 node subnet.
haproxy_trusted_networks = 10.0.20.0/24
service imap-login {
inet_listener imaps_proxy {
port = 10993
ssl = yes
haproxy = yes
}
}
EOF
fail2ban_conf = <<-EOF
[DEFAULT]
@ -142,31 +165,110 @@ resource "kubernetes_config_map" "mailserver_config" {
logtarget = SYSOUT
EOF
}
# Password hashes are different each time and avoid changing secret constantly.
# Either 1.Create consistent hashes or 2.Find a way to ignore_changes on per password
# bcrypt() generates a fresh salt on every evaluation, so the hash line
# differs each plan run. ignore_changes is the pragmatic workaround.
#
# INVARIANT (code-7ns, decision 2026-04-19): if a password in Vault
# (secret/platform.mailserver_accounts) is rotated, ignore_changes WILL
# mask that rotation TF will not re-render the ConfigMap and the pod
# will keep accepting the old password until the ConfigMap is force-
# taintned (`terraform taint module.mailserver.kubernetes_config_map
# .postfix-accounts-cf`) or the resource is addressed explicitly on
# apply (`-replace=...`). Currently there is NO automatic Vault
# rotation for mailserver_accounts, so this is acceptable. If automatic
# rotation is ever added, replace this ignore_changes with either:
# (a) deterministic hashing (bcrypt with a stable salt derived from
# the user string loses per-user salt uniqueness but keeps TF
# convergent), or
# (b) render postfix-accounts.cf from a K8s Secret synced by ESO
# (CRD consumed by a dedicated volume mount; docker-mailserver
# loads it at pod start).
lifecycle {
# DRIFT_WORKAROUND: postfix-accounts.cf password hashes non-deterministic; would flap on every apply. Reviewed 2026-04-18.
ignore_changes = [data["postfix-accounts.cf"]]
}
}
# resource "kubernetes_config_map" "user_patches" {
# metadata {
# name = "user-patches"
# namespace = kubernetes_namespace.mailserver.metadata[0].name
# labels = {
# "app" = "mailserver"
# }
# }
# code-yiu Phase 1a: user-patches.sh appends alt PROXY-speaking listeners to
# Postfix master.cf at container startup. docker-mailserver runs
# /tmp/docker-mailserver/user-patches.sh after initial config generation, so
# our append lands on every fresh pod. Idempotent guard prevents double-append
# on in-place container restarts. Dovecot extensions are in the dovecot.cf
# ConfigMap entry (no patches.sh entry needed).
resource "kubernetes_config_map" "mailserver_user_patches" {
metadata {
name = "mailserver-user-patches"
namespace = kubernetes_namespace.mailserver.metadata[0].name
labels = {
app = "mailserver"
}
annotations = {
"reloader.stakater.com/match" = "true"
}
}
# data = {
# user_patches = <<EOF
# #!/bin/bash
# cp -f /tmp/dovecot.key /etc/dovecot/ssl/dovecot.key
# cp -f /tmp/dovecot.crt /etc/dovecot/ssl/dovecot.pem
# EOF
# }
# }
data = {
"user-patches.sh" = <<-EOT
#!/bin/bash
# code-yiu Phase 5: append PROXY-speaking alt listeners to Postfix master.cf:
# :2525 postscreen (alt :25) injected with PROXY v2 by pfSense HAProxy
# :4465 smtpd (alt :465 SMTPS) ditto, wrappermode TLS
# :5587 smtpd (alt :587 submission) ditto
# Stock :25/:465/:587 stay in parallel (no PROXY required) so internal
# Roundcube/probe traffic on mailserver.svc ClusterIP keeps working.
# Dovecot alt IMAPS listener on :10993 is configured via dovecot.cf
# (not here) because that's a Dovecot config, not a Postfix master.cf.
set -euxo pipefail
MASTER_CF=/etc/postfix/master.cf
SENTINEL='# code-yiu:alt-proxy'
if ! grep -qF "$SENTINEL" "$MASTER_CF"; then
cat >> "$MASTER_CF" <<'PFXEOF'
# code-yiu:alt-proxy PROXY-speaking alt listeners for pfSense HAProxy backend pool.
# Mirrors stock docker-mailserver submission/submissions options (incl. SASL via
# Dovecot's /dev/shm/sasl-auth.sock) but with PROXY v2 upstream. chroot=n so the
# SASL path is readable from the smtpd process (sockets live outside /var/spool).
2525 inet n - n - 1 postscreen
-o syslog_name=postfix/smtpd-proxy25
-o postscreen_upstream_proxy_protocol=haproxy
-o postscreen_upstream_proxy_timeout=5s
4465 inet n - n - - smtpd
-o syslog_name=postfix/smtpd-proxy465
-o smtpd_tls_wrappermode=yes
-o smtpd_sasl_auth_enable=yes
-o smtpd_sasl_type=dovecot
-o smtpd_tls_auth_only=yes
-o smtpd_reject_unlisted_recipient=no
-o smtpd_sasl_authenticated_header=yes
-o smtpd_client_restrictions=permit_sasl_authenticated,reject
-o smtpd_relay_restrictions=permit_sasl_authenticated,reject
-o smtpd_sender_restrictions=$mua_sender_restrictions
-o smtpd_discard_ehlo_keywords=
-o milter_macro_daemon_name=ORIGINATING
-o cleanup_service_name=sender-cleanup
-o smtpd_upstream_proxy_protocol=haproxy
-o smtpd_upstream_proxy_timeout=5s
5587 inet n - n - - smtpd
-o syslog_name=postfix/smtpd-proxy587
-o smtpd_tls_security_level=encrypt
-o smtpd_sasl_auth_enable=yes
-o smtpd_sasl_type=dovecot
-o smtpd_tls_auth_only=yes
-o smtpd_reject_unlisted_recipient=no
-o smtpd_sasl_authenticated_header=yes
-o smtpd_client_restrictions=permit_sasl_authenticated,reject
-o smtpd_relay_restrictions=permit_sasl_authenticated,reject
-o smtpd_sender_restrictions=$mua_sender_restrictions
-o smtpd_discard_ehlo_keywords=
-o milter_macro_daemon_name=ORIGINATING
-o cleanup_service_name=sender-cleanup
-o smtpd_upstream_proxy_protocol=haproxy
-o smtpd_upstream_proxy_timeout=5s
PFXEOF
fi
EOT
}
}
resource "kubernetes_secret" "opendkim_key" {
metadata {
@ -230,7 +332,8 @@ resource "kubernetes_deployment" "mailserver" {
template {
metadata {
annotations = {
# "diun.enable" = "true"
"diun.enable" = "true"
"diun.include_tags" = "^latest$"
}
labels = {
"app" = "mailserver"
@ -242,11 +345,14 @@ resource "kubernetes_deployment" "mailserver" {
name = "docker-mailserver"
image = "docker.io/mailserver/docker-mailserver:15.0.0"
image_pull_policy = "IfNotPresent"
security_context {
capabilities {
add = ["NET_ADMIN"]
}
}
# NET_ADMIN was originally required by docker-mailserver's
# Fail2ban (iptables ban actions). Fail2ban is DISABLED in this
# stack (ENABLE_FAIL2BAN=0, see above) CrowdSec owns the
# brute-force policy. The capability is therefore unnecessary.
# Dropping it 2026-04-19 (code-4mu). If mail flow regresses,
# `kubectl logs -n mailserver -l app=mailserver -c docker-mailserver`
# will show permission-denied errors revert if observed.
security_context {}
lifecycle {
post_start {
@ -376,6 +482,15 @@ resource "kubernetes_deployment" "mailserver" {
sub_path = "fail2ban_conf"
read_only = true
}
# code-yiu Phase 1a: user-patches.sh runs at container startup to
# append PROXY-speaking listeners to master.cf (see
# kubernetes_config_map.mailserver_user_patches).
volume_mount {
name = "user-patches"
mount_path = "/tmp/docker-mailserver/user-patches.sh"
sub_path = "user-patches.sh"
read_only = true
}
port {
name = "smtp"
container_port = 25
@ -396,6 +511,29 @@ resource "kubernetes_deployment" "mailserver" {
container_port = 993
protocol = "TCP"
}
# code-yiu Phase 5: alt PROXY-speaking listeners.
# Postfix: 2525 (postscreen), 4465 (smtps), 5587 (submission).
# Dovecot: 10993 (imaps). All require PROXY v2 from pfSense HAProxy.
port {
name = "smtp-proxy"
container_port = 2525
protocol = "TCP"
}
port {
name = "smtps-proxy"
container_port = 4465
protocol = "TCP"
}
port {
name = "sub-proxy"
container_port = 5587
protocol = "TCP"
}
port {
name = "imaps-proxy"
container_port = 10993
protocol = "TCP"
}
env_from {
config_map_ref {
name = "mailserver.env.config"
@ -412,35 +550,25 @@ resource "kubernetes_deployment" "mailserver" {
}
}
readiness_probe {
tcp_socket {
port = 25
}
initial_delay_seconds = 30
period_seconds = 10
}
liveness_probe {
tcp_socket {
port = 993
}
initial_delay_seconds = 60
period_seconds = 60
timeout_seconds = 15
}
}
container {
name = "dovecot-exporter"
image = "viktorbarzin/dovecot_exporter:latest"
command = [
"/dovecot_exporter/exporter",
"--dovecot.socket-path=/var/run/dovecot/stats-reader"
]
image_pull_policy = "IfNotPresent"
port {
name = "dovecotexporter"
container_port = 9166
protocol = "TCP"
}
volume_mount {
name = "var-run-dovecot"
mount_path = "/var/run/dovecot"
}
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "32Mi"
}
}
}
volume {
name = "config"
@ -472,12 +600,14 @@ resource "kubernetes_deployment" "mailserver" {
# fs_type = "ext4"
# }
}
# volume {
# name = "user-patches"
# config_map {
# name = "user-patches"
# }
# }
# code-yiu Phase 1a
volume {
name = "user-patches"
config_map {
name = kubernetes_config_map.mailserver_user_patches.metadata[0].name
default_mode = "0755"
}
}
volume {
name = "var-run-dovecot"
empty_dir {}
@ -494,6 +624,13 @@ resource "kubernetes_deployment" "mailserver" {
}
resource "kubernetes_service" "mailserver" {
# code-yiu Phase 6: downgraded from LoadBalancer (MetalLB 10.0.20.202,
# ETP: Local) to ClusterIP on 2026-04-19. External mail now enters via
# pfSense HAProxy kubernetes_service.mailserver_proxy NodePort alt
# PROXY-speaking listeners. This Service exists only for intra-cluster
# clients (Roundcube pod, email-roundtrip-monitor CronJob) that talk to
# `mailserver.mailserver.svc.cluster.local:{25,465,587,993}` on the
# stock (PROXY-free) container listeners.
metadata {
name = "mailserver"
namespace = kubernetes_namespace.mailserver.metadata[0].name
@ -501,15 +638,10 @@ resource "kubernetes_service" "mailserver" {
labels = {
app = "mailserver"
}
annotations = {
"metallb.io/loadBalancerIPs" = "10.0.20.202"
}
}
spec {
type = "LoadBalancer"
external_traffic_policy = "Local"
type = "ClusterIP"
selector = {
app = "mailserver"
}
@ -541,12 +673,65 @@ resource "kubernetes_service" "mailserver" {
port = 993
target_port = "imap-secure"
}
}
}
# The `mailserver-metrics` ClusterIP Service (formerly split from the
# main LB in code-izl) was retired in code-1ik when the Dovecot
# exporter was removed the exporter spoke the pre-Dovecot-2.3
# old_stats protocol which docker-mailserver 15.0.0 no longer
# emits, so the scrape was a no-op. If a working exporter is ever
# re-introduced, add back: ClusterIP Service exposing port 9166
# with selector app=mailserver.
# code-yiu Phase 1a: NodePort Service for pfSense HAProxy backend connections.
# External SMTP flow post-cutover:
# Client pfSense WAN:25 pfSense HAProxy k8s-node:30125 (NodePort
# targeting container :2525 on any node, ETP: Cluster) pod postscreen
# with PROXY v2 parsing real client IP in maillog.
# Internal flow (Roundcube, probe) stays on the mailserver ClusterIP Service
# hitting container :25 without PROXY unchanged.
resource "kubernetes_service" "mailserver_proxy" {
metadata {
name = "mailserver-proxy"
namespace = kubernetes_namespace.mailserver.metadata[0].name
labels = {
app = "mailserver"
}
}
spec {
type = "NodePort"
external_traffic_policy = "Cluster"
selector = {
app = "mailserver"
}
port {
name = "dovecot-metrics"
name = "smtp-proxy"
protocol = "TCP"
port = 9166
target_port = 9166
port = 25
target_port = 2525
node_port = 30125
}
port {
name = "smtps-proxy"
protocol = "TCP"
port = 465
target_port = 4465
node_port = 30126
}
port {
name = "sub-proxy"
protocol = "TCP"
port = 587
target_port = 5587
node_port = 30127
}
port {
name = "imaps-proxy"
protocol = "TCP"
port = 993
target_port = 10993
node_port = 30128
}
}
}
@ -712,32 +897,68 @@ except Exception as e:
duration = time.time() - start
print(f"ERROR: {e}")
# Push metrics to Pushgateway
metrics = f"""# HELP email_roundtrip_success Whether the last e2e email probe succeeded
# TYPE email_roundtrip_success gauge
email_roundtrip_success {success}
# HELP email_roundtrip_duration_seconds Duration of the last e2e email probe
# TYPE email_roundtrip_duration_seconds gauge
email_roundtrip_duration_seconds {duration:.2f}
# HELP email_roundtrip_last_success_timestamp Unix timestamp of last successful probe
# TYPE email_roundtrip_last_success_timestamp gauge
email_roundtrip_last_success_timestamp {int(time.time()) if success else 0}
"""
try:
requests.put(PUSHGATEWAY, data=metrics, timeout=10)
print("Pushed metrics to Pushgateway")
except Exception as e:
print(f"Failed to push metrics: {e}")
# Push metrics to Pushgateway. On failure we omit email_roundtrip_last_success_timestamp
# and POST (not PUT) so the prior successful timestamp is preserved otherwise pushing 0
# makes EmailRoundtripStale fire immediately alongside EmailRoundtripFailing.
metric_lines = [
"# HELP email_roundtrip_success Whether the last e2e email probe succeeded",
"# TYPE email_roundtrip_success gauge",
f"email_roundtrip_success {success}",
"# HELP email_roundtrip_duration_seconds Duration of the last e2e email probe",
"# TYPE email_roundtrip_duration_seconds gauge",
f"email_roundtrip_duration_seconds {duration:.2f}",
]
if success:
metric_lines += [
"# HELP email_roundtrip_last_success_timestamp Unix timestamp of last successful probe",
"# TYPE email_roundtrip_last_success_timestamp gauge",
f"email_roundtrip_last_success_timestamp {int(time.time())}",
]
metrics = "\n".join(metric_lines) + "\n"
UPTIME_KUMA_URL = "http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration))
def push_with_retry(label, func, url):
# 3 attempts with exponential backoff (1s, 2s, 4s). Returns True on success, False otherwise.
# Final failure logs ERROR with URL + status code (or exception) so the pod log surfaces the drop.
last_status = None
last_exc = None
for attempt in range(3):
try:
resp = func()
last_status = resp.status_code
if 200 <= resp.status_code < 300:
print(f"Pushed to {label} (attempt {attempt+1}, status {resp.status_code})")
return True
last_exc = None
except Exception as e:
last_exc = e
last_status = None
if attempt < 2:
time.sleep(2 ** attempt)
detail = f"status={last_status}" if last_exc is None else f"exception={last_exc!r}"
print(f"ERROR: Failed to push to {label} after 3 attempts: url={url} {detail}", file=sys.stderr)
return False
pushgateway_ok = push_with_retry(
"Pushgateway",
lambda: requests.post(PUSHGATEWAY, data=metrics, timeout=10),
PUSHGATEWAY,
)
# Push to Uptime Kuma on success
uptime_kuma_ok = True
if success:
try:
requests.get("http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration)), timeout=10)
print("Pushed to Uptime Kuma")
except Exception as e:
print(f"Failed to push to Uptime Kuma: {e}")
uptime_kuma_ok = push_with_retry(
"Uptime Kuma",
lambda: requests.get(UPTIME_KUMA_URL, timeout=10),
UPTIME_KUMA_URL,
)
sys.exit(0 if success else 1)
# Exit non-zero when the round-trip itself failed, OR when BOTH push endpoints
# failed after all retries (only possible on the success path on failure we
# only attempt Pushgateway, and the round-trip failure already dominates exit).
both_pushes_failed = success and (not pushgateway_ok) and (not uptime_kuma_ok)
sys.exit(0 if (success and not both_pushes_failed) else 1)
'
EOT
]
@ -928,3 +1149,381 @@ resource "kubernetes_cron_job_v1" "mailserver-backup" {
}
}
# =============================================================================
# Roundcube Backup Daily rsync of html + enigma PVCs to NFS
# Roundcube uses two encrypted RWO PVCs (see roundcubemail.tf):
# - roundcubemail-html-encrypted /var/www/html (plugins, user sessions, skin overrides)
# - roundcubemail-enigma-encrypted /var/roundcube/enigma (user-uploaded PGP keys)
# Losing either one = users lose plugin state + have to re-import PGP keys.
# Mirrors the mailserver-backup pattern but:
# - pod_affinity targets app=roundcubemail (both PVCs attach to the
# Roundcube pod, not mailserver)
# - schedule offset by +10m (03:10) so two NFS-writers don't overlap
# - writes to /srv/nfs/roundcube-backup/<YYYY-WW>/{html,enigma}/
# =============================================================================
module "nfs_roundcube_backup_host" {
source = "../../../../modules/kubernetes/nfs_volume"
name = "roundcube-backup-host"
namespace = kubernetes_namespace.mailserver.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/roundcube-backup"
}
resource "kubernetes_cron_job_v1" "roundcube-backup" {
metadata {
name = "roundcube-backup"
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 5
# +10 min offset vs mailserver-backup (03:00) to avoid NFS contention.
schedule = "10 3 * * *"
starting_deadline_seconds = 10
successful_jobs_history_limit = 10
job_template {
metadata {}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata {}
spec {
# RWO co-location: Roundcube PVCs are ReadWriteOnce; the backup
# pod must land on the same node as the Roundcube pod (single
# replica, Recreate strategy see roundcubemail.tf).
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = {
app = "roundcubemail"
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "roundcube-backup"
image = "docker.io/library/alpine"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
apk add --no-cache rsync
_t0=$(date +%s)
_rb0=$(awk '/^read_bytes/{print $2}' /proc/$$/io 2>/dev/null || echo 0)
_wb0=$(awk '/^write_bytes/{print $2}' /proc/$$/io 2>/dev/null || echo 0)
week=$(date +"%Y-%W")
prev_week=$(date -d "-7 days" +"%Y-%W" 2>/dev/null || echo "")
dst=/backup/$week
mkdir -p "$dst"
# Use --link-dest against previous week for space-efficient
# incrementals (unchanged files are hardlinked, not re-copied).
link_dest_arg=""
if [ -n "$prev_week" ] && [ -d "/backup/$prev_week" ]; then
link_dest_arg="--link-dest=/backup/$prev_week"
fi
# Roundcube data layout (from deployment volume mounts in roundcubemail.tf):
# /src/html -> roundcubemail-html-encrypted (html PVC)
# /src/enigma -> roundcubemail-enigma-encrypted (enigma PVC, PGP keys)
for src in /src/html /src/enigma; do
[ -d "$src" ] || { echo "SKIP missing $src"; continue; }
name=$(basename "$src")
rsync -aH --delete $link_dest_arg "$src/" "$dst/$name/"
done
# Rotate keep 8 weekly snapshots (~2 months)
find /backup -maxdepth 1 -mindepth 1 -type d -regex '.*/[0-9]+-[0-9]+$' | sort | head -n -8 | xargs -r rm -rf
_dur=$(($(date +%s) - _t0))
_rb1=$(awk '/^read_bytes/{print $2}' /proc/$$/io 2>/dev/null || echo 0)
_wb1=$(awk '/^write_bytes/{print $2}' /proc/$$/io 2>/dev/null || echo 0)
echo "=== Backup IO Stats ==="
echo "duration: $${_dur}s"
echo "read: $(( (_rb1 - _rb0) / 1048576 )) MiB"
echo "written: $(( (_wb1 - _wb0) / 1048576 )) MiB"
echo "output: $(du -sh "$dst" | awk '{print $1}')"
_out_bytes=$(du -sb "$dst" | awk '{print $1}')
wget -qO- --post-data "backup_duration_seconds $${_dur}
backup_read_bytes $(( _rb1 - _rb0 ))
backup_written_bytes $(( _wb1 - _wb0 ))
backup_output_bytes $${_out_bytes}
backup_last_success_timestamp $(date +%s)
" "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/roundcube-backup" || true
EOT
]
volume_mount {
name = "html"
mount_path = "/src/html"
read_only = true
}
volume_mount {
name = "enigma"
mount_path = "/src/enigma"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "html"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.roundcube_html_encrypted.metadata[0].name
read_only = true
}
}
volume {
name = "enigma"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.roundcube_enigma_encrypted.metadata[0].name
read_only = true
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_roundcube_backup_host.claim_name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# =============================================================================
# Spam mailbox targeted retention (code-oy4)
#
# The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. Unbounded
# growth (~43 MiB baseline on 2026-04-18, 519 messages, top sender
# tldrnewsletter.com = 138 msgs / 8.2 MiB) makes it painful to triage.
# Profile (2026-04-18):
# - 502/519 messages older than 14 days (97 %)
# - 342/519 carry List-Unsubscribe: (66 %)
# - 21/519 carry Precedence: bulk ( 4 %)
# - 177/519 carry neither marker (= human-ish, 34 %)
#
# Strategy (user-signed-off 2026-04-18, do NOT blind-age-expunge):
# - Messages older than 14 days carrying List-Unsubscribe OR
# Precedence: bulk|list|junk OR Auto-Submitted: auto-* -> DELETE
# - Messages older than 90 days with no automated-sender marker
# -> DELETE (long-tail human forwards)
# - Everything else -> KEEP
#
# Implementation: kubectl exec into the mailserver pod because the
# Maildir lives on a RWO encrypted PVC; a sibling CronJob would fail to
# attach the volume while the mailserver pod holds it. Pattern mirrors
# the `nextcloud-watchdog` in stacks/nextcloud/main.tf.
# =============================================================================
resource "kubernetes_service_account" "spam_retention" {
metadata {
name = "spam-retention"
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
}
resource "kubernetes_role" "spam_retention" {
metadata {
name = "spam-retention"
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["list", "get"]
}
rule {
api_groups = [""]
resources = ["pods/exec"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "spam_retention" {
metadata {
name = "spam-retention"
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.spam_retention.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.spam_retention.metadata[0].name
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "spam_retention" {
metadata {
name = "spam-retention"
namespace = kubernetes_namespace.mailserver.metadata[0].name
}
spec {
schedule = "17 */4 * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 2
failed_jobs_history_limit = 3
starting_deadline_seconds = 300
job_template {
metadata {}
spec {
active_deadline_seconds = 600
backoff_limit = 1
ttl_seconds_after_finished = 600
template {
metadata {}
spec {
service_account_name = kubernetes_service_account.spam_retention.metadata[0].name
restart_policy = "Never"
container {
name = "spam-retention"
image = "bitnami/kubectl:latest"
command = ["/bin/bash", "-c", <<-EOF
set -euo pipefail
POD=$(kubectl -n mailserver get pods -l app=mailserver -o jsonpath='{.items[0].metadata.name}')
if [ -z "$POD" ]; then
echo "ERROR: no mailserver pod found" >&2
exit 1
fi
echo "Targeting pod $POD"
# Stream the retention script to python3 inside the mailserver
# container via stdin. Keeping the logic in Python avoids the
# POSIX-sh/awk fragility around stat(1) differences and header
# matching.
kubectl -n mailserver exec -i "$POD" -c docker-mailserver -- python3 - <<'PYEOF'
import os
import re
import sys
import time
SPAM = "/var/mail/viktorbarzin.me/spam/cur"
# Retention thresholds, in days, one per rule.
AUTOMATED_MAX_AGE_DAYS = 14
HUMAN_MAX_AGE_DAYS = 90
HEADER_SCAN_BYTES = 65536
AUTO_PATTERNS = (
re.compile(rb"^list-unsubscribe:", re.IGNORECASE),
re.compile(rb"^precedence:\s*(bulk|list|junk)", re.IGNORECASE),
re.compile(rb"^auto-submitted:\s*auto-", re.IGNORECASE),
)
def is_automated(path):
try:
with open(path, "rb") as fh:
head = fh.read(HEADER_SCAN_BYTES)
except OSError:
return False
hdr, _, _ = head.partition(b"\r\n\r\n")
if hdr == head:
hdr, _, _ = head.partition(b"\n\n")
for line in hdr.splitlines():
for pat in AUTO_PATTERNS:
if pat.search(line):
return True
return False
if not os.path.isdir(SPAM):
print(f"SKIP: {SPAM} does not exist")
sys.exit(0)
now = time.time()
scanned = auto_deleted = human_deleted = kept = errors = 0
for entry in sorted(os.listdir(SPAM)):
path = os.path.join(SPAM, entry)
try:
st = os.stat(path)
except OSError:
errors += 1
continue
if not os.path.isfile(path):
continue
scanned += 1
age_days = (now - st.st_mtime) / 86400
automated = is_automated(path)
if automated and age_days > AUTOMATED_MAX_AGE_DAYS:
try:
os.unlink(path)
auto_deleted += 1
except OSError:
errors += 1
continue
if (not automated) and age_days > HUMAN_MAX_AGE_DAYS:
try:
os.unlink(path)
human_deleted += 1
except OSError:
errors += 1
continue
kept += 1
# Metric lines (Pushgateway-compatible format). The parent
# kubectl wrapper logs them for now; Pushgateway integration
# is a follow-up.
print(f"spam_retention_scanned_total {scanned}")
print(f"spam_retention_auto_deleted_total {auto_deleted}")
print(f"spam_retention_human_deleted_total {human_deleted}")
print(f"spam_retention_kept_total {kept}")
print(f"spam_retention_errors_total {errors}")
sys.exit(1 if errors else 0)
PYEOF
# Refresh Dovecot index so IMAP sees the deletions immediately.
kubectl -n mailserver exec "$POD" -c docker-mailserver -- \
doveadm force-resync -u spam@viktorbarzin.me INBOX/spam || true
echo "Retention pass complete"
EOF
]
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "128Mi"
}
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -267,6 +267,7 @@ module "ingress" {
name = "mail"
service_name = "roundcubemail"
tls_secret_name = var.tls_secret_name
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Roundcube Mail"

View file

@ -12,6 +12,13 @@ smtp_tls_security_level = encrypt
smtpd_tls_cert_file=/tmp/ssl/tls.crt
smtpd_tls_key_file=/tmp/ssl/tls.key
smtpd_use_tls=yes
# Require STARTTLS before any AUTH command on the SMTPD listener.
# Without this, a misconfigured client that skips STARTTLS would send
# PLAIN/LOGIN creds in the clear. docker-mailserver's default does NOT
# enforce this at the main.cf level for submission (587).
# Note: smtpd_sasl_auth_only (sometimes cited) is NOT a real Postfix
# parameter only smtpd_tls_auth_only is. Addresses code-vnw.
smtpd_tls_auth_only = yes
header_size_limit = 4096000
# Debug mail tls
@ -37,139 +44,3 @@ anvil_rate_time_unit = 60s
postscreen_cache_map =
EOT
}
variable "postfix_cf_reference_DO_NOT_USE" {
default = <<EOT
# See /usr/share/postfix/main.cf.dist for a commented, more complete version
smtpd_banner = $myhostname ESMTP $mail_name (Debian)
biff = no
append_dot_mydomain = no
readme_directory = no
# Basic configuration
# myhostname =
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mydestination = $myhostname, localhost.$mydomain, localhost
mynetworks = 127.0.0.0/8 [::1]/128 [fe80::]/64
mailbox_size_limit = 0
recipient_delimiter = +
inet_interfaces = all
inet_protocols = ipv4
# TLS parameters
smtpd_tls_cert_file=/tmp/ssl/tls.crt
smtpd_tls_key_file=/tmp/ssl/tls.key
#smtpd_tls_CAfile=
#smtp_tls_CAfile=
smtpd_tls_security_level = may
smtpd_use_tls=yes
smtpd_tls_loglevel = 1
smtp_tls_loglevel = 1
tls_ssl_options = NO_COMPRESSION
tls_high_cipherlist = ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS
tls_preempt_cipherlist = yes
smtpd_tls_protocols = !SSLv2,!SSLv3
smtp_tls_protocols = !SSLv2,!SSLv3
smtpd_tls_mandatory_ciphers = high
smtpd_tls_mandatory_protocols = !SSLv2,!SSLv3
smtpd_tls_exclude_ciphers = aNULL, LOW, EXP, MEDIUM, ADH, AECDH, MD5, DSS, ECDSA, CAMELLIA128, 3DES, CAMELLIA256, RSA+AES, eNULL
smtpd_tls_dh1024_param_file = /etc/postfix/dhparams.pem
smtpd_tls_CApath = /etc/ssl/certs
smtp_tls_CApath = /etc/ssl/certs
# Settings to prevent SPAM early
smtpd_helo_required = yes
smtpd_delay_reject = yes
smtpd_helo_restrictions = permit_mynetworks, reject_invalid_helo_hostname, permit
#smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated defer_unauth_destination
#smtpd_relay_restrictions = reject_sender_login_mismatch permit_sasl_authenticated permit_mynetworks defer_unauth_destination
smtpd_relay_restrictions = reject_sender_login_mismatch permit_sasl_authenticated permit_mynetworks defer_unauth_destination
smtpd_recipient_restrictions = permit_sasl_authenticated, reject_unauth_destination, reject_unauth_pipelining, reject_invalid_helo_hostname, reject_non_fqdn_helo_hostname, reject_unknown_recipient_domain, reject_rbl_client bl.spamcop.net, permit_mynetworks
smtpd_client_restrictions = permit_mynetworks, permit_sasl_authenticated, reject_unauth_destination, reject_unauth_pipelining
#smtpd_sender_restrictions = reject_sender_login_mismatch, permit_sasl_authenticated, permit_mynetworks, reject_unknown_sender_domain
smtpd_sender_restrictions = reject_sender_login_mismatch, reject_authenticated_sender_login_mismatch, reject_unknown_sender_domain, permit_sasl_authenticated, permit_mynetworks
disable_vrfy_command = yes
# Postscreen settings to drop zombies/open relays/spam early
#postscreen_dnsbl_action = enforce
postscreen_dnsbl_action = ignore
postscreen_dnsbl_sites = zen.spamhaus.org*2
bl.mailspike.net
b.barracudacentral.org*2
bl.spameatingmonkey.net
bl.spamcop.net
dnsbl.sorbs.net
psbl.surriel.com
list.dnswl.org=127.0.[0..255].0*-2
list.dnswl.org=127.0.[0..255].1*-3
list.dnswl.org=127.0.[0..255].[2..3]*-4
postscreen_dnsbl_threshold = 3
postscreen_dnsbl_whitelist_threshold = -1
postscreen_greet_action = enforce
postscreen_bare_newline_action = enforce
# SASL
smtpd_sasl_auth_enable = no
#smtpd_sasl_auth_enable = yes
##smtpd_sasl_path = /var/spool/postfix/private/auth
#smtpd_sasl_path = /var/spool/postfix/private/smtpd
##smtpd_sasl_type = dovecot
#smtpd_sasl_type = dovecot
##smtpd_sasl_security_options = noanonymous
#smtpd_sasl_security_options = noanonymous
##smtpd_sasl_local_domain = $mydomain
##broken_sasl_auth_clients = yes
#broken_sasl_auth_clients = yes
# SMTP configuration
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl/passwd
smtp_sasl_security_options = noanonymous
smtp_sasl_tls_security_options = noanonymous
smtp_tls_security_level = encrypt
header_size_limit = 4096000
relayhost = [smtp.sendgrid.net]:587
# Mail directory
virtual_transport = lmtp:unix:/var/run/dovecot/lmtp
virtual_mailbox_domains = /etc/postfix/vhost
virtual_mailbox_maps = texthash:/etc/postfix/vmailbox
virtual_alias_maps = texthash:/etc/postfix/virtual
# Additional option for filtering
content_filter = smtp-amavis:[127.0.0.1]:10024
# Milters used by DKIM
milter_protocol = 6
milter_default_action = accept
dkim_milter = inet:localhost:8891
dmarc_milter = inet:localhost:8893
smtpd_milters = $dkim_milter,$dmarc_milter
non_smtpd_milters = $dkim_milter
# SPF policy settings
policyd-spf_time_limit = 3600
# Header checks for content inspection on receiving
header_checks = pcre:/etc/postfix/maps/header_checks.pcre
# Remove unwanted headers that reveail our privacy
smtp_header_checks = pcre:/etc/postfix/maps/sender_header_filter.pcre
myhostname = mail.viktorbarzin.me
mydomain = viktorbarzin.me
smtputf8_enable = no
message_size_limit = 20480000
sender_canonical_maps = tcp:localhost:10001
sender_canonical_classes = envelope_sender
recipient_canonical_maps = tcp:localhost:10002
recipient_canonical_classes = envelope_recipient,header_recipient
compatibility_level = 2
# enable_original_recipient = no # b4 uncommenting see https://serverfault.com/questions/661615/how-to-drop-orig-to-using-postfix-virtual-domains
always_add_missing_headers = yes
anvil_status_update_time = 5s
EOT
}

View file

@ -434,6 +434,223 @@
],
"title": "Transfer Speed (Global)",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 39 },
"id": 103,
"title": "MAM Profile (from jsonLoad.php)",
"type": "row"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"mappings": [
{ "type": "value", "options": {
"0": { "color": "red", "text": "Mouse" },
"1": { "color": "orange", "text": "Vole" },
"2": { "color": "yellow", "text": "User" },
"3": { "color": "green", "text": "Power User" },
"4": { "color": "green", "text": "Elite" },
"5": { "color": "blue", "text": "Torrent Master" },
"6": { "color": "blue", "text": "Power TM" },
"7": { "color": "purple", "text": "Elite TM" },
"8": { "color": "purple", "text": "VIP" }
} }
],
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 2 }
] }
}
},
"gridPos": { "h": 6, "w": 4, "x": 0, "y": 40 },
"id": 20,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"textMode": "value",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [{ "expr": "mam_class_code", "legendFormat": "Class" }],
"title": "MAM Class",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "orange", "value": 0.8 },
{ "color": "green", "value": 1.2 }
] },
"decimals": 3
}
},
"gridPos": { "h": 6, "w": 4, "x": 4, "y": 40 },
"id": 21,
"options": {
"colorMode": "background",
"graphMode": "area",
"justifyMode": "center",
"textMode": "value",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [{ "expr": "mam_ratio", "legendFormat": "Ratio" }],
"title": "MAM Ratio (profile)",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"unit": "short",
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 5000 }
] }
}
},
"gridPos": { "h": 6, "w": 4, "x": 8, "y": 40 },
"id": 22,
"options": {
"colorMode": "background",
"graphMode": "area",
"justifyMode": "center",
"textMode": "value",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [{ "expr": "mam_bp_balance", "legendFormat": "BP" }],
"title": "MAM Bonus Points",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": { "defaults": { "unit": "decbytes" } },
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 40 },
"id": 23,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"textMode": "value_and_name",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [
{ "expr": "mam_downloaded_bytes", "legendFormat": "Downloaded" },
{ "expr": "mam_uploaded_bytes", "legendFormat": "Uploaded" }
],
"title": "MAM Transfer (profile)",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"drawStyle": "line",
"fillOpacity": 10,
"lineWidth": 2,
"showPoints": "never",
"spanNulls": true,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "transparent", "value": null },
{ "color": "orange", "value": 500 }
] },
"unit": "short"
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 46 },
"id": 24,
"options": {
"legend": { "calcs": ["lastNotNull", "min"], "displayMode": "table", "placement": "bottom" },
"tooltip": { "mode": "multi" }
},
"targets": [
{ "expr": "mam_bp_balance", "legendFormat": "BP Balance" },
{ "expr": "mam_bp_needed_gib * 500", "legendFormat": "Next-run cost (BP)" }
],
"title": "BP Balance vs Reserve",
"type": "timeseries"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"drawStyle": "bars",
"fillOpacity": 80,
"lineWidth": 1,
"stacking": { "mode": "normal" }
},
"unit": "short"
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 46 },
"id": 25,
"options": {
"legend": { "calcs": ["lastNotNull", "sum"], "displayMode": "table", "placement": "bottom" },
"tooltip": { "mode": "multi" }
},
"targets": [
{
"expr": "mam_janitor_deleted_per_run",
"legendFormat": "{{reason}}"
}
],
"title": "Janitor Deletions per Run (by reason)",
"type": "timeseries"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": { "unit": "short" }
},
"gridPos": { "h": 6, "w": 12, "x": 0, "y": 54 },
"id": 26,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"textMode": "value_and_name",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [
{ "expr": "mam_janitor_preserved_hnr", "legendFormat": "Preserved (H&R <72h)" },
{ "expr": "mam_janitor_skipped_active", "legendFormat": "Skipped (in-progress)" },
{ "expr": "mam_janitor_dry_run", "legendFormat": "Dry-run mode" }
],
"title": "Janitor State",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": { "unit": "short" }
},
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 54 },
"id": 27,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"textMode": "value_and_name",
"reduceOptions": { "calcs": ["lastNotNull"] }
},
"targets": [
{ "expr": "mam_farming_grabbed", "legendFormat": "Last run grabbed" },
{ "expr": "mam_farming_total_seeding", "legendFormat": "Total in farming" },
{ "expr": "sum by (reason) (mam_grabber_skipped_reason)", "legendFormat": "Grabber skipped: {{reason}}" }
],
"title": "Grabber State",
"type": "stat"
}
],
"refresh": "1m",

File diff suppressed because it is too large Load diff

View file

@ -5,6 +5,8 @@ deploymentStrategy:
maxUnavailable: 1
replicas: 1
adminPassword: "${grafana_admin_password}"
plugins:
- netsage-sankey-panel
resources:
requests:
cpu: 50m

View file

@ -1355,12 +1355,65 @@ serverFiles:
annotations:
summary: "PostgreSQL pod {{ $labels.pod }} is not ready"
- alert: RedisDown
expr: kube_statefulset_status_replicas_ready{namespace="redis", statefulset="redis-node"} < 1
# Covers both the legacy Bitnami StatefulSet (redis-node) and the
# new raw StatefulSet (redis-v2) during the 2026-04-19 migration.
# Drop the redis-node branch after helm_release.redis is removed.
expr: (sum(kube_statefulset_status_replicas_ready{namespace="redis", statefulset=~"redis-node|redis-v2"}) or on() vector(0)) < 1
for: 5m
labels:
severity: critical
annotations:
summary: "Redis has no ready replicas"
summary: "Redis has no ready replicas across both clusters"
- alert: RedisMemoryPressure
expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent"
- alert: RedisEvictions
# allkeys-lru is configured so evictions under cache pressure are
# expected, but sustained evictions mean we're thrashing — raise it.
expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)"
- alert: RedisReplicationLagHigh
expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30
for: 3m
labels:
severity: warning
annotations:
summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master"
- alert: RedisForkLatencyHigh
# latest_fork_usec > 500ms means BGSAVE fork is stalling the main
# thread long enough to drop client requests. COW pressure or
# constrained memory headroom are the usual causes.
expr: redis_latest_fork_usec{namespace="redis"} > 500000
for: 0m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} fork took {{ $value }}us (>500ms) — investigate memory headroom"
- alert: RedisAOFRewriteLong
expr: redis_aof_rewrite_in_progress{namespace="redis"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate"
- alert: RedisReplicasMissing
# redis-v2 StatefulSet should always have 3 replicas connected to
# the master (2 replicas + itself). <2 connected_slaves means one
# replica is unreachable or still syncing.
expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)"
- alert: HeadscaleDown
expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1
for: 5m
@ -1868,13 +1921,24 @@ serverFiles:
summary: "NetFlow processing delay p50: {{ $value | printf \"%.0f\" }}s — softflowd may be overloaded"
- name: "DNS Anomaly Detection"
rules:
# Spike detection: compare current value against its own 1h history via
# avg_over_time. Previous version compared against dns_anomaly_avg_queries
# which was computed from a per-pod /tmp file and always equalled the
# current value (fresh /tmp each run), so the alert could never fire.
- alert: DNSQuerySpike
expr: dns_anomaly_total_queries > 2 * dns_anomaly_avg_queries and dns_anomaly_total_queries > 1000
expr: dns_anomaly_total_queries > 2 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and dns_anomaly_total_queries > 1000
for: 0m
labels:
severity: warning
annotations:
summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x average)"
summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x 1h avg)"
- alert: DNSQueryRateDropped
expr: dns_anomaly_total_queries < 0.5 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and avg_over_time(dns_anomaly_total_queries[1h] offset 15m) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "DNS query volume dropped: {{ $value | printf \"%.0f\" }} queries (<50% of 1h avg) — upstream clients may be failing to reach Technitium"
- alert: DNSHighErrorRate
expr: dns_anomaly_server_failure > 100
for: 0m
@ -1882,18 +1946,77 @@ serverFiles:
severity: warning
annotations:
summary: "High DNS SERVFAIL rate: {{ $value | printf \"%.0f\" }} failures detected"
- name: qbittorrent
rules:
- alert: QBittorrentMAMRatioLow
expr: qbt_tracker_ratio{tracker="mam"} < 1.0
for: 1h
- alert: TechnitiumZoneSyncFailed
expr: technitium_zone_sync_status != 0
for: 30m
labels:
severity: warning
annotations:
summary: "MAM ratio is {{ $value | printf \"%.2f\" }} (must be >= 1.0)"
summary: "Technitium zone-sync CronJob has reported failure for 30m — replicas may be missing zones"
- alert: TechnitiumZoneSyncStale
expr: (time() - technitium_zone_sync_last_run) > 3600
for: 10m
labels:
severity: warning
annotations:
summary: "Technitium zone-sync has not run successfully in >1h (last: {{ $value | humanizeDuration }} ago)"
- alert: TechnitiumZoneCountMismatch
expr: (max(technitium_zone_count) - min(technitium_zone_count)) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Technitium zone counts differ across instances (max-min delta: {{ $value | printf \"%.0f\" }}) — replica has drifted from primary"
- alert: CoreDNSForwardFailureRate
expr: sum(rate(coredns_forward_responses_total{rcode=~"SERVFAIL|REFUSED"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS forward SERVFAIL/REFUSED rate: {{ $value | printf \"%.2f\" }}/s — upstream DNS (pfSense/public) may be unhealthy"
- name: qbittorrent
rules:
- alert: MAMMouseClass
expr: mam_class_code == 0
for: 1h
labels:
severity: critical
annotations:
summary: "MAM account is in Mouse class — tracker is refusing announces, ratio cannot recover"
- alert: MAMCookieExpired
expr: mam_farming_cookie_expired > 0
for: 0m
labels:
severity: critical
annotations:
summary: "MAM session cookie has expired — refresh `mam_id` in Vault servarr/mam_id"
- alert: MAMRatioBelowOne
expr: mam_ratio < 1.0
for: 24h
labels:
severity: warning
annotations:
summary: "MAM ratio is {{ $value | printf \"%.2f\" }} for 24h (target: >= 1.0)"
- alert: MAMFarmingStuck
expr: |
increase(mam_farming_grabbed[4h]) == 0
and mam_farming_total_seeding < 150
and mam_ratio >= 1.2
for: 4h
labels:
severity: warning
annotations:
summary: "Grabber has added 0 torrents in 4h despite healthy ratio ({{ $value | printf \"%.2f\" }})"
- alert: MAMJanitorStuckBacklog
expr: mam_janitor_skipped_active > 400
for: 6h
labels:
severity: warning
annotations:
summary: "Janitor is skipping {{ $value | printf \"%.0f\" }} in-progress torrents — queue not draining"
- alert: QBittorrentDisconnected
expr: qbt_connected == 0
for: 5m
for: 10m
labels:
severity: critical
annotations:
@ -1977,6 +2100,41 @@ serverFiles:
severity: warning
annotations:
summary: "Authentik outpost restarted {{ $value | printf \"%.0f\" }} times in 30m — check for OOM or crash loop"
- alert: AuthentikOutpostDevShmFull
# Direct filesystem measure of the /dev/shm emptyDir sizeLimit.
# The 2026-04-18 incident went undetected for 40h because working-set
# memory lags tmpfs fill (files count against memory but not always
# against working set). This rule catches the underlying cause.
# See docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md.
expr: container_fs_usage_bytes{namespace="authentik", pod=~"ak-outpost-.*"} / container_fs_limit_bytes{namespace="authentik", pod=~"ak-outpost-.*"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Authentik outpost filesystem at {{ $value | humanizePercentage }} on {{ $labels.pod }} — session files filling tmpfs, forward-auth imminent failure"
- alert: AuthentikOutpostForwardAuth400Spike
# Sudden 400 spike from the outpost means forward-auth is broken
# for all protected services. The /dev/shm ENOSPC class of failures
# manifests as the outpost returning 400 on /outpost.goauthentik.io/auth/traefik.
expr: sum by (service) (increase(traefik_service_requests_total{code="400", service=~"authentik-authentik-outpost.*"}[5m])) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Authentik outpost returning {{ $value | printf \"%.0f\" }} 400s in 5m on {{ $labels.service }} — forward-auth broken for all 43 protected services"
- alert: AuthentikServerReplicasMismatch
# With 3 replicas + PDB minAvailable=2, a sustained drop to <3
# means a node is unschedulable, image pull failing, or quota hit.
expr: (kube_deployment_spec_replicas{namespace="authentik", deployment="goauthentik-server"} - kube_deployment_status_replicas_available{namespace="authentik", deployment="goauthentik-server"}) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Authentik server has {{ $value }} unavailable replica(s) for 15m — check pod events"
# Mailserver Dovecot alerts were removed with the exporter in
# code-1ik (viktorbarzin/dovecot_exporter incompatible with
# Dovecot 2.3 stats architecture). Re-add the rule group if a
# working exporter is introduced.
- name: Infrastructure Drift
# Metrics pushed by .woodpecker/drift-detection.yml after each cron run.
# See Wave 7 of the state-drift consolidation plan.
@ -2011,6 +2169,11 @@ serverFiles:
summary: "{{ $value | printf \"%.0f\" }} stacks drifting — likely a systemic cause (new admission webhook, provider upgrade). Check the most recent drift-detection run in Woodpecker."
extraScrapeConfigs: |
# The `mailserver-dovecot` scrape job was retired in code-1ik together
# with the Dovecot exporter. docker-mailserver 15.0.0's Dovecot 2.3
# doesn't emit the old_stats protocol the exporter expected, so the
# scrape only ever returned `dovecot_up{scope="user"} 0`. Re-add here
# if a working exporter is introduced.
- job_name: 'proxmox-host'
static_configs:
- targets:

View file

@ -142,7 +142,7 @@ resource "kubernetes_deployment" "onlyoffice-document-server" {
app = "onlyoffice-document-server"
}
annotations = {
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
}
}
spec {

View file

@ -441,11 +441,6 @@ resource "kubernetes_deployment" "openclaw" {
name = "UPTIME_KUMA_PASSWORD"
value = local.skill_secrets["uptime_kuma_password"]
}
# Skill secrets - Slack
env {
name = "SLACK_WEBHOOK_URL"
value = local.skill_secrets["slack_webhook"]
}
# Memory API
env {
name = "MEMORY_API_URL"
@ -837,15 +832,19 @@ resource "kubernetes_service" "task_webhook" {
}
module "task_webhook_ingress" {
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.openclaw.metadata[0].name
name = "task-webhook"
tls_secret_name = var.tls_secret_name
host = "task-webhook"
port = 80
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.openclaw.metadata[0].name
name = "task-webhook"
tls_secret_name = var.tls_secret_name
host = "task-webhook"
port = 80
external_monitor = false
}
# --- CronJob: Scheduled cluster health check ---
# --- Shared ServiceAccount: grants pod-exec into the openclaw pod ---
# Used by the task_processor CronJob (below). Previously also used by the
# cluster_healthcheck CronJob, which has been decommissioned the local
# `scripts/cluster_healthcheck.sh` is now the single authoritative runner.
resource "kubernetes_service_account" "healthcheck" {
metadata {
@ -888,76 +887,6 @@ resource "kubernetes_role_binding" "healthcheck_exec" {
}
}
resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
labels = {
app = "cluster-healthcheck"
tier = local.tiers.aux
}
}
spec {
schedule = "0 */8 * * *"
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
job_template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
active_deadline_seconds = 300
backoff_limit = 0
template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
restart_policy = "Never"
container {
name = "healthcheck"
image = "bitnami/kubectl:latest"
command = ["bash", "-c", <<-EOF
# Find the openclaw pod
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [ -z "$POD" ]; then
echo "ERROR: OpenClaw pod not found"
exit 1
fi
echo "Executing health check in pod $POD..."
kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
EOF
]
resources {
requests = {
cpu = "50m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# --- CronJob: Task processor polls Forgejo issues and triggers OpenClaw ---
resource "kubernetes_cron_job_v1" "task_processor" {
@ -982,8 +911,9 @@ resource "kubernetes_cron_job_v1" "task_processor" {
}
}
spec {
active_deadline_seconds = 600
backoff_limit = 0
active_deadline_seconds = 600
backoff_limit = 0
ttl_seconds_after_finished = 86400
template {
metadata {
labels = {

View file

@ -0,0 +1,82 @@
-- ot-recorder Lua hook: forward every location publish to Dawarich.
-- Loaded by ot-recorder via `--lua-script`. The hook() function is invoked
-- synchronously per publish; we fork curl with `&` to keep it fire-and-forget.
-- Dawarich's points table has UNIQUE (lonlat, timestamp, user_id) — duplicates
-- are safely dropped. The .rec file is always written regardless of hook result,
-- so a Dawarich 5xx loses nothing long-term (re-playable via backfill Job).
local function escape_shell_single(s)
return "'" .. tostring(s):gsub("'", "'\\''") .. "'"
end
local function json_escape_string(s)
return (s:gsub("\\", "\\\\")
:gsub('"', '\\"')
:gsub("\n", "\\n")
:gsub("\r", "\\r")
:gsub("\t", "\\t"))
end
-- Minimal JSON serializer — scalars, arrays, maps. Owntracks payloads are
-- all primitive/flat; no bignum or cyclic-ref concerns.
local function to_json(v)
local t = type(v)
if t == "nil" then return "null" end
if t == "number" then return tostring(v) end
if t == "boolean" then return tostring(v) end
if t == "string" then return '"' .. json_escape_string(v) .. '"' end
if t == "table" then
if #v > 0 or next(v) == nil then
local parts = {}
for i, x in ipairs(v) do parts[i] = to_json(x) end
return "[" .. table.concat(parts, ",") .. "]"
end
local parts = {}
for k, x in pairs(v) do
parts[#parts + 1] = '"' .. json_escape_string(tostring(k)) .. '":' .. to_json(x)
end
return "{" .. table.concat(parts, ",") .. "}"
end
return "null"
end
function otr_init()
otr.log("dawarich-bridge: init")
if not os.getenv("DAWARICH_API_KEY") then
otr.log("dawarich-bridge: WARN DAWARICH_API_KEY unset — hook will skip")
end
end
function otr_exit()
otr.log("dawarich-bridge: exit")
end
function otr_hook(topic, _type, data)
if _type ~= "location" then return end
local api_key = os.getenv("DAWARICH_API_KEY")
if not api_key or api_key == "" then
otr.log("dawarich-bridge: DAWARICH_API_KEY missing — dropping point")
return
end
-- Strip the base64 user avatar: ot-recorder appends a ~120KB `face` field
-- to enriched payloads which pushes the curl command past ARG_MAX (code=7
-- "Argument list too long"). Dawarich doesn't need it.
data.face = nil
local url = "https://dawarich.viktorbarzin.me/api/v1/owntracks/points?api_key=" .. api_key
local payload = to_json(data)
local cmd = table.concat({
"curl -sS -o /dev/null --max-time 5 -X POST",
"-H 'Content-Type: application/json'",
"-d", escape_shell_single(payload),
escape_shell_single(url),
"&",
}, " ")
local ok, reason, code = os.execute(cmd)
if not ok then
otr.log("dawarich-bridge: FAIL tst=" .. tostring(data.tst) ..
" reason=" .. tostring(reason) .. " code=" .. tostring(code) ..
" cmd=" .. cmd)
else
otr.log("dawarich-bridge: ok tst=" .. tostring(data.tst))
end
end

View file

@ -86,25 +86,13 @@ resource "kubernetes_secret" "basic_auth" {
}
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
resource "kubernetes_config_map" "dawarich_hook" {
metadata {
name = "owntracks-data-proxmox"
name = "dawarich-hook"
namespace = kubernetes_namespace.owntracks.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
data = {
"dawarich-hook.lua" = file("${path.module}/dawarich-hook.lua")
}
}
@ -149,10 +137,23 @@ resource "kubernetes_deployment" "owntracks" {
name = "http"
container_port = 8083
}
# ot-recorder 1.0.1 has no OTR_HTTPHOOK; forwarding to Dawarich is
# done via a Lua hook script loaded with --lua-script. The script
# reads DAWARICH_API_KEY from env and fires curl fire-and-forget.
args = ["--lua-script", "/hook/dawarich-hook.lua", "owntracks/#"]
env {
name = "OTR_PORT"
value = "0"
}
env {
name = "DAWARICH_API_KEY"
value_from {
secret_key_ref {
name = "owntracks-secrets"
key = "dawarich_api_key"
}
}
}
volume_mount {
name = "data"
@ -162,6 +163,11 @@ resource "kubernetes_deployment" "owntracks" {
name = "data"
mount_path = "/config"
}
volume_mount {
name = "hook"
mount_path = "/hook"
read_only = true
}
resources {
requests = {
cpu = "10m"
@ -178,6 +184,12 @@ resource "kubernetes_deployment" "owntracks" {
claim_name = "owntracks-data-encrypted"
}
}
volume {
name = "hook"
config_map {
name = kubernetes_config_map.dawarich_hook.metadata[0].name
}
}
}
}
}

View file

@ -117,7 +117,7 @@ resource "kubernetes_deployment" "paperless-ngx" {
annotations = {
"diun.enable" = "true"
"diun.include_tags" = "^\\d+(?:\\.\\d+)?(?:\\.\\d+)?$"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
}
}
spec {

View file

@ -217,7 +217,7 @@ resource "kubernetes_deployment" "realestate-crawler-api" {
"kubernetes.io/cluster-service" = "true"
}
annotations = {
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
}
}
spec {
@ -395,7 +395,7 @@ resource "kubernetes_deployment" "realestate-crawler-celery" {
app = "realestate-crawler-celery"
}
annotations = {
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
}
}
spec {
@ -524,7 +524,7 @@ resource "kubernetes_deployment" "realestate-crawler-celery-beat" {
app = "realestate-crawler-celery-beat"
}
annotations = {
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "mysql.dbaas:3306,redis-master.redis:6379"
}
}
spec {

View file

@ -72,13 +72,18 @@ resource "helm_release" "redis" {
}
}
# 256Mi was too tight once the working set crossed ~200Mi: BGSAVE
# fork during a replica full PSYNC doubled RSS via COW and pushed
# the master past 256Mi OOMKilled (exit 137), HAProxy flapped,
# every redis client (Paperless, Immich, Authentik) saw connection
# resets. 512Mi gives ~2x headroom on the current 204Mi RDB.
resources = {
requests = {
cpu = "100m"
memory = "64Mi"
memory = "512Mi"
}
limits = {
memory = "64Mi"
memory = "512Mi"
}
}
}
@ -100,10 +105,10 @@ resource "helm_release" "redis" {
resources = {
requests = {
cpu = "50m"
memory = "64Mi"
memory = "512Mi"
}
limits = {
memory = "64Mi"
memory = "512Mi"
}
}
}
@ -144,6 +149,24 @@ resource "kubernetes_config_map" "haproxy" {
timeout server 30s
timeout check 3s
# Dynamic DNS resolution via cluster CoreDNS. Without this, haproxy
# resolves server hostnames once at startup and caches forever, so
# when redis-node-X pods restart and get new IPs, haproxy keeps
# connecting to the old (dead) IPs and returns "Connection refused"
# until haproxy itself is restarted. This caused an immich outage
# on 2026-04-19 after a redis pod cycle.
resolvers kubernetes
nameserver coredns kube-dns.kube-system.svc.cluster.local:53
resolve_retries 3
timeout resolve 1s
timeout retry 1s
hold other 10s
hold refused 10s
hold nx 10s
hold timeout 10s
hold valid 10s
hold obsolete 10s
frontend redis_front
bind *:6379
default_backend redis_master
@ -163,13 +186,13 @@ resource "kubernetes_config_map" "haproxy" {
tcp-check expect rstring role:master
tcp-check send "QUIT\r\n"
tcp-check expect string +OK
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
backend redis_sentinel
balance roundrobin
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:26379 check inter 5s
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:26379 check inter 5s
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
EOT
}
}
@ -183,7 +206,11 @@ resource "kubernetes_deployment" "haproxy" {
}
}
spec {
replicas = 2
# 3 replicas + PDB minAvailable=2 (see kubernetes_pod_disruption_budget_v1.redis_haproxy).
# After Nextcloud drops its sentinel fallback in Phase 6 of the 2026-04-19 redis
# rework, HAProxy is the sole client-facing path for all 17 redis consumers, so
# it needs HA equivalent to other critical-path pods (Traefik, Authentik, PgBouncer).
replicas = 3
selector {
match_labels = {
app = "redis-haproxy"
@ -194,6 +221,11 @@ resource "kubernetes_deployment" "haproxy" {
labels = {
app = "redis-haproxy"
}
annotations = {
# Roll the deployment whenever haproxy.cfg content changes so a
# config update (e.g. DNS resolver tweaks) actually takes effect.
"checksum/config" = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])
}
}
spec {
container {
@ -282,7 +314,11 @@ resource "kubernetes_service" "redis_master" {
# This runs on every apply to ensure the Helm chart's service is always corrected.
resource "null_resource" "patch_redis_service" {
triggers = {
always = timestamp()
# Re-patch only when a Helm upgrade (chart version bump) or an HAProxy
# config change could have reset the selector / rotated HAProxy pods.
# timestamp() would force-replace on every apply, hiding real drift.
chart_version = helm_release.redis.version
haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])
}
provisioner "local-exec" {
@ -304,6 +340,499 @@ module "nfs_backup_host" {
nfs_path = "/srv/nfs/redis-backup"
}
#### Redis v2 parallel 3-node raw StatefulSet (target architecture)
#
# Built alongside the Bitnami helm_release.redis so data can migrate via
# REPLICAOF with <60s cutover downtime (see session plan / beads code-v2b).
#
# Pattern: MySQL standalone precedent (stacks/dbaas/modules/dbaas/main.tf,
# 2026-04-16 migration) raw kubernetes_stateful_set_v1 + official image,
# no Bitnami Helm chart (deprecated by Broadcom Aug 2025; atomic-Helm trap
# caused the 2026-04-04 memory-bump deadlock).
#
# Design choices driven by incident cluster in April 2026:
# - 3 sentinels (odd count, quorum=2) eliminates the split-brain class
# that caused the 2026-04-19 PM incident (2 sentinels, stale master state).
# - Init container regenerates sentinel.conf on every boot by probing
# peers for role:master no persistent sentinel runtime state, so stale
# entries can never resurface across pod restarts.
# - podManagementPolicy=Parallel all 3 pods start together, avoiding the
# "sentinel-0 elects before -2 booted" ordering bug.
# - Memory 768Mi (up from 512Mi) concurrent BGSAVE + AOF-rewrite fork can
# double RSS via COW. auto-aof-rewrite-percentage 200 + min-size 128mb
# tune down rewrite frequency.
# - Persistence: RDB snapshots + AOF everysec. Measured <1 GB/day write
# volume (2026-04-19 disk-wear analysis) 40+ year SSD runway.
# - HAProxy remains sole client-facing path for all 17 consumers.
resource "kubernetes_config_map" "redis_v2_conf" {
metadata {
name = "redis-v2-conf"
namespace = kubernetes_namespace.redis.metadata[0].name
}
data = {
"redis.conf" = <<-EOT
bind 0.0.0.0 -::*
port 6379
protected-mode no
dir /data
maxmemory 640mb
maxmemory-policy allkeys-lru
save 900 1
save 300 100
save 60 10000
rdbcompression yes
rdbchecksum yes
stop-writes-on-bgsave-error no
appendonly yes
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 200
auto-aof-rewrite-min-size 128mb
aof-load-truncated yes
aof-use-rdb-preamble yes
replica-read-only yes
replica-serve-stale-data yes
timeout 0
tcp-keepalive 300
tcp-backlog 511
databases 16
loglevel notice
# Included last so `replicaof` directive written by the init container
# overrides the "standalone master" default. Prevents the parallel-
# bootstrap race where all 3 pods claim role:master simultaneously.
include /shared/replica.conf
EOT
}
}
resource "kubernetes_config_map" "redis_v2_sentinel_bootstrap" {
metadata {
name = "redis-v2-sentinel-bootstrap"
namespace = kubernetes_namespace.redis.metadata[0].name
}
data = {
"init.sh" = <<-EOT
#!/bin/sh
set -eu
HOSTNAME=$(hostname)
MY_NUM=$${HOSTNAME##*-}
MY_DNS="$HOSTNAME.redis-v2-headless.redis.svc.cluster.local"
MASTER_HOST=""
echo "=== Redis v2 bootstrap ==="
echo "hostname: $HOSTNAME (index $MY_NUM)"
# Priority 1: ask peer sentinels for the consensus master. Covers the
# "steady-state pod restart" case sentinels already agree on reality
# and a restarting pod should join that topology.
votes_0=0; votes_1=0; votes_2=0; votes_total=0
for i in 0 1 2; do
if [ "$i" = "$MY_NUM" ]; then continue; fi
peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local"
reply=$(redis-cli -h "$peer" -p 26379 -t 2 SENTINEL get-master-addr-by-name mymaster 2>/dev/null | head -n1 || true)
echo "sentinel probe $peer: master=$${reply:-unreachable}"
case "$reply" in
*redis-v2-0*) votes_0=$((votes_0 + 1)); votes_total=$((votes_total + 1)) ;;
*redis-v2-1*) votes_1=$((votes_1 + 1)); votes_total=$((votes_total + 1)) ;;
*redis-v2-2*) votes_2=$((votes_2 + 1)); votes_total=$((votes_total + 1)) ;;
esac
done
if [ "$votes_total" -gt 0 ]; then
if [ "$votes_0" -ge "$votes_1" ] && [ "$votes_0" -ge "$votes_2" ] && [ "$votes_0" -gt 0 ]; then
MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local"
elif [ "$votes_1" -ge "$votes_2" ] && [ "$votes_1" -gt 0 ]; then
MASTER_HOST="redis-v2-1.redis-v2-headless.redis.svc.cluster.local"
elif [ "$votes_2" -gt 0 ]; then
MASTER_HOST="redis-v2-2.redis-v2-headless.redis.svc.cluster.local"
fi
[ -n "$MASTER_HOST" ] && echo "sentinel vote winner: $MASTER_HOST"
fi
# Priority 2: look for a peer redis that's a master WITH at least one
# replica connected. "Standalone master" peers (bootstrap race) are
# skipped connected_slaves=0 is ambiguous.
if [ -z "$MASTER_HOST" ]; then
for i in 0 1 2; do
if [ "$i" = "$MY_NUM" ]; then continue; fi
peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local"
info=$(redis-cli -h "$peer" -t 2 INFO replication 2>/dev/null || true)
role=$(echo "$info" | awk -F: '/^role:/ {gsub(/\r/,""); print $2; exit}')
slaves=$(echo "$info" | awk -F: '/^connected_slaves:/ {gsub(/\r/,""); print $2; exit}')
echo "redis probe $peer: role=$${role:-unreachable} slaves=$${slaves:-0}"
if [ "$role" = "master" ] && [ "$${slaves:-0}" -gt 0 ]; then
MASTER_HOST="$peer"
break
fi
done
fi
# Priority 3: deterministic fallback pod -0 is always the bootstrap
# master on a fresh cluster. All sentinels converge here, no race.
if [ -z "$MASTER_HOST" ]; then
MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local"
echo "no master found via probes — bootstrap default: $MASTER_HOST"
fi
cat > /shared/sentinel.conf <<EOF
port 26379
bind 0.0.0.0 -::*
dir /shared
sentinel resolve-hostnames yes
sentinel announce-hostnames yes
sentinel monitor mymaster $MASTER_HOST 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 30000
sentinel parallel-syncs mymaster 1
EOF
# replica.conf is included by redis.conf (see ConfigMap redis_v2_conf).
# Master pod gets an empty file; replicas get `replicaof <master>`.
# This way pods come up already in the right role no post-start race.
if [ "$MY_DNS" = "$MASTER_HOST" ]; then
: > /shared/replica.conf
echo "role: master"
else
echo "replicaof $MASTER_HOST 6379" > /shared/replica.conf
echo "role: replica of $MASTER_HOST"
fi
echo "=== bootstrap complete ==="
cat /shared/sentinel.conf
echo "--- replica.conf ---"
cat /shared/replica.conf
EOT
}
}
resource "kubernetes_service" "redis_v2_headless" {
metadata {
name = "redis-v2-headless"
namespace = kubernetes_namespace.redis.metadata[0].name
labels = {
app = "redis-v2"
}
}
spec {
cluster_ip = "None"
publish_not_ready_addresses = true
selector = {
app = "redis-v2"
}
port {
name = "redis"
port = 6379
}
port {
name = "sentinel"
port = 26379
}
port {
name = "exporter"
port = 9121
}
}
}
resource "kubernetes_stateful_set_v1" "redis_v2" {
metadata {
name = "redis-v2"
namespace = kubernetes_namespace.redis.metadata[0].name
labels = {
app = "redis-v2"
}
}
spec {
service_name = kubernetes_service.redis_v2_headless.metadata[0].name
replicas = 3
pod_management_policy = "Parallel"
selector {
match_labels = {
app = "redis-v2"
}
}
template {
metadata {
labels = {
app = "redis-v2"
}
annotations = {
"prometheus.io/scrape" = "true"
"prometheus.io/port" = "9121"
"checksum/conf" = sha256(kubernetes_config_map.redis_v2_conf.data["redis.conf"])
"checksum/bootstrap" = sha256(kubernetes_config_map.redis_v2_sentinel_bootstrap.data["init.sh"])
}
}
spec {
termination_grace_period_seconds = 30
affinity {
pod_anti_affinity {
preferred_during_scheduling_ignored_during_execution {
weight = 100
pod_affinity_term {
label_selector {
match_expressions {
key = "app"
operator = "In"
values = ["redis-v2"]
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
}
init_container {
name = "generate-sentinel-conf"
image = "docker.io/library/redis:7.4-alpine"
command = ["/bin/sh", "/bootstrap/init.sh"]
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "32Mi"
}
}
volume_mount {
name = "bootstrap"
mount_path = "/bootstrap"
read_only = true
}
volume_mount {
name = "shared"
mount_path = "/shared"
}
}
container {
name = "redis"
image = "docker.io/library/redis:7.4-alpine"
command = ["redis-server", "/etc/redis/redis.conf"]
port {
container_port = 6379
name = "redis"
}
resources {
requests = {
cpu = "100m"
memory = "768Mi"
}
limits = {
memory = "768Mi"
}
}
volume_mount {
name = "data"
mount_path = "/data"
}
volume_mount {
name = "conf"
mount_path = "/etc/redis"
read_only = true
}
volume_mount {
# redis.conf `include /shared/replica.conf` written by init container.
name = "shared"
mount_path = "/shared"
read_only = true
}
liveness_probe {
exec {
command = ["redis-cli", "PING"]
}
initial_delay_seconds = 15
period_seconds = 10
timeout_seconds = 3
failure_threshold = 3
}
readiness_probe {
exec {
command = ["redis-cli", "PING"]
}
initial_delay_seconds = 5
period_seconds = 5
timeout_seconds = 3
failure_threshold = 3
}
}
container {
name = "sentinel"
image = "docker.io/library/redis:7.4-alpine"
command = ["redis-sentinel", "/shared/sentinel.conf"]
port {
container_port = 26379
name = "sentinel"
}
resources {
requests = {
cpu = "20m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
volume_mount {
name = "shared"
mount_path = "/shared"
}
liveness_probe {
exec {
command = ["redis-cli", "-p", "26379", "PING"]
}
initial_delay_seconds = 20
period_seconds = 10
timeout_seconds = 3
failure_threshold = 3
}
readiness_probe {
exec {
command = ["redis-cli", "-p", "26379", "PING"]
}
initial_delay_seconds = 10
period_seconds = 5
timeout_seconds = 3
failure_threshold = 3
}
}
container {
name = "exporter"
image = "docker.io/oliver006/redis_exporter:v1.62.0"
port {
container_port = 9121
name = "exporter"
}
env {
name = "REDIS_ADDR"
value = "redis://localhost:6379"
}
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "32Mi"
}
}
liveness_probe {
http_get {
path = "/"
port = 9121
}
initial_delay_seconds = 15
period_seconds = 30
timeout_seconds = 5
}
}
volume {
name = "conf"
config_map {
name = kubernetes_config_map.redis_v2_conf.metadata[0].name
}
}
volume {
name = "bootstrap"
config_map {
name = kubernetes_config_map.redis_v2_sentinel_bootstrap.metadata[0].name
default_mode = "0755"
}
}
volume {
name = "shared"
empty_dir {}
}
}
}
volume_claim_template {
metadata {
name = "data"
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "20Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "5Gi"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_pod_disruption_budget_v1" "redis_v2" {
metadata {
name = "redis-v2"
namespace = kubernetes_namespace.redis.metadata[0].name
}
spec {
min_available = 2
selector {
match_labels = {
app = "redis-v2"
}
}
}
}
resource "kubernetes_pod_disruption_budget_v1" "redis_haproxy" {
metadata {
name = "redis-haproxy"
namespace = kubernetes_namespace.redis.metadata[0].name
}
spec {
min_available = 2
selector {
match_labels = {
app = "redis-haproxy"
}
}
}
}
# Hourly backup: copy RDB snapshot from master to NFS
resource "kubernetes_cron_job_v1" "redis-backup" {
metadata {
@ -335,10 +864,10 @@ resource "kubernetes_cron_job_v1" "redis-backup" {
TIMESTAMP=$(date +%Y%m%d-%H%M)
# Trigger a fresh RDB save on the master
redis-cli -h redis.redis BGSAVE
redis-cli -h redis-master.redis BGSAVE
sleep 5
# Copy the RDB via redis-cli --rdb
redis-cli -h redis.redis --rdb /backup/redis-$TIMESTAMP.rdb
redis-cli -h redis-master.redis --rdb /backup/redis-$TIMESTAMP.rdb
# Rotate 28-day retention
find /backup -name 'redis-*.rdb' -type f -mtime +28 -delete

View file

@ -14,7 +14,16 @@ variable "name" {}
variable "namespace" {
default = "reverse-proxy"
}
variable "external_name" {}
variable "external_name" {
type = string
default = null
description = "DNS name for ExternalName Service. Mutually exclusive with backend_ip."
}
variable "backend_ip" {
type = string
default = null
description = "IP address backend. When set, creates a selector-less Service + EndpointSlice pointing at this IP. Mutually exclusive with external_name — use for hosts that aren't in Technitium (e.g. upstream gateways)."
}
variable "port" {
default = "80"
}
@ -95,7 +104,14 @@ variable "public_ipv6" {
}
locals {
use_backend_ip = var.backend_ip != null
port_name = var.backend_protocol == "HTTPS" ? "https-${var.name}" : "${var.name}-web"
}
# ExternalName flavor used when the backend is addressable by DNS.
resource "kubernetes_service" "proxied-service" {
count = local.use_backend_ip ? 0 : 1
metadata {
name = var.name
namespace = var.namespace
@ -109,7 +125,7 @@ resource "kubernetes_service" "proxied-service" {
external_name = var.external_name
port {
name = var.backend_protocol == "HTTPS" ? "https-${var.name}" : "${var.name}-web"
name = local.port_name
port = var.port
protocol = "TCP"
target_port = var.port
@ -117,14 +133,73 @@ resource "kubernetes_service" "proxied-service" {
}
}
# IP-backend flavor selector-less Service + manually-managed EndpointSlice.
# Used for upstreams that have no DNS entry in Technitium (e.g. 192.168.1.1).
resource "kubernetes_service" "ip-backend-service" {
count = local.use_backend_ip ? 1 : 0
metadata {
name = var.name
namespace = var.namespace
labels = {
"app" = var.name
}
}
spec {
type = "ClusterIP"
port {
name = local.port_name
port = var.port
protocol = "TCP"
target_port = var.port
}
}
}
resource "kubernetes_manifest" "ip_backend_endpointslice" {
count = local.use_backend_ip ? 1 : 0
manifest = {
apiVersion = "discovery.k8s.io/v1"
kind = "EndpointSlice"
metadata = {
name = var.name
namespace = var.namespace
labels = {
"kubernetes.io/service-name" = var.name
"app" = var.name
}
}
addressType = "IPv4"
ports = [{
name = local.port_name
port = tonumber(var.port)
protocol = "TCP"
}]
endpoints = [{
addresses = [var.backend_ip]
conditions = {
ready = true
}
}]
}
depends_on = [kubernetes_service.ip-backend-service]
}
locals {
# External monitor defaults: on when proxied, off otherwise. Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied")
# Emit the annotation when effective is true (positive signal), or when the
# caller explicitly set external_monitor=false (opt-out). When the caller
# leaves it null AND dns_type != "proxied", emit nothing the sync script's
# default opt-in (any *.viktorbarzin.me ingress) keeps monitoring services
# that are publicly reachable via routes we don't manage here.
external_monitor_annotations = local.effective_external_monitor ? merge(
{ "uptime.viktorbarzin.me/external-monitor" = "true" },
var.external_monitor_name != null ? { "uptime.viktorbarzin.me/external-monitor-name" = var.external_monitor_name } : {},
) : {}
) : (var.external_monitor == false ?
{ "uptime.viktorbarzin.me/external-monitor" = "false" } : {}
)
}
resource "kubernetes_ingress_v1" "proxied-ingress" {

View file

@ -112,13 +112,11 @@ module "idrac" {
depends_on = [kubernetes_namespace.reverse-proxy]
}
# Can either listen on https or http; can't do both :/
# TODO: Not working yet
module "tp-link-gateway" {
source = "./factory"
dns_type = "proxied"
name = "gw"
external_name = "gw.viktorbarzin.lan"
backend_ip = "192.168.1.1"
port = 443
tls_secret_name = var.tls_secret_name
backend_protocol = "HTTPS"
@ -153,25 +151,6 @@ module "truenas" {
depends_on = [kubernetes_namespace.reverse-proxy]
}
# https://r730.viktorbarzin.me/
module "r730" {
source = "./factory"
name = "r730"
external_name = "r730.viktorbarzin.lan"
port = 443
tls_secret_name = var.tls_secret_name
backend_protocol = "HTTPS"
depends_on = [kubernetes_namespace.reverse-proxy]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "R730"
"gethomepage.dev/description" = "Dell PowerEdge server"
"gethomepage.dev/icon" = "dell.png"
"gethomepage.dev/group" = "Infrastructure"
"gethomepage.dev/pod-selector" = ""
}
}
# https://proxmox.viktorbarzin.me/
module "proxmox" {
source = "./factory"
@ -270,6 +249,7 @@ module "mladost3" {
port = 8080
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.reverse-proxy]
external_monitor = false
extra_annotations = { "gethomepage.dev/enabled" = "false" }
}
@ -301,43 +281,101 @@ resource "kubernetes_manifest" "ha_sofia_rate_limit" {
}
}
# Per-service retry bumps default (attempts=2) to 3 so transient DNS/connect
# stalls on the ha-sofia.viktorbarzin.lan ExternalName are absorbed before
# surfacing a 502. Drives bd code-rd1 Phase 2.2.
resource "kubernetes_manifest" "ha_sofia_retry" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "ha-sofia-retry"
namespace = "reverse-proxy"
}
spec = {
retry = {
attempts = 3
initialInterval = "100ms"
}
}
}
}
# Per-service ServersTransport overrides the global 60s dialTimeout
# (set for Immich) with 500ms so a stall fails fast and the retry middleware
# kicks in instead of blocking the connection for seconds. Drives bd
# code-rd1 Phase 2.3.
resource "kubernetes_manifest" "ha_sofia_transport" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "ServersTransport"
metadata = {
name = "ha-sofia-transport"
namespace = "reverse-proxy"
}
spec = {
forwardingTimeouts = {
dialTimeout = "500ms"
responseHeaderTimeout = "30s"
idleConnTimeout = "90s"
}
}
}
}
module "ha-sofia" {
source = "./factory"
dns_type = "non-proxied"
name = "ha-sofia"
external_name = "ha-sofia.viktorbarzin.lan"
port = 8123
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.reverse-proxy]
source = "./factory"
dns_type = "non-proxied"
name = "ha-sofia"
external_name = "ha-sofia.viktorbarzin.lan"
port = 8123
tls_secret_name = var.tls_secret_name
# depends_on on the retry/transport manifests avoids a dangling-reference
# window that would 404 ha-sofia traffic (memory 768: 2026-04-17 P0 outage).
depends_on = [
kubernetes_namespace.reverse-proxy,
kubernetes_manifest.ha_sofia_retry,
kubernetes_manifest.ha_sofia_transport,
]
protected = false
skip_global_rate_limit = true
extra_middlewares = [
"reverse-proxy-ha-sofia-rate-limit@kubernetescrd",
"reverse-proxy-ha-sofia-retry@kubernetescrd",
]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Home Assistant Sofia"
"gethomepage.dev/description" = "Smart home hub"
"gethomepage.dev/icon" = "home-assistant.png"
"gethomepage.dev/group" = "Smart Home"
"gethomepage.dev/pod-selector" = ""
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Home Assistant Sofia"
"gethomepage.dev/description" = "Smart home hub"
"gethomepage.dev/icon" = "home-assistant.png"
"gethomepage.dev/group" = "Smart Home"
"gethomepage.dev/pod-selector" = ""
"traefik.ingress.kubernetes.io/service.serverstransport" = "reverse-proxy-ha-sofia-transport@kubernetescrd"
}
}
# https://music-assistant.viktorbarzin.me/
module "music-assistant" {
source = "./factory"
dns_type = "non-proxied"
name = "music-assistant"
external_name = "ha-sofia.viktorbarzin.lan"
port = 8095
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.reverse-proxy]
source = "./factory"
dns_type = "non-proxied"
name = "music-assistant"
external_name = "ha-sofia.viktorbarzin.lan"
port = 8095
tls_secret_name = var.tls_secret_name
depends_on = [
kubernetes_namespace.reverse-proxy,
kubernetes_manifest.ha_sofia_retry,
kubernetes_manifest.ha_sofia_transport,
]
protected = false
skip_global_rate_limit = true
extra_middlewares = [
"reverse-proxy-ha-sofia-rate-limit@kubernetescrd",
"reverse-proxy-ha-sofia-retry@kubernetescrd",
]
extra_annotations = {
"traefik.ingress.kubernetes.io/service.serverstransport" = "reverse-proxy-ha-sofia-transport@kubernetescrd"
}
}
# https://ha-london.viktorbarzin.me/

View file

@ -86,6 +86,15 @@ module "qbittorrent" {
homepage_credentials = local.homepage_credentials
}
module "mam_farming" {
source = "./mam-farming"
namespace = kubernetes_namespace.servarr.metadata[0].name
depends_on = [
kubernetes_manifest.external_secret,
module.qbittorrent,
]
}
module "flaresolverr" {
source = "./flaresolverr"
tls_secret_name = var.tls_secret_name

View file

@ -0,0 +1,163 @@
"""
MAM bonus-point spender tier-aware, pay-what-we-owe.
MAM's bonusBuy.php API enforces a hard 50 GiB minimum per purchase
("Automated spenders are limited to buying at least 50 GB... due to log
spam"). Valid API tiers are 50, 100, 200, 500 GiB (@ 500 BP/GiB). That
means the "pay exactly what we owe" approach from the recovery plan
rounds UP to 50 GiB for the first purchase small buys can only be done
via the web UI, not the API.
Logic: pick the smallest valid tier that both (a) satisfies the ratio
deficit and (b) we can afford without burning the BP reserve. Skip if
nothing fits; the cron will retry in 6 h once BP grows.
"""
import math
import os
import sys
import tempfile
import time
import requests
PUSHGW = "http://prometheus-prometheus-pushgateway.monitoring:9091"
COOKIE_FILE = "/data/mam_id"
TARGET_RATIO = float(os.environ.get("TARGET_RATIO", "2.0"))
RESERVE_BP = int(os.environ.get("RESERVE_BP", "500"))
BP_PER_GB = int(os.environ.get("BP_PER_GB", "500"))
# MAM-enforced minimum purchase for API callers: 50 GiB.
API_TIERS_GIB = (50, 100, 200, 500)
CLASS_CODES = {
"Mouse": 0,
"Vole": 1,
"User": 2,
"Power User": 3,
"Elite": 4,
"Torrent Master": 5,
"Power TM": 6,
"Elite TM": 7,
"VIP": 8,
}
def save_cookie(resp):
for c in resp.cookies:
if c.name == "mam_id":
fd, tmp = tempfile.mkstemp(dir="/data")
os.write(fd, c.value.encode())
os.close(fd)
os.rename(tmp, COOKIE_FILE)
return
def push(metrics):
try:
requests.post(
f"{PUSHGW}/metrics/job/mam-bp-spender", data=metrics, timeout=10
)
except Exception as e:
print(f"pushgateway error: {e}", file=sys.stderr)
def load_cookie():
if os.path.exists(COOKIE_FILE):
return open(COOKIE_FILE).read().strip()
return os.environ.get("MAM_ID", "")
def main():
mam_id = load_cookie()
if not mam_id:
print("No mam_id available", file=sys.stderr)
sys.exit(1)
s = requests.Session()
s.cookies.set("mam_id", mam_id, domain=".myanonamouse.net")
r = s.get("https://www.myanonamouse.net/jsonLoad.php", timeout=15)
if r.status_code != 200:
push("mam_farming_cookie_expired 1\n")
print(f"Cookie expired: {r.status_code}", file=sys.stderr)
sys.exit(1)
save_cookie(r)
profile = r.json()
ratio = float(profile.get("ratio", 0) or 0)
classname = profile.get("classname", "Mouse")
class_code = CLASS_CODES.get(classname, 0)
# MAM returns `downloaded`/`uploaded` as pretty strings ("715.55 MiB");
# `*_bytes` are the authoritative integer fields.
downloaded = int(profile.get("downloaded_bytes", 0) or 0)
uploaded = int(profile.get("uploaded_bytes", 0) or 0)
bp = int(float(profile.get("seedbonus", 0) or 0))
deficit_bytes = max(0, int(downloaded * TARGET_RATIO) - uploaded)
needed_gib = math.ceil(deficit_bytes / (1024**3)) + 1 if deficit_bytes > 0 else 0
affordable_gib = max(0, (bp - RESERVE_BP) // BP_PER_GB)
# Pick the smallest API tier that satisfies the deficit AND fits the
# budget. If even the smallest tier is too expensive, skip — the cron
# will retry in 6 h once BP has grown.
buy_gib = 0
for tier in API_TIERS_GIB:
if tier >= needed_gib and tier <= affordable_gib:
buy_gib = tier
break
if buy_gib == 0 and needed_gib > 0 and affordable_gib >= API_TIERS_GIB[0]:
# Deficit exceeds all tiers we can afford — buy the largest
# tier that fits to make progress.
for tier in reversed(API_TIERS_GIB):
if tier <= affordable_gib:
buy_gib = tier
break
print(
f"Profile: ratio={ratio} class={classname} "
f"DL={downloaded / 1024**3:.2f} GiB UL={uploaded / 1024**3:.2f} GiB "
f"BP={bp} | deficit={deficit_bytes / 1024**3:.2f} GiB "
f"needed={needed_gib} affordable={affordable_gib} buy={buy_gib}"
)
spent_gib = 0
if buy_gib >= API_TIERS_GIB[0]:
time.sleep(3)
url = (
"https://www.myanonamouse.net/json/bonusBuy.php"
f"?spendtype=upload&amount={buy_gib}"
)
r2 = s.get(url, timeout=15)
save_cookie(r2)
try:
body = r2.json()
except ValueError:
body = {}
ok = r2.status_code == 200 and body.get("success") is True
print(
f"Buy {buy_gib} GiB -> {r2.status_code} "
f"success={body.get('success')} {r2.text[:160]}"
)
if ok:
spent_gib = buy_gib
metrics = (
"mam_farming_cookie_expired 0\n"
f"mam_ratio {ratio}\n"
f'mam_class_code{{classname="{classname}"}} {class_code}\n'
f"mam_downloaded_bytes {downloaded}\n"
f"mam_uploaded_bytes {uploaded}\n"
f"mam_bp_balance {bp}\n"
f"mam_bp_spent_gb {spent_gib}\n"
f"mam_bp_needed_gib {needed_gib}\n"
f"mam_bp_affordable_gib {affordable_gib}\n"
)
push(metrics)
print(
f"Done: BP={bp}, spent={spent_gib} GiB (needed={needed_gib}, "
f"affordable={affordable_gib})"
)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,264 @@
"""
MAM freeleech grabber demand-first, ratio-guarded.
Selects small-but-popular freeleech titles to grow the account's upload
credit. Refuses to grab while the account is in Mouse class or ratio is
below 1.2, because MAM rejects peer-list announces under those conditions
and new grabs only deepen the ratio hole.
Cleanup is handled by `mam-farming-janitor.py`, which runs unconditionally.
"""
import json
import math
import os
import random
import sys
import tempfile
import time
import requests
QB_URL = "http://qbittorrent.servarr.svc.cluster.local"
PUSHGW = "http://prometheus-prometheus-pushgateway.monitoring:9091"
COOKIE_FILE = "/data/mam_id"
GRABBED_IDS_FILE = "/data/grabbed_ids.txt"
MIN_MB = int(os.environ.get("MIN_MB", "50"))
MAX_MB = int(os.environ.get("MAX_MB", "1024"))
LEECHER_FLOOR = int(os.environ.get("LEECHER_FLOOR", "1"))
SEEDER_CEILING = int(os.environ.get("SEEDER_CEILING", "50"))
GRAB_PER_RUN = int(os.environ.get("GRAB_PER_RUN", "5"))
MAX_TORRENTS = int(os.environ.get("MAX_TORRENTS", "500"))
RATIO_FLOOR = float(os.environ.get("RATIO_FLOOR", "1.2"))
REQUEST_SLEEP = float(os.environ.get("REQUEST_SLEEP", "3"))
CLASS_CODES = {
"Mouse": 0,
"Vole": 1,
"User": 2,
"Power User": 3,
"Elite": 4,
"Torrent Master": 5,
"Power TM": 6,
"Elite TM": 7,
"VIP": 8,
}
def parse_size(s):
units = {"B": 1, "KiB": 1024, "MiB": 1024**2, "GiB": 1024**3, "TiB": 1024**4}
parts = s.split()
if len(parts) != 2:
return 0
return int(float(parts[0]) * units.get(parts[1], 1))
def save_cookie(resp):
for c in resp.cookies:
if c.name == "mam_id":
fd, tmp = tempfile.mkstemp(dir="/data")
os.write(fd, c.value.encode())
os.close(fd)
os.rename(tmp, COOKIE_FILE)
return
def push(metrics):
try:
requests.post(
f"{PUSHGW}/metrics/job/mam-freeleech-grabber", data=metrics, timeout=10
)
except Exception as e:
print(f"pushgateway error: {e}", file=sys.stderr)
def load_cookie():
if os.path.exists(COOKIE_FILE):
return open(COOKIE_FILE).read().strip()
return os.environ.get("MAM_ID", "")
def exit_cookie_expired(status):
push("mam_farming_cookie_expired 1\n")
print(f"Cookie expired: {status}", file=sys.stderr)
sys.exit(1)
def main():
mam_id = load_cookie()
if not mam_id:
print("No mam_id available", file=sys.stderr)
sys.exit(1)
s = requests.Session()
s.cookies.set("mam_id", mam_id, domain=".myanonamouse.net")
r = s.get("https://www.myanonamouse.net/jsonLoad.php", timeout=15)
if r.status_code != 200:
exit_cookie_expired(r.status_code)
save_cookie(r)
profile = r.json()
ratio = float(profile.get("ratio", 0) or 0)
classname = profile.get("classname", "Mouse")
# `*_bytes` are authoritative integers; `downloaded`/`uploaded` are
# pretty strings like "715.55 MiB".
downloaded = int(profile.get("downloaded_bytes", 0) or 0)
uploaded = int(profile.get("uploaded_bytes", 0) or 0)
class_code = CLASS_CODES.get(classname, 0)
profile_metrics = (
f"mam_farming_cookie_expired 0\n"
f"mam_ratio {ratio}\n"
f'mam_class_code{{classname="{classname}"}} {class_code}\n'
f"mam_downloaded_bytes {downloaded}\n"
f"mam_uploaded_bytes {uploaded}\n"
)
if ratio < RATIO_FLOOR or classname == "Mouse":
reason = "mouse_class" if classname == "Mouse" else "low_ratio"
print(
f"Skip grab: ratio={ratio} class={classname} (floor={RATIO_FLOOR}) "
f"reason={reason}"
)
push(
profile_metrics
+ f'mam_grabber_skipped_reason{{reason="{reason}"}} 1\n'
+ f"mam_farming_grabbed 0\n"
)
return
time.sleep(REQUEST_SLEEP)
r = s.get("https://t.myanonamouse.net/json/dynamicSeedbox.php", timeout=15)
save_cookie(r)
print(f"Seedbox: {r.text[:80]}")
grabbed_ids = set()
if os.path.exists(GRABBED_IDS_FILE):
raw = open(GRABBED_IDS_FILE).read().strip()
grabbed_ids = set(raw.split("\n")) if raw else set()
try:
all_torrents = requests.get(
f"{QB_URL}/api/v2/torrents/info", timeout=10
).json()
except Exception as e:
print(f"qBittorrent unreachable: {e}", file=sys.stderr)
push(profile_metrics + "mam_farming_grabbed 0\n")
sys.exit(1)
farming = [t for t in all_torrents if t.get("category") == "mam-farming"]
all_names_lower = {t["name"].lower() for t in all_torrents}
total_size = sum(t.get("size", 0) for t in farming)
print(
f"Profile: ratio={ratio} class={classname} | "
f"Farming: {len(farming)}, {total_size / (1024**3):.1f} GiB, "
f"tracked IDs: {len(grabbed_ids)}"
)
grabbed = 0
if len(farming) >= MAX_TORRENTS:
print(f"At max torrents ({MAX_TORRENTS}), skipping grab")
else:
time.sleep(REQUEST_SLEEP)
offset = random.randint(0, 1400)
params = {
"tor[searchType]": "fl",
"tor[searchIn]": "torrents",
"tor[perpage]": "50",
"tor[startNumber]": str(offset),
}
r = s.get(
"https://www.myanonamouse.net/tor/js/loadSearchJSONbasic.php",
params=params,
timeout=15,
)
save_cookie(r)
data = r.json()
results = data.get("data", []) or []
print(
f"Search offset={offset}, found={data.get('found', 0)}, "
f"page_results={len(results)}"
)
candidates = []
for t in results:
tid = str(t.get("id", ""))
if tid in grabbed_ids:
continue
title = t.get("title", "")
if any(title.lower() in n for n in all_names_lower):
grabbed_ids.add(tid)
continue
size = parse_size(t.get("size", "0 B"))
if size < MIN_MB * 1024**2 or size > MAX_MB * 1024**2:
continue
seeders = int(t.get("seeders", 999) or 999)
leechers = int(t.get("leechers", 0) or 0)
if leechers < LEECHER_FLOOR:
continue
if seeders > SEEDER_CEILING:
continue
wedge_bonus = (
200 if (t.get("free") == 1 or t.get("personal_freeleech") == 1) else 0
)
score = leechers * 3 - seeders * 0.5 + wedge_bonus
candidates.append((score, t))
candidates.sort(key=lambda x: -x[0])
for score, t in candidates[:GRAB_PER_RUN]:
time.sleep(REQUEST_SLEEP)
tid = t["id"]
r = s.get(
f"https://www.myanonamouse.net/tor/download.php?tid={tid}", timeout=15
)
save_cookie(r)
if not r.content.startswith(b"d"):
print(f"Bad torrent body for tid={tid}")
grabbed_ids.add(str(tid))
continue
add_resp = requests.post(
f"{QB_URL}/api/v2/torrents/add",
files={
"torrents": (
f"{tid}.torrent",
r.content,
"application/x-bittorrent",
)
},
data={
"savepath": "/downloads/mam-farming",
"category": "mam-farming",
"tags": "mam,freeleech",
},
timeout=20,
)
ok = add_resp.status_code == 200 and add_resp.text.strip() != "Fails."
print(
f"{'Added' if ok else 'FAILED'} (score={score:.1f}): "
f"{t['title'][:60]} ({t['size']}, S:{t.get('seeders')} "
f"L:{t.get('leechers')}) -> {add_resp.status_code}"
)
grabbed_ids.add(str(tid))
if ok:
grabbed += 1
fd, tmp = tempfile.mkstemp(dir="/data")
os.write(fd, "\n".join(grabbed_ids).encode())
os.close(fd)
os.rename(tmp, GRABBED_IDS_FILE)
metrics = (
profile_metrics
+ f"mam_farming_grabbed {grabbed}\n"
+ f"mam_farming_total_seeding {len(farming) + grabbed}\n"
+ f"mam_farming_size_bytes {total_size}\n"
)
push(metrics)
print(f"Done: grabbed={grabbed}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,177 @@
"""
MAM farming janitor H&R-aware cleanup.
Runs every 15 minutes independently of the grabber's ratio guard: stuck
torrents accumulate fastest precisely when the grabber is skipping. Never
deletes a torrent that's inside MAM's 72-hour Hit-and-Run window.
Set DRY_RUN=1 to log candidates without deleting (used for the first
24 hours after rollout to sanity-check the rules against live state).
"""
import json
import os
import sys
import time
import requests
QB_URL = "http://qbittorrent.servarr.svc.cluster.local"
PUSHGW = "http://prometheus-prometheus-pushgateway.monitoring:9091"
DRY_RUN = os.environ.get("DRY_RUN", "0") == "1"
HNR_SEED_SECONDS = int(os.environ.get("HNR_SEED_SECONDS", str(72 * 3600)))
NEVER_STARTED_AGE = int(os.environ.get("NEVER_STARTED_AGE", str(24 * 3600)))
STALLED_AGE = int(os.environ.get("STALLED_AGE", str(3 * 86400)))
SATISFIED_SEED_AGE = int(os.environ.get("SATISFIED_SEED_AGE", str(3 * 86400)))
SATISFIED_SEEDER_FLOOR = int(os.environ.get("SATISFIED_SEEDER_FLOOR", "5"))
GRACEFUL_SEED_AGE = int(os.environ.get("GRACEFUL_SEED_AGE", str(14 * 86400)))
ZERO_DEMAND_AGE = int(os.environ.get("ZERO_DEMAND_AGE", str(7 * 86400)))
UNREG_KEYWORDS = ("unregistered", "torrent not found", "info hash not authorized")
REASONS = (
"never_started",
"stalled_old",
"satisfied_redundant",
"graceful_retire",
"zero_demand",
"unregistered",
)
def classify(t, now, tracker_msg):
age = now - int(t.get("added_on", 0) or 0)
progress = float(t.get("progress", 0) or 0)
downloaded = int(t.get("downloaded", 0) or 0)
uploaded = int(t.get("uploaded", 0) or 0)
seed_time = int(t.get("seeding_time", 0) or 0)
state = t.get("state", "")
num_complete = int(t.get("num_complete", 0) or 0)
if tracker_msg and any(k in tracker_msg.lower() for k in UNREG_KEYWORDS):
return "unregistered"
if progress < 1.0:
if age > NEVER_STARTED_AGE and downloaded == 0:
return "never_started"
if state == "stalledDL" and age > STALLED_AGE:
return "stalled_old"
return None
if seed_time < HNR_SEED_SECONDS:
return "hnr_window"
if seed_time > GRACEFUL_SEED_AGE:
return "graceful_retire"
if (
seed_time >= HNR_SEED_SECONDS
and uploaded == 0
and age > ZERO_DEMAND_AGE
):
return "zero_demand"
if seed_time > SATISFIED_SEED_AGE and num_complete > SATISFIED_SEEDER_FLOOR:
return "satisfied_redundant"
return None
def fetch_tracker_msg(hash_):
try:
resp = requests.get(
f"{QB_URL}/api/v2/torrents/trackers",
params={"hash": hash_},
timeout=10,
)
trackers = resp.json() or []
except Exception:
return ""
for tr in trackers:
url = tr.get("url", "")
if url.startswith("** ["):
continue
msg = tr.get("msg", "")
if msg:
return msg
return ""
def push(metrics):
try:
requests.post(
f"{PUSHGW}/metrics/job/mam-farming-janitor", data=metrics, timeout=10
)
except Exception as e:
print(f"pushgateway error: {e}", file=sys.stderr)
def main():
try:
all_torrents = requests.get(
f"{QB_URL}/api/v2/torrents/info", timeout=15
).json()
except Exception as e:
print(f"qBittorrent unreachable: {e}", file=sys.stderr)
sys.exit(1)
farming = [t for t in all_torrents if t.get("category") == "mam-farming"]
now = int(time.time())
deleted = {r: 0 for r in REASONS}
preserved_hnr = 0
skipped_active = 0
delete_hashes = []
# Only inspect tracker msg on torrents with a peer problem — avoids
# hundreds of extra API calls when things are healthy.
for t in farming:
state = t.get("state", "")
progress = float(t.get("progress", 0) or 0)
tracker_msg = ""
if progress < 1.0 and state in ("stalledDL", "metaDL", "missingFiles"):
tracker_msg = fetch_tracker_msg(t["hash"])
verdict = classify(t, now, tracker_msg)
if verdict is None:
skipped_active += 1
elif verdict == "hnr_window":
preserved_hnr += 1
else:
deleted[verdict] += 1
delete_hashes.append((t["hash"], verdict, t.get("name", "")[:60]))
for hash_, reason, name in delete_hashes:
if DRY_RUN:
print(f"[DRY_RUN] would delete ({reason}): {name}")
continue
try:
requests.post(
f"{QB_URL}/api/v2/torrents/delete",
data={"hashes": hash_, "deleteFiles": "true"},
timeout=20,
)
print(f"Deleted ({reason}): {name}")
except Exception as e:
print(f"Delete failed for {name}: {e}", file=sys.stderr)
for reason in REASONS:
push(
f'mam_janitor_deleted_per_run{{reason="{reason}"}} '
f"{deleted[reason] if not DRY_RUN else 0}\n"
f'mam_janitor_dry_run_candidates{{reason="{reason}"}} '
f"{deleted[reason] if DRY_RUN else 0}\n"
)
push(
f"mam_janitor_preserved_hnr {preserved_hnr}\n"
f"mam_janitor_skipped_active {skipped_active}\n"
f"mam_janitor_dry_run {1 if DRY_RUN else 0}\n"
f"mam_janitor_last_run_timestamp {now}\n"
)
total = sum(deleted.values())
print(
f"Done: deleted={total} preserved_hnr={preserved_hnr} "
f"skipped_active={skipped_active} dry_run={DRY_RUN}"
)
print(f" per reason: {deleted}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,281 @@
variable "namespace" {
type = string
default = "servarr"
}
locals {
python_image = "docker.io/library/python:3.12-alpine"
pip_prefix = "pip install -q requests > /dev/null 2>&1; python3 /tmp/script.py"
data_pvc = "mam-farming-data-proxmox"
# Dry-run window was satisfied by a one-shot test on 2026-04-19 that
# produced 466 `never_started` candidates and 0 matches in any other
# reason bucket consistent with Phase B's expected 495 stuck torrents.
# Enforcing from here on.
janitor_dry_run = "0"
}
# ------------------------------- PVC -------------------------------
# Shared scratch volume for cookie + grabbed-ID dedup list. The existing
# in-cluster PVC (kubectl-applied 2026-04-14) is adopted via an `import {}`
# block declared in the root module (servarr/main.tf) Terraform 1.5+
# rejects imports inside child modules.
resource "kubernetes_persistent_volume_claim" "mam_data" {
wait_until_bound = false
metadata {
name = local.data_pvc
namespace = var.namespace
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
}
# --------------------------- Grabber ---------------------------------
# Every 30 minutes: skip while ratio < 1.2 or class == Mouse; otherwise
# grab up to 5 small-but-popular freeleech torrents. Existing ConfigMap
# + CronJob are adopted via imports in the parent stack.
resource "kubernetes_config_map" "grabber_script" {
metadata {
name = "mam-freeleech-grabber-script"
namespace = var.namespace
}
data = {
"script.py" = file("${path.module}/files/freeleech-grabber.py")
}
}
resource "kubernetes_cron_job_v1" "grabber" {
metadata {
name = "mam-freeleech-grabber"
namespace = var.namespace
}
spec {
schedule = "*/30 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
restart_policy = "Never"
container {
name = "freeleech-grabber"
image = local.python_image
command = ["/bin/sh", "-c", local.pip_prefix]
env {
name = "MAM_ID"
value_from {
secret_key_ref {
name = "servarr-secrets"
key = "mam_id"
}
}
}
resources {
requests = { memory = "64Mi", cpu = "10m" }
limits = { memory = "128Mi" }
}
volume_mount {
name = "script"
mount_path = "/tmp/script.py"
sub_path = "script.py"
}
volume_mount {
name = "data"
mount_path = "/data"
}
}
volume {
name = "script"
config_map {
name = kubernetes_config_map.grabber_script.metadata[0].name
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.mam_data.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# --------------------------- BP Spender ------------------------------
# Every 6 hours: compute the upload deficit against TARGET_RATIO and buy
# exactly what we need (+1 GiB margin), capped by BP reserve. Existing
# ConfigMap + CronJob are adopted via imports in the parent stack.
resource "kubernetes_config_map" "bp_spender_script" {
metadata {
name = "mam-bp-spender-script"
namespace = var.namespace
}
data = {
"script.py" = file("${path.module}/files/bp-spender.py")
}
}
resource "kubernetes_cron_job_v1" "bp_spender" {
metadata {
name = "mam-bp-spender"
namespace = var.namespace
}
spec {
schedule = "0 */6 * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
restart_policy = "Never"
container {
name = "bp-spender"
image = local.python_image
command = ["/bin/sh", "-c", local.pip_prefix]
env {
name = "MAM_ID"
value_from {
secret_key_ref {
name = "servarr-secrets"
key = "mam_id"
}
}
}
resources {
requests = { memory = "64Mi", cpu = "10m" }
limits = { memory = "128Mi" }
}
volume_mount {
name = "script"
mount_path = "/tmp/script.py"
sub_path = "script.py"
}
volume_mount {
name = "data"
mount_path = "/data"
}
}
volume {
name = "script"
config_map {
name = kubernetes_config_map.bp_spender_script.metadata[0].name
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.mam_data.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# ----------------------------- Janitor -------------------------------
# New: every 15 minutes, independent of grabber ratio guard. Deletes
# stuck/unregistered/redundant torrents in category=mam-farming while
# preserving torrents inside the 72h H&R window.
resource "kubernetes_config_map" "janitor_script" {
metadata {
name = "mam-farming-janitor-script"
namespace = var.namespace
}
data = {
"script.py" = file("${path.module}/files/mam-farming-janitor.py")
}
}
resource "kubernetes_cron_job_v1" "janitor" {
metadata {
name = "mam-farming-janitor"
namespace = var.namespace
}
spec {
schedule = "*/15 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
restart_policy = "Never"
container {
name = "farming-janitor"
image = local.python_image
command = ["/bin/sh", "-c", local.pip_prefix]
env {
name = "DRY_RUN"
value = local.janitor_dry_run
}
resources {
requests = { memory = "64Mi", cpu = "10m" }
limits = { memory = "128Mi" }
}
volume_mount {
name = "script"
mount_path = "/tmp/script.py"
sub_path = "script.py"
}
}
volume {
name = "script"
config_map {
name = kubernetes_config_map.janitor_script.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -79,11 +79,11 @@ resource "kubernetes_deployment" "qbittorrent" {
}
spec {
container {
image = "lscr.io/linuxserver/qbittorrent:5.0.4"
image = "lscr.io/linuxserver/qbittorrent:5.1.4"
name = "qbittorrent"
port {
container_port = 8787
container_port = 8080
}
env {
name = "PUID"
@ -113,6 +113,15 @@ resource "kubernetes_deployment" "qbittorrent" {
name = "audiobooks"
mount_path = "/audiobooks"
}
resources {
requests = {
memory = "512Mi"
cpu = "50m"
}
limits = {
memory = "1Gi"
}
}
}
volume {
name = "data"
@ -289,21 +298,26 @@ tracker_stats = defaultdict(lambda: {
})
for t in torrents:
category = (t.get("category") or "").lower()
tracker_url = t.get("tracker", "")
if not tracker_url:
domain = "unknown"
else:
domain = ""
if tracker_url:
try:
domain = urlparse(tracker_url).hostname or "unknown"
domain = (urlparse(tracker_url).hostname or "").lower()
except Exception:
domain = "unknown"
domain = ""
if "myanonamouse" in domain or "mam" in domain.lower():
# Category is the only signal for queuedDL torrents whose announces
# haven't happened yet (tracker field is empty). Map those first so
# hundreds of MAM torrents don't collect under "unknown".
if category == "mam-farming" or "myanonamouse" in domain or "mam" in domain:
label = "mam"
elif "audiobookbay" in domain or "abb" in domain.lower():
elif category.startswith("abb") or "audiobookbay" in domain or "abb" in domain:
label = "audiobookbay"
else:
elif domain:
label = domain.replace(".", "_")
else:
label = "unknown"
s = tracker_stats[label]
s["uploaded"] += t.get("uploaded", 0)

View file

@ -0,0 +1,69 @@
# =============================================================================
# CoreDNS Scaling, Anti-Affinity, PDB
# =============================================================================
#
# CoreDNS is kube-system / kubeadm-managed. We only patch replicas + affinity
# here (the Corefile ConfigMap is in main.tf). The hashicorp/kubernetes v3
# provider removed the *_patch resource family from v2, so we apply the
# desired state via `kubectl patch` inside a null_resource. The patch is
# idempotent a no-op when the deployment already matches.
#
# Kubeadm upgrades preserve the replica count on the existing deployment but
# reset the pod template (including affinity) from the ClusterConfiguration.
# Re-running `terraform apply` re-asserts the affinity patch; the readiness
# gate in `readiness.tf` catches regressions if the patch is reverted.
resource "null_resource" "coredns_scale_and_affinity" {
triggers = {
replicas = 3
spec_hash = sha256(file("${path.module}/coredns.tf"))
}
provisioner "local-exec" {
command = <<-BASH
set -euo pipefail
# 1. Scale to 3 replicas.
kubectl -n kube-system scale deploy/coredns --replicas=3
# 2. Switch anti-affinity from preferred required on hostname.
kubectl -n kube-system patch deploy/coredns --type=json -p='[
{
"op": "replace",
"path": "/spec/template/spec/affinity/podAntiAffinity",
"value": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{"key": "k8s-app", "operator": "In", "values": ["kube-dns"]}
]
},
"topologyKey": "kubernetes.io/hostname"
}
]
}
}
]' || true
# 3. Wait for rollout to settle.
kubectl -n kube-system rollout status deploy/coredns --timeout=120s
BASH
interpreter = ["/bin/bash", "-c"]
}
}
# PDB keep at least 2 CoreDNS pods running during voluntary disruptions.
resource "kubernetes_pod_disruption_budget_v1" "coredns" {
metadata {
name = "coredns"
namespace = "kube-system"
}
spec {
min_available = "2"
selector {
match_labels = {
"k8s-app" = "kube-dns"
}
}
}
}

View file

@ -115,11 +115,11 @@ resource "kubernetes_deployment" "technitium_secondary" {
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
cpu = "100m"
memory = "2Gi"
}
limits = {
memory = "512Mi"
memory = "2Gi"
}
}
port {
@ -270,11 +270,11 @@ resource "kubernetes_deployment" "technitium_tertiary" {
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
cpu = "100m"
memory = "2Gi"
}
limits = {
memory = "512Mi"
memory = "2Gi"
}
}
port {
@ -391,44 +391,90 @@ resource "kubernetes_cron_job_v1" "technitium_zone_sync" {
set -e
PRIMARY="http://technitium-primary.technitium.svc.cluster.local:5380"
REPLICAS="http://technitium-secondary-web.technitium.svc.cluster.local:5380 http://technitium-tertiary-web.technitium.svc.cluster.local:5380"
PUSHGW="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/technitium-zone-sync"
# Track overall status non-zero if any zone fails to create
OVERALL_STATUS=0
FAIL_COUNT=0
SYNCED=0
# Login to primary
P_TOKEN=$(curl -sf "$PRIMARY/api/user/login?user=$TECH_USER&pass=$TECH_PASS" | sed -n 's/.*"token":"\([^"]*\)".*/\1/p')
if [ -z "$P_TOKEN" ]; then echo "ERROR: Cannot login to primary"; exit 1; fi
if [ -z "$P_TOKEN" ]; then echo "ERROR: Cannot login to primary"; OVERALL_STATUS=1; fi
# Get zones from primary (excluding default zones that don't need replication)
curl -sf "$PRIMARY/api/zones/list?token=$P_TOKEN" | tr ',' '\n' | sed -n 's/.*"name":"\([^"]*\)".*/\1/p' | \
grep -v -E '^(localhost|0\.in-addr\.arpa|127\.in-addr\.arpa|255\.in-addr\.arpa|1\.0\.0.*ip6\.arpa)$$' > /tmp/primary_zones.txt
echo "Primary has $(wc -l < /tmp/primary_zones.txt) zones to replicate"
# Enable zone transfers on primary for all zones
while read -r zone; do
curl -sf "$PRIMARY/api/zones/options/set?token=$P_TOKEN&zone=$zone&zoneTransfer=Allow" > /dev/null || true
done < /tmp/primary_zones.txt
# Sync to each replica
SYNCED=0
for REPLICA in $REPLICAS; do
R_TOKEN=$(curl -sf "$REPLICA/api/user/login?user=$TECH_USER&pass=$TECH_PASS" | sed -n 's/.*"token":"\([^"]*\)".*/\1/p')
if [ -z "$R_TOKEN" ]; then echo "WARN: Cannot login to $REPLICA, skipping"; continue; fi
# Get existing zones on this replica
curl -sf "$REPLICA/api/zones/list?token=$R_TOKEN" | tr ',' '\n' | sed -n 's/.*"name":"\([^"]*\)".*/\1/p' > /tmp/replica_zones.txt
if [ "$OVERALL_STATUS" -eq 0 ]; then
# Get zones from primary (excluding default zones that don't need replication)
curl -sf "$PRIMARY/api/zones/list?token=$P_TOKEN" | tr ',' '\n' | sed -n 's/.*"name":"\([^"]*\)".*/\1/p' | \
grep -v -E '^(localhost|0\.in-addr\.arpa|127\.in-addr\.arpa|255\.in-addr\.arpa|1\.0\.0.*ip6\.arpa)$$' > /tmp/primary_zones.txt
PRIMARY_COUNT=$(wc -l < /tmp/primary_zones.txt)
echo "Primary has $PRIMARY_COUNT zones to replicate"
# Enable zone transfers on primary for all zones
while read -r zone; do
if grep -qx "$zone" /tmp/replica_zones.txt; then
# Zone exists just resync
curl -sf "$REPLICA/api/zones/resync?token=$R_TOKEN&zone=$zone" > /dev/null || true
else
# New zone create as Secondary and sync
echo "NEW: Creating $zone on $REPLICA"
curl -sf "$REPLICA/api/zones/create?token=$R_TOKEN&zone=$zone&type=Secondary&primaryNameServerAddresses=$PRIMARY_IP" > /dev/null || true
SYNCED=$((SYNCED + 1))
fi
curl -sf "$PRIMARY/api/zones/options/set?token=$P_TOKEN&zone=$zone&zoneTransfer=Allow" > /dev/null || true
done < /tmp/primary_zones.txt
done
echo "Zone sync complete. $$SYNCED new zone(s) created."
# Sync to each replica
for REPLICA in $REPLICAS; do
R_NAME=$(echo "$REPLICA" | sed 's|http://||; s|-web.*||')
R_TOKEN=$(curl -sf "$REPLICA/api/user/login?user=$TECH_USER&pass=$TECH_PASS" | sed -n 's/.*"token":"\([^"]*\)".*/\1/p')
if [ -z "$R_TOKEN" ]; then
echo "ERROR: Cannot login to $REPLICA"
OVERALL_STATUS=1
FAIL_COUNT=$((FAIL_COUNT + 1))
# Push replica zone_count=0 so divergence alert fires
printf 'technitium_zone_count{instance="%s"} 0\n' "$R_NAME" | \
curl -sf --data-binary @- "$PUSHGW/instance/$R_NAME" || true
continue
fi
# Get existing zones on this replica
curl -sf "$REPLICA/api/zones/list?token=$R_TOKEN" | tr ',' '\n' | sed -n 's/.*"name":"\([^"]*\)".*/\1/p' > /tmp/replica_zones.txt
REPLICA_COUNT=$(wc -l < /tmp/replica_zones.txt)
while read -r zone; do
if grep -qx "$zone" /tmp/replica_zones.txt; then
# Zone exists just resync
curl -sf "$REPLICA/api/zones/resync?token=$R_TOKEN&zone=$zone" > /dev/null || true
else
# New zone create as Secondary and validate response
echo "NEW: Creating $zone on $REPLICA"
RESP=$(curl -sf "$REPLICA/api/zones/create?token=$R_TOKEN&zone=$zone&type=Secondary&primaryNameServerAddresses=$PRIMARY_IP" || echo '{"status":"error"}')
if echo "$RESP" | grep -q '"status":"ok"'; then
SYNCED=$((SYNCED + 1))
else
echo "ERROR: Failed to create $zone on $REPLICA: $RESP"
OVERALL_STATUS=1
FAIL_COUNT=$((FAIL_COUNT + 1))
fi
fi
done < /tmp/primary_zones.txt
# Push per-replica zone count
printf 'technitium_zone_count{instance="%s"} %s\n' "$R_NAME" "$REPLICA_COUNT" | \
curl -sf --data-binary @- "$PUSHGW/instance/$R_NAME" || true
done
# Push primary zone count
printf 'technitium_zone_count{instance="primary"} %s\n' "$PRIMARY_COUNT" | \
curl -sf --data-binary @- "$PUSHGW/instance/primary" || true
fi
# Push overall status (0=ok, 1=fail) + last-run timestamp
cat <<METRICS | curl -sf --data-binary @- "$PUSHGW" || true
# HELP technitium_zone_sync_status Zone sync job status (0=ok, 1=fail)
# TYPE technitium_zone_sync_status gauge
technitium_zone_sync_status $OVERALL_STATUS
# HELP technitium_zone_sync_failures Zones that failed to create this run
# TYPE technitium_zone_sync_failures gauge
technitium_zone_sync_failures $FAIL_COUNT
# HELP technitium_zone_sync_last_run Timestamp of last zone-sync run
# TYPE technitium_zone_sync_last_run gauge
technitium_zone_sync_last_run $(date +%s)
METRICS
echo "Zone sync complete. $SYNCED new zone(s) created. $FAIL_COUNT failures. status=$OVERALL_STATUS"
exit $OVERALL_STATUS
SCRIPT
]
env {

View file

@ -60,10 +60,15 @@ resource "kubernetes_config_map" "coredns" {
ttl 30
}
prometheus :9153
forward . 10.0.20.1 8.8.8.8 1.1.1.1
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
policy sequential
health_check 5s
max_fails 2
}
cache {
success 10000 300 6
denial 10000 300 60
serve_stale 86400s
}
loop
reload
@ -77,10 +82,14 @@ resource "kubernetes_config_map" "coredns" {
rcode NXDOMAIN
fallthrough
}
forward . 10.96.0.53 # Technitium ClusterIP (technitium-dns-internal)
forward . 10.96.0.53 {
health_check 5s
max_fails 2
}
cache {
success 10000 300 6
denial 10000 300 60
serve_stale 86400s
}
}
EOF
@ -161,11 +170,11 @@ resource "kubernetes_deployment" "technitium" {
name = "technitium"
resources {
requests = {
cpu = "25m"
memory = "512Mi"
cpu = "100m"
memory = "2Gi"
}
limits = {
memory = "512Mi"
memory = "2Gi"
}
}
port {
@ -221,6 +230,10 @@ resource "kubernetes_deployment" "technitium" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "technitium-web" {

View file

@ -0,0 +1,100 @@
# =============================================================================
# Post-apply readiness gate
# =============================================================================
#
# Runs after all three Technitium deployments + the DNS LB service have been
# applied. Verifies that every instance is rolled out, the API responds, the
# DNS pods answer queries, and zone counts agree. Fails the apply if any
# check fails. No canary this is a hard gate.
#
# Override for emergency maintenance: apply with `-var skip_readiness=true`
# (set via terragrunt inputs when needed), or `terraform apply -target` the
# resources needed without touching this module.
variable "skip_readiness" {
type = bool
default = false
description = "Skip the Technitium readiness gate. Use only for emergency maintenance."
}
resource "null_resource" "technitium_readiness_gate" {
count = var.skip_readiness ? 0 : 1
# Re-run when any deployment image/resource changes, or on every apply
# (timestamp) so transient drift still gets exercised.
triggers = {
primary_digest = sha256(jsonencode(kubernetes_deployment.technitium.spec[0].template[0].spec[0].container[0]))
secondary_digest = sha256(jsonencode(kubernetes_deployment.technitium_secondary.spec[0].template[0].spec[0].container[0]))
tertiary_digest = sha256(jsonencode(kubernetes_deployment.technitium_tertiary.spec[0].template[0].spec[0].container[0]))
corefile = sha256(kubernetes_config_map.coredns.data["Corefile"])
always = timestamp()
}
provisioner "local-exec" {
command = <<-BASH
set -euo pipefail
NS=technitium
echo "=== Technitium readiness gate ==="
# 1. Wait for rollout on all three deployments.
for d in technitium technitium-secondary technitium-tertiary; do
echo "-> rollout status deploy/$d"
kubectl -n $NS rollout status deploy/$d --timeout=180s
done
# 2. Per-pod DNS check + content parity. Technitium pods have `dig` but
# no HTTP client, so we use DNS directly. Each pod must return an A
# record for idrac.viktorbarzin.lan, AND the answer must match across
# all three instances. This catches:
# - Zone not loaded on an instance (NXDOMAIN / empty)
# - Zone drift between primary and replicas (different A record)
# The AXFR chain means all three should converge on the same value.
PODS=$(kubectl -n $NS get pod -l dns-server=true -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')
if [ -z "$PODS" ]; then
echo "ERROR: no dns-server=true pods found"
exit 1
fi
# Zone load can take tens of seconds after a memory-bump rollout, so retry
# up to 6 times with 10s backoff before giving up.
ANSWERS=""
for POD in $PODS; do
echo "-> dig @127.0.0.1 idrac.viktorbarzin.lan on $POD"
ANSWER=""
for TRY in 1 2 3 4 5 6; do
ANSWER=$(kubectl -n $NS exec "$POD" -- dig +short +time=5 +tries=2 @127.0.0.1 idrac.viktorbarzin.lan A 2>&1 || true)
if echo "$ANSWER" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'; then
break
fi
echo " attempt $TRY: no A record yet, sleeping 10s"
sleep 10
ANSWER=""
done
if [ -z "$ANSWER" ]; then
echo "ERROR: pod $POD never returned an A record for idrac.viktorbarzin.lan"
exit 1
fi
echo " $POD → $ANSWER"
ANSWERS="$ANSWERS $ANSWER"
done
# 3. Content parity all three instances must agree on the A record.
UNIQ=$(echo "$ANSWERS" | tr ' ' '\n' | grep -v '^$' | sort -u | wc -l)
if [ "$UNIQ" -gt 1 ]; then
echo "ERROR: instances returned different A records for idrac.viktorbarzin.lan: $ANSWERS"
exit 1
fi
echo "=== Technitium readiness gate PASSED ==="
BASH
interpreter = ["/bin/bash", "-c"]
}
depends_on = [
kubernetes_deployment.technitium,
kubernetes_deployment.technitium_secondary,
kubernetes_deployment.technitium_tertiary,
kubernetes_service.technitium-dns,
kubernetes_pod_disruption_budget_v1.technitium_dns,
]
}

View file

@ -295,12 +295,13 @@ resource "kubernetes_service" "torrserver-bt" {
}
module "torrserver_ingress" {
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.tor-proxy.metadata[0].name
name = "torrserver"
tls_secret_name = var.tls_secret_name
port = "8090"
protected = true
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.tor-proxy.metadata[0].name
name = "torrserver"
tls_secret_name = var.tls_secret_name
port = "8090"
protected = true
external_monitor = false
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "TorrServer"

View file

@ -252,7 +252,7 @@ resource "kubernetes_deployment" "trading-bot-frontend" {
app = "trading-bot-frontend"
}
annotations = {
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
}
}
spec {
@ -353,7 +353,7 @@ resource "kubernetes_deployment" "trading-bot-workers" {
app = "trading-bot-workers"
}
annotations = {
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
}
}
spec {

View file

@ -552,10 +552,42 @@ locals {
type = "mysql"
database_connection_string = "mysql://uptimekuma@mysql.dbaas.svc.cluster.local:3306"
database_password_vault_key = "uptimekuma_db_password"
hostname = null
port = null
interval = 60
retry_interval = 60
max_retries = 2
},
{
# HAProxy service in redis ns health-checks INFO replication and
# only routes to the current Sentinel-elected master, so this
# survives failover. Bitnami chart has auth disabled, so no
# password_vault_key.
name = "Redis"
type = "redis"
database_connection_string = "redis://redis-master.redis.svc.cluster.local:6379"
database_password_vault_key = null
hostname = null
port = null
interval = 60
retry_interval = 30
max_retries = 3
},
{
# TP-Link home router upstream of pfSense. Complements the
# `[External] gw` HTTPS monitor: this one checks the router
# directly on 443, so we can tell a Cloudflare/tunnel outage
# apart from the router itself being unreachable.
name = "TP-Link Gateway (192.168.1.1)"
type = "port"
database_connection_string = null
database_password_vault_key = null
hostname = "192.168.1.1"
port = 443
interval = 60
retry_interval = 30
max_retries = 3
},
]
}
@ -570,6 +602,7 @@ resource "kubernetes_secret" "internal_monitor_sync" {
for m in local.internal_monitors :
"DB_PASSWORD_${upper(replace(m.name, "/[^A-Za-z0-9]/", "_"))}" =>
data.vault_kv_secret_v2.viktor.data[m.database_password_vault_key]
if m.database_password_vault_key != null
},
)
}
@ -585,7 +618,9 @@ resource "kubernetes_config_map_v1" "internal_monitor_targets" {
name = m.name
type = m.type
database_connection_string = m.database_connection_string
password_env = "DB_PASSWORD_${upper(replace(m.name, "/[^A-Za-z0-9]/", "_"))}"
hostname = m.hostname
port = m.port
password_env = m.database_password_vault_key != null ? "DB_PASSWORD_${upper(replace(m.name, "/[^A-Za-z0-9]/", "_"))}" : null
interval = m.interval
retry_interval = m.retry_interval
max_retries = m.max_retries
@ -634,40 +669,42 @@ existing = {m["name"]: m for m in api.get_monitors()}
for t in targets:
name = t["name"]
password = os.environ[t["password_env"]]
# MYSQL monitors use `databaseConnectionString` + `radiusPassword`
# (UK v2 re-uses the radiusPassword field for mysql auth backwards compat).
mtype = MonitorType(t["type"])
# MYSQL uses `databaseConnectionString` + `radiusPassword` (UK v2 re-uses
# radiusPassword for mysql auth backwards compat). Redis has auth
# disabled on the cluster, so password_env is null. PORT monitors use
# hostname + port directly.
desired = {
"type": MonitorType(t["type"]),
"type": mtype,
"name": name,
"databaseConnectionString": t["database_connection_string"],
"radiusPassword": password,
"interval": t["interval"],
"retryInterval": t["retry_interval"],
"maxretries": t["max_retries"],
}
if mtype == MonitorType.PORT:
desired["hostname"] = t["hostname"]
desired["port"] = t["port"]
else:
desired["databaseConnectionString"] = t["database_connection_string"]
if t.get("password_env"):
desired["radiusPassword"] = os.environ[t["password_env"]]
if name not in existing:
print(f"Creating monitor: {name}")
api.add_monitor(**desired)
continue
m = existing[name]
drifted = (
m.get("databaseConnectionString") != desired["databaseConnectionString"]
or m.get("radiusPassword") != desired["radiusPassword"]
or m.get("interval") != desired["interval"]
or m.get("retryInterval") != desired["retryInterval"]
or m.get("maxretries") != desired["maxretries"]
)
drift_fields = ["interval", "retryInterval", "maxretries"]
if mtype == MonitorType.PORT:
drift_fields += ["hostname", "port"]
else:
drift_fields += ["databaseConnectionString"]
if "radiusPassword" in desired:
drift_fields += ["radiusPassword"]
drifted = any(m.get(f) != desired.get(f) for f in drift_fields)
if drifted:
print(f"Updating monitor {name} (id={m['id']})")
api.edit_monitor(
m["id"],
databaseConnectionString=desired["databaseConnectionString"],
radiusPassword=desired["radiusPassword"],
interval=desired["interval"],
retryInterval=desired["retryInterval"],
maxretries=desired["maxretries"],
)
edit_kwargs = {f: desired[f] for f in drift_fields if f in desired}
api.edit_monitor(m["id"], **edit_kwargs)
else:
print(f"Monitor {name} (id={m['id']}) already in desired state")
time.sleep(0.3)

View file

@ -394,9 +394,14 @@ resource "vault_kubernetes_auth_backend_role" "ci" {
role_name = "ci"
bound_service_account_names = ["default"]
bound_service_account_namespaces = ["woodpecker"]
token_policies = [vault_policy.ci.name]
token_ttl = 604800 # 7d
token_period = 604800 # periodic: auto-renews indefinitely
# terraform_state policy grants `database/static-creds/pg-terraform-state`
# read scripts/tg needs this to fetch the Tier-1 PG backend password.
# Without it, CI's per-stack `tg apply` dies with
# `ERROR: Cannot read PG credentials from Vault` and the default.yml
# apply-loop swallows the exit code (set +e) fixed in bd code-e1x.
token_policies = [vault_policy.ci.name, vault_policy.terraform_state.name]
token_ttl = 604800 # 7d
token_period = 604800 # periodic: auto-renews indefinitely
}
# --- ESO Policy & Role ---

View file

@ -9,6 +9,10 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

View file

@ -133,7 +133,7 @@ SLACK_BOT_TOKEN = os.getenv("SLACK_BOT_TOKEN", "")
SLACK_CHANNEL = os.getenv("SLACK_CHANNEL", "automation")
# Redis configuration
REDIS_URL = os.getenv("REDIS_URL", "redis://redis.redis.svc.cluster.local:6379/0")
REDIS_URL = os.getenv("REDIS_URL", "redis://redis-master.redis.svc.cluster.local:6379/0")
REDIS_PREFIX = "yt-highlights:"
# Paths

File diff suppressed because one or more lines are too long