## Context
A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14
via kubectl apply (outside Terraform). Five days later the account was
still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0.
Tracker responses on 7 of 9 active torrents returned
`status=4 | msg="User currently mouse rank, you need to get your ratio up!"`
— MAM was actively refusing to serve peer lists because the account was
in Mouse class, and refusing to serve peer lists made the ratio impossible
to recover. Meanwhile the grabber kept digging: 501 torrents sat in
qBittorrent, 0 completed, 0 bytes uploaded.
Root causes (ranked):
1. Death spiral — Mouse class blocks announces, nothing uploads.
2. BP-spender 30 000 BP threshold blocked the only exit even though the
account already had 24 500 BP.
3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand
torrents filtered to <100 MiB — ratio-hostile by design.
4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so
torrents that never started never qualified. Combined with the 500-
torrent cap this stalled the grabber indefinitely.
5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL.
6. Ratio-monitor labelled queued torrents `unknown` (empty tracker
field), hiding the problem on the MAM Grafana panel.
7. qBittorrent memory limit (256 Mi LimitRange default) too low.
8. All of the above was Terraform drift with no reviewability.
## This change
Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts
the three kubectl-applied resources and replaces their scripts with
demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes
ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana
panel row.
### Architecture
MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP)
├─── loadSearchJSONbasic.php (freeleech search)
├─── bonusBuy.php (50 GiB min tier for API)
└─── download.php (torrent file)
│
Pushgateway <──┬────────────┤
│ mam_ratio ┌────────────────────┐
│ mam_class_code │ freeleech-grabber │ */30
│ mam_bp_balance ◄───│ (ratio-guarded) │
│ mam_farming_* └──────────┬─────────┘
│ mam_janitor_* │ adds to
│ ▼
│ Grafana panels qBittorrent (mam-farming)
│ + 5 alerts ▲
│ │ deletes by rule
│ ┌──────────┴─────────┐
│ ◄───│ farming-janitor │ */15
│ │ (H&R-aware) │
│ └──────────┬─────────┘
│ │ buys credit
│ ┌──────────┴─────────┐
└───────────────────────│ bp-spender │ 0 */6
│ (tier-aware) │
└────────────────────┘
### Key decisions
- **Ratio guard on grabber** — refuse to grab if ratio < 1.2 OR class ==
Mouse. Prevents the death spiral from deepening. Emits
`mam_grabber_skipped_reason{reason=...}` and exits clean.
- **Demand-first selection** — new score formula
`leechers*3 - seeders*0.5 + 200 if freeleech_wedge else 0`; size band
50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that
will actually upload.
- **Janitor decoupled from grabber** — runs every 15 min regardless of
the ratio-guard state. Without this, stuck torrents accumulate
fastest exactly when the grabber is skipping (Mouse class). H&R-aware:
never deletes `progress==1.0 AND seeding_time < 72h`. Six delete
reasons observable via `mam_janitor_deleted_per_run{reason=...}`.
- **BP-spender tier-aware** — MAM imposes a hard 50 GiB minimum on API
buyers ("Automated spenders are limited to buying at least 50 GB...
due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB.
The spender picks the smallest tier that satisfies the ratio deficit
AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB
tier is too expensive, it skips and retries on the next 6-hour cron.
- **Authoritative metrics use MAM profile fields** —
`downloaded_bytes` / `uploaded_bytes` (integers) rather than the
pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB"
that MAM also returns.
- **Ratio-monitor category-first labelling** — `tracker` is empty for
queued torrents that never announced. Now maps `category==mam-farming`
to label `mam` first, only falls back to tracker-URL parsing when
category is absent. Stops hundreds of MAM torrents collecting under
`unknown`.
- **qBittorrent resources bumped** to `requests=512Mi / limits=1Gi` so
hundreds of active torrents don't OOM.
### Emergency recovery performed this session
1. Adopted 5 in-cluster resources via root-module `import {}` blocks
(Terraform 1.5+ rejects imports inside child modules).
2. Ran the janitor in DRY_RUN=1 to verify rules against live state —
466 `never_started` candidates, 0 false positives in any other
reason bucket. Flipped to enforce mode.
3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35
preserved as active/in-progress).
4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become
eligible again.
The ratio is still 0 because the API cannot buy below 50 GiB and the
account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the
MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4
and unblock announces. Future automation cannot do this for us due to
MAMs anti-spam rule.
### What is NOT in this change
- qBittorrent prefs reconciliation (max_active_downloads=20,
max_active_uploads=150, max_active_torrents=150). The plan wanted
this; deferred to a follow-up because the janitor + ratio recovery
handles the 500-torrent backlog first. A small reconciler CronJob
posting to /api/v2/app/setPreferences is the intended follow-up.
- VIP purchase (~100 k BP) — deferred until BP accumulates.
- Cross-seed / autobrr — separate initiative.
## Alerts added
- P1 MAMMouseClass — `mam_class_code == 0` for 1h
- P1 MAMCookieExpired — `mam_farming_cookie_expired > 0`
- P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old
QBittorrentMAMRatioLow, now driven by authoritative profile metric)
- P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy
- P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h
## Test plan
### Automated
$ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 | grep Plan
Plan: 5 to import, 2 to add, 6 to change, 0 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed.
# Re-plan after import block removal (idempotent)
$ ../../scripts/tg plan 2>&1 | grep Plan
Plan: 0 to add, 1 to change, 0 to destroy.
# The 1 change is a pre-existing MetalLB annotation drift on the
# qbittorrent-torrenting Service — unrelated to this change.
$ cd ../monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
# Python + JSON syntax
$ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [
"infra/stacks/servarr/mam-farming/files/freeleech-grabber.py",
"infra/stacks/servarr/mam-farming/files/bp-spender.py",
"infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]'
$ python3 -c 'import json; json.load(open(
"infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))'
### Manual Verification
1. Grabber ratio-guard path:
$ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
$ kubectl -n servarr logs job/g1
Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class
2. BP-spender tier path:
$ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1
$ kubectl -n servarr logs job/s1
Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551
| deficit=1.40 GiB needed=3 affordable=48 buy=0
Done: BP=24551, spent=0 GiB (needed=3, affordable=48)
Correctly skips because affordable (48) < smallest API tier (50).
3. Janitor in enforce mode:
$ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1
$ kubectl -n servarr logs job/j1 | tail -3
Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False
per reason: {'never_started': 466, ...}
Second run immediately after: `deleted=0 skipped_active=35` —
steady state with only active/seeding torrents left.
4. Alerts loaded:
$ kubectl -n monitoring get cm prometheus-server \
-o jsonpath='{.data.alerting_rules\.yml}' \
| grep -E "alert: MAM|alert: QBittorrent"
- alert: MAMMouseClass
- alert: MAMCookieExpired
- alert: MAMRatioBelowOne
- alert: MAMFarmingStuck
- alert: MAMJanitorStuckBacklog
- alert: QBittorrentDisconnected
- alert: QBittorrentMAMUnsatisfied
5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new
"MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP
balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor
deletion stacked chart, janitor state stat, grabber state stat.
## Reproduce locally
1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect
0 add / 1 change (unrelated MetalLB annotation drift).
2. `kubectl -n servarr get cronjobs` — expect three:
mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor.
3. Trigger each via `kubectl create job --from=cronjob/<name> <job>`
and read logs; outputs match the manual-verification snippets above.
Closes: code-qfs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Ships the monorepo commit
(code@2fd7670d [claude-agent-service] Raise /execute default timeout
from 15m to 45m) that raises ExecuteRequest.timeout_seconds from 900 to
2700. The auto-upgrade pipeline (DIUN → n8n → claude-agent-service →
service-upgrade agent) had been silently timing out mid-run for 3 days:
139 × 202 Accepted + 6 × TimeoutError in the last 24h, zero commits to
infra, zero Slack posts. Root cause was the 15-minute cap truncating
CAUTION-class upgrades that need to summarise multi-release changelogs,
poll Woodpecker CI, and wait on on-demand DB backup CronJobs.
## What changed
`local.image_tag` 0c24c9b6 → 2fd7670d. Image built + pushed to
registry.viktorbarzin.me/claude-agent-service:2fd7670d. Deployment is
`Recreate`, so the single pod is dropped + recreated.
## Test Plan
### Automated
`terraform plan` — `Plan: 0 to add, 1 to change, 0 to destroy` (3
container image refs flip from 0c24c9b6 → 2fd7670d).
`terraform apply` — `Apply complete! Resources: 0 added, 1 changed,
0 destroyed.`
### Manual Verification
```
$ kubectl -n claude-agent rollout status deploy/claude-agent-service --timeout=120s
deployment "claude-agent-service" successfully rolled out
$ kubectl -n claude-agent get deploy claude-agent-service \
-o jsonpath='{.spec.template.spec.containers[0].image}'
registry.viktorbarzin.me/claude-agent-service:2fd7670d
$ kubectl -n claude-agent exec deploy/claude-agent-service -- \
sh -c 'cd /srv && python3 -c "from app.main import ExecuteRequest; \
print(ExecuteRequest(prompt=\"p\", agent=\"a\").timeout_seconds)"'
2700
```
Next DIUN cycle (every 6h) should land ≥1 unattended upgrade as an
infra commit + Slack message without TimeoutError in the agent logs.
Closes: code-cfy
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real
metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the
exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot
2.3 deprecated in favour of `service stats` + `doveadm-server` with
a different wire format. The scrape only ever returned
`dovecot_up{scope="user"} 0`.
code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or
(b) retire the exporter + scrape + alerts. Picking (b) — carrying a
no-op exporter + scrape + alert group taxes cluster resources,
clutters Prometheus /targets, and tees up an alert that can never
fire correctly. If a future session needs real Dovecot stats, reach
for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and
rebuild this scaffolding.
## This change
### mailserver stack
- Removes the `dovecot-exporter` container from
`kubernetes_deployment.mailserver` (was ~28 lines). Pod now
runs a single `docker-mailserver` container.
- Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service
added in code-izl). The `mailserver` LoadBalancer (ports 25, 465,
587, 993) is unaffected.
- Drops the dovecot.cf comment documenting the failed code-vnc
attempt — the documentation survives here + in bd code-vnc /
code-1ik.
### monitoring stack
- Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`.
- Removes the `Mailserver Dovecot` PrometheusRule group
(`DovecotConnectionsNearLimit`, `DovecotExporterDown`).
- Inline comments in both files point future work at code-1ik's
decision record.
Prometheus configmap-reload picked up the change; scrape target set
now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to
1/1 Running.
## What is NOT in this change
- No replacement exporter — deliberate. The alert that was removed
was a false-signal alert; its removal returns cluster alerting to
a correct, lower-noise state.
- mailserver MetalLB Service + SMTP/IMAP ports — unchanged.
- `auth_failure_delay`, `mail_max_userip_connections` — stay; those
are unrelated to stats export.
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver
NAME READY STATUS RESTARTS AGE
mailserver-78589bfd95-swz6h 1/1 Running 0 49s
$ kubectl get svc -n mailserver
NAME TYPE PORT(S)
mailserver LoadBalancer 25/TCP,465/TCP,587/TCP,993/TCP
roundcubemail ClusterIP 80/TCP
# mailserver-metrics gone
$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
{"status":"success","data":{"activeTargets":[]}}
```
### Manual Verification
1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence)
2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even
without the exporter signal
3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit
or DovecotExporterDown
Closes: code-1ik
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
bd code-vnc investigated why `viktorbarzin/dovecot_exporter` only
exposed `dovecot_up{scope="user"} 0`. Root cause: the exporter speaks
the legacy pre-2.3 `old_stats` FIFO wire protocol. docker-mailserver
15.0.0 ships Dovecot 2.3.19, which moved to `service stats` with a
different architecture — `doveadm stats dump` on the old-stats
unix_listener returns "Failed to read VERSION line" and the exporter
loops on "Input does not provide any columns".
Attempted fix: enabled `old_stats` plugin via `mail_plugins` +
declared `service old-stats { unix_listener stats-reader }`. Socket
was created but protocol incompatibility made it useless. Reverted.
## This change
- Reverts the attempted dovecot.cf additions
- Adds a comment in the dovecot.cf heredoc explaining why we
deliberately do NOT enable old_stats here
- `auth_failure_delay = 5s` (code-9mi) and
`mail_max_userip_connections = 50` stay — they're unrelated to
stats
## What is NOT in this change
- A replacement exporter — filed as follow-up bd code-1ik with
two paths: switch to jtackaberry/dovecot_exporter, or retire the
exporter+scrape+alert entirely
- The `mailserver-metrics` ClusterIP Service (from code-izl) —
kept; it will be useful for whichever path code-1ik chooses
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
supervisorctl status dovecot postfix
dovecot RUNNING pid 1022, uptime 0:00:27
postfix RUNNING pid 1063, uptime 0:00:26
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
Dovecot config returns to baseline + auth_failure_delay. Mail continues
to flow (E2E probe continues to succeed via `email-roundtrip-monitor`).
Closes: code-vnc
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Companion change to payslip-ingest v2 (regex parser + accurate RSU tax
attribution). The Grafana dashboard now has 4 more panels powered by the
new earnings-decomposition and YTD-snapshot columns, and the Claude
fallback agent's prompt is aligned with the new schema so non-Meta
payslips still land with the full field set.
## This change
### `.claude/agents/payslip-extractor.md`
Rewrites the RSU handling section to match Meta UK's actual template
(rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching
rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead).
Adds a new "Earnings decomposition (v2)" section telling the fallback
agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_*
and when to use pension_employee vs pension_sacrifice without
double-counting.
### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`
- **Panel 4 (Effective rate)** — SQL switched from the naive
`(income_tax + NIC) / cash_gross` to the YTD-effective-rate
method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid /
ytd_taxable_pay)`. Title updated to "YTD-corrected" so the
change is discoverable.
- **Panel 5 (Table)** — adds salary, bonus, pension_sacrifice,
taxable_pay columns so row-level debugging against the parser
output is trivial.
- **+Panel 8 (Earnings breakdown)** — monthly stacked bars of
salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice
months show up as a massive negative pension_sacrifice spike
paired with a near-zero bonus bar.
- **+Panel 9 (Accurate cash tax rate)** — timeseries of
cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU
contribution the payslip hides in the single `Tax paid` line.
- **+Panel 10 (All-in compensation)** — stacked bars of cash_gross
+ rsu_vest per payslip.
- **+Panel 11 (YTD cumulative cash gross vs total comp)** — two
lines partitioned by tax_year; the gap between them is the RSU
contribution YTD.
Total panels go from 7 → 11.
## Test Plan
### Automated
Dashboard JSON validity:
```
$ python3 -m json.tool uk-payslip.json > /dev/null && echo ok
ok
```
### Manual Verification
After applying `stacks/monitoring/`:
1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels
2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the
negative pension_sacrifice bar in panel 8
3. Panel 9 "Accurate cash effective tax rate" shows the
cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in
RSU-vest months
## Reproduce locally
1. `cd infra/stacks/monitoring && terragrunt plan`
2. Expected: ConfigMap diff on the payslip dashboard with the new panel
JSON
3. `terragrunt apply` — Grafana reloads the dashboard automatically
(configmap-reload sidecar)
Relates to: payslip-ingest commit 9741816
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying
directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`).
The first time a new consumer is added, the mount fails with
`mount.nfs: … No such file or directory` and the pod hangs in
ContainerCreating.
This bit us twice during the Wave 1/2 rollout — once for the mailserver
backup (code-z26) and again for the Roundcube backup (code-1f6). Both
times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`.
Rather than automate the SSH dependency into the module (which would
break hermeticity and fail for operators without host SSH), this runbook
documents the manual bootstrap step and the rationale.
Addresses bd code-yo4.
## This change
New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers,
gives the copy-paste SSH command, and explains why auto-creation was
rejected (two options, neither worth the churn).
## What is NOT in this change
- Any automation of the bootstrap — runbook only
- Migration to `nfs-subdir-external-provisioner` — explicitly out of scope
## Test Plan
### Automated
```
$ cat docs/runbooks/nfs-prerequisites.md | head -5
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
```
### Manual Verification
Before the next stack adds a new `nfs_volume` consumer, read the runbook
and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod
reaches Ready within a minute of the PV creation.
Closes: code-yo4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.
## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:
- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking
The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.
## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).
Sentinels stay at 64Mi — they don't store data.
## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:
helm rollback redis 43 -n redis
kubectl -n redis patch sts redis-node --type=strategic \
-p '{...memory: 256Mi...}'
kubectl -n redis delete pod redis-node-1 --force
Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.
## Verification
kubectl -n redis get pods
redis-node-0 2/2 Running 0 <bounce>
redis-node-1 2/2 Running 0 <bounce>
kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
256Mi
## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
(separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
deadlock recovery
Closes: code-pqt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The mailserver container had `capabilities.add = ["NET_ADMIN"]`. Upstream
docker-mailserver docs say the capability is only needed by Fail2ban to
run iptables ban actions. Fail2ban is DISABLED in this stack
(`ENABLE_FAIL2BAN=0`, see line ~68) — CrowdSec owns the brute-force
policy at the LB layer. The capability was therefore unused ballast and
a minor attack-surface reduction opportunity. Addresses code-4mu.
## This change
Replaces the explicit `capabilities { add = ["NET_ADMIN"] }` block with
an empty `security_context {}`. Post-rollout verification
(`supervisorctl status`) confirms every service we actually run is
healthy — dovecot, postfix, rspamd, rsyslog, postsrsd, changedetector,
cron, mailserver. Every STOPPED entry was already disabled.
The inline comment documents the revert trigger: check
`kubectl logs -c docker-mailserver` for permission-denied patterns and
restore the capability if observed.
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver -o jsonpath='{.items[0].spec.containers[?(@.name=="docker-mailserver")].securityContext}'
{"allowPrivilegeEscalation":true,"privileged":false,"readOnlyRootFilesystem":false,"runAsNonRoot":false}
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
supervisorctl status | grep RUNNING
changedetector RUNNING ...
cron RUNNING ...
dovecot RUNNING ...
mailserver RUNNING ...
postfix RUNNING ...
postsrsd RUNNING ...
rspamd RUNNING ...
rsyslog RUNNING ...
```
### Observation window
EmailRoundtripFailing + EmailRoundtripStale alerts continue to run
every 20 min. If no alert fires in the 24h post-rollout window
(through ~2026-04-20 10:40 UTC), the change is considered safe and
this commit stands. Otherwise revert this commit.
## What is NOT in this change
- readOnlyRootFilesystem (separate hardening, out of scope)
- runAsNonRoot (docker-mailserver needs root for Postfix)
- Removing privilege-escalation defaults (container needs those for
chowning mail spool at startup)
Closes: code-4mu
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB
LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable,
shipping an internal metric on the same listening IP as external mail
conflated two concerns and over-exposed the port. Prometheus was
scraping via the same LB Service. Addresses code-izl (follow-up to
code-61v which added the scrape job).
## This change
### mailserver stack
- Drops `dovecot-metrics` port from `kubernetes_service.mailserver`
(LoadBalancer stays: 25, 465, 587, 993).
- Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only,
selecting the same `app=mailserver` pod, exposing 9166.
### monitoring stack
- Updates `extraScrapeConfigs` in the Prometheus chart values to
target the new `mailserver-metrics.mailserver.svc.cluster.local:9166`
instead of `mailserver.mailserver.svc.cluster.local:9166`.
- helm_release.prometheus updated in-place; configmap-reload sidecar
picked up the new target within 10s.
```
mailserver LB mailserver-metrics ClusterIP
┌──────────────────┐ ┌──────────────────┐
│ 25 smtp │ │ 9166 dovecot- │
│ 465 smtp-secure │ │ metrics │ ← Prometheus only
│ 587 smtp-auth │ └──────────────────┘
│ 993 imap-secure │
└──────────────────┘
↑ 10.0.20.202
```
## What is NOT in this change
- Per-Service RBAC/NetworkPolicy tightening (separate task)
- Moving the metrics port to a dedicated sidecar-only Service Monitor
(ServiceMonitor CRDs not installed; extraScrapeConfigs is correct
for the prometheus-community chart in use)
## Test Plan
### Automated
```
$ kubectl get svc -n mailserver
mailserver LoadBalancer 10.0.20.202 25/TCP,465/TCP,587/TCP,993/TCP
mailserver-metrics ClusterIP 10.100.102.174 9166/TCP
$ kubectl get endpoints -n mailserver mailserver-metrics
mailserver-metrics 10.10.169.163:9166
$ # Prometheus target (after 10s configmap-reload)
$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics
health: up
```
### Manual Verification
1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused
2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name
3. PromQL: `up{job="mailserver-dovecot"}` returns `1`
Closes: code-izl
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The `postfix-accounts.cf` ConfigMap renders `bcrypt(pass, 6)` for each
user in `var.mailserver_accounts`. bcrypt generates a fresh salt on
every evaluation → the ConfigMap `data` hash line differs every plan
run. `ignore_changes = [data["postfix-accounts.cf"]]` was the pragmatic
workaround, but the side-effect wasn't documented: a Vault rotation of
a mailserver password would be MASKED by ignore_changes — TF would
never push the new hash and the pod would keep accepting the old
password until manual taint/replace.
Addresses bd code-7ns.
## This change
Inline comment on the lifecycle block spelling out:
- Why ignore_changes exists (non-deterministic bcrypt)
- What the invariant costs (masks automatic rotation)
- Why it's acceptable TODAY (no automatic rotation for
mailserver_accounts — verified in Vault; manual password change is a
manual TF run anyway)
- Two concrete alternatives if rotation is ever added:
(a) deterministic bcrypt with stable per-user salt
(b) render from an ESO-synced K8s Secret
No code change, no apply needed — this is a comment-only commit. The
decision (live-with + document) is one of the three options in the plan.
## What is NOT in this change
- Deterministic hashing (not needed until automatic rotation exists)
- ESO-driven Secret (same reason)
- Removal of ignore_changes (would cause the original drift flap)
## Test Plan
### Automated
```
$ cd stacks/mailserver && /home/wizard/code/infra/scripts/tg plan
# no diff expected on this comment-only change; other drift remains
# but is pre-existing and out of scope.
```
### Manual Verification
Read the new comment block at `stacks/mailserver/modules/mailserver/
main.tf` around the postfix-accounts-cf lifecycle — comprehensible
without session context.
Closes: code-7ns
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Dovecot's `dovecot.cf` block previously set only
`mail_max_userip_connections = 50`. No equivalent of the SMTP rate
limit existed for IMAP auth — brute-force against IMAP/POP auth was
throttled only by CrowdSec at the LB level. Adding an in-process
auth delay is cheap defense in depth. Addresses code-9mi.
## This change
Adds `auth_failure_delay = 5s` to the dovecot.cf ConfigMap key.
Each failed auth attempt pauses 5s before responding; a sequential
1000-entry dictionary attack stretches from <1s to ~85min, bought
out CrowdSec's ban window.
## What is NOT in this change
- `login_processes_count` tuning (workload doesn't warrant it yet)
- Equivalent SMTP AUTH delay (CrowdSec already covers, and SMTP AUTH
is rate-limited via `smtpd_client_connection_rate_limit`)
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
doveconf -n | grep -E 'auth_failure|mail_max_userip'
auth_failure_delay = 5 secs
mail_max_userip_connections = 50
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:993`
2. `a1 LOGIN bogus@viktorbarzin.me wrongpass` — expect ~5s delay before `NO [AUTHENTICATIONFAILED]`
3. Fire 5 failed attempts rapidly: total ≥25s
## Reproduce locally
1. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- doveconf -n | grep auth_failure`
2. Expected: `auth_failure_delay = 5 secs`
Closes: code-9mi
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
docker-mailserver 15.0.0's default Postfix config does NOT set
`smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587
(or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec
and rate limiting don't catch this — it's an auth-path leak, not a
bruteforce. Addresses bd code-vnw.
## This change
Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the
`postfix-main.cf` ConfigMap key consumed by docker-mailserver).
Rolled the pod to pick up the new ConfigMap.
### Deviation from task spec
code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT
a real Postfix parameter — attempting it gets
`postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The
acceptance test (reject PLAIN auth before STARTTLS) is satisfied by
`smtpd_tls_auth_only`, which is the correct knob. Added an inline
comment noting the common confusion.
## What is NOT in this change
- Per-service override in master.cf (smtpd_tls_auth_only applied
globally, which is safe because port 25 doesn't accept AUTH here)
- Other Postfix hardening (sender_restrictions, etc.)
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
postconf smtpd_tls_auth_only
smtpd_tls_auth_only = yes
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp`
2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS`
3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled`
4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds
## Reproduce locally
1. From a shell with `kubectl` access to the cluster:
2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only`
3. Expected: `smtpd_tls_auth_only = yes`
Closes: code-vnw
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`viktorbarzin/dovecot_exporter:latest` was consumed with `IfNotPresent`
pull, which means whichever node landed the pod kept whatever digest
was cached from an earlier pull. A SHA-level pin is the reproducibility
baseline this repo uses for every other home-built image
(`headscale`, `excalidraw`, `linkwarden`).
## This change
- Pins `dovecot-exporter` container image to
`viktorbarzin/dovecot_exporter@sha256:1114224c...` — the digest the
pod is actually running today (captured from live `imageID`).
- Enables Diun tag watching on the mailserver Deployment
(`diun.enable=true`, `diun.include_tags=^latest$`) so new `:latest`
digests trigger a notification rather than silently landing on the
next `IfNotPresent` miss.
Deviation from task spec (code-cno): the task asked for an 8-char SHA
*tag*, but Docker Hub only publishes `:latest` for this image — a SHA
tag doesn't exist. Used the digest-pin pattern already established at
`stacks/headscale/modules/headscale/main.tf:204` instead; Diun watches
the `:latest` tag for drift, which is the equivalent notification.
## What is NOT in this change
- Volume-mount ordering drift on `kubernetes_deployment.mailserver`
(pre-existing; tolerated by Waves 1+2).
- Splitting the metrics port into its own Service (code-izl).
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver \
-o jsonpath='{.items[0].spec.containers[*].image}'
docker.io/mailserver/docker-mailserver:15.0.0 \
viktorbarzin/dovecot_exporter@sha256:1114224c9bf0261ca8e9949a6b42d3c5a2c923d34ca4593f6b62f034daf14fc5
$ kubectl get deployment -n mailserver mailserver \
-o jsonpath='{.spec.template.metadata.annotations}'
{"diun.enable":"true","diun.include_tags":"^latest$"}
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. Push a new `:latest` digest to the exporter image (or wait for one).
2. Check Diun notifier output: a tag event for `^latest$` should fire.
3. `kubectl describe deployment/mailserver -n mailserver` shows the
digest pin unchanged until someone rebumps it.
## Reproduce locally
1. `kubectl -n mailserver get pod -l app=mailserver -o yaml | \
grep -A1 dovecot_exporter`
2. Expected: `image: viktorbarzin/dovecot_exporter@sha256:1114224c...`.
Closes: code-cno
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but
nothing was scraping it. Added a static `mailserver-dovecot` scrape job
to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not
`kube-prometheus-stack`, so no ServiceMonitor CRDs are available).
Two alerts in a new `Mailserver Dovecot` rule group:
- `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for
5m (85% of `mail_max_userip_connections = 50`).
- `DovecotExporterDown` fires if the scrape target is unreachable
for 10m (catches pod restarts + network issues).
Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule
on `mailserver-beta1` branch; that commit is abandoned because the
CRDs aren't installed. This path is functionally equivalent and plans
cleanly.
Closes: code-61v
## Context
The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. The
mailbox had no retention policy. On 2026-04-18 it held 519 messages
consuming 43 MiB. Without a policy, the only brake on growth was
manual deletion, which has not been happening - hence the bd task.
Viktor's explicit constraint when filing code-oy4: DO NOT blind
age-expunge. We need targeted retention that keeps genuine forwarded
human mail for a long time while shedding the recurring-newsletter
cruft that dominates the byte count.
## Profile findings (2026-04-18, verified on the live pod)
Total: 519 messages, 43 MiB, 0 in new/, 0 in tmp/.
Top senders by volume:
138 dan@tldrnewsletter.com
51 hi@ratepunk.com
40 uber@uber.com
35 truenas@viktorbarzin.me
19 ubereats@uber.com
15 hello@travel.jacksflightclub.com
12 chris@chriswillx.com
10 me@viktorbarzin.me
Top senders by storage bytes:
8,176,481 dan@tldrnewsletter.com (19 % of 43 MiB alone)
2,866,104 uber@uber.com
2,207,458 noreply@mail.selfh.st
2,066,094 hi@ratepunk.com
1,675,435 ubereats@uber.com
Age distribution:
97 % older than 14 days (502 / 519)
23 % older than 90 days (121 / 519)
Automated-sender markers:
66 % carry List-Unsubscribe: (342 / 519)
4 % carry Precedence: bulk|list|junk ( 21 / 519)
34 % carry neither marker (= human-ish tail) (177 / 519)
Combined "automated AND >14d": 328 messages -> target of rule 1.
## Retention strategy
Signed off by Viktor 2026-04-18. Two rules, both delete-leaf:
1. Older than 14 days AND header matches one of:
- `^List-Unsubscribe:`
- `^Precedence:\s*(bulk|list|junk)`
- `^Auto-Submitted:\s*auto-`
-> DELETE.
Rationale: these markers are the RFC-agreed indicators of bulk /
robotic senders. A 14-day window still lets genuine subscription
alerts (delivery, flight, calendar invite) come to attention.
2. Older than 90 days AND no automated marker at all
-> DELETE.
Rationale: these are long-tail forwards from real people to the
catch-all. 90 days is deliberately generous - I would rather
leak bytes than lose Viktor's personal correspondence.
3. Everything else -> KEEP (recent traffic, or aged human tail
younger than 90d).
## Implementation
A `kubernetes_cron_job_v1.spam_retention` running every 4h (at :17
past) that `kubectl exec`s a Python retention script into the
mailserver pod.
Why kubectl exec and not a sibling CronJob with the Maildir mounted:
mailserver-data-encrypted is a RWO volume held by the mailserver
pod. A sibling would fail to attach. The nextcloud-watchdog pattern
in stacks/nextcloud/main.tf already solves this for a similar
"interact with the live pod on a schedule" shape. Mirrored here with
its own SA + Role + RoleBinding scoped to list/get pods and create
pods/exec in the mailserver namespace only.
Why Python and not pure shell: POSIX `find + stat + awk` struggles
with the header-scan-up-to-blank-line rule, and `stat -c` is Linux-
GNU-specific anyway. The script reads each message's first 64 KiB,
stops at the first blank line, scans headers only, then checks mtime.
The CronJob streams the Python source via `kubectl exec -i ... --
python3 - <<PYEOF`. After the retention pass, `doveadm force-resync
-u spam@viktorbarzin.me INBOX/spam` refreshes Dovecot's cached index
so the deletions appear in IMAP immediately instead of after the next
pod restart.
Includes the standard KYVERNO_LIFECYCLE_V1 marker on the CronJob so
Kyverno ndots mutation does not cause perpetual drift.
## What is NOT in this change
- Dovecot sieve rules (no sieve infrastructure exists in the module;
the plan file's fallback option was precisely this CronJob path).
- Push of retention metrics to Pushgateway - the script prints them
to the job log for now; plumbing Pushgateway is a follow-up if
Viktor wants alerts.
- Any touch of other mailboxes - only `/var/mail/viktorbarzin.me/spam/cur`
is walked.
- Any mailserver pod restart or config reload.
## Test plan
### Automated
`terraform fmt` + `terragrunt hclfmt` pass. `scripts/tg plan` on the
mailserver stack shows:
Plan: 7 to add, 3 to change, 0 to destroy.
Of the 7 adds, 4 are mine (SA + Role + RoleBinding + CronJob). The
other 3 adds belong to the concurrent roundcube-backup CronJob +
nfs_roundcube_backup_host PV + PVC already on master in parallel.
The 3 in-place updates are pre-existing drift on the mailserver
Deployment, Service and email_roundtrip_monitor CronJob, not
introduced by this change.
### Manual Verification
After `scripts/tg apply` lands the CronJob:
1. Trigger an immediate run:
`kubectl -n mailserver create job --from=cronjob/spam-retention manual-1`
2. Wait for completion, read the log:
`kubectl -n mailserver logs job/manual-1`
-> expected tail:
spam_retention_scanned_total <N>
spam_retention_auto_deleted_total <M>
spam_retention_human_deleted_total <H>
spam_retention_kept_total <K>
spam_retention_errors_total 0
Retention pass complete
3. Confirm mailbox shrunk:
`kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
-- du -sh /var/mail/viktorbarzin.me/spam/`
-> expected: well below 43 MiB within one run (bulk rule alone
purges ~328 messages per the profile numbers above).
4. Confirm IMAP reflects the deletions:
`kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
-- doveadm mailbox status -u spam@viktorbarzin.me messages INBOX/spam`
-> expected: message count dropped accordingly.
5. 4 hours later, confirm the next scheduled run logs a much
smaller scan count and 0 deletions (nothing new crossed the
threshold).
Closes: code-oy4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Roundcube webmail runs with two encrypted RWO PVCs (see roundcubemail.tf:
`roundcubemail-html-encrypted`, `roundcubemail-enigma-encrypted`). These
carry user-visible state that is NOT regenerable without user action:
- `html` PVC → Apache docroot, plugin installs, skin overrides, session
artefacts (two_factor_webauthn keys, persistent_login tokens, rcguard
throttle state)
- `enigma` PVC → user-uploaded PGP private keyrings
Per the subdir CLAUDE.md "Storage & Backup Architecture" rule every
proxmox-lvm* PVC MUST have a backup CronJob writing to NFS
`/mnt/main/<app>-backup/`. Mailserver already complies via code-z26's
`mailserver-backup` CronJob; Roundcube does not. Losing either Roundcube
PVC means users must re-add 2FA devices, re-install plugins, and
re-import PGP keys — none of it recoverable from a database dump.
Target task: `code-1f6`.
## This change
- Adds `module.nfs_roundcube_backup_host` sourcing
`modules/kubernetes/nfs_volume` pointed at
`/srv/nfs/roundcube-backup` on the Proxmox host (NFSv4, inotify
change-tracker picks it up for Synology offsite).
- Adds `kubernetes_cron_job_v1.roundcube-backup`:
- Schedule `10 3 * * *` — 10 minutes after `mailserver-backup`
(`0 3 * * *`) to avoid NFS write-window contention. Roundcube PVCs
are tiny (<200 MiB combined on current cluster) so the window is
well under 10 min.
- `pod_affinity` on `app=roundcubemail` (Roundcube runs 1 replica with
`Recreate` strategy on a fresh node per pod; the backup pod must
co-locate because both PVCs are RWO).
- `rsync -aH --delete --link-dest=/backup/<prev-week>` into
`/backup/<YYYY-WW>/{html,enigma}/` — hardlinks unchanged files vs
the previous weekly snapshot, keeping storage cost ~= delta only.
- Weekly rotation retains 8 snapshots (~2 months), matching
`mailserver-backup`.
- Pushgateway metrics under `job=roundcube-backup` so existing
`BackupDurationHigh` / `BackupStale` alert patterns detect
regressions without extra wiring.
- `KYVERNO_LIFECYCLE_V1` `ignore_changes` for mutated `dns_config`.
## Layout
```
NFS server 192.168.1.127:/srv/nfs/
├── mailserver-backup/ (0 3 * * * — code-z26)
│ └── <YYYY-WW>/{data,state,log}/
└── roundcube-backup/ (10 3 * * * — this change)
└── <YYYY-WW>/{html,enigma}/
```
## What is NOT in this change
- Changing the mailserver-backup CronJob to also cover Roundcube. Two
separate CronJobs keep the concerns (and pod anti-affinity/affinity)
clean; the 10-min stagger eliminates the contention justification for
merging them.
- Retention alerting tuning — existing Pushgateway/Prometheus rule
ecosystem suffices for now.
- Restore tooling — follows the standard pattern in
`docs/runbooks/` (rsync back, fix perms).
## Reproduce locally
1. Plan: `cd stacks/mailserver && scripts/tg plan -lock=false` →
2 new resources (nfs_volume module + CronJob).
2. Apply, then trigger a one-shot run:
`kubectl -n mailserver create job --from=cronjob/roundcube-backup roundcube-backup-manual-1`
3. Expected on success:
- `kubectl -n mailserver logs job/roundcube-backup-manual-1` → "=== Backup IO Stats ===".
- On Proxmox host:
`ls /srv/nfs/roundcube-backup/$(date +%Y-%W)/` → `html`, `enigma`.
- `/mnt/backup/.nfs-changes.log` (Proxmox) lists fresh paths under
`roundcube-backup/` within ~1s of the rsync finishing.
- Pushgateway: `curl -s prometheus-prometheus-pushgateway.monitoring:9091/metrics | grep roundcube`
shows `backup_duration_seconds`, `backup_last_success_timestamp`.
## Automated
- `terraform fmt -check -recursive stacks/mailserver/modules/mailserver/` → clean.
- `scripts/tg plan -lock=false` in stacks/mailserver expected to show
`+ module.nfs_roundcube_backup_host.*`, `+ kubernetes_cron_job_v1.roundcube-backup`.
Closes: code-1f6
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently
wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)`
in bare `try/except` that only prints "Failed to push ..." on error. If
Pushgateway is transiently unreachable (e.g., during a Prometheus Helm
upgrade / HPA scale-down / brief network blip) metrics silently drop and
downstream detection relies entirely on `EmailRoundtripStale` firing after
60 min of staleness. Single transient failures masquerade as data-plane
breakage for up to an hour.
Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes.
## This change
- Extracts a `push_with_retry(label, func, url)` helper that performs 3
attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as
success, everything else as failure. On final failure, logs an explicit
`ERROR:` line to stderr with the URL and either the last HTTP status or
the exception repr — matches the existing `print(...)` logging style
used throughout the heredoc (no stdlib `logging` dependency added).
- Replaces the two inline `try/requests.put/except print` blocks with
calls to the helper. Pushgateway runs unconditionally; Uptime Kuma
still only runs on round-trip success (same as before).
- Makes exit code responsive to push outcome: probe exits non-zero when
the round-trip itself failed (unchanged), OR when BOTH pushes failed
all retries on the success path. Single-endpoint push failure with the
other succeeding keeps exit 0 — partial observability is preferred
over noisy pod restarts from Kubernetes' Job controller.
## Behavior matrix
```
roundtrip | pushgw | kuma | exit | rationale
----------+--------+------+------+-------------------------------
success | ok | ok | 0 | happy path (unchanged)
success | fail | ok | 0 | one endpoint still has telemetry
success | ok | fail | 0 | one endpoint still has telemetry
success | fail | fail | 1 | NEW — total observability loss
fail | ok | - | 1 | roundtrip failed (unchanged, Kuma skipped)
fail | fail | - | 1 | roundtrip failed (unchanged, Kuma skipped)
```
## What is NOT in this change
- Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of
scope per the task description.
- `logging` stdlib adoption — rest of heredoc uses `print`, staying
consistent.
- Moving the heredoc out of `main.tf` into a sidecar Python file —
separate refactor.
## Reproduce locally
1. Point PUSHGATEWAY at a black hole:
`kubectl -n mailserver set env cronjob/email-roundtrip-monitor \`
`PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test`
2. Trigger a one-shot job:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test`
3. Expected in logs:
- 3 attempts, each ~1s/2s/4s apart
- `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...`
- Uptime Kuma push still succeeds (round-trip ok) → exit 0
4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect
exit 1 + two ERROR lines.
## Automated
- `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK
(heredoc extracts cleanly).
- `terraform fmt -check -recursive modules/mailserver/` → no diff.
Closes: code-n5l
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix main.cf layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but wastes reviewer
attention on every `git blame` and drives up `variables.tf` read time.
Note on history: the prior commit 09c11056 landed with an identical
title ("Delete postfix_cf_reference_DO_NOT_USE dead code") but
actually committed `docs/runbooks/mailserver-proxy-protocol.md` —
fallout from a race between two concurrent mailserver sessions that
staged files in parallel. That commit accidentally closed this beads
task via the `Closes:` trailer without performing the deletion. This
commit does the actual deletion that was originally intended for
code-o3q. The runbook from 09c11056 is legitimate work for code-rtb
and is left in place.
## This change
Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the live `postfix_cf` variable that is actually consumed
by the module.
## What is NOT in this change
- No Terraform state modification — variable was never read, so state
has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
`kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
independently. Those 2 in-place updates are known and tracked
separately.
- No apply needed — pure source hygiene.
## Test Plan
### Automated
Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)
Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```
`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` -> `_v1` deprecation
notices, unrelated)
`terragrunt plan` (from `infra/stacks/mailserver/`):
```
# module.mailserver.kubernetes_deployment.mailserver will be updated in-place
# module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift. No change is
attributable to this commit — the dead variable was never referenced.
### Manual Verification
1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` -> expected `0`
3. `wc -l variables.tf` -> expected `39` (was `175`; 136 lines removed)
4. `cd ../..` -> `terragrunt validate` -> expected `Success!`
5. `terragrunt plan` -> expected `Plan: 0 to add, 2 to change, 0 to
destroy.` (pre-existing drift only).
Closes: code-o3q
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix `main.cf` layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but actively wastes
reviewer attention on every `git blame`, drives up `variables.tf` read
time, and lets drift calcify.
Trade-offs considered:
- Keep it "just in case" → rejected; the file it mirrored
(`/usr/share/postfix/main.cf.dist`) is already canonical upstream and
reproducible inside any docker-mailserver container.
- Move it to a comment block → rejected; same noise cost, no value
over deletion (authoritative source is in the image).
## This change
Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the single live variable `postfix_cf` that is actually
consumed by the module.
## What is NOT in this change
- No Terraform state modification — variable was never read, so state
has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
`kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
independently. Those 2 in-place updates are known and tracked
separately; this commit explicitly avoids conflating cleanup with
drift resolution.
- No apply needed — pure source hygiene.
## Test Plan
### Automated
Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)
Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```
`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation
notices, unrelated)
`terragrunt plan` (from `infra/stacks/mailserver/`):
```
# module.mailserver.kubernetes_deployment.mailserver will be updated in-place
# module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift
(volume_mount ordering + stale `metallb.io/ip-allocated-from-pool`
annotation). No change is attributable to this commit — the dead
variable was never referenced, so removing it leaves state untouched.
### Manual Verification
1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` → expected `0`
3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed
including the trailing blank after the EOT)
4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains
5. `cd ../..` (stack root) → `terragrunt validate` → expected:
`Success! The configuration is valid`
6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to
destroy.` (the 2 are the pre-existing drift, not from this commit).
Closes: code-o3q
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
mail.viktorbarzin.me exposed the Roundcube login page directly: requests
hit Traefik → CrowdSec + anti-AI middleware → Roundcube. The `ingress_factory`
call in `roundcubemail.tf` omitted `protected = true`, so the Authentik
ForwardAuth middleware was never wired up. Project rule
(`infra/.claude/CLAUDE.md`): ingresses should be `protected = true` unless
there is a specific reason to leave them open. Credentialed surfaces (login
pages) have no reason to skip the OIDC gate — CrowdSec alone is a behavioural
signal, not an identity gate.
Trade-off accepted by Viktor on 2026-04-18: webmail now requires two logins
(Authentik SSO, then Roundcube IMAP auth against dovecot). This is tolerable
for a low-volume personal webmail; mail clients (Thunderbird, phone Mail)
bypass the webmail entirely and speak IMAPS/SMTP directly against
`mail.viktorbarzin.me` on the MetalLB service IP (10.0.20.202), which is a
separate path and MUST stay open.
## This change
Single-line flip: `protected = true` added to the `ingress_factory` call in
`stacks/mailserver/modules/mailserver/roundcubemail.tf`.
The factory (`modules/kubernetes/ingress_factory/main.tf`) responds to the
flag by:
1. Appending `traefik-authentik-forward-auth@kubernetescrd` to the ingress
`router.middlewares` annotation — Traefik then hands each request to
the Authentik outpost before forwarding to Roundcube.
2. Flipping `effective_anti_ai` from true → false (logic:
`anti_ai_scraping != null ? … : !var.protected`), which removes the two
anti-AI middlewares. Rationale in the factory: a login-gated resource
is already invisible to unauthenticated scrapers, so the robots/noai
middleware chain is redundant.
Request path before vs after:
Before: Client → Traefik → [retry, error-pages, rate-limit, csp,
crowdsec, ai-bot-block, anti-ai-headers]
→ Roundcube (200 on /)
After: Client → Traefik → [retry, error-pages, rate-limit, csp,
crowdsec, authentik-forward-auth]
→ if unauth: 302 to authentik.viktorbarzin.me
→ if auth: Roundcube (login form)
## What is NOT in this change
- The `mailserver` Service (MetalLB IP 10.0.20.202) is untouched. IMAPS
(993), SMTPS (465), SMTP-Submission (587) continue to bypass Traefik
entirely and speak directly to dovecot/postfix. Mail clients are
unaffected.
- Pre-existing drift on `kubernetes_deployment.mailserver` (volume_mount
ordering) and `kubernetes_service.mailserver` (stale metallb annotation)
is left alone — out of scope per bd-bmh. Apply was scoped with
`-target=` to the ingress resource only.
- No Authentik app/provider Terraform was touched — the `mail.*` ingress
is already covered by the existing wildcard Authentik proxy outpost on
`*.viktorbarzin.me` (standard pattern).
## Test Plan
### Automated
Baseline (before apply):
$ curl -sI https://mail.viktorbarzin.me/ | head -2
HTTP/2 200
alt-svc: h3=":443"; ma=2592000
$ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \
| grep -E 'CONNECTED|subject='
CONNECTED(00000003)
subject=CN = viktorbarzin.me
After apply:
$ curl -sI https://mail.viktorbarzin.me/ | head -3
HTTP/2 302
alt-svc: h3=":443"; ma=2592000
location: https://authentik.viktorbarzin.me/application/o/authorize/?client_id=…
$ openssl s_client -connect mail.viktorbarzin.me:993 < /dev/null 2>&1 \
| grep -E 'CONNECTED|subject='
CONNECTED(00000003)
subject=CN = viktorbarzin.me
Middleware annotation on the ingress:
$ kubectl get ingress -n mailserver mail \
-o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
traefik-retry@kubernetescrd,traefik-error-pages@kubernetescrd,
traefik-rate-limit@kubernetescrd,traefik-csp-headers@kubernetescrd,
traefik-crowdsec@kubernetescrd,traefik-authentik-forward-auth@kubernetescrd
Terraform apply (targeted):
$ scripts/tg apply --non-interactive \
-target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress
…
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
### Manual Verification
1. In a private browser window, navigate to https://mail.viktorbarzin.me/
2. Expected: redirected to Authentik SSO login (not Roundcube)
3. Authenticate with Authentik credentials
4. Expected: redirected back and shown the Roundcube IMAP login form
5. Enter IMAP credentials (same as before the change)
6. Expected: Roundcube inbox loads normally
7. Separately, verify a mail client (Thunderbird, phone Mail) still
connects to IMAPS on mail.viktorbarzin.me:993 and SMTP on :587 without
any Authentik prompt — that path hits MetalLB 10.0.20.202 directly.
## Reproduce locally
1. cd infra/stacks/mailserver
2. vault login -method=oidc
3. scripts/tg plan
Expected: 0 to add, 3 to change, 0 to destroy. Relevant change is the
`router.middlewares` annotation on
`module.ingress.kubernetes_ingress_v1.proxied-ingress` swapping the
two anti-AI middlewares for `traefik-authentik-forward-auth`. The
other 2 changes are pre-existing drift (volume_mounts, metallb
annotation) and are out of scope.
4. scripts/tg apply --non-interactive \
-target=module.mailserver.module.ingress.kubernetes_ingress_v1.proxied-ingress
5. curl -sI https://mail.viktorbarzin.me/ — expect HTTP/2 302 to
authentik.viktorbarzin.me
Closes: code-bmh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Mailgun was decommissioned on 2026-04-12 in favour of Brevo as the outbound
SMTP relay. The DMARC aggregate (`rua`) and forensic (`ruf`) report targets
still pointed at `e21c0ff8@dmarc.mailgun.org`, an inbox that no longer
exists — meaning every DMARC report Google/Microsoft/etc. generate has
been bouncing or silently dropped for six days. No alerts fire on this
(DMARC reports are best-effort, not RFC-mandated), but we've lost visibility
into alignment failures and spoofing attempts during the exact window where
the SPF/DKIM/DMARC posture was being reshaped for the Brevo cutover.
Decision (2026-04-18): route reports to `mailto:dmarc@viktorbarzin.me`.
The mailserver's catch-all sieve delivers anything to non-existent
local-parts into `spam@`, so `dmarc@` does not need to be provisioned as
a real mailbox — the inbox will land in `spam@`'s maildir unchanged.
Alternative considered: route to a dedicated `dmarc@` maildir with sieve
rules to file into a folder. Rejected for now — the monitoring value of
DMARC reports is low-frequency (one aggregate per reporter per day at
most), so the catch-all path is good enough until volume justifies a
proper parser. Can be revisited once we see actual report traffic.
The third-party aggregator target `adb84997@inbox.ondmarc.com` (Red Sift
OnDMARC) is preserved in both rua and ruf — it provides parsed dashboards
that we actually read. The `postmaster@viktorbarzin.me` ruf-only target
also stays as a local mirror.
As a side effect, this apply also canonicalises the TXT record: the
previous value was stored as a two-string split in Cloudflare state
(`...viktorbarzin" ".me;"`) due to the 255-byte TXT string limit
(the record length exceeded 255 chars). The new value is shorter
(dmarc@viktorbarzin.me is 21 chars vs e21c0ff8@dmarc.mailgun.org's
26 chars, doubled across rua and ruf) and fits in a single string,
so the provider serialises it as one string and the prior split-drift
noise disappears from future plans.
## This change
Single-line content edit on `cloudflare_record.mail_dmarc` in
`stacks/cloudflared/modules/cloudflared/cloudflare.tf`:
Before → After (rua and ruf, both):
```
mailto:e21c0ff8@dmarc.mailgun.org → mailto:dmarc@viktorbarzin.me
```
All other DMARC tags unchanged: `v=DMARC1`, `p=quarantine`, `pct=100`,
`fo=1`, `ri=3600`, `sp=quarantine`, `adkim=r`, `aspf=r`.
Delivery flow:
```
DMARC reporter (Gmail/Outlook/...)
│ aggregate XML.gz to rua / forensic to ruf
▼
dmarc@viktorbarzin.me
│ mailserver catch-all (no local recipient)
▼
spam@viktorbarzin.me (Viki's mailbox)
```
## What is NOT in this change
- **Mailbox sieve rules** to file DMARC reports into a dedicated folder
(separate concern; deferred until traffic justifies it).
- **DMARC parser / dashboard**. OnDMARC (adb84997@inbox.ondmarc.com)
already provides this for aggregate reports.
- **Policy tightening** (`p=reject`, `pct` ramp) — out of scope.
- **SPF / DKIM records** — not touched.
- **Removal of the split-string drift suppression**, if any existed in
prior work. The canonicalisation happens naturally on this apply;
no separate workaround was needed.
## Test Plan
### Automated
Targeted terragrunt plan + apply via `scripts/tg`:
```
$ cd stacks/cloudflared && scripts/tg plan \
-target=module.cloudflared.cloudflare_record.mail_dmarc
...
Terraform will perform the following actions:
# module.cloudflared.cloudflare_record.mail_dmarc will be updated in-place
~ resource "cloudflare_record" "mail_dmarc" {
~ content = "\"v=DMARC1; ...
rua=mailto:e21c0ff8@dmarc.mailgun.org,
mailto:adb84997@inbox.ondmarc.com; ...
ruf=mailto:e21c0ff8@dmarc.mailgun.org,
mailto:adb84997@inbox.ondmarc.com,
mailto:postmaster@viktorbarzin\" \".me;\""
-> "\"v=DMARC1; ...
rua=mailto:dmarc@viktorbarzin.me,
mailto:adb84997@inbox.ondmarc.com; ...
ruf=mailto:dmarc@viktorbarzin.me,
mailto:adb84997@inbox.ondmarc.com,
mailto:postmaster@viktorbarzin.me;\""
}
Plan: 0 to add, 1 to change, 0 to destroy.
$ scripts/tg apply /tmp/dmarc.tfplan
module.cloudflared.cloudflare_record.mail_dmarc: Modifying...
module.cloudflared.cloudflare_record.mail_dmarc: Modifications complete after 1s
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
Authoritative DNS post-apply:
```
$ dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
"v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:dmarc@viktorbarzin.me,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;"
```
Note: `dig @1.1.1.1` still served the old value immediately after apply —
Cloudflare's public resolver holds its cache until TTL expires
(TTL=1/auto ≈ 5 min). Authoritative NS is the source of truth.
### Manual Verification
**Setup**: none (DNS change only).
**Commands**:
```
# 1. Confirm authoritative DNS (run now, should pass)
dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
# Expected: rua=mailto:dmarc@viktorbarzin.me,... and ruf similarly.
# 2. Confirm public resolver catches up (run after ~5min)
dig TXT _dmarc.viktorbarzin.me @1.1.1.1 +short
# Expected: same as above (no more mailgun.org entries).
# 3. Within 24-48h, check Viki's spam@ inbox for an incoming DMARC
# aggregate report from Google/Microsoft/etc. Reports are
# typically .zip or .gz attachments with XML inside.
```
**Interpretation**: seeing a DMARC report land in spam@ proves the
end-to-end delivery path works: reporter DNS lookup → _dmarc.viktorbarzin.me
→ mailto:dmarc@viktorbarzin.me → catch-all → spam@ maildir.
## Reproduce locally
```
1. git pull
2. cd stacks/cloudflared
3. dig TXT _dmarc.viktorbarzin.me @evan.ns.cloudflare.com +short
4. Expected: rua=mailto:dmarc@viktorbarzin.me (and ruf the same).
```
Closes: code-569
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
An audit of the mailserver stack raised the question: why is Fail2ban
disabled in the docker-mailserver deployment? The setting
`ENABLE_FAIL2BAN = "0"` lives in the env ConfigMap at
`stacks/mailserver/modules/mailserver/main.tf:68` with no documented
rationale, which made the decision look accidental rather than
deliberate.
The decision is deliberate: CrowdSec is the cluster-wide bouncer for
SSH, HTTP, and SMTP/IMAP brute-force defence. It already tails
`postfix` + `dovecot` logs via the installed collections and enforces
decisions at the LB/firewall tier with real client IPs preserved by
`externalTrafficPolicy: Local` on the dedicated MetalLB IP. Enabling
Fail2ban in-pod would duplicate that response path — two systems
racing to ban the same offender from different enforcement points,
iptables churn inside the container, and a split audit trail across
two decision stores. User decision 2026-04-18: keep disabled, document
the decision so the next auditor doesn't have to re-derive it.
## This change
Adds a new subsection "Fail2ban Disabled (CrowdSec is the Policy)" to
the Security section of `docs/architecture/mailserver.md`, placed
immediately after the existing CrowdSec Integration block. The
paragraph cites `stacks/mailserver/modules/mailserver/main.tf:68`
(where `ENABLE_FAIL2BAN = "0"` lives) and explains why duplicating the
layer would make things worse, not better. Pure docs — no Terraform
touched.
## Test Plan
### Automated
None — docs-only change. No tests, lint, or type checks apply to
markdown prose.
### Manual Verification
1. `less infra/docs/architecture/mailserver.md` — locate the Security
section; confirm the new "Fail2ban Disabled (CrowdSec is the
Policy)" subsection appears between "CrowdSec Integration" and
"Rspamd".
2. Render on GitHub or via a markdown previewer; confirm the inline
link to `main.tf` resolves and the paragraph reads cleanly.
3. `grep -n 'ENABLE_FAIL2BAN' infra/stacks/mailserver/modules/mailserver/main.tf`
— confirm it still reports the value on line 68, matching the
citation in the doc.
Closes: code-zhn
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Viktor wanted live forwarding from Owntracks to Dawarich so his map
stays in sync without a periodic backfill. The original plan assumed
ot-recorder honoured an `OTR_HTTPHOOK` environment variable — but
Recorder 1.0.1 (latest on Docker Hub as of Aug 2025) has no such
feature:
```
$ kubectl -n owntracks exec deploy/owntracks -- \
strings /usr/bin/ot-recorder | grep -iE 'hook|webhook|http_post'
(no matches)
```
Lua hooks, on the other hand, are first-class: `--lua-script` loads a
file and calls the `otr_hook(topic, _type, data)` function for every
publish. That is the pivot this commit makes.
## This change
Mount a Lua script via ConfigMap and tell ot-recorder to load it:
```
Phone POST /pub ---> Traefik ---> Recorder pod
|
| handle_payload() writes .rec
| otr_hook(topic,_type,data)
| |
| +---> os.execute("curl … &")
| |
| v
| Dawarich /api/v1/owntracks/points
|
+---> HTTP 200 to phone
```
Per-publish cost: one `curl` subprocess, `--max-time 5`, backgrounded
with `&` so it doesn't block the HTTP response to the phone. A
Dawarich 5xx drops exactly one point — the `.rec` write still happens,
so the one-shot backfill Job can always re-play.
`DAWARICH_API_KEY` is injected from K8s Secret `owntracks-secrets`
(sourced from Vault `secret/owntracks.dawarich_api_key` via the
existing `dataFrom.extract` ExternalSecret). The Lua reads it with
`os.getenv()` so the key never lands in Terraform state.
### Key discoveries in the verification loop (why iteration count > 1)
1. The hook function must be named `otr_hook`, not `hook` (recorder's
`luasupport.c` calls `lua_getglobal(L, "otr_hook")`). The recorder
logs `cannot invoke otr_hook in Lua script` when missing — the
plan's `hook()` naming was wrong.
2. Dawarich's `latitude`/`longitude` scalar columns are legacy and
always NULL; the authoritative geometry is in the `lonlat` PostGIS
column (`ST_AsText(lonlat::geometry)`). Early "it's broken" readings
were me querying the wrong columns.
3. Default Recreate-strategy rollouts cause ~30s 502/503 windows on
the ingress — tolerable, but every apply is visible as an outage
to the phone. Batching edits is important.
## What is NOT in this change
- **Not** OTR_HTTPHOOK. Removed with this commit (dead env var).
- **Not** the one-shot backfill Job — that comes after the phone
buffer has flushed to avoid racing against incoming hook POSTs
(follow-up: code-h2r).
- **Not** Anca's bridge — a second Recorder instance or a smarter
hook is needed to route her posts under her own Dawarich api_key
(follow-up: code-72g).
- No Ingress or Service change — Commit 1 (`a21d4a44`) already landed
those.
## Test Plan
### Automated
```
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.
$ kubectl -n owntracks logs deploy/owntracks --tail=5
+ initializing Lua hooks from `/hook/dawarich-hook.lua'
+ dawarich-bridge: init
+ HTTP listener started on 0.0.0.0:8083, without browser-apikey
...
+ dawarich-bridge: tst=1 lat=0 lon=0 ok=true
```
### Manual Verification
```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
-H 'Content-Type: application/json' \
-H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
-d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
https://owntracks.viktorbarzin.me/pub
HTTP 200
$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
psql -U postgres -d dawarich -c \
"SELECT timestamp, ST_AsText(lonlat::geometry) FROM points \
WHERE user_id=1 AND timestamp=$TST"
timestamp | st_astext
------------+-------------------------
1776555707 | POINT(-0.1278 51.5074)
```
Real phone traffic (from in-flight buffer flush) lands in Dawarich too:
`traefik logs -l app.kubernetes.io/name=traefik | grep 'POST /api/v1/owntracks/points'`
shows ingress POSTs from `owntracks` namespace to `dawarich` backend
with status 200.
### Reproduce locally
1. `vault login -method=oidc`
2. `kubectl -n owntracks logs deploy/owntracks --tail=20` — expect
`dawarich-bridge: init` after the Lua loader line.
3. Do the curl above, poll the DB, expect `POINT(lon lat)`.
Closes: code-z9b
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The mailserver container (Postfix + Dovecot in one pod) had no liveness, readiness, or startup probes declared. If either daemon deadlocked or hung on a socket, Kubernetes had no way to detect it and restart. The only external canary was the email-roundtrip-monitor CronJob which runs on a 20-minute interval, giving a detection lag of 20-60 minutes — long enough for real delivery failures before an alert fires.
Tracked as bd code-ekf out of the mailserver probe audit. Both port 25 (SMTP) and port 993 (IMAPS) are cheap, reliable up-signals — the existing e2e probe already hits IMAPS, so TCP probes on those ports are a close proxy for user-visible service health without the cost of full SMTP/IMAP handshakes every 10s.
## This change
Adds a readiness_probe (TCP :25, initial_delay=30s, period=10s) and a liveness_probe (TCP :993, initial_delay=60s, period=60s, timeout=15s) to the mailserver deployment's primary container.
Design choices:
- **TCP over exec/HTTP**: the daemons do not expose HTTP health; exec probes would require shelling into the container with auth for SMTP/IMAP banner checks, which is both costly and flaky. TCP accept is sufficient — if postfix cannot accept a TCP connection on :25 it is unambiguously broken.
- **Split ports per probe**: readiness on :25 (the public SMTP surface — if this is down, external delivery is broken) and liveness on :993 (IMAPS, the other critical daemon — catches Dovecot deadlocks independently of Postfix).
- **30s readiness delay**: Postfix needs ~20-30s to warm up including chroot setup and DKIM key loading; probing earlier would cause bogus NotReady cycles on deploy.
- **60s liveness delay + 60s period + 15s timeout**: generous so transient blips (brief CPU spike, RBL timeout, slow NFS unmount during rotation) do not trigger a restart loop. With failure_threshold=3 (default), a real deadlock is detected in ~3 minutes; false positives on transient load are suppressed.
- **No startup_probe**: the 60s liveness initial_delay is enough cover for the warmup window; adding a startup probe would be redundant machinery.
## What is NOT in this change
- No startup_probe (liveness initial_delay_seconds=60 handles warmup)
- No exec-based probes (banner-check probes are out of scope and not needed)
- No changes to the opendkim or other sidecars
- Pre-existing drift in other stacks (dawarich namespace label, owntracks dawarich-hook wiring) is deliberately left out — those are separate workstreams
## Test Plan
### Automated
Applied via `tg apply -target=kubernetes_deployment.mailserver` before this commit. Current pod state:
```
$ kubectl get pod -n mailserver -l app=mailserver
NAME READY STATUS RESTARTS AGE
mailserver-6c6bf77ffb-w7nl5 2/2 Running 0 2m26s
$ kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness|Restart Count|Status:|Ready:)"
Status: Running
Ready: True
Restart Count: 0
Ready: True
Restart Count: 0
Liveness: tcp-socket :993 delay=60s timeout=15s period=60s #success=1 #failure=3
Readiness: tcp-socket :25 delay=30s timeout=1s period=10s #success=1 #failure=3
```
Pod has run >120s (two full liveness cycles) with RESTARTS=0 and Ready=True.
### Manual Verification
1. Confirm probes are declared on the live pod:
```
kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)"
```
Expected: `Liveness: tcp-socket :993 ...` and `Readiness: tcp-socket :25 ...`
2. Confirm pod stays Ready under normal load for 5+ minutes:
```
kubectl get pod -n mailserver -l app=mailserver -w
```
Expected: RESTARTS stays at 0, READY stays at 2/2.
3. (Optional) Failure-simulate by dropping :993 inside the pod and observing liveness failure + restart within ~3 minutes (3 × period_seconds).
## Reproduce locally
1. `cd infra/stacks/mailserver`
2. `tg plan -target=kubernetes_deployment.mailserver`
3. Expected: no drift (or only the probe additions if rolling forward a stale state)
4. `kubectl get pod -n mailserver -l app=mailserver` — pod Ready, RESTARTS=0
5. `kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)"` — both probes present
Closes: code-ekf
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Panels 1/2/4: compute on (gross_pay - rsu_vest) so numbers reflect
actual UK cash pay, not the RSU-inflated figure the payslip shows.
- Detailed table: add cash_gross / rsu_vest / rsu_offset columns.
- New RSU panel at the bottom: bar chart of rsu_vest over time
(only shows months with stock vests). Taxed at Schwab — included
here for reporting/reconciliation, not for P&L.
## Context
The email-roundtrip-monitor CronJob injected `BREVO_API_KEY` and
`EMAIL_MONITOR_IMAP_PASSWORD` as inline `env { value = var.xxx }` —
Terraform read them from Vault at plan time and embedded them in the
generated CronJob spec. Anyone with `kubectl describe cronjob` (or
pod-event read) in the `mailserver` namespace could read both secrets
verbatim.
The two upstream Vault entries are not flat strings:
- `secret/viktor` → `brevo_api_key` = base64(JSON({"api_key": "..."}))
- `secret/platform` → `mailserver_accounts` = JSON({"spam@viktorbarzin.me": "<pw>", ...})
A plain ESO `remoteRef.property` can traverse one level of JSON but
cannot base64-decode the wrapper or index a map key that contains `@`.
So the ExternalSecret pulls the raw Vault values and the rendered K8s
Secret is produced via ESO's `target.template` (engineVersion v2, sprig
pipeline `b64dec | fromJson | dig`). `mergePolicy` defaults to Replace,
so only the transformed `BREVO_API_KEY` / `EMAIL_MONITOR_IMAP_PASSWORD`
keys land in the K8s Secret — the raw wrapped inputs never reach it.
## This change
1. New `kubernetes_manifest.email_roundtrip_monitor_secrets` rendering
an `external-secrets.io/v1beta1` ExternalSecret into a K8s Secret
named `mailserver-probe-secrets` via the `vault-kv` ClusterSecretStore.
2. CronJob's two `env { name=... value=var.xxx }` blocks replaced with
a single `env_from { secret_ref { name = "mailserver-probe-secrets" } }`.
3. Unused `brevo_api_key` / `email_monitor_imap_password` module
variables + their wiring in `stacks/mailserver/main.tf` removed.
`data "vault_kv_secret_v2" "viktor"` dropped (last consumer gone).
```
Before: After:
┌────────────┐ ┌────────────┐
│ Vault KV │ │ Vault KV │
└────┬───────┘ └────┬───────┘
│ (plan-time read) │ (runtime pull)
▼ ▼
┌────────────┐ ┌────────────┐
│ Terraform │ │ ESO ctrl │
│ state │ │ +template │
└────┬───────┘ └────┬───────┘
│ inline value= │ sprig b64dec | fromJson
▼ ▼
┌────────────┐ ┌────────────┐
│ CronJob │ <-- kubectl describe leaks! │ K8s Secret │
│ env[].value│ │ probe-sec │
└────────────┘ └────┬───────┘
│ env_from.secret_ref
▼
┌────────────┐
│ CronJob │
│ (no values │
│ in spec) │
└────────────┘
```
## Test Plan
### Automated
`terragrunt plan -target=...ExternalSecret -target=...CronJob`:
```
Plan: 1 to add, 1 to change, 0 to destroy.
+ kubernetes_manifest.email_roundtrip_monitor_secrets (ExternalSecret)
~ kubernetes_cron_job_v1.email_roundtrip_monitor
- env { name = "BREVO_API_KEY" ... }
- env { name = "EMAIL_MONITOR_IMAP_PASSWORD" ... }
+ env_from { secret_ref { name = "mailserver-probe-secrets" } }
```
`terragrunt apply --non-interactive` same targets:
```
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.
```
`kubectl get externalsecret -n mailserver mailserver-probe-secrets`:
```
NAME STORE REFRESH INTERVAL STATUS READY
mailserver-probe-secrets vault-kv 15m SecretSynced True
```
`kubectl get secret -n mailserver mailserver-probe-secrets -o yaml`
exposes exactly two data keys (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) —
both populated, 120 / 32 base64 chars, no raw `brevo_api_key_wrapped` /
`mailserver_accounts` keys.
`kubectl describe cronjob -n mailserver email-roundtrip-monitor`:
```
Environment Variables from:
mailserver-probe-secrets Secret Optional: false
Environment: <none>
```
(Previously the `Environment:` block listed both secrets with their raw
values.)
### Manual Verification
1. `kubectl create job --from=cronjob/email-roundtrip-monitor \
probe-test-$RANDOM -n mailserver`
2. `kubectl logs -n mailserver -l job-name=probe-test-... --tail=30`
expected:
```
Sent test email via Brevo: 201 marker=e2e-probe-...
Found test email after 1 attempts
Deleted 1 e2e probe email(s)
Round-trip SUCCESS in 20.3s
Pushed metrics to Pushgateway
Pushed to Uptime Kuma
```
3. `kubectl exec -n monitoring deploy/prometheus-prometheus-pushgateway \
-- wget -q -O- http://localhost:9091/metrics | grep email_roundtrip`
shows `email_roundtrip_success=1`, fresh timestamp, duration in range.
4. `kubectl delete job -n mailserver probe-test-...` to clean up.
Closes: code-39v
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document what RSU vest / RSU offset look like on Meta UK payslips and
tell the agent to populate rsu_vest + rsu_offset fields (new in the
payslip-ingest schema) rather than rolling them into gross_pay.
Two new panels below the 4 existing ones:
- Detailed table: every payslip sorted by pay_date DESC with all fields
(gross, all deductions, net, tax_year, validated flag, paperless_doc_id).
Footer reducer sums the numeric columns.
- Full deductions stacked bars: income_tax + NI + pension_employee +
pension_employer + student_loan per payslip. The earlier panel only
showed 4 deductions; this one shows the complete picture.
## Context
The mailserver stack holds everything valuable and hard to recreate:
243M of maildirs, dovecot/rspamd state, and the DKIM private key that
signs outbound mail. Today the only defense is the LVM thin-pool
snapshots on the PVE host (7-day retention, storage-class scope only)
— there is no app-level backup. Infra/.claude/CLAUDE.md mandates that
every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob,
and the mailserver stack was the only one still out of compliance.
Loss of mailserver-data-encrypted without backups = total loss of all
stored mail plus a DKIM key rotation (which requires a DNS update and
breaks signature verification on every message in transit for the TTL
window). Unacceptable for a service people actually use.
Trade-offs considered:
- mysqldump-style single-file dump vs rsync snapshot — maildirs are
millions of small files, not a DB export. rsync --link-dest gives
incremental weekly snapshots for ~10% of the cost of a full copy.
- RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so
the backup Job has to co-locate with the mailserver pod. vaultwarden
solves this with pod_affinity; mirrored here.
- Image choice — alpine + apk add rsync matches vaultwarden's pattern
and keeps the container image small.
## This change
Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the
mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup
and 00:45 per-db windows, and the */20 email-roundtrip cadence). The
job rsyncs /var/mail, /var/mail-state, /var/log/mail into
/srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the
previous week for space-efficient incrementals. 8-week retention.
Data layout (flowed through from the deployment's subPath mounts so
the rsync tree matches the mailserver's own on-disk layout):
PVC mailserver-data-encrypted (RWO, 2Gi)
├─ data/ (subPath) → pod's /var/mail → backup/<week>/data/
├─ state/ (subPath) → pod's /var/mail-state → backup/<week>/state/
└─ log/ (subPath) → pod's /var/log/mail → backup/<week>/log/
Safety:
- PVC mounted read-only (volume.persistent_volume_claim.read_only
AND all three volume_mounts set read_only=true) so a backup-script
bug cannot corrupt maildirs.
- pod_affinity on app=mailserver + topology_key=hostname forces the
Job pod onto the same node holding the RWO PVC attachment.
- set -euxo pipefail + per-directory existence guard so a missing
subPath short-circuits cleanly instead of silently no-op'ing.
Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup
convention (job="mailserver-backup"):
backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_output_bytes, backup_last_success_timestamp.
Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden:
- MailserverBackupStale — 36h threshold, critical, 30m for:
- MailserverBackupNeverSucceeded — critical, 1h for:
## Reproduce locally
1. cd infra/stacks/mailserver && ../../scripts/tg plan
Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on
deployment/service is pre-existing.
2. ../../scripts/tg apply --non-interactive \
-target=module.mailserver.module.nfs_mailserver_backup_host \
-target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup
3. cd ../monitoring && ../../scripts/tg apply --non-interactive
4. kubectl create job --from=cronjob/mailserver-backup \
mailserver-backup-test -n mailserver
5. kubectl wait --for=condition=complete --timeout=300s \
job/mailserver-backup-test -n mailserver
6. Expected: test pod co-locates with mailserver on same node
(k8s-node2 today), rsync writes ~950M to
/srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes
backup_output_bytes{job="mailserver-backup"}.
## Test Plan
### Automated
$ kubectl get cronjob -n mailserver mailserver-backup
NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
mailserver-backup 0 3 * * * <none> False 0 <none> 3s
$ kubectl create job --from=cronjob/mailserver-backup \
mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test created
$ kubectl wait --for=condition=complete --timeout=300s \
job/mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test condition met
$ kubectl logs -n mailserver job/mailserver-backup-test | tail -5
=== Backup IO Stats ===
duration: 80s
read: 1120 MiB
written: 1186 MiB
output: 947.0M
$ kubectl run nfs-verify --rm --image=alpine --restart=Never \
--overrides='{...nfs mount /srv/nfs...}' \
-n mailserver --attach -- ls -la /nfs/mailserver-backup/
947.0M /nfs/mailserver-backup/2026-15
$ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
| grep mailserver-backup
backup_duration_seconds{instance="",job="mailserver-backup"} 80
backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09
backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08
backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09
backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09
$ curl -s http://prometheus-server/api/v1/rules \
| jq '.data.groups[].rules[] | select(.name | test("Mailserver"))'
MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600
MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0
### Manual Verification
1. Wait for the scheduled 03:00 run tonight; verify
`kubectl get job -n mailserver` shows a new completed job.
2. Check that `backup_last_success_timestamp` advances past today.
3. Confirm `MailserverBackupNeverSucceeded` did not fire.
4. Next week (week 16), confirm `--link-dest` builds hardlinks vs
2026-15 (size delta should drop from ~950M to ~the actual churn).
## Deviations from mysql-backup pattern
- Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0`
base is not applicable for a filesystem rsync).
- pod_affinity: required for RWO PVC co-location (mysql uses its own
MySQL service for network access; mailserver must mount the PVC).
- Metric push via wget (mirrors vaultwarden; alpine has wget, not curl).
- Week-folder layout with --link-dest rotation: rsync pattern, closer
to the PVE daily-backup script than mysql's single-file gzip dumps.
[ci skip]
Closes: code-z26
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
iOS Owntracks app has been unable to upload for months — phone buffer
now holds ~1200 pending points. Last successful `.rec` write was
2026-01-02T14:32:00Z, matching when the failures started.
### The 500 — verified in Traefik access log
```
152.37.101.156 - viktor "POST /pub HTTP/1.1" 500 21 "-" "-" 47900
"owntracks-owntracks-owntracks-viktorbarzin-me@kubernetes"
"https://10.10.107.194:8083" 84ms
```
Basic-auth + middleware chain (rate-limit, csp, crowdsec) all pass.
Traefik then opens backend connection to `https://10.10.107.194:8083`.
The Recorder pod listens **plain HTTP** on :8083 (`OTR_PORT=0` disables
HTTPS in ot-recorder), so the TLS handshake never completes → 500.
### Root cause — Service port spec
`kubernetes_service.owntracks` declared the port as:
```
name: https
port: 443
targetPort: 8083
```
Traefik's IngressClass scheme inference: if the Service port is named
`https` OR numbered `443`, Traefik speaks HTTPS to that backend. Both
were true here, pointing at a plain-HTTP socket. The name/number were
purely cosmetic — a leftover from mirroring the external `:443` edge —
and worked only while Traefik's default happened to be HTTP. A Traefik
upgrade (or middleware-chain change) tightened inference and surfaced
the mismatch.
## This change
Rename port to `name=http, port=80` and update the matching Ingress
backend `port.number` from 443 to 80. `targetPort` stays at 8083.
```
Phone -----> CF tunnel -----> Traefik (:443, TLS) -----> Service
\ :80 (http)
\ |
\ v
---------------> Pod :8083
(plain HTTP hop) (HTTP listener)
```
Deployment container port label also renamed `https` → `http` for
consistency (no functional effect — just readability).
## What is NOT in this change
- **Not** switching the Recorder pod to HTTPS natively. That would
require mounting a cert + rotation plumbing. External TLS is already
terminated at Cloudflare/Traefik; in-cluster hop to the pod is
plain-HTTP by design.
- **Not** enabling `OTR_HTTPHOOK` to bridge Recorder → Dawarich
(follow-up: code-z9b).
- **Not** backfilling historical `.rec` files into Dawarich (follow-up:
code-h2r).
- Incidental: `providers.tf` + `.terraform.lock.hcl` refreshed by
`terraform init -upgrade` to pick up the goauthentik provider that
the ingress_factory module recently started requiring.
## Test Plan
### Automated
```
$ ../../scripts/tg plan
Plan: 0 to add, 3 to change, 0 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
$ kubectl -n owntracks get svc owntracks -o=jsonpath='{.spec.ports[0]}'
{"name":"http","port":80,"protocol":"TCP","targetPort":8083}
$ kubectl -n owntracks get ingress owntracks -o=jsonpath='{.spec.rules[0].http.paths[0].backend}'
{"service":{"name":"owntracks","port":{"number":80}}}
```
### Manual Verification
In-cluster auth'd POST through the full ingress chain:
```
VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
kubectl -n owntracks run curltest --rm -i --image=curlimages/curl --restart=Never -- \
curl -s -o /dev/null -w "HTTP %{http_code}\n" -X POST -u "viktor:$VIKTOR_PW" \
-H "Content-Type: application/json" \
-d '{"_type":"location","lat":0,"lon":0,"tst":1000000000,"tid":"vb"}' \
https://owntracks.viktorbarzin.me/pub
# HTTP 200
```
(previously: HTTP 500 on identical request)
### Reproduce locally
1. `vault login -method=oidc`
2. `cd infra/stacks/owntracks && ../../scripts/tg plan`
3. Expected: `Plan: 0 to add, 3 to change, 0 to destroy.` (or empty if already applied)
4. Watch next iOS Owntracks POST → Traefik access log should show `200`, not `500`.
Closes: code-nqd
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grafana 11.2+ Postgres plugin reads the DB name from jsonData.database
(see grafana/grafana#112418). The top-level 'database' field is silently
ignored by the frontend — datasource health checks and POST /api/ds/query
still work because the backend honors it, but every dashboard panel fails
with 'you do not have default database'.
Rolling back to the supported shape fixes rendering for all 4 uk-payslip
panels.
## Context
First end-to-end test of the broker-sync-fidelity CronJob failed with
`PermissionError: [Errno 13] Permission denied:
'/data/fidelity_storage_state.json'`. Init container runs as root (uid
0) but the broker-sync container runs as uid 10001; chmod 600 without
chown made the file unreadable from the main container.
## This change
Added `chown 10001:10001` before the existing `chmod 600` in the
`stage-storage-state` init container command. Init container has
CAP_CHOWN by default as root, so this succeeds.
## Verification
$ kubectl apply -f test-pod.yaml # same init + main pattern
$ kubectl logs fidelity-debug -c broker-sync
...
broker_sync.providers.fidelity_planviewer.FidelitySessionError:
PlanViewer session stale — run `broker-sync fidelity-seed`
Init container succeeded + main container read the file + Playwright
launched Chromium + navigated to PlanViewer + hit the 15-min idle page
→ exactly the intended behaviour for a stale session. Next step
(out-of-band): Viktor paste a fresh SMS OTP and re-seed via
fidelity-seed on Viktor's laptop or the existing chat-driven flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 6b of the state-drift consolidation plan. `scripts/pve-nfs-exports` is
the git-managed source of truth for the Proxmox host's NFS export table
(file header documents this since the 2026-04-14 NFS outage post-mortem).
Deploying it was runbook-only — `scp` then `ssh ... exportfs -ra` — which
means a change could sit unpushed-to-PVE indefinitely, and nothing alerted
on divergence between git and host.
Wave 6b closes that loop: a Woodpecker pipeline watches
`scripts/pve-nfs-exports` on the `master` branch, diffs against the
current host file, and scp's the new content followed by `exportfs -ra`.
The same 2-shell-command runbook, now a CI step that runs on every push
and is manually triggerable.
## This change
- New pipeline `.woodpecker/pve-nfs-exports-sync.yml` — path-filtered push
trigger + manual.
- SSH credentials provisioned 2026-04-18:
- ed25519 keypair `woodpecker-pve-nfs-exports-sync`
- Public key in `root@192.168.1.127:~/.ssh/authorized_keys`
- Private key in Vault `secret/woodpecker/pve_ssh_key` (plus known_hosts
entry for deterministic host-key pinning from Vault)
- Woodpecker repo-level secret `pve_ssh_key` (id 139) bound to the infra
repo's `push`/`manual`/`cron` events
- Pipeline steps: install openssh + curl (alpine image) → stage private
key from secret → ssh-keyscan the PVE host into known_hosts → diff
current vs. proposed exports (shown in pipeline log) → scp → exportfs
-ra → Slack notify status.
## What is NOT in this change
- Drift detection (git-truth vs. host-truth) via cron: this pipeline only
fires on *push*, so a host-side edit wouldn't be caught. Could add a
daily cron that just runs the diff step and alerts if non-empty. Left
as a refinement if drift becomes an issue.
- Pulling known_hosts from Vault rather than ssh-keyscan on each run: the
keyscan is simpler and works against key rotation without needing a
Vault round-trip. Pulling from Vault is the right answer the moment we
add MITM risk, which the internal network doesn't have today.
## Reproduce locally
Edit `scripts/pve-nfs-exports`, push to master. Watch the pipeline in
Woodpecker. Verify on PVE: `ssh root@192.168.1.127 "md5sum /etc/exports"`
matches `md5sum scripts/pve-nfs-exports` in the repo.
Closes: code-dne
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when
variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`.
Postfix immediately began using Brevo for user mail — but the SPF TXT record
at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every
Brevo-relayed message failed SPF alignment and was spam-foldered or
DMARC-quarantined by Gmail/Outlook.
Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`:
"v=spf1 include:mailgun.org -all" <-- wrong sender network
User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`.
Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo
sends quarantined rather than outright rejected while we validate Brevo's
sending IPs + rate limits for real user mail. Tighten to `-all` once the
relay is proven stable.
The docs in `docs/architecture/mailserver.md` still described the old
Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets
table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change",
those are updated in the same commit.
## This change
Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs):
1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content
flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`,
with an inline comment pointing at the mailserver docs for rationale.
2. `docs/architecture/mailserver.md` —
- Last-updated stamp moved to 2026-04-18 with the cutover note.
- Overview paragraph now says "relays through Brevo EU" (was Mailgun).
- DNS table SPF row reflects the new value plus an annotated history
note ("was include:mailgun.org -all until 2026-04-18").
- DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua
target and flags that the current live record still points at
e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569.
- Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo
relay credentials; `mailgun_api_key` annotated as retained for the
E2E roundtrip probe only (inbound delivery testing, not user mail).
Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf`
to avoid sweeping up two unrelated pre-existing drifts that the Terraform
state shows on this stack: the DMARC + mail._domainkey_rspamd records are
stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and
a naive refresh+apply would normalize them in the state back to single
strings. Those drifts are semantically equivalent (DNS concatenates
adjacent TXT strings at resolution time) and are out of scope for this
commit — they'll be handled under their own ticket.
## What is NOT in this change
- DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1),
still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses
in the live record.
- DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and
`mail_domainkey_rspamd` — pre-existing Cloudflare representation drift,
untouched here.
- Removal of Mailgun references in history/decision sections of the docs,
or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API
on purpose for inbound delivery testing (code-569 scope).
- Mailgun DKIM record `s1._domainkey` — left in place; still consumed by
the roundtrip probe.
- Other pending items from the 2026-04-18 mail audit plan.
## Test Plan
### Automated
Targeted plan showed exactly one change, no other drift sneaking in:
module.cloudflared.cloudflare_record.mail_spf will be updated in-place
~ content = "\"v=spf1 include:mailgun.org -all\""
-> "\"v=spf1 include:spf.brevo.com ~all\""
Plan: 0 to add, 1 to change, 0 to destroy.
Apply result:
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
DNS propagation verified on three independent resolvers immediately after
apply:
$ dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
"v=spf1 include:spf.brevo.com ~all"
$ dig TXT viktorbarzin.me @8.8.8.8 +short | grep spf
"v=spf1 include:spf.brevo.com ~all"
$ dig TXT viktorbarzin.me @10.0.20.201 +short | grep spf # Technitium primary
"v=spf1 include:spf.brevo.com ~all"
### Manual Verification
Setup: nothing extra — change is already live (TF applied before commit
per home-lab convention; `[ci skip]` in title).
1. Confirm SPF is the Brevo-only record from an external resolver:
dig TXT viktorbarzin.me @1.1.1.1 +short
Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference.
2. Send a test email via the mailserver (through Brevo relay) to a Gmail
account and view the original headers:
Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me
...
Received-SPF: Pass (google.com: domain of ... designates ... as
permitted sender)
Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this
change because the envelope sender IP was a Brevo IP not covered by
`include:mailgun.org`).
3. Confirm no live Mailgun references in the mailserver doc:
grep -n mailgun.org infra/docs/architecture/mailserver.md
Expected: only annotated-history mentions — SPF "was ... until
2026-04-18" and DMARC "current live record still points at
e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active
Mailgun relay.
## Reproduce locally
cd infra
git pull
dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
# expected: "v=spf1 include:spf.brevo.com ~all"
# inspect the TF change:
git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf
# inspect the doc change:
git show HEAD -- docs/architecture/mailserver.md
Closes: code-q8p
Closes: code-9pe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grafana 11's Postgres plugin shows 'you do not have default database'
on any panel whose target is missing rawQuery:true / editorMode:"code".
The query builder can't reason about a custom schema.table path and
blanks the panel.
The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer
one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana
to fall back to 'default database' and error out when rendering.
Ran sed over the JSON; all 8 panel+target type refs now match the installed
plugin name. UID (payslips-pg) was already correct.
Grafana can't auto-create the reserved 'General' folder ('A folder with
that name already exists'), which aborts the sidecar provisioner's walk
and drops every dashboard in that folder. Move uk-payslip to Finance so
it loads.
## Context
Wave 5b of the state-drift consolidation plan. Calico has run this cluster's
pod networking since 2024-07-30, installed via raw kubectl manifests —
tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan
flagged Calico as HIGH BLAST because the operator + Installation CR sit on
the critical path for pod scheduling; any mistake during adoption can
break CNI and block new pods cluster-wide within seconds.
This session takes the safe sub-step: adopt only the three namespaces.
Namespaces are label containers — TF managing their names + PSA labels
cannot disrupt Calico networking. Getting the operator, Installation CR,
and CRDs under TF requires dedicated prep (picking the right
`ignore_changes` fields to absorb operator-generated defaults in the
Installation CR, decoupling from the embedded PSA labels applied at
admission, and a low-traffic window). Deferred to `code-3ad`.
## This change
New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks
(Wave 8 convention, commit 8a99be11):
- `kubernetes_namespace.calico_system` ← id `calico-system`
- `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver`
- `kubernetes_namespace.tigera_operator` ← id `tigera-operator`
Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a
second `tg plan` that returns `No changes`. Zero cluster impact —
namespaces stayed exactly as they were cluster-side.
### terragrunt dependency choice
Deliberately no `dependency "platform"` clause — Calico is lower in the
stack than platform, so introducing a `platform → calico` or
`calico → platform` edge would invite cycle-like pain on first
bootstrap. The plan on this stack is always safe to run standalone.
### `ignore_changes` scope on each namespace
- `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy
stamp (Wave 3B sweep, commit 8b43692a).
- `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator
stamps these on `calico-system` + `calico-apiserver` to opt them out
of PSA. These labels aren't surfaced by the kubernetes provider as
part of the import (they arrive through a different field manager),
so left unmanaged to keep the plan clean. `tigera-operator` ns
doesn't get the PSA labels so they aren't ignored there.
## What is NOT in this change
- The three live workloads: `tigera-operator` Deployment in
`tigera-operator` ns, `calico-kube-controllers`/`calico-node`/
`calico-typha` workloads in `calico-system`, the `calico-apiserver`
in `calico-apiserver`. These are all reconciled by the tigera-operator
from the Installation CR — importing them into TF is redundant with
importing the CR itself.
- The `Installation` CR (`default`, apiVersion
`operator.tigera.io/v1`) — the user-authored minimal spec has since
been filled to 104 lines of operator-generated defaults. Adopting it
requires a well-scoped `ignore_changes` list on the `manifest` field.
Separate follow-up `code-3ad`.
- `.sops.yaml` / `tier0_stacks` updates — the original plan suggested
Tier 0 (local SOPS state) for the full Calico stack on the theory
that "network underpins all". With only three namespaces in the stack,
the argument doesn't hold: a failed Tier 1 plan on calico namespaces
cannot break networking, so no need to pay the Tier 0 tax.
## Verification
```
$ cd stacks/calico && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS
calico-kube-controllers-... 1/1 Running 0
calico-node-... 1/1 Running 0
... (all healthy, pre-existing)
```
Follow-up: code-3ad for operator + Installation CR adoption (needs
low-traffic window + ignore_changes scoping).
Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 6a of the state-drift consolidation plan. The Domain wide catch all
Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all)
+ the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have
run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2
hit, it was harder to reason about the config than it should have been —
the only source of truth was the Authentik admin UI. Bringing the provider
+ application under Terraform means future changes are reviewable in PRs
and recoverable from git if the admin UI misbehaves.
## This change
Adds the `goauthentik/authentik` provider to the repo's central
`terragrunt.hcl` `required_providers` (side-effect: every stack can now
declare authentik resources; this stack is the only current consumer).
Stack-local `stacks/authentik/authentik_provider.tf` holds the provider
instance configuration + API token wiring + two resources + their flow
data-source lookups.
### Auth
- API token stored in Vault at `secret/authentik/tf_api_token`, identifier
`terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable
by rewriting the Vault KV + any running TF apply picks it up on next
plan.
### Imports (both landed zero-diff)
- `authentik_application.catchall` ← id `domain-wide-catch-all`
- `authentik_provider_proxy.catchall` ← id `5`
### Flow references
Authorization + invalidation flows are looked up via `data
"authentik_flow"` by slug (`default-provider-authorization-implicit-consent`
+ `default-provider-invalidation-flow`). Keeping them as data sources
rather than hardcoded UUIDs means a flow recreation (slug unchanged)
doesn't require an HCL edit.
### `lifecycle { ignore_changes }` scope
On `authentik_provider_proxy.catchall`:
- `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the
live state references complex many-to-many relations that are easier
to manage from the Authentik UI than to serialise in HCL. Drift
suppressed.
- `skip_path_regex`, `internal_host`, all `basic_auth_*`,
`intercept_header_auth`, `access_token_validity` — either defaults or
UI-only tuning knobs that aren't part of Terraform's concern for this
catch-all provider.
On `authentik_application.catchall`:
- `meta_description`, `meta_launch_url`, `meta_icon`, `group`,
`backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` —
cosmetic/non-functional attributes; the Authentik UI is the right
place to edit these and drift on them isn't interesting.
## What is NOT in this change
- Outpost-binding resource — the embedded outpost's provider list is a
single-row many-to-many that the Authentik UI manages cleanly; adding
TF there would fight the UI without reducing drift.
- Property mappings and JWT federation source — managed via UI, drift
suppressed. A future wave can bring them in when someone actually
wants to edit them through code review.
- Other Authentik entities (Flows, Stages, Groups, RBAC policies) —
same rationale: UI is the natural editing surface. Adopt incrementally
as they become interesting to code-review.
## Verification
```
$ cd stacks/authentik && ../../scripts/tg plan | grep Plan:
Plan: 0 to add, 1 to change, 0 to destroy.
# module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift,
# unrelated to this commit (image_pull_policy Always -> IfNotPresent)
$ ../../scripts/tg state list | grep authentik_
authentik_application.catchall
authentik_provider_proxy.catchall
data.authentik_flow.default_authorization_implicit_consent
data.authentik_flow.default_provider_invalidation
```
## Reproduce locally
1. `git pull && cd stacks/authentik && ../../scripts/tg init`
2. Terraform pulls goauthentik/authentik provider (first time).
3. `tg plan` — expect only pgbouncer drift; authentik resources read-only.
Refs: Wave 6a of the state-drift consolidation (code-hl1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
## Context
Wave 7 of the state-drift consolidation plan. The drift-detection pipeline
(`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every
stack daily and Slack-posted a summary, but its output was ephemeral —
nothing persisted in Prometheus, so there was no historical view of which
stacks drift, when, or for how long. Following the convergence work in
waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4
mysql cleanup), the baseline is clean enough that *new* drift should
stand out. That only works if we have observability.
## This change
### `.woodpecker/drift-detection.yml`
Enhances the existing cron pipeline to push a batched set of metrics to
the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`)
after each run:
| Metric | Kind | Purpose |
|---|---|---|
| `drift_stack_state{stack}` | gauge, 0/1/2 | 0=clean, 1=drift, 2=error |
| `drift_stack_first_seen{stack}` | gauge (unix seconds) | Preserved across runs for drift-age tracking |
| `drift_stack_age_hours{stack}` | gauge (hours) | Computed from `first_seen` |
| `drift_stack_count` | gauge (count) | Total drifted stacks this run |
| `drift_error_count` | gauge (count) | Total plan-errored stacks |
| `drift_clean_count` | gauge (count) | Total clean stacks |
| `drift_detection_last_run_timestamp` | gauge (unix seconds) | Pipeline heartbeat |
First-seen preservation: on each drift hit, the pipeline queries
Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}`
value. If present and non-zero, reuse it; otherwise stamp with `NOW`.
That means age-hours grows monotonically until the stack goes clean
(at which point state=0 resets first_seen by omission).
Atomic batched push: all metrics for a run are POST'd in a single
HTTP request. Pushgateway doesn't support atomic multi-metric updates
natively, but batching at the pipeline layer prevents half-updated
state if the curl is interrupted mid-run (the second call would just
fail the entire run and alert on `DriftDetectionStale`).
### `stacks/monitoring/.../prometheus_chart_values.tpl`
New `Infrastructure Drift` alert group with three rules:
- **DriftDetectionStale** (warning, 30m): fires if
`drift_detection_last_run_timestamp` is older than 26h. Gives a 2h
grace window on top of the 24h cron so transient Pushgateway or
cluster unavailability doesn't false-alarm. Guards against the
pipeline silently failing or the cron not firing.
- **DriftUnaddressed** (warning, 1h): fires if any stack has
`drift_stack_age_hours > 72` — three days of unacknowledged drift.
Three days is long enough to absorb weekends + typical review cycles
but short enough to force follow-up before drift compounds.
- **DriftStacksMany** (warning, 30m): fires if `drift_stack_count > 10`
in a single run. Sudden wide drift usually signals systemic causes
(new admission webhook, provider version bump, cluster-wide CRD
upgrade) rather than individual configuration errors, and the alert
body nudges toward that diagnosis.
Applied to `stacks/monitoring` this session — 1 helm_release changed,
no other drift surfaced.
## What is NOT in this change
- The Wave 7 **GitHub issue auto-filer** — the full plan included
filing a `drift-detected` issue per drifted stack. Deferred because
it requires wiring the `file-issue` skill's convention + a gh token
exposed to Woodpecker, both of which need separate setup. The Slack
alert covers the same need at lower fidelity in the meantime.
- The Wave 7 **PG drift_history table** — would provide the richest
historical view but adds a new DB schema dependency for a CI
pipeline. Pushgateway + Prometheus handle the 72h window we care
about; PG history is nice-to-have for quarterly reviews.
- Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the
baseline has been stable for a few cycles.
Follow-ups tracked: file dedicated beads items for GH-issue filer + PG
drift_history.
## Verification
```
$ cd stacks/monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
# After next cron run (cron expr: "drift-detection" in Woodpecker UI):
$ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
| grep -c '^drift_'
# expect a positive number
```
## Reproduce locally
1. `git pull`
2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules | jq '.data.groups[] | select(.name == "Infrastructure Drift")'`
3. Manually trigger the Woodpecker cron and watch Pushgateway populate.
Refs: Wave 7 umbrella (code-hl1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 5c of the state-drift consolidation plan. `local-path-provisioner`
(Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw
`kubectl apply` against the upstream manifest. It serves as the cluster's
default StorageClass and is still actively in use — the 2026-04-18 live
survey showed helper-pod-delete cycles running against existing PVCs.
Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding),
ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown
scripts, StorageClass `local-path` (default), and the 1-replica
Deployment itself. Seven resources total.
## This change
New Tier 1 stack `stacks/local-path/` with all seven resources, adopted
via Wave 8's HCL `import {}` block convention (commit 8a99be11):
- `kubernetes_namespace.local_path_storage` → id `local-path-storage`
- `kubernetes_service_account.local_path_provisioner` →
id `local-path-storage/local-path-provisioner-service-account`
- `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role`
- `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind`
- `kubernetes_config_map.local_path_config` →
id `local-path-storage/local-path-config`
- `kubernetes_storage_class_v1.local_path` → id `local-path`
- `kubernetes_deployment.local_path_provisioner` →
id `local-path-storage/local-path-provisioner`
Conventions applied:
- Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
Goldilocks `vpa-update-mode` label drift (Wave 3B, commit 8b43692a).
- Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
ndots dns_config drift (Wave 3A, commit c9d221d5 + 327ce215).
- ServiceAccount + pod spec pin `automount_service_account_token = false`
and `enable_service_links = false` to match the live spec exactly.
- `import {}` stanzas removed after the apply converged to zero-diff
(per AGENTS.md → "Adopting Existing Resources").
## Apply outcome
`Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.`
The 3 in-place changes were:
- `kubernetes_config_map.local_path_config.data` — whitespace/format
reshuffle. The live ConfigMap contained the upstream manifest's
hand-indented JSON + YAML; my HCL uses canonical `jsonencode` /
heredoc. Semantic content identical, so the provisioner continued
running (no pod restart).
- `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true`
— TF-only attribute, no cluster impact.
- `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false`
+ `is-default-class` annotation re-asserted — TF-schema reconciliation
only; the StorageClass remained default throughout.
Post-apply `scripts/tg plan` returns `No changes`.
## Verification
```
$ cd stacks/local-path && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl -n local-path-storage get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
local-path-provisioner 1/1 1 1 55d
$ kubectl get sc local-path
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer
```
## What is NOT in this change
- Helm-release adoption — local-path-provisioner was never installed via
Helm in this cluster; raw manifests only. Keeping native typed
resources rather than retrofitting a chart.
- PV-path customisation — sticks with upstream default
`/opt/local-path-provisioner` on all nodes (via
`DEFAULT_PATH_FOR_NON_LISTED_NODES`).
Closes: code-3gp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.
The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`
So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.
## Flow
```
user: bd assign <id> agent
│
▼
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy? skip) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Decisions
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
`curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The script copies it into `/tmp/.beads/` because bd
may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
`beads-task-runner` never trips the reaper. Failures trip it; pod crashes
(in-memory job state lost) also trip it.
## What is NOT in this change
- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
`cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.
## Deviations from plan
Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
serializes `notes` as a string (not an array), and every `bd note` bumps
`updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
`-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.
## Test plan
### Automated
```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!
$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.
$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```
### Manual verification
1. **Apply**
```
vault login -method=oidc
cd infra/stacks/beads-server
scripts/tg apply
```
Expect: `kubernetes_config_map.beads_metadata`,
`kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
created. No changes to existing resources.
2. **CronJobs exist with right schedule**
```
kubectl -n beads-server get cronjob
```
Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`,
both with `SUSPEND=False`.
3. **End-to-end smoke**
```
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' line and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '{status, notes}'
```
Expect notes to contain `auto-dispatcher claimed at …` and
`dispatched: job=<uuid>`, status `in_progress`.
4. **Reaper smoke**
Assign + dispatch a long bead, then
`kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
30 min + one reaper tick, `bd show <id>` shows `blocked` with a
`reaper: no progress for Nm — blocking` note.
5. **Kill switch**
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
kubectl -n beads-server get cronjob
```
Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
nothing happens within 5 min. Re-apply with `=true` to re-enable.
Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.
Closes: code-8sm
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces
of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all
cluster changes via TF" invariant and drifting silently:
1. **kured** (Helm release): deployed 265d ago via `helm install kured` on
the CLI. Values were edited only via `helm upgrade` — never captured.
Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri
02:00–06:00 London reboot window, Slack notifyUrl, and a custom
`/sentinel/gated-reboot-required` sentinel file.
2. **kured-sentinel-gate**: a custom DaemonSet + ServiceAccount +
ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem
(memory 390) when kured rebooted nodes during a containerd overlayfs
outage and turned a single-node blip into a 26h cluster outage.
The gate DaemonSet creates `/var/run/gated-reboot-required` only when
(a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all
calico-node pods Running, (d) no node transitioned Ready in the last
30 minutes (cool-down). kured's `rebootSentinel` then points at the
gated file so reboots are effectively gated by cluster health.
Applied 33d ago via `kubectl apply` — no TF footprint.
Both are now codified in the new `stacks/kured/` (Tier 1, PG state).
## This change
- New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl`
(standard platform-dep) + `secrets` symlink.
- All 6 resources adopted via Wave 8's HCL `import {}` block pattern
(commit 8a99be11) — written as `import {}` stanzas in the initial
commit, plan-applied to zero, then stanzas deleted before this commit
per the convention:
- `kubernetes_namespace.kured` (id: `kured`)
- `helm_release.kured` (id: `kured/kured`)
- `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`)
- `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`)
- `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- Slack notifyUrl moved from inline helm values into Vault at
`secret/kured` under key `slack_kured_webhook`, consumed via
`data "vault_kv_secret_v2"`. No plaintext secret in git.
- Namespace gets `tier = "1-cluster"` label (new — previously untiered,
so Kyverno auto-quotas applied cluster-tier defaults on kured pods).
Benign additive change; pod specs have explicit resources anyway.
- DaemonSet + SA get `automount_service_account_token = false` /
`enable_service_links = false` to match the live pod spec exactly —
otherwise TF schema defaults would flip these fields.
- DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift
(Wave 3A convention, commit c9d221d5 + 327ce215).
- Namespace carries the same marker on the
`goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep,
commit 8b43692a).
## Import outcomes
Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.`
The 3 in-place changes were all TF-schema reconciliation, not cluster
mutations:
- `helm_release.kured.values` — format reshuffle; the imported state
stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic
YAML is byte-identical, so the triggered Helm upgrade was a no-op on
the cluster side (revision bumped 2→3, zero pod restarts).
- `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new
label added. Already discussed above.
- `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true
— TF-only attribute, no k8s impact.
Post-apply `scripts/tg plan` on `stacks/kured` returns:
`No changes. Your infrastructure matches the configuration.`
## What is NOT in this change
- `import {}` stanzas — intentionally removed after the apply landed.
They would be no-ops and would clutter future diffs. Per Wave 8
convention (AGENTS.md → "Adopting Existing Resources").
- Calico adoption (Wave 5b) — separate higher-blast change, needs a
dedicated low-traffic window.
- local-path-storage (Wave 5c) — check-or-remove task still open.
## Verification
```
$ kubectl -n kured get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
kured 5 5 5 5 5
kured-sentinel-gate 5 5 5 5 5
$ helm -n kured list
NAME NAMESPACE REVISION STATUS CHART APP VERSION
kured kured 3 deployed kured-5.11.0 1.21.0
$ cd stacks/kured && ../../scripts/tg plan | tail -1
No changes. Your infrastructure matches the configuration.
```
## Reproduce locally
1. `git pull`
2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes
3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready.
Closes: code-q8k
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL
support — confirmed 2026-04-18 via repo inspection). The DB lives on
an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod.
First attempt at a standalone backup CronJob failed with Multi-Attach
error: RWO volume is already attached to the running WF pod, so no
separate pod can mount it. Switched to a backup sidecar in the same
pod — shares the PVC mount naturally.
## This change
- `container "backup"` added to the WF Deployment:
- alpine:3.20 + sqlite + busybox-suid (for crond).
- Mounts /data read-only (shared with WF container) + /backup (new
NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup).
- Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh
which runs `sqlite3 .backup` (WAL-safe online snapshot, zero
downtime), copies secrets.json, and prunes anything older than 30d.
- 16Mi request / 64Mi limit — sleeps most of the time.
- NFS volume declared in pod spec — server from the existing
`var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created
on the PVE host in the same session.
Removed the standalone backup CronJob that couldn't work.
## Verification
### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 1 changed, 1 destroyed (the transient CronJob).
### Manual (2026-04-18)
$ kubectl -n wealthfolio get pods -l app=wealthfolio
wealthfolio-95d8bd498-cj8kw 2/2 Running
$ kubectl -n wealthfolio logs <pod> -c backup
wealthfolio-backup sidecar ready; next 04:30 UTC
$ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh
wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M)
$ ls /srv/nfs/wealthfolio-backup/
2026-04-18T22-24-55/ ← first sidecar-produced backup
## Reproduce locally
1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh
2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/
3. Expected: new dated folder appears with wealthfolio.db + secrets.json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.
The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.
Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.
## This change
Two edits to the embedded probe script inside the cronjob resource:
```
- # Step 2: Wait for delivery, retry IMAP up to 3 min
+ # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
...
- for attempt in range(9):
+ for attempt in range(15):
...
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```
Flow (before):
```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```
Flow (after):
```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
│
└─ timeout ─► log, continue to next loop
```
## What is NOT in this change
- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
3600s + for: 10m. Those fire only on sustained multi-hour issues and
should not be loosened — they would mask future regressions. Probe
success rate is now expected to recover to ≥95% from the two upstream
fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
cleanup of stale e2e-probe-* messages.
## Test Plan
### Automated
`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:
```
# module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
- for attempt in range(9):
+ for attempt in range(15):
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```
`scripts/tg apply`:
```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
### Manual Verification
1. Trigger the probe manually:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
`kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
successful run should still complete in < 60s now that postscreen
is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
Prometheus — expect ≥95% (was ~65% before all three fixes).
## Reproduce locally
1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.
Closes: code-18e
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>