Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the stack with the repo HEAD carrying migration 0004
(cash_income_tax + ytd_rsu_* columns), migration 0005 (p60_reference
table), the bonus-dedup logic, and the Woodpecker path-filter fix.
Applied + verified:
- pod rolled out with the new image, Alembic ran 0003→0004→0005
- cash_income_tax backfilled on 71/71 existing rows
- dashboard Panel 7 YTD split query returns real numbers
- no existing (tax_year, bonus) duplicates found — guard ships for future
Closes: code-7z0
## Context
After the Mouse-class unblock on 2026-04-19, end-to-end testing of the
grabber revealed three issues with the plan's original filter values:
1. **`SEEDER_CEILING=50` rejects ~99% of MAM's catalogue.** MAM is a
well-seeded private tracker — 100-700 seeders per torrent is normal.
A ceiling of 50 makes the filter too tight: across 140 FL torrents
sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied
swarms") is still valid; the threshold was wrong for MAM's shape.
2. **`RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now
over-tight.** Its job is preventing the death spiral where Mouse-class
accounts can't announce, so any grab deepens the ratio hole. Once
class > Mouse, MAM serves peer lists normally and demand-first
filtering (`leechers>=1`) keeps new grabs upload-positive on average.
With ratio sitting at 0.7 post-recovery (we over-downloaded while
unblocking), 1.2 was preventing the very grabs that would earn us
back to healthy ratio.
3. **`parse_size` crashed on `"1,002.9 MiB"`.** MAM's pretty-printed
sizes use thousands separators; `float("1,002.9")` raises
`ValueError`. Every grabber run that hit a ≥1000-MiB candidate on
the page crashed with a traceback instead of skipping the size.
## This change
- `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was
rejecting viable demand-first candidates like `Zen and the Art of
Motorcycle Maintenance` (S=156, L=1, score=125).
- `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips,
but no longer a steady-state block. Class == Mouse remains an
absolute skip (separate branch).
- `parse_size`: `s.replace(",", "").split()` before int-parse.
## Verified post-change
Manual grabber loop (5 runs at random offsets) after applying:
run=1 parse_size crash on "1,002.9" (this crash motivated fix#3)
run=2 GRABBED 3 torrents:
Dean and Me: A Love Story (240.7 MiB, S:18, L:1) score=194
Digital Nature Photography (83.7 MiB, S:42, L:1) score=182
Zen and the Art of Motorcycle (830.3 MiB, S:156, L:1) score=125
run=3-5 grabbed=0 at offsets that landed on pages with no matches
(expected — MAM returns 20/page, many offsets yield nothing)
MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock),
BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading
~8 MiB to MAM this session across 2 of the Maxximized comic issues.
## What is NOT in this change
- No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass`
(1h) already handle the "going back to Mouse" case; lowering the floor
on the grabber doesn't change alerting.
- No janitor changes — the janitor rules are H&R-based and independent
of ratio/class state.
## Test plan
### Automated
$ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
$ python3 -c 'import ast; ast.parse(open(
"infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())'
### Manual Verification
1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7:
$ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
$ kubectl -n servarr logs job/g1 | head -5
Profile: ratio=0.7 class=User | Farming: 33, 2.0 GiB, tracked IDs: 4
Search offset=<random>, found=1323, page_results=20
Added (score=...) ...
2. Repeat 3-5× at different random offsets. Over the course of a 30-min
cron cadence, expect 2-5 grabs across the day given MAM's catalogue
churn and our filter intersection.
## Reproduce locally
cd infra/stacks/servarr
../../scripts/tg plan # expect: 0 to add, 2 to change (configmap + cronjob)
../../scripts/tg apply --non-interactive
kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
kubectl -n servarr logs job/g1
Follow-up: `bd close code-qfs` already completed in the parent commit;
this is a post-shipping tune, no beads action needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds per-node DNS cache that transparently intercepts pod queries on
10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via
hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their
existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet
rollout needed for transparent mode.
Layout mirrors existing stacks (technitium, descheduler, kured):
stacks/nodelocal-dns/
main.tf # module wiring + IP params
modules/nodelocal-dns/main.tf # SA, Services, ConfigMap, DS
Key decisions:
- Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1
- Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception)
- Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods
(separate ClusterIP avoids cache looping back through itself)
- viktorbarzin.lan zone forwards directly to Technitium ClusterIP
(10.96.0.53), bypassing CoreDNS for internal names
- priorityClassName: system-node-critical
- tolerations: operator=Exists (runs on master + all tainted nodes)
- No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi
- Kyverno dns_config drift suppressed on the DaemonSet
- Kubelet clusterDNS NOT changed — transparent mode is sufficient;
rolling 5 nodes just to switch to 169.254.20.10 has no additional
benefit and expanding blast radius for no reason.
Verified:
- DaemonSet 5/5 Ready across k8s-master + 4 workers
- dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4
- dig @169.254.20.10 github.com -> 140.82.121.3
- Deleted all 3 CoreDNS pods; cached queries still resolved via
NodeLocal DNSCache (resilience confirmed)
Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table,
graph diagram, stacks table; rewrites pod DNS resolution paths to show
the cache layer; adds troubleshooting entry.
Closes: code-2k6
Zone-count parity required hitting /api/zones/list which requires auth. The
null_resource has no access to the Technitium admin password (it's declared
`sensitive = true` on the module variable), so we were probing with an empty
token and getting 200 OK with an error JSON — silently returning 0 zones for
every instance.
Replaced the HTTP probe with a second DNS check: dig idrac.viktorbarzin.lan
on each pod, require the same A record from all three. This catches both
"zone not loaded on an instance" and "zone drift between primary and
replicas" without needing any HTTP client or credentials. The AXFR chain
guarantees all three should converge on the same value.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Panel 7 (YTD uses): replace the single `ytd_income_tax` stack segment
with two — `ytd_cash_income_tax` (full red, same color as before) and
`ytd_rsu_income_tax` (desaturated orange) — computed from the new
`cash_income_tax` column on payslip. RSU-vest months now visually
separate the cash tax from the PAYE attributable to the grossed-up
RSU, matching user mental model of "what I actually paid in cash tax".
Panel 8 (Sankey): split the single `Gross → Income Tax` edge into two
edges (`Gross → Income Tax (cash)` and `Gross → Income Tax (RSU)`)
sourcing the same two figures.
Panel 3 (effective rate): left untouched — it's the "all-in" rate and
keeps using raw `income_tax`.
Panel 9 (P60 reconciliation — new): per-tax-year table comparing HMRC
P60 annual figures against SUM(payslip) via LATERAL JOIN on
payslip_ingest.p60_reference. Threshold-coloured delta columns (|Δ|<1
green, 1-50 yellow, >50 red) surface missing months or parser drift.
Panel 10 (HMRC Tax Year Reconciliation — new): placeholder for the
hmrc-sync service (code scaffolded, awaiting HMRC prod approval to
activate). Queries `hmrc_sync.tax_year_snapshot`; renders empty until
that schema lands. Delta > £10 → red.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The zone-count parity check was trivially passing when the ephemeral
curl pod failed to reach the Technitium web API: all three counts came
back as 0, UNIQ=1, gate claimed "PASSED". This happened during today's
DNS hardening apply when CoreDNS was in CrashLoopBackOff and the curl
pod couldn't resolve service names.
Added a MIN > 0 sanity check. Technitium always has built-in zones
(localhost, standard reverse PTRs), so a zero count means the probe
didn't reach the API, not that the instance truly has zero zones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm
release so data can migrate via REPLICAOF during a future short maintenance
window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still
points at redis-node-{0,1}.
Architecture:
- 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter
- podManagementPolicy=Parallel + init container that writes fresh
sentinel.conf on every boot by probing peer sentinels and redis for
consensus master (priority: sentinel vote > role:master with slaves >
pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM.
- redis.conf `include /shared/replica.conf` — init container writes
`replicaof <master> 6379` for non-master pods so they come up already in
the correct role. No bootstrap race.
- master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork
COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn.
- RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec.
- PodDisruptionBudget minAvailable=2.
Also:
- HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since
Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes
the sole client-facing path for all 17 consumers.
- New Prometheus alerts: RedisMemoryPressure, RedisEvictions,
RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong,
RedisReplicasMissing. Updated RedisDown to cover both statefulsets
during the migration.
- databases.md updated to describe the interim parallel-cluster state.
Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status
up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded
into Prometheus and inactive.
Beads: code-v2b (still in progress — Phase 3-7 await maintenance window).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`:
plugin/cache: invalid value for serve_stale refresh mode: 86400s
serve_stale takes one DURATION and an optional refresh_mode keyword
("immediate" or "verify"), not two durations. Simplified to
`serve_stale 86400s` (serve cached entries for up to 24h when upstream
is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two
old pods kept serving traffic so there was no outage, but the partial
apply left the cluster wedged with the bad ConfigMap.
Also collapses the inline viktorbarzin.lan cache block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `external-monitor-sync` script is opt-IN by default for any
*.viktorbarzin.me ingress, so a missing annotation means "monitored."
Both ingress factories previously OMITTED the annotation when
`external_monitor = false`, which silently left monitors in place.
Fix: when the caller sets `external_monitor = false` explicitly, emit
`uptime.viktorbarzin.me/external-monitor = "false"` so the sync script
deletes the monitor. Keep the previous behavior (no annotation) for
callers that leave external_monitor null — otherwise 19 publicly-reachable
services with `dns_type="none"` would lose monitoring.
Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy)
to match the other two already-flagged services. Delete the r730 ingress
module entirely — the Dell server has been decommissioned.
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
+authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
lines) and the openclaw cluster_healthcheck CronJob (local CLI is
now the single authoritative runner). Keep the healthcheck SA +
Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
the script first, refreshes the 42-check table, drops stale
CronJob/Slack/post-mortem sections, documents the monorepo-canonical
+ hardlink layout. File is hardlinked to
/home/wizard/code/.claude/skills/cluster-health/SKILL.md for
dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi
leaves enough headroom for normal operation but thin margin if blocklists or
cache grow. User escalated: OOM cascades are the exact failure mode that
causes user-visible DNS outages, so give a full 2x safety margin across all
three instances. Replicas currently use 124-155Mi steady-state so they have
enormous headroom at 2Gi — accepted for symmetry and future growth (OISD
blocklists, in-memory cache).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but
Technitium has no record for that name (the router isn't a DHCP client and
Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and
the `[External] gw` Uptime Kuma monitor was permanently failing.
Factory now accepts `backend_ip` as an alternative to `external_name`: it
creates a selector-less ClusterIP Service + manual EndpointSlice pointing
at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1);
the old ExternalName path is retained for every other service.
Also add a direct `port` monitor for the router in uptime-kuma's
internal_monitors list so we can tell a Cloudflare/tunnel outage apart
from the router itself being down. Extended the internal-monitor-sync
script to handle non-DB monitor types (hostname + port fields).
Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod
health check from wget against /api to dig +short against 127.0.0.1. This
probes the actual DNS serving path, which is what we care about anyway.
Zone-count parity can't be done inside the Technitium pod (no HTTP client),
so it spawns a short-lived curlimages/curl pod via kubectl run --rm that
curls the three internal web services and exits.
Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after
a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to
load into memory on a cold start.
Relaxed the A-record regex to match any IPv4 rather than 10.x — records may
legitimately live outside that range.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.
**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
(secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
per-instance zone_count gauges to Pushgateway, fail the job on any
create error (was silently passing).
**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
forward, health_check on viktorbarzin.lan forward, serve_stale
3600s/86400s on both cache blocks — pfSense flap no longer takes the
cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.
**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
CoreDNSForwardFailureRate.
**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
kubectl rollout status on all 3 deployments (180s), per-pod
/api/stats/get probe, zone-count parity across the 3 instances.
Fails the apply on any check fail. Override: -var skip_readiness=true.
**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
modes, emergency override.
Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.
Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
[servarr] Starting apply...
ERROR: Cannot read PG credentials from Vault.
Run: vault login -method=oidc
[servarr] FAILED (exit 1)
Two root causes, two fixes here.
### 1. Vault `ci` role lacks Tier-1 PG backend creds
The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.
**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.
### 2. Apply-loop swallows stack failures
`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.
**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.
Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.
## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
TF file is already at 5.1.4 in git; once CI picks up this commit
it'll apply on its own, or Viktor can run `tg apply` locally now
that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
per-stack continuation so a single bad stack doesn't hide the
others' plans from the log. Just making the final status honest.
## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
# vault_kubernetes_auth_backend_role.ci will be updated in-place
~ token_policies = [
+ "terraform-state",
# (1 unchanged element hidden)
]
# vault_jwt_auth_backend.oidc will be updated in-place
~ tune = [...] # cosmetic provider-schema drift, pre-existing
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.
### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
curl -s -H "X-Vault-Token: $TOK" \
http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}
# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```
Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.
Refs: bd code-e1x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms)
and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into
ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS
stalls that were surfacing as 18 x 502s/day without disturbing the
global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new
manifests to avoid the dangling-reference pattern from the 2026-04-17
Traefik P0.
Closes: code-rd1
Three changes:
1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting-
clean stacked-area panels side-by-side:
- "YTD sources": salary + bonus + rsu_vest + residual (= gross)
- "YTD uses": net + income_tax + NI + pension_employee + student_loan
+ rsu_offset (= gross, per validate_totals identity)
Green for take-home, red/orange for taxes, purple for pension, teal
for RSU offset — visually encodes "what you earned vs what was taken".
2. Panel 3 effective rate switched from per-slip attribution to YTD
cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike:
the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but
Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top
marginal`, not rsu_vest × average. Cumulative approach blends both
proportionally, no attribution hack needed. Also adds a third series:
all-deductions rate (income_tax + NI + student_loan / gross).
3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross →
uses over the selected time range. Plugin added to grafana Helm values.
The Redis monitor (id=53) was created manually with a connection string
pointing at redis-master.redis-headless.redis.svc.cluster.local, which
doesn't resolve — headless only exposes pod DNS (redis-node-N.redis-headless),
not a synthetic "redis-master" name. Status had been DOWN with ENOTFOUND
for weeks.
Declare it in local.internal_monitors using redis-master.redis.svc.cluster.local
(the HAProxy-fronted ClusterIP that already routes to the Sentinel-elected
master). Verified RESP PING through HAProxy returns PONG.
Tighten intervals to 60s / 30s retry / 3 retries — Redis is core (Paperless,
Immich, Authentik, Dawarich all depend on it), a 5-minute detection window
was way too loose given the blast radius.
Also teach the sync CronJob to handle no-password monitors (auth disabled
on the Bitnami chart), via an optional database_password_vault_key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Companion fix to 2026-04-19's service-upgrade spec refactor. The agent
pod has no Vault CLI auth (no VAULT_TOKEN, port 8200 refused), so every
`vault kv get` in the spec returned empty:
- `WOODPECKER_TOKEN=""` → 401 on /api/repos/1/pipelines → agent can't
find its pipeline → 15m poll timeout → rollback loop → >30m cap.
- `SLACK_WEBHOOK=""` → webhook POST to empty URL → no Slack messages
for 3+ days (the surface symptom that kicked off bd code-3o3).
## This change
Extends the `claude-agent-secrets` ExternalSecret with two more keys,
making them available to the agent via `envFrom`:
- `WOODPECKER_API_TOKEN` ← `secret/ci/global.woodpecker_api_token`
(already used by the vault-woodpecker-sync CronJob, same key)
- `SLACK_WEBHOOK_URL` ← `secret/viktor.alertmanager_slack_api_url`
(shared webhook also consumed by Alertmanager)
Pairs with commit a5963169 which refactored service-upgrade.md to read
these env vars directly instead of shelling out to `vault kv get`.
## What is NOT in this change
- REGISTRY_USER / REGISTRY_PASSWORD — not needed on the agent side.
The separate `.woodpecker/build-cli.yml` fix (bd code-3o3 fix C)
will add those to `secret/ci/global` for the vault-woodpecker-sync
CronJob to publish as Woodpecker secrets, not here.
## Test Plan
### Automated
`terraform plan` reported `Plan: 0 to add, 2 to change, 0 to destroy`
(ExternalSecret + a cosmetic `tier` label drop on the Deployment).
Applied cleanly.
### Manual Verification
```
$ kubectl -n claude-agent get externalsecret claude-agent-secrets \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
secret synced
$ kubectl -n claude-agent exec deploy/claude-agent-service -- sh -c \
'echo "WP=${WOODPECKER_API_TOKEN:0:20}... SLACK=${SLACK_WEBHOOK_URL:0:40}..."'
WP=eyJhbGciOiJIUzI1NiIs... SLACK=https://hooks.slack.com/services/T02SV75...
$ kubectl -n claude-agent rollout status deploy/claude-agent-service
deployment "claude-agent-service" successfully rolled out
```
Next step: fire one synthetic DIUN webhook to confirm the agent reaches
Slack + lands a commit + exits cleanly, completing code-3o3.
Refs: bd code-3o3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
256Mi was tight once the working set crossed ~200Mi: a BGSAVE fork
during replica full PSYNC doubled master RSS via COW and pushed it
past the limit, OOMing (exit 137) in a loop. HAProxy flapped, every
client (Paperless, Immich, Authentik, Dawarich) saw session store
failures → 500s on authenticated requests.
512Mi gives ~2x headroom on the current 204Mi RDB.
Closes: code-n81
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds panel 6 that reconciles each payslip's reported YTD summary block
(ytd_gross, ytd_taxable_pay, ytd_tax_paid) against the cumulative sum
of extracted per-payslip values within the same tax year. Any Δ > £0.02
flags a parser regression, missing slip, or duplicate ingest — the
algebraic companion to the existing missing-months panel.
Variant A payslips (pre-mid-2022) carry no YTD block and are filtered
out via WHERE ytd_gross IS NOT NULL.
Adds `external_monitor = false` to the ingress_factory calls for
task-webhook and torrserver so the `external-monitor-sync` CronJob
stops auto-creating `[External] <name>` monitors for them. Both
services remain deployed and reachable; only the Uptime Kuma monitors
are dropped.
Collapse from 11 panels to 5. New hero "Tax-year YTD — gross / net /
taxes / RSU / salary" merges the old YTD cumulative + total-comp +
earnings-breakdown panels into a single line chart (tax-band thresholds
still on ytd_cash_gross). New "Data integrity" table surfaces missing
months and zero-salary anomalies at a glance — catches the 2024-02 gap
(Paperless doc never uploaded) and any future parser regressions.
Monthly cash flow, effective-rate, and full payslip table kept as-is.
Total dashboard height: 39 rows (was ~67). No parser / schema changes.
[ci skip]
Replaces `redis.redis:6379` with `redis-master.redis:6379` in all 11
dependency.kyverno.io/wait-for annotations across 8 stacks, plus one
docs comment in the Kyverno module.
These annotations drive DNS-only `nc -z` init-container readiness
checks — zero RW risk. Both hostnames resolve, so there is no wait-for
failure window during the rolling re-apply.
Closes: code-otr
Completes the T0 hostname migration. The `redis.redis` service is a
legacy alias that routes to HAProxy via a `null_resource` selector
patch; `redis-master.redis` is the canonical name that has always
routed to HAProxy directly and health-checks master-only.
Changes:
- redis-backup CronJob: redis-cli BGSAVE + --rdb now target
redis-master.redis. BGSAVE runs on the master (what we want).
- config.tfvars `resume_redis_url`: unused fallback updated for
grep hygiene; nothing reads it today.
- ytdlp REDIS_URL default: updated for dev-local runs; production
already sets REDIS_URL via main.tf:283-285 → var.redis_host.
- immich chart_values.tpl REDIS_HOSTNAME: dead Helm template (values
block commented out in main.tf:524, Immich deploys as raw
kubernetes_deployment using var.redis_host). Updated to keep the
file consistent if someone ever revives it.
HAProxy resolved `redis-node-{0,1}.redis-headless.redis.svc.cluster.local`
once at pod startup and cached the IPs forever. When redis-node pods
cycled (new pod IPs), HAProxy kept connecting to the dead IPs — backends
flapped between "Connection refused" and "Layer4 timeout", and Immich's
ioredis client hit EPIPE until max-retries exhausted and the pod entered
CrashLoopBackOff. This caused an Immich outage on 2026-04-19.
Fix:
- Add `resolvers kubernetes` stanza pointing at kube-dns (10s hold on
every category so we pick up pod IP changes within a DNS TTL window).
- Add `resolvers kubernetes init-addr last,libc,none` to every backend
server line so HAProxy resolves at startup AND uses the dynamic
resolver for runtime refresh.
- Add `checksum/config` pod annotation to the HAProxy Deployment so a
haproxy.cfg change actually rolls the pods (including this one).
Closes: code-fd6
## Context (bd code-yiu)
With Phase 4+5 proven (external mail flows through pfSense HAProxy +
PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB
LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are
obsolete. Phase 6 decommissions them and documents the steady-state
architecture.
## This change
### Terraform (stacks/mailserver/modules/mailserver/main.tf)
- `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`.
- Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation.
- Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP).
- Port set unchanged — the Service still exposes 25/465/587/993 for
intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
CronJob) that hit the stock PROXY-free container listeners.
- Inline comment documents the downgrade rationale + companion
`mailserver-proxy` NodePort Service that now carries external traffic.
### pfSense (ops, not in git)
- `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT
rule references it post-Phase-4; keeping it would be misleading dead
metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php`
companion script (ad-hoc, not checked in — alias is just a
Firewall → Aliases → Hosts entry).
### Uptime Kuma (ops)
- Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202`
→ `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` /
`... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy
layer liveness). History retained (edit, not delete-recreate).
### Docs
- `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten
"Current state" section; now reflects steady-state architecture with
two-path diagram (external via HAProxy / intra-cluster via ClusterIP).
Phase history table marks Phase 6 ✅. Rollback section updated (no
one-liner post-Phase-6; need Service-type re-upgrade + alias re-add).
- `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound
flow, CrowdSec section, Uptime Kuma monitors list, Decisions section
(dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY
v2"), Troubleshooting all updated.
- `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph
updated with new external path description; references the new runbook.
## What is NOT in this change
- Removal of `10.0.20.202` from `cloudflare_proxied_names` or any
reserved-IP tracking — wasn't there to begin with. The
`metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of
19 available after this, confirming `.202` went back to the pool.
- Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if
someone re-introduces the MetalLB LB (see runbook "Rollback").
## Test Plan
### Automated (verified pre-commit 2026-04-19)
```
# Service is ClusterIP with no EXTERNAL-IP
$ kubectl get svc -n mailserver mailserver
mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP
# 10.0.20.202 no longer answers ARP (ping from pfSense)
$ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202'
2 packets transmitted, 0 packets received, 100.0% packet loss
# MetalLB pool released the IP
$ kubectl get ipaddresspool default -n metallb-system \
-o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}'
2 of 19 available
# E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver
... Round-trip SUCCESS in 20.3s ...
$ kubectl delete job probe-phase6 -n mailserver
# pfSense mailserver alias removed
$ ssh admin@10.0.20.1 'php -r "..." | grep mailserver'
(no output)
```
### Manual Verification
1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new
hostname `10.0.20.1`.
2. Roundcube login works (`https://mail.viktorbarzin.me/`).
3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe
`postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in
mailserver logs within ~10s.
4. CrowdSec should still see real client IPs in postfix/dovecot parsers
(verify with `cscli alerts list` on next auth-fail event).
## Phase history (bd code-yiu)
| Phase | Status | Description |
|---|---|---|
| 1a | ✅ `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed |
| 3 | ✅ `ba697b02` | HAProxy config persisted in pfSense XML |
| 4+5 | ✅ `9806d515` | 4-port alt listeners + HAProxy frontends + NAT flip |
| 6 | ✅ **this commit** | MetalLB LB retired; 10.0.20.202 released; docs updated |
Closes: code-yiu
## Context
`null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`,
so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add`
even when nothing has changed. That noise hides real drift in the signal and
trains us to ignore redis-stack plans — which is exactly what you don't want
on a load-bearing patch.
The patch itself is still load-bearing (three consumers hard-code bare
`redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`,
`stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus
Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local`
and call it during pod startup). Removing the null_resource is a follow-up
(beads T0) once those consumers migrate to `redis-master.redis.svc`. For now
the goal is just: stop being noisy.
## This change
1. Replace the `always = timestamp()` trigger with two inputs that only change
when re-patching is genuinely required:
- `chart_version = helm_release.redis.version` — changes only on a Bitnami
chart version bump, which is the one code path that rewrites the `redis`
Service selector back to `component=node`.
- `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])`
— changes only when HAProxy config is edited; aligned with the existing
`checksum/config` annotation that rolls the Deployment on config change.
Both attributes are known at plan time (verified against `hashicorp/helm`
v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision`
(not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))`
(readability unverified on v3), and `kubernetes_deployment.haproxy.id`
(static `namespace/name`, never changes) — don't meet the bar.
2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly
states the write/sentinel/avoid endpoints, so new consumers start from
`redis-master.redis.svc` (the documented `var.redis_host`) and long-lived
connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout
client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor
already learned that lesson the hard way (memory id=748).
## What is NOT in this change
- Deleting `null_resource.patch_redis_service` — still load-bearing (T0).
- Deleting `kubernetes_service.redis_master` — stays as the declared write API.
- Migrating any consumer off bare `redis.redis.svc` — T0 epic.
- Per-client sentinel migration — T1 epic.
- Retiring HAProxy — T2 epic (blocked on T1 + T3).
## Before / after
Before (steady state):
```
scripts/tg plan
Plan: 1 to add, 2 to change, 1 to destroy.
# null_resource.patch_redis_service must be replaced
# triggers = { "always" = "<timestamp>" } -> (known after apply)
```
After (steady state, post-apply):
```
scripts/tg plan
No changes. Your infrastructure matches the configuration.
```
After (chart version bump):
```
scripts/tg plan
# null_resource.patch_redis_service must be replaced
# triggers = { "chart_version" = "25.3.2" -> "25.4.0" }
```
— the trigger fires only when it actually needs to.
## Test Plan
### Automated
`scripts/tg plan` pre-change (confirms baseline noise):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
~ triggers = { # forces replacement
~ "always" = "2026-04-19T10:39:40Z" -> (known after apply)
}
}
Plan: 1 to add, 2 to change, 1 to destroy.
```
`scripts/tg plan` post-edit (confirms the one-time structural replacement):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
~ triggers = { # forces replacement
- "always" = "2026-04-19T10:39:40Z" -> null
+ "chart_version" = "25.3.2"
+ "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775"
}
}
```
Apply is deferred to the operator — the working tree on the same file also
contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage)
that needs its own review before rolling out together. No `scripts/tg apply`
run from this session.
### Manual Verification
Reproduce locally:
1. `cd infra/stacks/redis && ../../scripts/tg plan`
2. Before apply: expect `null_resource.patch_redis_service` to be replaced
exactly once, with the trigger map transitioning from `{always = <ts>}`
to `{chart_version, haproxy_config}`.
3. After apply: `../../scripts/tg plan` twice in a row must both report
`No changes.` (excluding unrelated drift from other work-in-progress).
4. Cluster-side invariant (must hold pre- and post-apply):
`kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'`
→ `{"app":"redis-haproxy"}`
`kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'`
→ `{"app":"redis-haproxy"}`
5. Regression test for the trigger doing its job: bump `helm_release.redis.version`
in a branch, `tg plan`, expect the null_resource to replace. Revert.
Bundles two small follow-ups to the live bridge + port-fix work:
## Face avatar fix (dawarich-hook.lua)
After the Recorder ran in production for a while it began enriching
publish payloads with a `face` field — the base64-encoded user avatar
uploaded via the Recorder's web UI (~120 KB). Our Lua hook builds a
curl command that embeds the JSON payload as `-d '<payload>'`, which
hit `E2BIG` / `Argument list too long` (os.execute reason=code=7) on
Linux's `execve` argv limit (~128 KB). Every live POST stopped making
it to Dawarich, even though the HTTP POST from the phone to Owntracks
still returned 200 and the .rec write still happened.
Fix: `data.face = nil` before serializing. Dawarich doesn't use it
anyway (not persisted into any column — `raw_data` stored without it).
Also upgraded the debug log: on failure we now emit
`dawarich-bridge: FAIL tst=... reason=... code=... cmd=...` so any
future variant of this problem (next big field surfaced upstream, etc.)
is one log tail away from a diagnosis.
```
$ kubectl -n owntracks logs deploy/owntracks --tail=5 | grep dawarich-bridge
+ dawarich-bridge: init
+ dawarich-bridge: ok tst=1776600238
```
## Orphan PVC removal (main.tf)
`owntracks-data-proxmox` (1 Gi, proxmox-lvm, unencrypted) was a leftover
from the encrypted-migration attempt; the Deployment has been mounting
`owntracks-data-encrypted` the whole time. Verified `Used By: <none>`
on the live PVC before removal. Removing the resource from Terraform
destroys the PVC — harmless, no data loss.
## Test Plan
### Automated
```
$ ../../scripts/tg plan
Plan: 0 to add, 1 to change, 1 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 1 destroyed.
$ kubectl -n owntracks get pvc
NAME STATUS VOLUME ...
owntracks-data-encrypted Bound ...
(owntracks-data-proxmox gone)
```
### Manual Verification
```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
-H 'Content-Type: application/json' \
-H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
-d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
https://owntracks.viktorbarzin.me/pub
HTTP 200
$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
psql -U postgres -d dawarich -tAc \
"SELECT ST_AsText(lonlat::geometry) FROM points WHERE user_id=1 AND timestamp=$TST"
POINT(-0.1278 51.5074)
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context (bd code-yiu)
Toward replacing MetalLB ETP:Local + pod-speaker colocation with pfSense
HAProxy injecting PROXY v2 → mailserver. This commit lays the k8s-side
groundwork for port 25 only. External SMTP flow post-cutover:
Client → pfSense WAN:25 → pfSense HAProxy (injects PROXY v2) → k8s-node:30125
(NodePort for mailserver-proxy Service, ETP:Cluster) → kube-proxy → pod :2525
(postscreen with postscreen_upstream_proxy_protocol=haproxy) → real client IP
recovered from PROXY header despite kube-proxy SNAT.
Internal clients (Roundcube, email-roundtrip-monitor) keep using the stock
:25 on mailserver.svc ClusterIP — no PROXY required, zero regression.
## This change
- New `kubernetes_config_map.mailserver_user_patches` with a
`user-patches.sh` script. docker-mailserver runs
`/tmp/docker-mailserver/user-patches.sh` on startup; our script appends a
`2525 postscreen` entry to `master.cf` with
`-o postscreen_upstream_proxy_protocol=haproxy` and a 5s PROXY timeout.
Sentinel-guarded for idempotency on in-place restart.
- New volume + volume_mount (`mode = 0755` via defaultMode) wires the
ConfigMap into the mailserver container.
- New container port spec for 2525 (informational; kube-proxy resolves
targetPort by number anyway).
- New Service `mailserver-proxy` — NodePort type, ETP:Cluster, selector
`app=mailserver`, port 25 → targetPort 2525 → fixed nodePort 30125.
pfSense HAProxy's backend pool will be `<all k8s node IPs>:30125 check
send-proxy-v2`.
The existing `mailserver` LoadBalancer Service (ETP:Local, 10.0.20.202,
ports 25/465/587/993) is untouched. Traffic still flows through it via the
pfSense NAT `<mailserver>` alias; this commit does not change routing.
## What is NOT in this change
- pfSense HAProxy install/config (Phase 2 — out-of-Terraform, runbook-managed)
- pfSense NAT rdr flip from `<mailserver>` → HAProxy VIP (Phase 4)
- 465/587/993 — scoped to port 25 first for proof of concept. Other ports
get the same treatment (alt listeners 4465/5587/10993 + Service ports)
once 25 is proven.
- Dovecot per-listener `haproxy = yes` — irrelevant until IMAP is migrated.
## Test Plan
### Automated (verified pre-commit)
```
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
postconf -M | grep '^2525'
2525 inet n - y - 1 postscreen \
-o syslog_name=postfix/smtpd-proxy \
-o postscreen_upstream_proxy_protocol=haproxy \
-o postscreen_upstream_proxy_timeout=5s
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
ss -ltn | grep -E ':25\b|:2525'
LISTEN 0 100 0.0.0.0:2525 0.0.0.0:*
LISTEN 0 100 0.0.0.0:25 0.0.0.0:*
$ kubectl get svc -n mailserver mailserver-proxy
NAME TYPE CLUSTER-IP PORT(S) AGE
mailserver-proxy NodePort 10.98.213.164 25:30125/TCP 93s
# Expected-to-fail probe (no PROXY header) → postscreen rejects
$ timeout 8 nc -v 10.0.20.101 30125 </dev/null
Connection to 10.0.20.101 30125 port [tcp/*] succeeded!
421 4.3.2 No system resources
```
### Manual Verification (after Phase 2 — pfSense HAProxy)
Once HAProxy on pfSense is configured to listen on alt port :2525 (not the
real :25 yet) and targets `k8s-nodes:30125` with `send-proxy-v2`:
1. From an external host: `swaks --to smoke-test@viktorbarzin.me
--server <pfsense-ip>:2525 --body "phase 1 test"`
2. In mailserver logs: `kubectl logs -c docker-mailserver deployment/mailserver
| grep postfix/smtpd-proxy` — "connect from [<external-ip>]" with the real
public IP, NOT the k8s node IP.
3. E2E probe CronJob keeps green (uses ClusterIP path, unaffected).
## Reproduce locally
1. `kubectl get svc mailserver-proxy -n mailserver` → NodePort 30125 exists
2. `kubectl get cm mailserver-user-patches -n mailserver` → exists
3. `timeout 8 nc -v <k8s-node>:30125 </dev/null` → "421 4.3.2 No system resources"
(postscreen rejecting malformed PROXY)
## Context
A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14
via kubectl apply (outside Terraform). Five days later the account was
still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0.
Tracker responses on 7 of 9 active torrents returned
`status=4 | msg="User currently mouse rank, you need to get your ratio up!"`
— MAM was actively refusing to serve peer lists because the account was
in Mouse class, and refusing to serve peer lists made the ratio impossible
to recover. Meanwhile the grabber kept digging: 501 torrents sat in
qBittorrent, 0 completed, 0 bytes uploaded.
Root causes (ranked):
1. Death spiral — Mouse class blocks announces, nothing uploads.
2. BP-spender 30 000 BP threshold blocked the only exit even though the
account already had 24 500 BP.
3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand
torrents filtered to <100 MiB — ratio-hostile by design.
4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so
torrents that never started never qualified. Combined with the 500-
torrent cap this stalled the grabber indefinitely.
5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL.
6. Ratio-monitor labelled queued torrents `unknown` (empty tracker
field), hiding the problem on the MAM Grafana panel.
7. qBittorrent memory limit (256 Mi LimitRange default) too low.
8. All of the above was Terraform drift with no reviewability.
## This change
Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts
the three kubectl-applied resources and replaces their scripts with
demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes
ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana
panel row.
### Architecture
MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP)
├─── loadSearchJSONbasic.php (freeleech search)
├─── bonusBuy.php (50 GiB min tier for API)
└─── download.php (torrent file)
│
Pushgateway <──┬────────────┤
│ mam_ratio ┌────────────────────┐
│ mam_class_code │ freeleech-grabber │ */30
│ mam_bp_balance ◄───│ (ratio-guarded) │
│ mam_farming_* └──────────┬─────────┘
│ mam_janitor_* │ adds to
│ ▼
│ Grafana panels qBittorrent (mam-farming)
│ + 5 alerts ▲
│ │ deletes by rule
│ ┌──────────┴─────────┐
│ ◄───│ farming-janitor │ */15
│ │ (H&R-aware) │
│ └──────────┬─────────┘
│ │ buys credit
│ ┌──────────┴─────────┐
└───────────────────────│ bp-spender │ 0 */6
│ (tier-aware) │
└────────────────────┘
### Key decisions
- **Ratio guard on grabber** — refuse to grab if ratio < 1.2 OR class ==
Mouse. Prevents the death spiral from deepening. Emits
`mam_grabber_skipped_reason{reason=...}` and exits clean.
- **Demand-first selection** — new score formula
`leechers*3 - seeders*0.5 + 200 if freeleech_wedge else 0`; size band
50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that
will actually upload.
- **Janitor decoupled from grabber** — runs every 15 min regardless of
the ratio-guard state. Without this, stuck torrents accumulate
fastest exactly when the grabber is skipping (Mouse class). H&R-aware:
never deletes `progress==1.0 AND seeding_time < 72h`. Six delete
reasons observable via `mam_janitor_deleted_per_run{reason=...}`.
- **BP-spender tier-aware** — MAM imposes a hard 50 GiB minimum on API
buyers ("Automated spenders are limited to buying at least 50 GB...
due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB.
The spender picks the smallest tier that satisfies the ratio deficit
AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB
tier is too expensive, it skips and retries on the next 6-hour cron.
- **Authoritative metrics use MAM profile fields** —
`downloaded_bytes` / `uploaded_bytes` (integers) rather than the
pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB"
that MAM also returns.
- **Ratio-monitor category-first labelling** — `tracker` is empty for
queued torrents that never announced. Now maps `category==mam-farming`
to label `mam` first, only falls back to tracker-URL parsing when
category is absent. Stops hundreds of MAM torrents collecting under
`unknown`.
- **qBittorrent resources bumped** to `requests=512Mi / limits=1Gi` so
hundreds of active torrents don't OOM.
### Emergency recovery performed this session
1. Adopted 5 in-cluster resources via root-module `import {}` blocks
(Terraform 1.5+ rejects imports inside child modules).
2. Ran the janitor in DRY_RUN=1 to verify rules against live state —
466 `never_started` candidates, 0 false positives in any other
reason bucket. Flipped to enforce mode.
3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35
preserved as active/in-progress).
4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become
eligible again.
The ratio is still 0 because the API cannot buy below 50 GiB and the
account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the
MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4
and unblock announces. Future automation cannot do this for us due to
MAMs anti-spam rule.
### What is NOT in this change
- qBittorrent prefs reconciliation (max_active_downloads=20,
max_active_uploads=150, max_active_torrents=150). The plan wanted
this; deferred to a follow-up because the janitor + ratio recovery
handles the 500-torrent backlog first. A small reconciler CronJob
posting to /api/v2/app/setPreferences is the intended follow-up.
- VIP purchase (~100 k BP) — deferred until BP accumulates.
- Cross-seed / autobrr — separate initiative.
## Alerts added
- P1 MAMMouseClass — `mam_class_code == 0` for 1h
- P1 MAMCookieExpired — `mam_farming_cookie_expired > 0`
- P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old
QBittorrentMAMRatioLow, now driven by authoritative profile metric)
- P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy
- P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h
## Test plan
### Automated
$ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 | grep Plan
Plan: 5 to import, 2 to add, 6 to change, 0 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed.
# Re-plan after import block removal (idempotent)
$ ../../scripts/tg plan 2>&1 | grep Plan
Plan: 0 to add, 1 to change, 0 to destroy.
# The 1 change is a pre-existing MetalLB annotation drift on the
# qbittorrent-torrenting Service — unrelated to this change.
$ cd ../monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
# Python + JSON syntax
$ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [
"infra/stacks/servarr/mam-farming/files/freeleech-grabber.py",
"infra/stacks/servarr/mam-farming/files/bp-spender.py",
"infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]'
$ python3 -c 'import json; json.load(open(
"infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))'
### Manual Verification
1. Grabber ratio-guard path:
$ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
$ kubectl -n servarr logs job/g1
Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class
2. BP-spender tier path:
$ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1
$ kubectl -n servarr logs job/s1
Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551
| deficit=1.40 GiB needed=3 affordable=48 buy=0
Done: BP=24551, spent=0 GiB (needed=3, affordable=48)
Correctly skips because affordable (48) < smallest API tier (50).
3. Janitor in enforce mode:
$ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1
$ kubectl -n servarr logs job/j1 | tail -3
Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False
per reason: {'never_started': 466, ...}
Second run immediately after: `deleted=0 skipped_active=35` —
steady state with only active/seeding torrents left.
4. Alerts loaded:
$ kubectl -n monitoring get cm prometheus-server \
-o jsonpath='{.data.alerting_rules\.yml}' \
| grep -E "alert: MAM|alert: QBittorrent"
- alert: MAMMouseClass
- alert: MAMCookieExpired
- alert: MAMRatioBelowOne
- alert: MAMFarmingStuck
- alert: MAMJanitorStuckBacklog
- alert: QBittorrentDisconnected
- alert: QBittorrentMAMUnsatisfied
5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new
"MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP
balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor
deletion stacked chart, janitor state stat, grabber state stat.
## Reproduce locally
1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect
0 add / 1 change (unrelated MetalLB annotation drift).
2. `kubectl -n servarr get cronjobs` — expect three:
mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor.
3. Trigger each via `kubectl create job --from=cronjob/<name> <job>`
and read logs; outputs match the manual-verification snippets above.
Closes: code-qfs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Ships the monorepo commit
(code@2fd7670d [claude-agent-service] Raise /execute default timeout
from 15m to 45m) that raises ExecuteRequest.timeout_seconds from 900 to
2700. The auto-upgrade pipeline (DIUN → n8n → claude-agent-service →
service-upgrade agent) had been silently timing out mid-run for 3 days:
139 × 202 Accepted + 6 × TimeoutError in the last 24h, zero commits to
infra, zero Slack posts. Root cause was the 15-minute cap truncating
CAUTION-class upgrades that need to summarise multi-release changelogs,
poll Woodpecker CI, and wait on on-demand DB backup CronJobs.
## What changed
`local.image_tag` 0c24c9b6 → 2fd7670d. Image built + pushed to
registry.viktorbarzin.me/claude-agent-service:2fd7670d. Deployment is
`Recreate`, so the single pod is dropped + recreated.
## Test Plan
### Automated
`terraform plan` — `Plan: 0 to add, 1 to change, 0 to destroy` (3
container image refs flip from 0c24c9b6 → 2fd7670d).
`terraform apply` — `Apply complete! Resources: 0 added, 1 changed,
0 destroyed.`
### Manual Verification
```
$ kubectl -n claude-agent rollout status deploy/claude-agent-service --timeout=120s
deployment "claude-agent-service" successfully rolled out
$ kubectl -n claude-agent get deploy claude-agent-service \
-o jsonpath='{.spec.template.spec.containers[0].image}'
registry.viktorbarzin.me/claude-agent-service:2fd7670d
$ kubectl -n claude-agent exec deploy/claude-agent-service -- \
sh -c 'cd /srv && python3 -c "from app.main import ExecuteRequest; \
print(ExecuteRequest(prompt=\"p\", agent=\"a\").timeout_seconds)"'
2700
```
Next DIUN cycle (every 6h) should land ≥1 unattended upgrade as an
infra commit + Slack message without TimeoutError in the agent logs.
Closes: code-cfy
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real
metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the
exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot
2.3 deprecated in favour of `service stats` + `doveadm-server` with
a different wire format. The scrape only ever returned
`dovecot_up{scope="user"} 0`.
code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or
(b) retire the exporter + scrape + alerts. Picking (b) — carrying a
no-op exporter + scrape + alert group taxes cluster resources,
clutters Prometheus /targets, and tees up an alert that can never
fire correctly. If a future session needs real Dovecot stats, reach
for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and
rebuild this scaffolding.
## This change
### mailserver stack
- Removes the `dovecot-exporter` container from
`kubernetes_deployment.mailserver` (was ~28 lines). Pod now
runs a single `docker-mailserver` container.
- Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service
added in code-izl). The `mailserver` LoadBalancer (ports 25, 465,
587, 993) is unaffected.
- Drops the dovecot.cf comment documenting the failed code-vnc
attempt — the documentation survives here + in bd code-vnc /
code-1ik.
### monitoring stack
- Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`.
- Removes the `Mailserver Dovecot` PrometheusRule group
(`DovecotConnectionsNearLimit`, `DovecotExporterDown`).
- Inline comments in both files point future work at code-1ik's
decision record.
Prometheus configmap-reload picked up the change; scrape target set
now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to
1/1 Running.
## What is NOT in this change
- No replacement exporter — deliberate. The alert that was removed
was a false-signal alert; its removal returns cluster alerting to
a correct, lower-noise state.
- mailserver MetalLB Service + SMTP/IMAP ports — unchanged.
- `auth_failure_delay`, `mail_max_userip_connections` — stay; those
are unrelated to stats export.
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver
NAME READY STATUS RESTARTS AGE
mailserver-78589bfd95-swz6h 1/1 Running 0 49s
$ kubectl get svc -n mailserver
NAME TYPE PORT(S)
mailserver LoadBalancer 25/TCP,465/TCP,587/TCP,993/TCP
roundcubemail ClusterIP 80/TCP
# mailserver-metrics gone
$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
{"status":"success","data":{"activeTargets":[]}}
```
### Manual Verification
1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence)
2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even
without the exporter signal
3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit
or DovecotExporterDown
Closes: code-1ik
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
bd code-vnc investigated why `viktorbarzin/dovecot_exporter` only
exposed `dovecot_up{scope="user"} 0`. Root cause: the exporter speaks
the legacy pre-2.3 `old_stats` FIFO wire protocol. docker-mailserver
15.0.0 ships Dovecot 2.3.19, which moved to `service stats` with a
different architecture — `doveadm stats dump` on the old-stats
unix_listener returns "Failed to read VERSION line" and the exporter
loops on "Input does not provide any columns".
Attempted fix: enabled `old_stats` plugin via `mail_plugins` +
declared `service old-stats { unix_listener stats-reader }`. Socket
was created but protocol incompatibility made it useless. Reverted.
## This change
- Reverts the attempted dovecot.cf additions
- Adds a comment in the dovecot.cf heredoc explaining why we
deliberately do NOT enable old_stats here
- `auth_failure_delay = 5s` (code-9mi) and
`mail_max_userip_connections = 50` stay — they're unrelated to
stats
## What is NOT in this change
- A replacement exporter — filed as follow-up bd code-1ik with
two paths: switch to jtackaberry/dovecot_exporter, or retire the
exporter+scrape+alert entirely
- The `mailserver-metrics` ClusterIP Service (from code-izl) —
kept; it will be useful for whichever path code-1ik chooses
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
supervisorctl status dovecot postfix
dovecot RUNNING pid 1022, uptime 0:00:27
postfix RUNNING pid 1063, uptime 0:00:26
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
Dovecot config returns to baseline + auth_failure_delay. Mail continues
to flow (E2E probe continues to succeed via `email-roundtrip-monitor`).
Closes: code-vnc
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Companion change to payslip-ingest v2 (regex parser + accurate RSU tax
attribution). The Grafana dashboard now has 4 more panels powered by the
new earnings-decomposition and YTD-snapshot columns, and the Claude
fallback agent's prompt is aligned with the new schema so non-Meta
payslips still land with the full field set.
## This change
### `.claude/agents/payslip-extractor.md`
Rewrites the RSU handling section to match Meta UK's actual template
(rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching
rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead).
Adds a new "Earnings decomposition (v2)" section telling the fallback
agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_*
and when to use pension_employee vs pension_sacrifice without
double-counting.
### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`
- **Panel 4 (Effective rate)** — SQL switched from the naive
`(income_tax + NIC) / cash_gross` to the YTD-effective-rate
method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid /
ytd_taxable_pay)`. Title updated to "YTD-corrected" so the
change is discoverable.
- **Panel 5 (Table)** — adds salary, bonus, pension_sacrifice,
taxable_pay columns so row-level debugging against the parser
output is trivial.
- **+Panel 8 (Earnings breakdown)** — monthly stacked bars of
salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice
months show up as a massive negative pension_sacrifice spike
paired with a near-zero bonus bar.
- **+Panel 9 (Accurate cash tax rate)** — timeseries of
cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU
contribution the payslip hides in the single `Tax paid` line.
- **+Panel 10 (All-in compensation)** — stacked bars of cash_gross
+ rsu_vest per payslip.
- **+Panel 11 (YTD cumulative cash gross vs total comp)** — two
lines partitioned by tax_year; the gap between them is the RSU
contribution YTD.
Total panels go from 7 → 11.
## Test Plan
### Automated
Dashboard JSON validity:
```
$ python3 -m json.tool uk-payslip.json > /dev/null && echo ok
ok
```
### Manual Verification
After applying `stacks/monitoring/`:
1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels
2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the
negative pension_sacrifice bar in panel 8
3. Panel 9 "Accurate cash effective tax rate" shows the
cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in
RSU-vest months
## Reproduce locally
1. `cd infra/stacks/monitoring && terragrunt plan`
2. Expected: ConfigMap diff on the payslip dashboard with the new panel
JSON
3. `terragrunt apply` — Grafana reloads the dashboard automatically
(configmap-reload sidecar)
Relates to: payslip-ingest commit 9741816
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.
## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:
- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking
The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.
## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).
Sentinels stay at 64Mi — they don't store data.
## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:
helm rollback redis 43 -n redis
kubectl -n redis patch sts redis-node --type=strategic \
-p '{...memory: 256Mi...}'
kubectl -n redis delete pod redis-node-1 --force
Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.
## Verification
kubectl -n redis get pods
redis-node-0 2/2 Running 0 <bounce>
redis-node-1 2/2 Running 0 <bounce>
kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
256Mi
## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
(separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
deadlock recovery
Closes: code-pqt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The mailserver container had `capabilities.add = ["NET_ADMIN"]`. Upstream
docker-mailserver docs say the capability is only needed by Fail2ban to
run iptables ban actions. Fail2ban is DISABLED in this stack
(`ENABLE_FAIL2BAN=0`, see line ~68) — CrowdSec owns the brute-force
policy at the LB layer. The capability was therefore unused ballast and
a minor attack-surface reduction opportunity. Addresses code-4mu.
## This change
Replaces the explicit `capabilities { add = ["NET_ADMIN"] }` block with
an empty `security_context {}`. Post-rollout verification
(`supervisorctl status`) confirms every service we actually run is
healthy — dovecot, postfix, rspamd, rsyslog, postsrsd, changedetector,
cron, mailserver. Every STOPPED entry was already disabled.
The inline comment documents the revert trigger: check
`kubectl logs -c docker-mailserver` for permission-denied patterns and
restore the capability if observed.
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver -o jsonpath='{.items[0].spec.containers[?(@.name=="docker-mailserver")].securityContext}'
{"allowPrivilegeEscalation":true,"privileged":false,"readOnlyRootFilesystem":false,"runAsNonRoot":false}
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
supervisorctl status | grep RUNNING
changedetector RUNNING ...
cron RUNNING ...
dovecot RUNNING ...
mailserver RUNNING ...
postfix RUNNING ...
postsrsd RUNNING ...
rspamd RUNNING ...
rsyslog RUNNING ...
```
### Observation window
EmailRoundtripFailing + EmailRoundtripStale alerts continue to run
every 20 min. If no alert fires in the 24h post-rollout window
(through ~2026-04-20 10:40 UTC), the change is considered safe and
this commit stands. Otherwise revert this commit.
## What is NOT in this change
- readOnlyRootFilesystem (separate hardening, out of scope)
- runAsNonRoot (docker-mailserver needs root for Postfix)
- Removing privilege-escalation defaults (container needs those for
chowning mail spool at startup)
Closes: code-4mu
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB
LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable,
shipping an internal metric on the same listening IP as external mail
conflated two concerns and over-exposed the port. Prometheus was
scraping via the same LB Service. Addresses code-izl (follow-up to
code-61v which added the scrape job).
## This change
### mailserver stack
- Drops `dovecot-metrics` port from `kubernetes_service.mailserver`
(LoadBalancer stays: 25, 465, 587, 993).
- Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only,
selecting the same `app=mailserver` pod, exposing 9166.
### monitoring stack
- Updates `extraScrapeConfigs` in the Prometheus chart values to
target the new `mailserver-metrics.mailserver.svc.cluster.local:9166`
instead of `mailserver.mailserver.svc.cluster.local:9166`.
- helm_release.prometheus updated in-place; configmap-reload sidecar
picked up the new target within 10s.
```
mailserver LB mailserver-metrics ClusterIP
┌──────────────────┐ ┌──────────────────┐
│ 25 smtp │ │ 9166 dovecot- │
│ 465 smtp-secure │ │ metrics │ ← Prometheus only
│ 587 smtp-auth │ └──────────────────┘
│ 993 imap-secure │
└──────────────────┘
↑ 10.0.20.202
```
## What is NOT in this change
- Per-Service RBAC/NetworkPolicy tightening (separate task)
- Moving the metrics port to a dedicated sidecar-only Service Monitor
(ServiceMonitor CRDs not installed; extraScrapeConfigs is correct
for the prometheus-community chart in use)
## Test Plan
### Automated
```
$ kubectl get svc -n mailserver
mailserver LoadBalancer 10.0.20.202 25/TCP,465/TCP,587/TCP,993/TCP
mailserver-metrics ClusterIP 10.100.102.174 9166/TCP
$ kubectl get endpoints -n mailserver mailserver-metrics
mailserver-metrics 10.10.169.163:9166
$ # Prometheus target (after 10s configmap-reload)
$ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot'
scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics
health: up
```
### Manual Verification
1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused
2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name
3. PromQL: `up{job="mailserver-dovecot"}` returns `1`
Closes: code-izl
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The `postfix-accounts.cf` ConfigMap renders `bcrypt(pass, 6)` for each
user in `var.mailserver_accounts`. bcrypt generates a fresh salt on
every evaluation → the ConfigMap `data` hash line differs every plan
run. `ignore_changes = [data["postfix-accounts.cf"]]` was the pragmatic
workaround, but the side-effect wasn't documented: a Vault rotation of
a mailserver password would be MASKED by ignore_changes — TF would
never push the new hash and the pod would keep accepting the old
password until manual taint/replace.
Addresses bd code-7ns.
## This change
Inline comment on the lifecycle block spelling out:
- Why ignore_changes exists (non-deterministic bcrypt)
- What the invariant costs (masks automatic rotation)
- Why it's acceptable TODAY (no automatic rotation for
mailserver_accounts — verified in Vault; manual password change is a
manual TF run anyway)
- Two concrete alternatives if rotation is ever added:
(a) deterministic bcrypt with stable per-user salt
(b) render from an ESO-synced K8s Secret
No code change, no apply needed — this is a comment-only commit. The
decision (live-with + document) is one of the three options in the plan.
## What is NOT in this change
- Deterministic hashing (not needed until automatic rotation exists)
- ESO-driven Secret (same reason)
- Removal of ignore_changes (would cause the original drift flap)
## Test Plan
### Automated
```
$ cd stacks/mailserver && /home/wizard/code/infra/scripts/tg plan
# no diff expected on this comment-only change; other drift remains
# but is pre-existing and out of scope.
```
### Manual Verification
Read the new comment block at `stacks/mailserver/modules/mailserver/
main.tf` around the postfix-accounts-cf lifecycle — comprehensible
without session context.
Closes: code-7ns
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Dovecot's `dovecot.cf` block previously set only
`mail_max_userip_connections = 50`. No equivalent of the SMTP rate
limit existed for IMAP auth — brute-force against IMAP/POP auth was
throttled only by CrowdSec at the LB level. Adding an in-process
auth delay is cheap defense in depth. Addresses code-9mi.
## This change
Adds `auth_failure_delay = 5s` to the dovecot.cf ConfigMap key.
Each failed auth attempt pauses 5s before responding; a sequential
1000-entry dictionary attack stretches from <1s to ~85min, bought
out CrowdSec's ban window.
## What is NOT in this change
- `login_processes_count` tuning (workload doesn't warrant it yet)
- Equivalent SMTP AUTH delay (CrowdSec already covers, and SMTP AUTH
is rate-limited via `smtpd_client_connection_rate_limit`)
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
doveconf -n | grep -E 'auth_failure|mail_max_userip'
auth_failure_delay = 5 secs
mail_max_userip_connections = 50
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:993`
2. `a1 LOGIN bogus@viktorbarzin.me wrongpass` — expect ~5s delay before `NO [AUTHENTICATIONFAILED]`
3. Fire 5 failed attempts rapidly: total ≥25s
## Reproduce locally
1. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- doveconf -n | grep auth_failure`
2. Expected: `auth_failure_delay = 5 secs`
Closes: code-9mi
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
docker-mailserver 15.0.0's default Postfix config does NOT set
`smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587
(or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec
and rate limiting don't catch this — it's an auth-path leak, not a
bruteforce. Addresses bd code-vnw.
## This change
Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the
`postfix-main.cf` ConfigMap key consumed by docker-mailserver).
Rolled the pod to pick up the new ConfigMap.
### Deviation from task spec
code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT
a real Postfix parameter — attempting it gets
`postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The
acceptance test (reject PLAIN auth before STARTTLS) is satisfied by
`smtpd_tls_auth_only`, which is the correct knob. Added an inline
comment noting the common confusion.
## What is NOT in this change
- Per-service override in master.cf (smtpd_tls_auth_only applied
globally, which is safe because port 25 doesn't accept AUTH here)
- Other Postfix hardening (sender_restrictions, etc.)
## Test Plan
### Automated
```
$ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
postconf smtpd_tls_auth_only
smtpd_tls_auth_only = yes
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp`
2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS`
3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled`
4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds
## Reproduce locally
1. From a shell with `kubectl` access to the cluster:
2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only`
3. Expected: `smtpd_tls_auth_only = yes`
Closes: code-vnw
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`viktorbarzin/dovecot_exporter:latest` was consumed with `IfNotPresent`
pull, which means whichever node landed the pod kept whatever digest
was cached from an earlier pull. A SHA-level pin is the reproducibility
baseline this repo uses for every other home-built image
(`headscale`, `excalidraw`, `linkwarden`).
## This change
- Pins `dovecot-exporter` container image to
`viktorbarzin/dovecot_exporter@sha256:1114224c...` — the digest the
pod is actually running today (captured from live `imageID`).
- Enables Diun tag watching on the mailserver Deployment
(`diun.enable=true`, `diun.include_tags=^latest$`) so new `:latest`
digests trigger a notification rather than silently landing on the
next `IfNotPresent` miss.
Deviation from task spec (code-cno): the task asked for an 8-char SHA
*tag*, but Docker Hub only publishes `:latest` for this image — a SHA
tag doesn't exist. Used the digest-pin pattern already established at
`stacks/headscale/modules/headscale/main.tf:204` instead; Diun watches
the `:latest` tag for drift, which is the equivalent notification.
## What is NOT in this change
- Volume-mount ordering drift on `kubernetes_deployment.mailserver`
(pre-existing; tolerated by Waves 1+2).
- Splitting the metrics port into its own Service (code-izl).
## Test Plan
### Automated
```
$ kubectl get pod -n mailserver -l app=mailserver \
-o jsonpath='{.items[0].spec.containers[*].image}'
docker.io/mailserver/docker-mailserver:15.0.0 \
viktorbarzin/dovecot_exporter@sha256:1114224c9bf0261ca8e9949a6b42d3c5a2c923d34ca4593f6b62f034daf14fc5
$ kubectl get deployment -n mailserver mailserver \
-o jsonpath='{.spec.template.metadata.annotations}'
{"diun.enable":"true","diun.include_tags":"^latest$"}
$ kubectl rollout status deployment/mailserver -n mailserver
deployment "mailserver" successfully rolled out
```
### Manual Verification
1. Push a new `:latest` digest to the exporter image (or wait for one).
2. Check Diun notifier output: a tag event for `^latest$` should fire.
3. `kubectl describe deployment/mailserver -n mailserver` shows the
digest pin unchanged until someone rebumps it.
## Reproduce locally
1. `kubectl -n mailserver get pod -l app=mailserver -o yaml | \
grep -A1 dovecot_exporter`
2. Expected: `image: viktorbarzin/dovecot_exporter@sha256:1114224c...`.
Closes: code-cno
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but
nothing was scraping it. Added a static `mailserver-dovecot` scrape job
to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not
`kube-prometheus-stack`, so no ServiceMonitor CRDs are available).
Two alerts in a new `Mailserver Dovecot` rule group:
- `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for
5m (85% of `mail_max_userip_connections = 50`).
- `DovecotExporterDown` fires if the scrape target is unreachable
for 10m (catches pod restarts + network issues).
Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule
on `mailserver-beta1` branch; that commit is abandoned because the
CRDs aren't installed. This path is functionally equivalent and plans
cleanly.
Closes: code-61v
## Context
The @viktorbarzin.me catch-all routes to spam@viktorbarzin.me. The
mailbox had no retention policy. On 2026-04-18 it held 519 messages
consuming 43 MiB. Without a policy, the only brake on growth was
manual deletion, which has not been happening - hence the bd task.
Viktor's explicit constraint when filing code-oy4: DO NOT blind
age-expunge. We need targeted retention that keeps genuine forwarded
human mail for a long time while shedding the recurring-newsletter
cruft that dominates the byte count.
## Profile findings (2026-04-18, verified on the live pod)
Total: 519 messages, 43 MiB, 0 in new/, 0 in tmp/.
Top senders by volume:
138 dan@tldrnewsletter.com
51 hi@ratepunk.com
40 uber@uber.com
35 truenas@viktorbarzin.me
19 ubereats@uber.com
15 hello@travel.jacksflightclub.com
12 chris@chriswillx.com
10 me@viktorbarzin.me
Top senders by storage bytes:
8,176,481 dan@tldrnewsletter.com (19 % of 43 MiB alone)
2,866,104 uber@uber.com
2,207,458 noreply@mail.selfh.st
2,066,094 hi@ratepunk.com
1,675,435 ubereats@uber.com
Age distribution:
97 % older than 14 days (502 / 519)
23 % older than 90 days (121 / 519)
Automated-sender markers:
66 % carry List-Unsubscribe: (342 / 519)
4 % carry Precedence: bulk|list|junk ( 21 / 519)
34 % carry neither marker (= human-ish tail) (177 / 519)
Combined "automated AND >14d": 328 messages -> target of rule 1.
## Retention strategy
Signed off by Viktor 2026-04-18. Two rules, both delete-leaf:
1. Older than 14 days AND header matches one of:
- `^List-Unsubscribe:`
- `^Precedence:\s*(bulk|list|junk)`
- `^Auto-Submitted:\s*auto-`
-> DELETE.
Rationale: these markers are the RFC-agreed indicators of bulk /
robotic senders. A 14-day window still lets genuine subscription
alerts (delivery, flight, calendar invite) come to attention.
2. Older than 90 days AND no automated marker at all
-> DELETE.
Rationale: these are long-tail forwards from real people to the
catch-all. 90 days is deliberately generous - I would rather
leak bytes than lose Viktor's personal correspondence.
3. Everything else -> KEEP (recent traffic, or aged human tail
younger than 90d).
## Implementation
A `kubernetes_cron_job_v1.spam_retention` running every 4h (at :17
past) that `kubectl exec`s a Python retention script into the
mailserver pod.
Why kubectl exec and not a sibling CronJob with the Maildir mounted:
mailserver-data-encrypted is a RWO volume held by the mailserver
pod. A sibling would fail to attach. The nextcloud-watchdog pattern
in stacks/nextcloud/main.tf already solves this for a similar
"interact with the live pod on a schedule" shape. Mirrored here with
its own SA + Role + RoleBinding scoped to list/get pods and create
pods/exec in the mailserver namespace only.
Why Python and not pure shell: POSIX `find + stat + awk` struggles
with the header-scan-up-to-blank-line rule, and `stat -c` is Linux-
GNU-specific anyway. The script reads each message's first 64 KiB,
stops at the first blank line, scans headers only, then checks mtime.
The CronJob streams the Python source via `kubectl exec -i ... --
python3 - <<PYEOF`. After the retention pass, `doveadm force-resync
-u spam@viktorbarzin.me INBOX/spam` refreshes Dovecot's cached index
so the deletions appear in IMAP immediately instead of after the next
pod restart.
Includes the standard KYVERNO_LIFECYCLE_V1 marker on the CronJob so
Kyverno ndots mutation does not cause perpetual drift.
## What is NOT in this change
- Dovecot sieve rules (no sieve infrastructure exists in the module;
the plan file's fallback option was precisely this CronJob path).
- Push of retention metrics to Pushgateway - the script prints them
to the job log for now; plumbing Pushgateway is a follow-up if
Viktor wants alerts.
- Any touch of other mailboxes - only `/var/mail/viktorbarzin.me/spam/cur`
is walked.
- Any mailserver pod restart or config reload.
## Test plan
### Automated
`terraform fmt` + `terragrunt hclfmt` pass. `scripts/tg plan` on the
mailserver stack shows:
Plan: 7 to add, 3 to change, 0 to destroy.
Of the 7 adds, 4 are mine (SA + Role + RoleBinding + CronJob). The
other 3 adds belong to the concurrent roundcube-backup CronJob +
nfs_roundcube_backup_host PV + PVC already on master in parallel.
The 3 in-place updates are pre-existing drift on the mailserver
Deployment, Service and email_roundtrip_monitor CronJob, not
introduced by this change.
### Manual Verification
After `scripts/tg apply` lands the CronJob:
1. Trigger an immediate run:
`kubectl -n mailserver create job --from=cronjob/spam-retention manual-1`
2. Wait for completion, read the log:
`kubectl -n mailserver logs job/manual-1`
-> expected tail:
spam_retention_scanned_total <N>
spam_retention_auto_deleted_total <M>
spam_retention_human_deleted_total <H>
spam_retention_kept_total <K>
spam_retention_errors_total 0
Retention pass complete
3. Confirm mailbox shrunk:
`kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
-- du -sh /var/mail/viktorbarzin.me/spam/`
-> expected: well below 43 MiB within one run (bulk rule alone
purges ~328 messages per the profile numbers above).
4. Confirm IMAP reflects the deletions:
`kubectl -n mailserver exec deploy/mailserver -c docker-mailserver \
-- doveadm mailbox status -u spam@viktorbarzin.me messages INBOX/spam`
-> expected: message count dropped accordingly.
5. 4 hours later, confirm the next scheduled run logs a much
smaller scan count and 0 deletions (nothing new crossed the
threshold).
Closes: code-oy4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Roundcube webmail runs with two encrypted RWO PVCs (see roundcubemail.tf:
`roundcubemail-html-encrypted`, `roundcubemail-enigma-encrypted`). These
carry user-visible state that is NOT regenerable without user action:
- `html` PVC → Apache docroot, plugin installs, skin overrides, session
artefacts (two_factor_webauthn keys, persistent_login tokens, rcguard
throttle state)
- `enigma` PVC → user-uploaded PGP private keyrings
Per the subdir CLAUDE.md "Storage & Backup Architecture" rule every
proxmox-lvm* PVC MUST have a backup CronJob writing to NFS
`/mnt/main/<app>-backup/`. Mailserver already complies via code-z26's
`mailserver-backup` CronJob; Roundcube does not. Losing either Roundcube
PVC means users must re-add 2FA devices, re-install plugins, and
re-import PGP keys — none of it recoverable from a database dump.
Target task: `code-1f6`.
## This change
- Adds `module.nfs_roundcube_backup_host` sourcing
`modules/kubernetes/nfs_volume` pointed at
`/srv/nfs/roundcube-backup` on the Proxmox host (NFSv4, inotify
change-tracker picks it up for Synology offsite).
- Adds `kubernetes_cron_job_v1.roundcube-backup`:
- Schedule `10 3 * * *` — 10 minutes after `mailserver-backup`
(`0 3 * * *`) to avoid NFS write-window contention. Roundcube PVCs
are tiny (<200 MiB combined on current cluster) so the window is
well under 10 min.
- `pod_affinity` on `app=roundcubemail` (Roundcube runs 1 replica with
`Recreate` strategy on a fresh node per pod; the backup pod must
co-locate because both PVCs are RWO).
- `rsync -aH --delete --link-dest=/backup/<prev-week>` into
`/backup/<YYYY-WW>/{html,enigma}/` — hardlinks unchanged files vs
the previous weekly snapshot, keeping storage cost ~= delta only.
- Weekly rotation retains 8 snapshots (~2 months), matching
`mailserver-backup`.
- Pushgateway metrics under `job=roundcube-backup` so existing
`BackupDurationHigh` / `BackupStale` alert patterns detect
regressions without extra wiring.
- `KYVERNO_LIFECYCLE_V1` `ignore_changes` for mutated `dns_config`.
## Layout
```
NFS server 192.168.1.127:/srv/nfs/
├── mailserver-backup/ (0 3 * * * — code-z26)
│ └── <YYYY-WW>/{data,state,log}/
└── roundcube-backup/ (10 3 * * * — this change)
└── <YYYY-WW>/{html,enigma}/
```
## What is NOT in this change
- Changing the mailserver-backup CronJob to also cover Roundcube. Two
separate CronJobs keep the concerns (and pod anti-affinity/affinity)
clean; the 10-min stagger eliminates the contention justification for
merging them.
- Retention alerting tuning — existing Pushgateway/Prometheus rule
ecosystem suffices for now.
- Restore tooling — follows the standard pattern in
`docs/runbooks/` (rsync back, fix perms).
## Reproduce locally
1. Plan: `cd stacks/mailserver && scripts/tg plan -lock=false` →
2 new resources (nfs_volume module + CronJob).
2. Apply, then trigger a one-shot run:
`kubectl -n mailserver create job --from=cronjob/roundcube-backup roundcube-backup-manual-1`
3. Expected on success:
- `kubectl -n mailserver logs job/roundcube-backup-manual-1` → "=== Backup IO Stats ===".
- On Proxmox host:
`ls /srv/nfs/roundcube-backup/$(date +%Y-%W)/` → `html`, `enigma`.
- `/mnt/backup/.nfs-changes.log` (Proxmox) lists fresh paths under
`roundcube-backup/` within ~1s of the rsync finishing.
- Pushgateway: `curl -s prometheus-prometheus-pushgateway.monitoring:9091/metrics | grep roundcube`
shows `backup_duration_seconds`, `backup_last_success_timestamp`.
## Automated
- `terraform fmt -check -recursive stacks/mailserver/modules/mailserver/` → clean.
- `scripts/tg plan -lock=false` in stacks/mailserver expected to show
`+ module.nfs_roundcube_backup_host.*`, `+ kubernetes_cron_job_v1.roundcube-backup`.
Closes: code-1f6
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>