infra/stacks/mailserver
Viktor Barzin 1698cd1ce1 [mailserver] Add daily backup CronJob for mailserver PVC
## Context

The mailserver stack holds everything valuable and hard to recreate:
243M of maildirs, dovecot/rspamd state, and the DKIM private key that
signs outbound mail. Today the only defense is the LVM thin-pool
snapshots on the PVE host (7-day retention, storage-class scope only)
— there is no app-level backup. Infra/.claude/CLAUDE.md mandates that
every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob,
and the mailserver stack was the only one still out of compliance.

Loss of mailserver-data-encrypted without backups = total loss of all
stored mail plus a DKIM key rotation (which requires a DNS update and
breaks signature verification on every message in transit for the TTL
window). Unacceptable for a service people actually use.

Trade-offs considered:
- mysqldump-style single-file dump vs rsync snapshot — maildirs are
  millions of small files, not a DB export. rsync --link-dest gives
  incremental weekly snapshots for ~10% of the cost of a full copy.
- RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so
  the backup Job has to co-locate with the mailserver pod. vaultwarden
  solves this with pod_affinity; mirrored here.
- Image choice — alpine + apk add rsync matches vaultwarden's pattern
  and keeps the container image small.

## This change

Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the
mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup
and 00:45 per-db windows, and the */20 email-roundtrip cadence). The
job rsyncs /var/mail, /var/mail-state, /var/log/mail into
/srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the
previous week for space-efficient incrementals. 8-week retention.

Data layout (flowed through from the deployment's subPath mounts so
the rsync tree matches the mailserver's own on-disk layout):

    PVC mailserver-data-encrypted (RWO, 2Gi)
      ├─ data/   (subPath) → pod's /var/mail        → backup/<week>/data/
      ├─ state/  (subPath) → pod's /var/mail-state  → backup/<week>/state/
      └─ log/    (subPath) → pod's /var/log/mail    → backup/<week>/log/

Safety:
- PVC mounted read-only (volume.persistent_volume_claim.read_only
  AND all three volume_mounts set read_only=true) so a backup-script
  bug cannot corrupt maildirs.
- pod_affinity on app=mailserver + topology_key=hostname forces the
  Job pod onto the same node holding the RWO PVC attachment.
- set -euxo pipefail + per-directory existence guard so a missing
  subPath short-circuits cleanly instead of silently no-op'ing.

Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup
convention (job="mailserver-backup"):
  backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_output_bytes, backup_last_success_timestamp.

Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden:
- MailserverBackupStale — 36h threshold, critical, 30m for:
- MailserverBackupNeverSucceeded — critical, 1h for:

## Reproduce locally

1. cd infra/stacks/mailserver && ../../scripts/tg plan
   Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on
   deployment/service is pre-existing.
2. ../../scripts/tg apply --non-interactive \
     -target=module.mailserver.module.nfs_mailserver_backup_host \
     -target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup
3. cd ../monitoring && ../../scripts/tg apply --non-interactive
4. kubectl create job --from=cronjob/mailserver-backup \
     mailserver-backup-test -n mailserver
5. kubectl wait --for=condition=complete --timeout=300s \
     job/mailserver-backup-test -n mailserver
6. Expected: test pod co-locates with mailserver on same node
   (k8s-node2 today), rsync writes ~950M to
   /srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes
   backup_output_bytes{job="mailserver-backup"}.

## Test Plan

### Automated

$ kubectl get cronjob -n mailserver mailserver-backup
NAME                SCHEDULE    TIMEZONE   SUSPEND   ACTIVE   LAST SCHEDULE   AGE
mailserver-backup   0 3 * * *   <none>     False     0        <none>          3s

$ kubectl create job --from=cronjob/mailserver-backup \
    mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test created

$ kubectl wait --for=condition=complete --timeout=300s \
    job/mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test condition met

$ kubectl logs -n mailserver job/mailserver-backup-test | tail -5
=== Backup IO Stats ===
duration: 80s
read:    1120 MiB
written: 1186 MiB
output:  947.0M

$ kubectl run nfs-verify --rm --image=alpine --restart=Never \
    --overrides='{...nfs mount /srv/nfs...}' \
    -n mailserver --attach -- ls -la /nfs/mailserver-backup/
947.0M  /nfs/mailserver-backup/2026-15

$ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
    | grep mailserver-backup
backup_duration_seconds{instance="",job="mailserver-backup"} 80
backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09
backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08
backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09
backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09

$ curl -s http://prometheus-server/api/v1/rules \
    | jq '.data.groups[].rules[] | select(.name | test("Mailserver"))'
MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600
MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0

### Manual Verification

1. Wait for the scheduled 03:00 run tonight; verify
   `kubectl get job -n mailserver` shows a new completed job.
2. Check that `backup_last_success_timestamp` advances past today.
3. Confirm `MailserverBackupNeverSucceeded` did not fire.
4. Next week (week 16), confirm `--link-dest` builds hardlinks vs
   2026-15 (size delta should drop from ~950M to ~the actual churn).

## Deviations from mysql-backup pattern

- Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0`
  base is not applicable for a filesystem rsync).
- pod_affinity: required for RWO PVC co-location (mysql uses its own
  MySQL service for network access; mailserver must mount the PVC).
- Metric push via wget (mirrors vaultwarden; alpine has wget, not curl).
- Week-folder layout with --link-dest rotation: rsync pattern, closer
  to the PVE daily-backup script than mysql's single-file gzip dumps.

[ci skip]

Closes: code-z26

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:08 +00:00
..
modules/mailserver [mailserver] Add daily backup CronJob for mailserver PVC 2026-04-18 23:26:08 +00:00
main.tf fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 2026-04-15 06:41:56 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00