## Context
The mailserver stack holds everything valuable and hard to recreate:
243M of maildirs, dovecot/rspamd state, and the DKIM private key that
signs outbound mail. Today the only defense is the LVM thin-pool
snapshots on the PVE host (7-day retention, storage-class scope only)
— there is no app-level backup. Infra/.claude/CLAUDE.md mandates that
every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob,
and the mailserver stack was the only one still out of compliance.
Loss of mailserver-data-encrypted without backups = total loss of all
stored mail plus a DKIM key rotation (which requires a DNS update and
breaks signature verification on every message in transit for the TTL
window). Unacceptable for a service people actually use.
Trade-offs considered:
- mysqldump-style single-file dump vs rsync snapshot — maildirs are
millions of small files, not a DB export. rsync --link-dest gives
incremental weekly snapshots for ~10% of the cost of a full copy.
- RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so
the backup Job has to co-locate with the mailserver pod. vaultwarden
solves this with pod_affinity; mirrored here.
- Image choice — alpine + apk add rsync matches vaultwarden's pattern
and keeps the container image small.
## This change
Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the
mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup
and 00:45 per-db windows, and the */20 email-roundtrip cadence). The
job rsyncs /var/mail, /var/mail-state, /var/log/mail into
/srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the
previous week for space-efficient incrementals. 8-week retention.
Data layout (flowed through from the deployment's subPath mounts so
the rsync tree matches the mailserver's own on-disk layout):
PVC mailserver-data-encrypted (RWO, 2Gi)
├─ data/ (subPath) → pod's /var/mail → backup/<week>/data/
├─ state/ (subPath) → pod's /var/mail-state → backup/<week>/state/
└─ log/ (subPath) → pod's /var/log/mail → backup/<week>/log/
Safety:
- PVC mounted read-only (volume.persistent_volume_claim.read_only
AND all three volume_mounts set read_only=true) so a backup-script
bug cannot corrupt maildirs.
- pod_affinity on app=mailserver + topology_key=hostname forces the
Job pod onto the same node holding the RWO PVC attachment.
- set -euxo pipefail + per-directory existence guard so a missing
subPath short-circuits cleanly instead of silently no-op'ing.
Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup
convention (job="mailserver-backup"):
backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_output_bytes, backup_last_success_timestamp.
Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden:
- MailserverBackupStale — 36h threshold, critical, 30m for:
- MailserverBackupNeverSucceeded — critical, 1h for:
## Reproduce locally
1. cd infra/stacks/mailserver && ../../scripts/tg plan
Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on
deployment/service is pre-existing.
2. ../../scripts/tg apply --non-interactive \
-target=module.mailserver.module.nfs_mailserver_backup_host \
-target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup
3. cd ../monitoring && ../../scripts/tg apply --non-interactive
4. kubectl create job --from=cronjob/mailserver-backup \
mailserver-backup-test -n mailserver
5. kubectl wait --for=condition=complete --timeout=300s \
job/mailserver-backup-test -n mailserver
6. Expected: test pod co-locates with mailserver on same node
(k8s-node2 today), rsync writes ~950M to
/srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes
backup_output_bytes{job="mailserver-backup"}.
## Test Plan
### Automated
$ kubectl get cronjob -n mailserver mailserver-backup
NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
mailserver-backup 0 3 * * * <none> False 0 <none> 3s
$ kubectl create job --from=cronjob/mailserver-backup \
mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test created
$ kubectl wait --for=condition=complete --timeout=300s \
job/mailserver-backup-test -n mailserver
job.batch/mailserver-backup-test condition met
$ kubectl logs -n mailserver job/mailserver-backup-test | tail -5
=== Backup IO Stats ===
duration: 80s
read: 1120 MiB
written: 1186 MiB
output: 947.0M
$ kubectl run nfs-verify --rm --image=alpine --restart=Never \
--overrides='{...nfs mount /srv/nfs...}' \
-n mailserver --attach -- ls -la /nfs/mailserver-backup/
947.0M /nfs/mailserver-backup/2026-15
$ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
| grep mailserver-backup
backup_duration_seconds{instance="",job="mailserver-backup"} 80
backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09
backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08
backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09
backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09
$ curl -s http://prometheus-server/api/v1/rules \
| jq '.data.groups[].rules[] | select(.name | test("Mailserver"))'
MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600
MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0
### Manual Verification
1. Wait for the scheduled 03:00 run tonight; verify
`kubectl get job -n mailserver` shows a new completed job.
2. Check that `backup_last_success_timestamp` advances past today.
3. Confirm `MailserverBackupNeverSucceeded` did not fire.
4. Next week (week 16), confirm `--link-dest` builds hardlinks vs
2026-15 (size delta should drop from ~950M to ~the actual churn).
## Deviations from mysql-backup pattern
- Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0`
base is not applicable for a filesystem rsync).
- pod_affinity: required for RWO PVC co-location (mysql uses its own
MySQL service for network access; mailserver must mount the PVC).
- Metric push via wget (mirrors vaultwarden; alpine has wget, not curl).
- Week-folder layout with --link-dest rotation: rsync pattern, closer
to the PVE daily-backup script than mysql's single-file gzip dumps.
[ci skip]
Closes: code-z26
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .beads | ||
| .claude | ||
| .git-crypt | ||
| .github | ||
| .planning | ||
| .woodpecker | ||
| ci | ||
| cli | ||
| diagram | ||
| docs | ||
| modules | ||
| playbooks | ||
| scripts | ||
| secrets | ||
| stacks | ||
| state/stacks | ||
| .gitattributes | ||
| .gitignore | ||
| .sops.yaml | ||
| AGENTS.md | ||
| config.tfvars | ||
| CONTRIBUTING.md | ||
| LICENSE.txt | ||
| MEMORY.md | ||
| README.md | ||
| setup-monitoring.sh | ||
| terragrunt.hcl | ||
| tiers.tf | ||
This repo contains my infra-as-code sources.
My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.
Read more by visiting my website: https://viktorbarzin.me
Documentation
Full architecture documentation is available in docs/ — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more.
Adding a New User (Admin)
Adding a new namespace-owner to the cluster requires three steps — no code changes needed.
1. Authentik Group Assignment
In the Authentik admin UI, add the user to:
kubernetes-namespace-ownersgroup (grants OIDC group claim for K8s RBAC)Headscale Usersgroup (if they need VPN access)
2. Vault KV Entry
Add a JSON entry to secret/platform → k8s_users key in Vault:
"username": {
"role": "namespace-owner",
"email": "user@example.com",
"namespaces": ["username"],
"domains": ["myapp"],
"quota": {
"cpu_requests": "2",
"memory_requests": "4Gi",
"memory_limits": "8Gi",
"pods": "20"
}
}
usernamekey must match the user's Forgejo username (for Woodpecker admin access)namespaces— K8s namespaces to create and grant admin access todomains— subdomains underviktorbarzin.mefor Cloudflare DNS recordsquota— resource limits per namespace (defaults shown above)
3. Apply Stacks
vault login -method=oidc
cd stacks/vault && terragrunt apply --non-interactive
# Creates: namespace, Vault policy, identity entity, K8s deployer role
cd ../platform && terragrunt apply --non-interactive
# Creates: RBAC bindings, ResourceQuota, TLS secret, DNS records
cd ../woodpecker && terragrunt apply --non-interactive
# Adds user to Woodpecker admin list
What Gets Auto-Generated
| Resource | Stack |
|---|---|
| Kubernetes namespace | vault |
Vault policy (namespace-owner-{user}) |
vault |
| Vault identity entity + OIDC alias | vault |
| K8s deployer Role + Vault K8s role | vault |
| RBAC RoleBinding (namespace admin) | platform |
| RBAC ClusterRoleBinding (cluster read-only) | platform |
| ResourceQuota | platform |
| TLS secret in namespace | platform |
| Cloudflare DNS records | platform |
| Woodpecker admin access | woodpecker |
New User Onboarding
If you've been added as a namespace-owner, follow these steps to get started.
1. Join the VPN
# Install Tailscale: https://tailscale.com/download
tailscale login --login-server https://headscale.viktorbarzin.me
# Send the registration URL to Viktor, wait for approval
ping 10.0.20.100 # verify connectivity
2. Install Tools
Run the setup script to install kubectl, kubelogin, Vault CLI, Terraform, and Terragrunt:
# macOS
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
# Linux
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
3. Authenticate
# Log into Vault (opens browser for SSO)
vault login -method=oidc
# Test kubectl (opens browser for OIDC login)
kubectl get pods -n YOUR_NAMESPACE
4. Deploy Your First App
# Clone the infra repo
git clone https://github.com/ViktorBarzin/infra.git && cd infra
# Copy the stack template
cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf
# Edit main.tf — replace all <placeholders>
# Store secrets in Vault
vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
# Submit a PR
git checkout -b feat/myapp
git add stacks/myapp/
git commit -m "add myapp stack"
git push -u origin feat/myapp
After review and merge, an admin runs cd stacks/myapp && terragrunt apply.
5. Set Up CI/CD (Optional)
Create .woodpecker.yml in your app's Forgejo repo:
steps:
- name: build
image: woodpeckerci/plugin-docker-buildx
settings:
repo: YOUR_DOCKERHUB_USER/myapp
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
username:
from_secret: dockerhub-username
password:
from_secret: dockerhub-token
platforms: linux/amd64
- name: deploy
image: hashicorp/vault:1.18.1
commands:
- export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
- export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login
role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
- KUBE_TOKEN=$(vault write -field=service_account_token
kubernetes/creds/YOUR_NAMESPACE-deployer
kubernetes_namespace=YOUR_NAMESPACE)
- kubectl --server=https://kubernetes.default.svc
--token=$KUBE_TOKEN
--certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
-n YOUR_NAMESPACE set image deployment/myapp
myapp=YOUR_DOCKERHUB_USER/myapp:${CI_PIPELINE_NUMBER}
Useful Commands
# Check your pods
kubectl get pods -n YOUR_NAMESPACE
# View quota usage
kubectl describe resourcequota -n YOUR_NAMESPACE
# Store/read secrets
vault kv put secret/YOUR_USERNAME/myapp KEY=value
vault kv get secret/YOUR_USERNAME/myapp
# Get a short-lived K8s deploy token
vault write kubernetes/creds/YOUR_NAMESPACE-deployer \
kubernetes_namespace=YOUR_NAMESPACE
Important Rules
- All changes go through Terraform — never
kubectl apply/edit/patchdirectly - Never put secrets in code — use Vault:
vault kv put secret/YOUR_USERNAME/... - Always use a PR — never push directly to master
- Docker images: build for
linux/amd64, use versioned tags (not:latest)
git-crypt setup
To decrypt the secrets, you need to setup git-crypt.
- Install git-crypt.
- Setup gpg keys on the machine
git-crypt unlock
This will unlock the secrets and will lock them on commit