From cfd0f5bcc941072cd7b17ba6efe3d2dab76f71b2 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 18 Apr 2026 23:45:17 +0000 Subject: [PATCH] [mailserver] Add liveness/readiness TCP probes [ci skip] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Context The mailserver container (Postfix + Dovecot in one pod) had no liveness, readiness, or startup probes declared. If either daemon deadlocked or hung on a socket, Kubernetes had no way to detect it and restart. The only external canary was the email-roundtrip-monitor CronJob which runs on a 20-minute interval, giving a detection lag of 20-60 minutes — long enough for real delivery failures before an alert fires. Tracked as bd code-ekf out of the mailserver probe audit. Both port 25 (SMTP) and port 993 (IMAPS) are cheap, reliable up-signals — the existing e2e probe already hits IMAPS, so TCP probes on those ports are a close proxy for user-visible service health without the cost of full SMTP/IMAP handshakes every 10s. ## This change Adds a readiness_probe (TCP :25, initial_delay=30s, period=10s) and a liveness_probe (TCP :993, initial_delay=60s, period=60s, timeout=15s) to the mailserver deployment's primary container. Design choices: - **TCP over exec/HTTP**: the daemons do not expose HTTP health; exec probes would require shelling into the container with auth for SMTP/IMAP banner checks, which is both costly and flaky. TCP accept is sufficient — if postfix cannot accept a TCP connection on :25 it is unambiguously broken. - **Split ports per probe**: readiness on :25 (the public SMTP surface — if this is down, external delivery is broken) and liveness on :993 (IMAPS, the other critical daemon — catches Dovecot deadlocks independently of Postfix). - **30s readiness delay**: Postfix needs ~20-30s to warm up including chroot setup and DKIM key loading; probing earlier would cause bogus NotReady cycles on deploy. - **60s liveness delay + 60s period + 15s timeout**: generous so transient blips (brief CPU spike, RBL timeout, slow NFS unmount during rotation) do not trigger a restart loop. With failure_threshold=3 (default), a real deadlock is detected in ~3 minutes; false positives on transient load are suppressed. - **No startup_probe**: the 60s liveness initial_delay is enough cover for the warmup window; adding a startup probe would be redundant machinery. ## What is NOT in this change - No startup_probe (liveness initial_delay_seconds=60 handles warmup) - No exec-based probes (banner-check probes are out of scope and not needed) - No changes to the opendkim or other sidecars - Pre-existing drift in other stacks (dawarich namespace label, owntracks dawarich-hook wiring) is deliberately left out — those are separate workstreams ## Test Plan ### Automated Applied via `tg apply -target=kubernetes_deployment.mailserver` before this commit. Current pod state: ``` $ kubectl get pod -n mailserver -l app=mailserver NAME READY STATUS RESTARTS AGE mailserver-6c6bf77ffb-w7nl5 2/2 Running 0 2m26s $ kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness|Restart Count|Status:|Ready:)" Status: Running Ready: True Restart Count: 0 Ready: True Restart Count: 0 Liveness: tcp-socket :993 delay=60s timeout=15s period=60s #success=1 #failure=3 Readiness: tcp-socket :25 delay=30s timeout=1s period=10s #success=1 #failure=3 ``` Pod has run >120s (two full liveness cycles) with RESTARTS=0 and Ready=True. ### Manual Verification 1. Confirm probes are declared on the live pod: ``` kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)" ``` Expected: `Liveness: tcp-socket :993 ...` and `Readiness: tcp-socket :25 ...` 2. Confirm pod stays Ready under normal load for 5+ minutes: ``` kubectl get pod -n mailserver -l app=mailserver -w ``` Expected: RESTARTS stays at 0, READY stays at 2/2. 3. (Optional) Failure-simulate by dropping :993 inside the pod and observing liveness failure + restart within ~3 minutes (3 × period_seconds). ## Reproduce locally 1. `cd infra/stacks/mailserver` 2. `tg plan -target=kubernetes_deployment.mailserver` 3. Expected: no drift (or only the probe additions if rolling forward a stale state) 4. `kubectl get pod -n mailserver -l app=mailserver` — pod Ready, RESTARTS=0 5. `kubectl describe pod -n mailserver -l app=mailserver | grep -E "(Liveness|Readiness)"` — both probes present Closes: code-ekf Co-Authored-By: Claude Opus 4.7 (1M context) --- stacks/mailserver/modules/mailserver/main.tf | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/stacks/mailserver/modules/mailserver/main.tf b/stacks/mailserver/modules/mailserver/main.tf index c3b7bd91..f10e03e6 100644 --- a/stacks/mailserver/modules/mailserver/main.tf +++ b/stacks/mailserver/modules/mailserver/main.tf @@ -412,6 +412,23 @@ resource "kubernetes_deployment" "mailserver" { } } + readiness_probe { + tcp_socket { + port = 25 + } + initial_delay_seconds = 30 + period_seconds = 10 + } + + liveness_probe { + tcp_socket { + port = 993 + } + initial_delay_seconds = 60 + period_seconds = 60 + timeout_seconds = 15 + } + } container {