kured + cnpg: drain-safe defaults ahead of Monday reboot wave

Three defensive moves to make the kured rolling-reboot cycle survive edge cases without operator intervention: kured (stacks/kured/main.tf): - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if a future PDB or finalizer stalls drain, kured retries forever and the node stays cordoned silently. 30m caps the silent-failure window — after timeout kured logs the abort and waits for the next period; the node stays Schedulable so cluster capacity isn't lost. Lets us fail closed instead of fail-silent. CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf): - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the failover during a primary-node drain depended on the lone replica being caught up; a WAL backlog would stall the drain until the replica was current. With 3 instances CNPG always has at least one fully-current replica to promote, and the PDB's `minAvailable=1` on the primary selector is satisfied throughout the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about 35Gi after autoresize). Memory: +3Gi pod limit. - Updated the `triggers.instances` so the null_resource's local-exec actually re-applies the YAML (kubectl apply with the new spec). The YAML is the source-of-truth but the trigger is what tells terraform to re-run the provisioner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:06:30 +00:00 · 2026-05-16 12:06:30 +00:00 · a726e963e3
commit a726e963e3
parent 16a470e950
2 changed files with 16 additions and 2 deletions
--- a/stacks/kured/main.tf
+++ b/stacks/kured/main.tf
@ -72,6 +72,14 @@ resource "helm_release" "kured" {
      notifyUrl      = data.vault_kv_secret_v2.secrets.data["slack_kured_webhook"]
      concurrency    = 1
      rebootDelay    = "30s"
+      # Fail closed instead of looping forever. Default is 0 (unlimited) — if
+      # a future PDB or finalizer stalls drain, kured retries indefinitely and
+      # the node stays cordoned silently. 30m gives CNPG / shared-store
+      # Anubis / any other stateful workload plenty of time to settle, but
+      # caps the silent-failure window. After timeout kured logs the abort
+      # and waits for the next period; node stays Schedulable so the cluster
+      # doesn't lose capacity. Fixed 2026-05-16.
+      drainTimeout = "30m"
      # Halt rolling reboots when ANY firing Prometheus alert is not in the
      # ignore-list. The ignore-list excludes self-referential / always-firing
      # alerts that would otherwise deadlock kured. alertFilterMatchOnly stays