dbaas: require pod anti-affinity on pg-cluster (one PG per node)

Default CNPG affinity was `preferred` (soft). During the 2026-05-26
node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing
that node would have taken the whole PG cluster down (no quorum) AND
the 9.2 GiB pg-cluster footprint was the dominant reason frigate
couldn't fit on the GPU node.

With 3 instances + 4 worker nodes, `required` is safe under 1-node
drain (3 distinct nodes always available, even excluding the drained
one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-26 09:00:37 +00:00
parent 400ee88967
commit 12b4f6f81a

View file

@ -1088,6 +1088,7 @@ resource "null_resource" "pg_cluster" {
storage_class = "proxmox-lvm-encrypted"
memory_limit = "3Gi"
pg_params = "v3-shared1024-walcomp-workmem16-max200"
affinity = "required-hostname-v1"
}
provisioner "local-exec" {
@ -1106,6 +1107,15 @@ resource "null_resource" "pg_cluster" {
# during a long WAL backlog the failover would stall the drain.
# Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
instances: 3
# Hard anti-affinity: force one PG instance per node. Default is
# `preferred` which let all 3 pods collapse onto k8s-node1 during
# the 2026-05-26 node4 outage losing node1 would have killed the
# whole cluster (no quorum). With 3 instances + 4 worker nodes,
# `required` is safe under 1-node drain.
affinity:
enablePodAntiAffinity: true
podAntiAffinityType: required
topologyKey: kubernetes.io/hostname
imageName: ghcr.io/cloudnative-pg/postgis:16
postgresql:
parameters: