Viktor Barzin e1ab23193d redis: revert 3-node Sentinel HA to single standalone instance [ci skip]

The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).

Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].

Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.

Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 17:49:43 +00:00

15 KiB

Raw Blame History

Databases

Overview

The cluster provides shared database services (PostgreSQL, MySQL, Redis) for multi-tenant workloads with automated credential rotation via Vault. PostgreSQL uses CloudNativePG (CNPG) with PgBouncer connection pooling, MySQL runs as an InnoDB Cluster with anti-affinity rules for stability, and Redis provides a shared cache layer. SQLite is used for per-app local storage with careful attention to filesystem compatibility.

Architecture Diagram

graph TB
    subgraph Apps
        A1[trading-bot]
        A2[apple-health-data]
        A3[wrongmove]
        A4[claude-memory-mcp]
    end

    subgraph PostgreSQL
        A1 --> PGB[PgBouncer<br/>3 replicas]
        A2 --> PGB
        A4 --> PGB
        PGB --> CNPG_RW[CNPG Primary<br/>pg-cluster-rw.dbaas]
        CNPG_RW --> CNPG_R1[CNPG Replica 1]
    end

    subgraph MySQL
        A3 --> MYC[MySQL InnoDB Cluster<br/>3 instances]
        MYC --> LVM1[Proxmox-LVM Storage]
        MYC -.anti-affinity.-> NODE1[Exclude k8s-node1<br/>GPU node]
    end

    subgraph Redis
        A1 --> RED[Redis<br/>redis.redis.svc.cluster.local]
    end

    subgraph Vault
        V[Vault DB Engine]
        V -.7-day rotation.-> PGB
        V -.7-day rotation.-> MYC
    end

    style CNPG_RW fill:#2088ff
    style PGB fill:#4c9e47
    style MYC fill:#f39c12
    style RED fill:#dc382d

Components

Component	Version	Location	Purpose
PostgreSQL (CNPG)	CloudNativePG (PostGIS 16: `postgis:16`)	`dbaas` namespace	Primary/replica cluster, auto-failover
PgBouncer	3 replicas	`dbaas` namespace	Connection pooling for PostgreSQL
MySQL InnoDB Cluster	8.4.4	`dbaas` namespace	Multi-master MySQL cluster
Redis	Latest	`redis` namespace	Shared cache layer
Vault DB Engine	-	`vault` namespace	Automated credential rotation

Database Endpoints

Service	Endpoint	Notes
PostgreSQL (primary)	`pg-cluster-rw.dbaas.svc.cluster.local`	Always use this via PgBouncer
PgBouncer	`pgbouncer.dbaas.svc.cluster.local`	Connection pool (3 replicas)
MySQL	`mysql.dbaas.svc.cluster.local`	InnoDB Cluster VIP
Redis	`redis.redis.svc.cluster.local`	Shared instance
PostgreSQL (compat)	`postgresql.dbaas.svc.cluster.local`	Compatibility service, selects CNPG primary

How It Works

PostgreSQL (CNPG + PgBouncer)

CNPG Cluster: Manages PostgreSQL primary and replicas
- Primary: pg-cluster-rw.dbaas.svc.cluster.local
- Auto-failover on primary failure
- Replicas for read scaling
PgBouncer: Connection pooling layer (3 replicas)
- Apps connect to PgBouncer, not directly to PostgreSQL
- Reduces connection overhead
- Load balances across PgBouncer instances
Credential Rotation: Vault DB engine rotates credentials every 7 days
- Apps fetch credentials from Vault on startup
- Vault manages rotation lifecycle

Used by:

trading-bot
apple-health-data (health)
linkwarden
affine
woodpecker
claude-memory-mcp
tripit
5 active PG roles

MySQL InnoDB Cluster

Cluster Topology: 3 MySQL instances with auto-recovery
- Multi-master replication
- Automatic split-brain resolution
Storage: Proxmox-LVM persistent volumes
- Thin-provisioned LVM on Proxmox hosts
- Block-level storage with proper write guarantees
Anti-Affinity: Excludes k8s-node1 (GPU node)
- Pods scheduled to node2, node3, node4, etc.
- Keeps database workloads off the GPU-dedicated node
Resource Allocation: 2Gi request / 3Gi limit
- Right-sized based on VPA recommendations

Used by:

wrongmove (realestate-crawler)
speedtest
codimd
nextcloud
shlink
grafana
technitium (DNS query logs via QueryLogsMySqlApp plugin)

Redis

Single standalone instance shared by all consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Celery apps, Traefik, etc.). Clients talk to redis-master.redis.svc.cluster.local:6379, which now selects the single redis pod directly. No Sentinel, no HAProxy, no replicas — reverted from 3-node HA on 2026-05-30 (see "Why standalone" below).

Architecture:

1 pod in StatefulSet redis-v2 (replicas=1, podManagementPolicy=Parallel retained for STS-field immutability), running redis + redis_exporter containers on docker.io/library/redis:8-alpine (8.6.2). Data on a proxmox-lvm-encrypted PVC (data-redis-v2-0, 5Gi→20Gi autoresize).

maxmemory=640mb (83% of the 768Mi pod limit), maxmemory-policy=volatile-lru. The instance is shared by two workload classes: CACHES (want LRU eviction of disposable keys) and QUEUES (Immich BullMQ bull:*, Celery _kombu:* — must never be evicted or jobs vanish). volatile-lru evicts only keys carrying a TTL (caches set them) and never touches TTL-less keys (queue jobs), serving both correctly in one instance. Backstop: alert RedisMemoryPressure at 80% — if it ever fills with non-volatile keys, writes error like noeviction.
Persistence: RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. aof-load-corrupt-tail-max-size=1024 tolerates ≤1KB of AOF tail garbage from an unclean reboot instead of crashlooping. Disk-wear (sdb Samsung 850 EVO, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway.
Memory requests=limits=768Mi. BGSAVE + AOF-rewrite fork can double RSS via COW; auto-aof-rewrite-percentage=200 + auto-aof-rewrite-min-size=128mb tune down rewrite frequency.
Service redis-master (name/DNS unchanged across the HA teardown so no consumer needed editing). Keel opt-out (keel.sh/policy=never, label + annotation) — a prior patch-bump to :8.0.6-alpine rejected the AOF config and crashed it.
Weekly RDB backup to NFS (/srv/nfs/redis-backup/, Sunday 03:00, 28-day retention, Pushgateway metrics).
Auth disabled — NetworkPolicy is the isolation layer. requirepass + creds rollout to all clients remains a planned follow-up.
Downtime model: a single instance means a pod restart (image bump, node drain, OOM) is a few-seconds cluster-wide Redis blip. Explicitly accepted (Viktor, 2026-05-30) as the price of eliminating the HA failure modes below. There is no PDB (a single-replica PDB would only block node drains).

Observability: oliver006/redis_exporter:v1.62.0 sidecar on port 9121, auto-scraped. Alerts: RedisDown, RedisMemoryPressure (>80%), RedisEvictions, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisBackupStale, RedisBackupNeverSucceeded. (RedisReplicationLagHigh + RedisReplicasMissing removed with the replicas.)

Why standalone — HA Redis caused more outages than it prevented in this homelab. Five incidents: (a) 2026-04-04 service selector routed writes to a replica → READONLY; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC (256Mi too tight); (c) 2026-04-19 PM sentinel quorum drift (2 sentinels, no majority) routed writes to a slave; (d) 2026-04-22 five-factor flap cascade (soft anti-affinity co-located pods + aggressive sentinel/probe timing + HAProxy polling race); (e) 2026-05-30 split-brain — redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 is bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2; HAProxy's expect rstring role:master matched both and round-robined client connections across them, so Immich enqueued BullMQ jobs on one master while its workers blocked-popped on the other → every queue wedged, new-upload thumbnails 404'd cluster-wide. The 3-sentinel design (beads code-v2b) was built specifically to prevent split-brain after incident (c), yet the bootstrap fallback manufactured one anyway. Conclusion: for a homelab cache/broker, a single instance with a few-seconds restart blip is strictly simpler and more reliable than chasing Sentinel correctness. Mirrors the MySQL InnoDB-Cluster → standalone reversion (2026-04-16). Post-mortem: docs/post-mortems/2026-05-30-redis-split-brain.md.

SQLite (Per-App)

Apps using SQLite:

headscale
vaultwarden
plotting-book
holiday-planner
priority-pass

Critical: SQLite on NFS is unreliable

NFS lacks proper fsync() support
Causes database corruption under load
Solution: Use Proxmox-LVM volumes for SQLite apps

Vault Database Engine

Rotation Schedule: 7 days (604800s)

PostgreSQL Rotation:

health (apple-health-data)
linkwarden
affine
woodpecker
claude_memory
tripit (Vault static role pg-tripit)

MySQL Rotation:

speedtest
wrongmove
codimd
nextcloud
shlink
grafana
technitium (password synced to Technitium DNS app via CronJob every 6h)

Excluded from Rotation:

authentik (uses PgBouncer, incompatible)
crowdsec (Helm-baked credentials)
Root users (manual management)

How Rotation Works:

Vault rotates the MySQL user's password (static role, 7-day period)
ExternalSecrets Operator syncs new password to K8s Secret (15-min refresh)
Apps read from K8s Secret via secret_key_ref env vars
Special case: Technitium stores its MySQL connection in internal app config, so a CronJob pushes the rotated password to the Technitium API every 6 hours

Configuration

Terraform Shared Variables

Always use shared variables, never hardcode endpoints:

variable "postgresql_host" {
  default = "pgbouncer.dbaas.svc.cluster.local"
}

variable "mysql_host" {
  default = "mysql.dbaas.svc.cluster.local"
}

variable "redis_host" {
  default = "redis.redis.svc.cluster.local"
}

Vault Paths

PostgreSQL Dynamic Credentials:

database/creds/postgres-<app>-role

MySQL Dynamic Credentials:

database/creds/mysql-<app>-role

Static Credentials (non-rotated):

secret/data/mysql/root
secret/data/postgres/root

Version Pinning

Diun Monitoring Disabled for database images to prevent unwanted version bumps:

MySQL: pinned version in Terraform
PostgreSQL: pinned CNPG operator version
Redis: pinned image tag

Rationale: Database upgrades require careful planning and testing

Example Terraform Stack (PostgreSQL)

resource "vault_database_secret_backend_role" "app" {
  backend             = "database"
  name                = "postgres-myapp-role"
  db_name             = "postgres"
  creation_statements = [
    "CREATE USER \"{{name}}\" WITH PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
    "GRANT ALL PRIVILEGES ON DATABASE myapp TO \"{{name}}\";"
  ]
  default_ttl         = 604800  # 7 days
  max_ttl             = 604800
}

resource "kubernetes_secret" "db_creds" {
  metadata {
    name      = "myapp-db"
    namespace = "default"
  }

  data = {
    host     = var.postgresql_host
    database = "myapp"
    # App fetches username/password from Vault at runtime
  }
}

Decisions & Rationale

Why CNPG Instead of Postgres Operator?

Alternatives considered:

Zalando Postgres Operator: Mature but complex
Bitnami PostgreSQL Helm: Simple but manual failover
CNPG (chosen): Kubernetes-native, auto-failover, active development

Benefits:

Native Kubernetes CRDs
Automatic failover and recovery
Active community and updates
Better resource efficiency than Zalando

Why PgBouncer for PostgreSQL?

Reduces connection overhead (apps create many connections)
Load balances across PgBouncer replicas
Essential for apps that don't implement connection pooling
Required for Vault DB engine compatibility with some apps

Why MySQL InnoDB Cluster?

Alternatives considered:

Single MySQL instance: No HA
Galera Cluster: Complex, split-brain issues
InnoDB Cluster (chosen): Built-in multi-master, auto-recovery

Benefits:

Native MySQL HA solution
Automatic split-brain resolution
Simpler than Galera

Why Block Storage for Databases?

NFS lacks proper fsync() support (causes SQLite corruption)
Proxmox-LVM provides block-level storage with proper write guarantees
Lower latency than NFS for database workloads

Why 7-Day Credential Rotation?

Balance between security (shorter is better) and operational overhead
7 days allows ample time to debug issues before next rotation
Reduces rotation-related disruptions while maintaining security hygiene

Why Shared Redis (Not Per-App)?

Most apps use Redis for ephemeral data (caching, sessions)
Over-provisioning Redis wastes memory
Shared instance sufficient for current load
Can migrate to per-app if needed

Troubleshooting

PostgreSQL: "Too many connections"

Cause: Apps connecting directly to PostgreSQL instead of PgBouncer

Fix:

# Check PgBouncer is running
kubectl get pods -n dbaas | grep pgbouncer

# Verify apps use pgbouncer.dbaas, not pg-cluster-rw
kubectl get configmap <app-config> -o yaml | grep postgres

PostgreSQL: Primary Failover Not Working

Cause: CNPG controller not running or network partition

Fix:

# Check CNPG operator
kubectl get pods -n cnpg-system

# Check cluster status
kubectl get cluster -n dbaas

# Manually trigger failover (last resort)
kubectl cnpg promote pg-cluster-2 -n dbaas

MySQL: Pod Stuck on Excluded Node

Cause: Anti-affinity rule not applied (should exclude k8s-node1)

Fix:

# Check pod affinity rules
kubectl get pod <mysql-pod> -n dbaas -o yaml | grep -A 10 affinity

# Delete pod to reschedule
kubectl delete pod <mysql-pod> -n dbaas

MySQL: Pod Scheduled on GPU Node

Cause: Anti-affinity rule not preventing scheduling on k8s-node1

Fix:

# Check pod affinity rules
kubectl get pod <mysql-pod> -n dbaas -o yaml | grep -A 10 affinity

# Delete pod to reschedule away from node1
kubectl delete pod <mysql-pod> -n dbaas

SQLite: Database Corruption

Cause: SQLite on NFS volume

Fix:

# Check volume type
kubectl get pv | grep <app>

# If NFS, migrate to proxmox-lvm:
# 1. Create proxmox-lvm PVC
# 2. Backup SQLite database
# 3. Restore to proxmox-lvm volume
# 4. Update app to use new volume

Vault Rotation: "User already exists"

Cause: Previous rotation failed to clean up

Fix:

# Connect to database
kubectl exec -it <mysql-pod> -n dbaas -- mysql -u root -p

# List users
SELECT user, host FROM mysql.user WHERE user LIKE 'v-root-%';

# Drop stale users
DROP USER 'v-root-postgres-<hash>'@'%';

# Retry rotation
vault read database/rotate-root/postgres

Redis: Out of Memory

Cause: No eviction policy configured

Fix:

# Connect to Redis
kubectl exec -it redis-0 -n redis -- redis-cli

# Set eviction policy
CONFIG SET maxmemory-policy allkeys-lru

# Persist config
CONFIG REWRITE

App Can't Connect: "Connection refused"

Cause: Service endpoint not reachable or PgBouncer not running

Fix:

# Check service endpoints
kubectl get endpoints pgbouncer -n dbaas
kubectl get endpoints postgresql -n dbaas

# Update app to use pgbouncer
kubectl set env deployment/<app> DB_HOST=pgbouncer.dbaas.svc.cluster.local

CI/CD Pipeline — Database credentials in CI/CD
Multi-Tenancy — Per-user database provisioning
Runbook: ../runbooks/database-failover.md — Manual failover procedures
Runbook: ../runbooks/vault-rotation-troubleshooting.md — Debug credential rotation
Vault documentation: Database secrets engine
CNPG documentation: Cluster configuration

15 KiB Raw Blame History

Databases

Overview

Architecture Diagram

Components

Database Endpoints

How It Works

PostgreSQL (CNPG + PgBouncer)

MySQL InnoDB Cluster

Redis

SQLite (Per-App)

Vault Database Engine

Configuration

Terraform Shared Variables

Vault Paths

Version Pinning

Example Terraform Stack (PostgreSQL)

Decisions & Rationale

Why CNPG Instead of Postgres Operator?

Why PgBouncer for PostgreSQL?

Why MySQL InnoDB Cluster?

Why Block Storage for Databases?

Why 7-Day Credential Rotation?

Why Shared Redis (Not Per-App)?

Troubleshooting

PostgreSQL: "Too many connections"

PostgreSQL: Primary Failover Not Working

MySQL: Pod Stuck on Excluded Node

MySQL: Pod Scheduled on GPU Node

SQLite: Database Corruption

Vault Rotation: "User already exists"

Redis: Out of Memory

App Can't Connect: "Connection refused"

Related

15 KiB

Raw Blame History