[redis] Stabilise patch_redis_service trigger + document service naming

## Context `null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`, so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add` even when nothing has changed. That noise hides real drift in the signal and trains us to ignore redis-stack plans — which is exactly what you don't want on a load-bearing patch. The patch itself is still load-bearing (three consumers hard-code bare `redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`, `stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local` and call it during pod startup). Removing the null_resource is a follow-up (beads T0) once those consumers migrate to `redis-master.redis.svc`. For now the goal is just: stop being noisy. ## This change 1. Replace the `always = timestamp()` trigger with two inputs that only change when re-patching is genuinely required: - `chart_version = helm_release.redis.version` — changes only on a Bitnami chart version bump, which is the one code path that rewrites the `redis` Service selector back to `component=node`. - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])` — changes only when HAProxy config is edited; aligned with the existing `checksum/config` annotation that rolls the Deployment on config change. Both attributes are known at plan time (verified against `hashicorp/helm` v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision` (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))` (readability unverified on v3), and `kubernetes_deployment.haproxy.id` (static `namespace/name`, never changes) — don't meet the bar. 2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly states the write/sentinel/avoid endpoints, so new consumers start from `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor already learned that lesson the hard way (memory id=748). ## What is NOT in this change - Deleting `null_resource.patch_redis_service` — still load-bearing (T0). - Deleting `kubernetes_service.redis_master` — stays as the declared write API. - Migrating any consumer off bare `redis.redis.svc` — T0 epic. - Per-client sentinel migration — T1 epic. - Retiring HAProxy — T2 epic (blocked on T1 + T3). ## Before / after Before (steady state): ``` scripts/tg plan Plan: 1 to add, 2 to change, 1 to destroy. # null_resource.patch_redis_service must be replaced # triggers = { "always" = "<timestamp>" } -> (known after apply) ``` After (steady state, post-apply): ``` scripts/tg plan No changes. Your infrastructure matches the configuration. ``` After (chart version bump): ``` scripts/tg plan # null_resource.patch_redis_service must be replaced # triggers = { "chart_version" = "25.3.2" -> "25.4.0" } ``` — the trigger fires only when it actually needs to. ## Test Plan ### Automated `scripts/tg plan` pre-change (confirms baseline noise): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply) } } Plan: 1 to add, 2 to change, 1 to destroy. ``` `scripts/tg plan` post-edit (confirms the one-time structural replacement): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement - "always" = "2026-04-19T10:39:40Z" -> null + "chart_version" = "25.3.2" + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775" } } ``` Apply is deferred to the operator — the working tree on the same file also contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage) that needs its own review before rolling out together. No `scripts/tg apply` run from this session. ### Manual Verification Reproduce locally: 1. `cd infra/stacks/redis && ../../scripts/tg plan` 2. Before apply: expect `null_resource.patch_redis_service` to be replaced exactly once, with the trigger map transitioning from `{always = <ts>}` to `{chart_version, haproxy_config}`. 3. After apply: `../../scripts/tg plan` twice in a row must both report `No changes.` (excluding unrelated drift from other work-in-progress). 4. Cluster-side invariant (must hold pre- and post-apply): `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` 5. Regression test for the trigger doing its job: bump `helm_release.redis.version` in a branch, `tg plan`, expect the null_resource to replace. Revert.
2026-04-19 12:17:52 +00:00 · 2026-04-19 12:17:52 +00:00 · 702db75f84
commit 702db75f84
parent ba697b02a2
2 changed files with 19 additions and 1 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -118,6 +118,20 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 ## Shared Variables (never hardcode)
 `var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`

+## Redis Service Naming (read before wiring a new consumer)
+
+The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
+
+| Endpoint | Port(s) | Use for | Backed by |
+|----------|---------|---------|-----------|
+| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
+| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
+| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
+
+**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
+
+**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
+
 ## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)

 Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
--- a/stacks/redis/modules/redis/main.tf
+++ b/stacks/redis/modules/redis/main.tf
@ -286,7 +286,11 @@ resource "kubernetes_service" "redis_master" {
 # This runs on every apply to ensure the Helm chart's service is always corrected.
 resource "null_resource" "patch_redis_service" {
  triggers = {
-    always = timestamp()
+    # Re-patch only when a Helm upgrade (chart version bump) or an HAProxy
+    # config change could have reset the selector / rotated HAProxy pods.
+    # timestamp() would force-replace on every apply, hiding real drift.
+    chart_version  = helm_release.redis.version
+    haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])
  }

  provisioner "local-exec" {