Commit graph

24 commits

Author SHA1 Message Date
Viktor Barzin
7dfe89a6e0 [redis] stabilise against node-crash flap cascade — RC1-RC5 fixes
Five compounding factors produced the 2026-04-22 flap cascade: soft
anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced
NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe
timing amplified LUKS-encrypted LVM I/O stalls into spurious
+switch-master loops, HAProxy's 1s polling raced sentinel failovers
and routed writes to demoted masters, publish_not_ready_addresses=true
fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery
CrashLoopBackOff closed the feedback loop.

Changes:
- Anti-affinity: preferred → required (one redis pod per node, hard)
- Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000
- Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5
- HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s
- Headless svc: publish_not_ready_addresses true→false

Post-rollout verification clean: 0 flaps, 0 +switch-master events,
0 celery ReadOnlyError in the 60s window after settle. Docs updated.
2026-04-22 15:59:00 +00:00
Viktor Barzin
e55c549c9a [redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs
Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h
with 0 alerts firing and 127 ops/sec on the v2 master — skipped the
nominal 24h rollback window per user direction.

 - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm
   destroy cleaned up the StatefulSet redis-node (already scaled to 0),
   ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless`
   ClusterIP services that the chart owned.
 - Removed `null_resource.patch_redis_service` — the kubectl-patch hack
   that worked around the Bitnami chart's broken service selector. No
   Helm chart, no patch needed.
 - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy
   deployment.
 - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two
   orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete).
 - Simplified the top-of-file comment and the redis-v2 architecture
   comment — they talked about the parallel-cluster migration state that
   no longer exists. Folded in the sentinel hostname gotcha, the redis
   8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning
   so the rationale survives in the code rather than only in beads.
 - `RedisDown` alert no longer matches `redis-node|redis-v2` — just
   `redis-v2` since that's the only StatefulSet now. Kept the `or on()
   vector(0)` so the alert fires when kube_state_metrics has no sample
   (e.g. after accidental delete).
 - `docs/architecture/databases.md` trimmed: no more "pending TF removal"
   or "cold rollback for 24h" language.

Verification after apply:
 - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-*
   (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only.
 - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted).
 - Sentinel: all 3 agree mymaster = redis-v2-0 hostname.
 - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master.
 - Prometheus: 0 firing redis alerts.

Closes: code-v2b
Closes: code-2mw

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:32:14 +00:00
Viktor Barzin
b6cd83f85a [redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
 - Discovered the v2 cluster was running redis:7.4-alpine, but the
   Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
   the 7.4 replicas rejected the stream with "Can't handle RDB format
   version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
   restore PSYNC compatibility.
 - Discovered that sentinel on BOTH v2 and old Bitnami clusters
   auto-discovered the cross-cluster replication chain when v2-0
   REPLICAOF'd the old master, triggering a failover that reparented
   old-master to a v2 replica and took HAProxy's backend offline.
   Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
   clusters) during the REPLICAOF surgery, then re-MONITOR after
   cutover. This must be done on the OLD sentinels too, not just v2 —
   they're the ones that kept fighting our REPLICAOF.
 - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
   All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
   BullMQ queues and `_kombu.*` Celery queues — the user-stated
   must-survive data class.

Phase 4 — HAProxy cutover:
 - Updated `kubernetes_config_map.haproxy` to point at
   `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
   redis_sentinel backends (removed redis-node-{0,1}).
 - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
   ConfigMap apply so HAProxy's 1s health-check interval found a
   role:master within a few seconds. Cutover disruption on HAProxy
   rollout was brief; old clients naturally moved to new HAProxy pods
   within the rolling update window.
 - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
   mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
   + `announce-hostnames yes` were active — this ensures sentinel
   stores the hostname (not resolved IP) in its rewritten config, so
   pod-IP churn on restart doesn't break failover.

Phase 5 — chaos:
 - Round 1: killed master v2-0 mid-probe. First run exposed the
   sentinel IP-storage issue (stored 10.10.107.222, went stale on
   restart) — ~12s probe disruption. Fixed hostname persistence and
   re-MONITORed.
 - Round 2: killed new master v2-2 with hostnames correctly stored.
   Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
   60s — target <3s of actual user-visible disruption.

Phase 6 — Nextcloud simplification:
 - `zzz-redis.config.php` no longer queries sentinel in-process —
   just points at `redis-master.redis.svc.cluster.local`. Removed 20
   lines of PHP. HAProxy handles master tracking transparently now
   that it's scaled to 3 + PDB minAvailable=2.

Phase 7 step 1:
 - `kubectl scale statefulset/redis-node --replicas=0` (transient —
   TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
   preserved as cold rollback.

Docs:
 - Rewrote `databases.md` Redis section to reflect post-cutover reality
   and the sentinel hostname gotcha (so future sessions don't relearn it).
 - `.claude/reference/service-catalog.md` entry updated.

The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.

Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
Viktor Barzin
150f196095 [redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts
Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm
release so data can migrate via REPLICAOF during a future short maintenance
window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still
points at redis-node-{0,1}.

Architecture:
 - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter
 - podManagementPolicy=Parallel + init container that writes fresh
   sentinel.conf on every boot by probing peer sentinels and redis for
   consensus master (priority: sentinel vote > role:master with slaves >
   pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM.
 - redis.conf `include /shared/replica.conf` — init container writes
   `replicaof <master> 6379` for non-master pods so they come up already in
   the correct role. No bootstrap race.
 - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork
   COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn.
 - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec.
 - PodDisruptionBudget minAvailable=2.

Also:
 - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since
   Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes
   the sole client-facing path for all 17 consumers.
 - New Prometheus alerts: RedisMemoryPressure, RedisEvictions,
   RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong,
   RedisReplicasMissing. Updated RedisDown to cover both statefulsets
   during the migration.
 - databases.md updated to describe the interim parallel-cluster state.

Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status
up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded
into Prometheus and inactive.

Beads: code-v2b (still in progress — Phase 3-7 await maintenance window).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:05 +00:00
Viktor Barzin
83f4a72b6f [redis] Raise master+replica memory 256Mi → 512Mi
256Mi was tight once the working set crossed ~200Mi: a BGSAVE fork
during replica full PSYNC doubled master RSS via COW and pushed it
past the limit, OOMing (exit 137) in a loop. HAProxy flapped, every
client (Paperless, Immich, Authentik, Dawarich) saw session store
failures → 500s on authenticated requests.

512Mi gives ~2x headroom on the current 204Mi RDB.

Closes: code-n81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:18:30 +00:00
Viktor Barzin
12a372bf92 [redis] Migrate live RW consumers off bare redis.redis hostname
Completes the T0 hostname migration. The `redis.redis` service is a
legacy alias that routes to HAProxy via a `null_resource` selector
patch; `redis-master.redis` is the canonical name that has always
routed to HAProxy directly and health-checks master-only.

Changes:
- redis-backup CronJob: redis-cli BGSAVE + --rdb now target
  redis-master.redis. BGSAVE runs on the master (what we want).
- config.tfvars `resume_redis_url`: unused fallback updated for
  grep hygiene; nothing reads it today.
- ytdlp REDIS_URL default: updated for dev-local runs; production
  already sets REDIS_URL via main.tf:283-285 → var.redis_host.
- immich chart_values.tpl REDIS_HOSTNAME: dead Helm template (values
  block commented out in main.tf:524, Immich deploys as raw
  kubernetes_deployment using var.redis_host). Updated to keep the
  file consistent if someone ever revives it.
2026-04-19 12:42:36 +00:00
Viktor Barzin
d5a47e35fc [redis] Restore dynamic DNS in HAProxy to fix stale-IP outage
HAProxy resolved `redis-node-{0,1}.redis-headless.redis.svc.cluster.local`
once at pod startup and cached the IPs forever. When redis-node pods
cycled (new pod IPs), HAProxy kept connecting to the dead IPs — backends
flapped between "Connection refused" and "Layer4 timeout", and Immich's
ioredis client hit EPIPE until max-retries exhausted and the pod entered
CrashLoopBackOff. This caused an Immich outage on 2026-04-19.

Fix:
- Add `resolvers kubernetes` stanza pointing at kube-dns (10s hold on
  every category so we pick up pod IP changes within a DNS TTL window).
- Add `resolvers kubernetes init-addr last,libc,none` to every backend
  server line so HAProxy resolves at startup AND uses the dynamic
  resolver for runtime refresh.
- Add `checksum/config` pod annotation to the HAProxy Deployment so a
  haproxy.cfg change actually rolls the pods (including this one).

Closes: code-fd6
2026-04-19 12:39:09 +00:00
Viktor Barzin
702db75f84 [redis] Stabilise patch_redis_service trigger + document service naming
## Context

`null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`,
so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add`
even when nothing has changed. That noise hides real drift in the signal and
trains us to ignore redis-stack plans — which is exactly what you don't want
on a load-bearing patch.

The patch itself is still load-bearing (three consumers hard-code bare
`redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`,
`stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus
Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local`
and call it during pod startup). Removing the null_resource is a follow-up
(beads T0) once those consumers migrate to `redis-master.redis.svc`. For now
the goal is just: stop being noisy.

## This change

1. Replace the `always = timestamp()` trigger with two inputs that only change
   when re-patching is genuinely required:
   - `chart_version = helm_release.redis.version` — changes only on a Bitnami
     chart version bump, which is the one code path that rewrites the `redis`
     Service selector back to `component=node`.
   - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])`
     — changes only when HAProxy config is edited; aligned with the existing
     `checksum/config` annotation that rolls the Deployment on config change.

   Both attributes are known at plan time (verified against `hashicorp/helm`
   v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision`
   (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))`
   (readability unverified on v3), and `kubernetes_deployment.haproxy.id`
   (static `namespace/name`, never changes) — don't meet the bar.

2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly
   states the write/sentinel/avoid endpoints, so new consumers start from
   `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived
   connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout
   client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor
   already learned that lesson the hard way (memory id=748).

## What is NOT in this change

- Deleting `null_resource.patch_redis_service` — still load-bearing (T0).
- Deleting `kubernetes_service.redis_master` — stays as the declared write API.
- Migrating any consumer off bare `redis.redis.svc` — T0 epic.
- Per-client sentinel migration — T1 epic.
- Retiring HAProxy — T2 epic (blocked on T1 + T3).

## Before / after

Before (steady state):
```
scripts/tg plan
Plan: 1 to add, 2 to change, 1 to destroy.
#   null_resource.patch_redis_service must be replaced
#     triggers = { "always" = "<timestamp>" } -> (known after apply)
```

After (steady state, post-apply):
```
scripts/tg plan
No changes. Your infrastructure matches the configuration.
```

After (chart version bump):
```
scripts/tg plan
#   null_resource.patch_redis_service must be replaced
#     triggers = { "chart_version" = "25.3.2" -> "25.4.0" }
```
— the trigger fires only when it actually needs to.

## Test Plan

### Automated

`scripts/tg plan` pre-change (confirms baseline noise):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply)
      }
  }
Plan: 1 to add, 2 to change, 1 to destroy.
```

`scripts/tg plan` post-edit (confirms the one-time structural replacement):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        - "always"         = "2026-04-19T10:39:40Z" -> null
        + "chart_version"  = "25.3.2"
        + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775"
      }
  }
```

Apply is deferred to the operator — the working tree on the same file also
contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage)
that needs its own review before rolling out together. No `scripts/tg apply`
run from this session.

### Manual Verification

Reproduce locally:
1. `cd infra/stacks/redis && ../../scripts/tg plan`
2. Before apply: expect `null_resource.patch_redis_service` to be replaced
   exactly once, with the trigger map transitioning from `{always = <ts>}`
   to `{chart_version, haproxy_config}`.
3. After apply: `../../scripts/tg plan` twice in a row must both report
   `No changes.` (excluding unrelated drift from other work-in-progress).
4. Cluster-side invariant (must hold pre- and post-apply):
   `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
   `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
5. Regression test for the trigger doing its job: bump `helm_release.redis.version`
   in a branch, `tg plan`, expect the null_resource to replace. Revert.
2026-04-19 12:17:52 +00:00
Viktor Barzin
28009a0e85 [redis] Bump master/replica memory 64Mi→256Mi (OOMKilled on PSYNC)
## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.

## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:

- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking

The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.

## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).

Sentinels stay at 64Mi — they don't store data.

## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:

  helm rollback redis 43 -n redis
  kubectl -n redis patch sts redis-node --type=strategic \
    -p '{...memory: 256Mi...}'
  kubectl -n redis delete pod redis-node-1 --force

Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.

## Verification
  kubectl -n redis get pods
    redis-node-0   2/2  Running  0  <bounce>
    redis-node-1   2/2  Running  0  <bounce>
  kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
    256Mi

## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
  (separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
  deadlock recovery

Closes: code-pqt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:40:51 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
8b004c4c94 feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.

## Context

Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.

## This change

Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
  with 1280Mi limit for LUKS2 Argon2id key derivation

Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates

Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update

Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover

Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs

## Services migrated (18 PVCs across 14 namespaces)

```
vaultwarden     → vaultwarden-data-encrypted
dbaas           → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver      → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud       → nextcloud-data-encrypted
forgejo         → forgejo-data-encrypted
matrix          → matrix-data-encrypted
n8n             → n8n-data-encrypted
affine          → affine-data-encrypted
health          → health-uploads-encrypted
hackmd          → hackmd-data-encrypted
redis           → redis-data-redis-node-{0,1}
headscale       → headscale-data-encrypted
frigate         → frigate-config-encrypted
meshcentral     → meshcentral-{data,files}-encrypted
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
4da8f0242f fix: right-size service memory after PVE RAM upgrade (142→272GB)
- MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit)
- Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled)
- Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled)
- Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled
- Navidrome: 128Mi/128Mi → 256Mi/384Mi
- Matrix: add explicit 256Mi/512Mi resources
- Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled
- Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi
- Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi
- Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections
2026-04-05 23:02:50 +03:00
Viktor Barzin
ce7b8c2b2e add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip]
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
2026-04-03 23:30:00 +03:00
Viktor Barzin
dd59512153 migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip]
Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all
block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes
the iSCSI network hop for database I/O.

New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart
with StorageClass "proxmox-lvm" using existing local-lvm thin pool.

Migrated PVCs (12 total):
- Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus
- Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2)

All services verified healthy post-migration.
2026-04-02 22:13:04 +03:00
Viktor Barzin
d20c5e5535 add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard
- All 7 backup CronJobs now push backup_output_bytes (file size after backup)
- Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes
- Grafana dashboard: new Output (MiB) table column, Output Size Trend panel,
  Write Throughput panel, Cloud Sync Transfer Volume bargauge
- All timeseries panels use points-only draw style (discrete backup snapshots)
- etcd backup restructured: init_container for etcdctl (distroless image),
  busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS
- Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG)
- Fixed grep -oP not available in alpine/busybox (cloud sync monitor)
2026-03-25 10:44:53 +02:00
Viktor Barzin
a95d434ff1 fix backup IO stats: use /proc/$$/io instead of /proc/self/io
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
2026-03-23 12:33:52 +02:00
Viktor Barzin
0a294a30a6 add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
  (etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
  overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
Viktor Barzin
e463281205 optimize backup schedules: compress dumps, stagger to weekly, extend retention
- dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed
- infra-maintenance: etcd backup daily→weekly Sunday 1am
- redis: backup hourly→weekly Sunday 3am, retention 7→28 days
- vault: raft backup daily→weekly Sunday 2am
2026-03-23 02:24:34 +02:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00