## Context
`null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`,
so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add`
even when nothing has changed. That noise hides real drift in the signal and
trains us to ignore redis-stack plans — which is exactly what you don't want
on a load-bearing patch.
The patch itself is still load-bearing (three consumers hard-code bare
`redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`,
`stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus
Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local`
and call it during pod startup). Removing the null_resource is a follow-up
(beads T0) once those consumers migrate to `redis-master.redis.svc`. For now
the goal is just: stop being noisy.
## This change
1. Replace the `always = timestamp()` trigger with two inputs that only change
when re-patching is genuinely required:
- `chart_version = helm_release.redis.version` — changes only on a Bitnami
chart version bump, which is the one code path that rewrites the `redis`
Service selector back to `component=node`.
- `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])`
— changes only when HAProxy config is edited; aligned with the existing
`checksum/config` annotation that rolls the Deployment on config change.
Both attributes are known at plan time (verified against `hashicorp/helm`
v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision`
(not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))`
(readability unverified on v3), and `kubernetes_deployment.haproxy.id`
(static `namespace/name`, never changes) — don't meet the bar.
2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly
states the write/sentinel/avoid endpoints, so new consumers start from
`redis-master.redis.svc` (the documented `var.redis_host`) and long-lived
connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout
client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor
already learned that lesson the hard way (memory id=748).
## What is NOT in this change
- Deleting `null_resource.patch_redis_service` — still load-bearing (T0).
- Deleting `kubernetes_service.redis_master` — stays as the declared write API.
- Migrating any consumer off bare `redis.redis.svc` — T0 epic.
- Per-client sentinel migration — T1 epic.
- Retiring HAProxy — T2 epic (blocked on T1 + T3).
## Before / after
Before (steady state):
```
scripts/tg plan
Plan: 1 to add, 2 to change, 1 to destroy.
# null_resource.patch_redis_service must be replaced
# triggers = { "always" = "<timestamp>" } -> (known after apply)
```
After (steady state, post-apply):
```
scripts/tg plan
No changes. Your infrastructure matches the configuration.
```
After (chart version bump):
```
scripts/tg plan
# null_resource.patch_redis_service must be replaced
# triggers = { "chart_version" = "25.3.2" -> "25.4.0" }
```
— the trigger fires only when it actually needs to.
## Test Plan
### Automated
`scripts/tg plan` pre-change (confirms baseline noise):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
~ triggers = { # forces replacement
~ "always" = "2026-04-19T10:39:40Z" -> (known after apply)
}
}
Plan: 1 to add, 2 to change, 1 to destroy.
```
`scripts/tg plan` post-edit (confirms the one-time structural replacement):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
~ triggers = { # forces replacement
- "always" = "2026-04-19T10:39:40Z" -> null
+ "chart_version" = "25.3.2"
+ "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775"
}
}
```
Apply is deferred to the operator — the working tree on the same file also
contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage)
that needs its own review before rolling out together. No `scripts/tg apply`
run from this session.
### Manual Verification
Reproduce locally:
1. `cd infra/stacks/redis && ../../scripts/tg plan`
2. Before apply: expect `null_resource.patch_redis_service` to be replaced
exactly once, with the trigger map transitioning from `{always = <ts>}`
to `{chart_version, haproxy_config}`.
3. After apply: `../../scripts/tg plan` twice in a row must both report
`No changes.` (excluding unrelated drift from other work-in-progress).
4. Cluster-side invariant (must hold pre- and post-apply):
`kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'`
→ `{"app":"redis-haproxy"}`
`kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'`
→ `{"app":"redis-haproxy"}`
5. Regression test for the trigger doing its job: bump `helm_release.redis.version`
in a branch, `tg plan`, expect the null_resource to replace. Revert.
15 KiB
Infrastructure Repository — AI Agent Instructions
Critical Rules (MUST FOLLOW)
- ALL changes through Terraform/Terragrunt — NEVER
kubectl apply/edit/patch/deletefor persistent changes. Read-only kubectl is fine. - NEVER put secrets in plaintext — use
secrets.sops.json(SOPS-encrypted) orterraform.tfvars(git-crypt, legacy) - NEVER restart NFS on the Proxmox host — causes cluster-wide mount failures across all pods
- NEVER commit secrets — triple-check before every commit
[ci skip]in commit messages when changes were already applied locally- Ask before
git push— always confirm with the user first
Execution
- Apply a service:
scripts/tg apply --non-interactive(auto-decrypts SOPS secrets) - Legacy apply:
cd stacks/<service> && terragrunt apply --non-interactive(uses terraform.tfvars) - kubectl:
kubectl --kubeconfig $(pwd)/config - Health check:
bash scripts/cluster_healthcheck.sh --quiet - Plan all:
cd stacks && terragrunt run --all --non-interactive -- plan
Adopting Existing Resources — Use import {} Blocks, Not the CLI
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL import {} block (Terraform 1.5+). Do NOT use terraform import on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
Canonical workflow:
- Write the
resourceblock that matches the live object. - In the same stack, add an
import {}stanza naming the target and the provider-specific ID:import { to = helm_release.kured id = "kured/kured" # Helm ID format: <namespace>/<release-name> } resource "helm_release" "kured" { name = "kured" namespace = "kured" repository = "https://kubereboot.github.io/charts/" chart = "kured" version = "5.7.0" # ... values matching the live release } scripts/tg plan— every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is 0 changes.scripts/tg apply— the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.- After the apply lands cleanly, delete the
import {}block in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
Why import {} and not terraform import:
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
- Plan-safe: the
importplan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after. - Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use
scripts/tg. - Re-runnable: if the apply fails partway through, the
import {}block is idempotent. The CLI path's state mutation is not.
Finding the provider-specific ID: each provider has its own convention.
| Resource | ID format | Example |
|---|---|---|
helm_release |
<namespace>/<release-name> |
kured/kured |
kubernetes_manifest |
{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}} |
(pass as HCL object literal) |
kubernetes_<kind>_v1 |
<namespace>/<name> for namespaced, <name> for cluster-scoped |
kube-system/coredns |
authentik_provider_proxy |
provider UUID | 0eecac07-97c7-443c-... |
cloudflare_record |
<zone-id>/<record-id> |
abc123/def456 |
Secrets Management (SOPS)
config.tfvars— plaintext config (hostnames, IPs, DNS records, public keys)secrets.sops.json— SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys).sops.yaml— defines who can decrypt (age public keys: Viktor + CI)scripts/tg— wrapper that auto-decrypts SOPS before running terragrunt- Edit secrets:
sops secrets.sops.json(opens $EDITOR, re-encrypts on save) - Add a secret:
sops set secrets.sops.json '["new_key"]' '"value"' - Operators push PRs → Viktor reviews → CI decrypts and applies. No encryption keys needed for operators.
Sealed Secrets (User-Managed Secrets)
For secrets that users manage themselves (no SOPS/git-crypt access needed):
- Create:
kubectl create secret generic <name> --from-literal=key=value -n <ns> --dry-run=client -o yaml | kubeseal --controller-name sealed-secrets --controller-namespace sealed-secrets -o yaml > sealed-<name>.yaml - Commit: Place
sealed-*.yamlfiles in the stack directory (stacks/<service>/) - Terraform picks them up automatically via
fileset+for_each:resource "kubernetes_manifest" "sealed_secrets" { for_each = fileset(path.module, "sealed-*.yaml") manifest = yamldecode(file("${path.module}/${each.value}")) } - Deploy: Push → CI runs
terragrunt apply→ controller decrypts into real K8s Secrets
- Only the in-cluster controller has the private key.
kubesealuses the public key — safe to distribute. - Naming convention: files MUST match
sealed-*.yamlglob pattern. - The
kubernetes_manifestblock is safe to add even with zero sealed-*.yaml files (empty for_each).
Architecture
Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Proxmox VMs.
- 100+ stacks, each in
stacks/<service>/with its own Terraform state - Core platform:
stacks/platform/is now an empty shell — all modules have been extracted to independent stacks understacks/ - Public domain:
viktorbarzin.me(Cloudflare) | Internal:viktorbarzin.lan(Technitium DNS) - Onboarding portal:
https://k8s-portal.viktorbarzin.me— self-service kubectl setup + docs - CI/CD: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
Key Paths
stacks/<service>/main.tf— service definitionstacks/platform/modules/<service>/— core infra modulesmodules/kubernetes/ingress_factory/— standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (dns_type = "proxied"or"non-proxied")modules/kubernetes/nfs_volume/— NFS volume module (CSI-backed, soft mount)config.tfvars— non-secret configuration (plaintext)secrets.sops.json— all secrets (SOPS-encrypted JSON)terraform.tfvars— legacy secrets file (git-crypt, kept for reference)scripts/cluster_healthcheck.sh— 25-check cluster health script
Storage
- NFS (
nfs-proxmoxStorageClass): For app data. Use thenfs_volumemodule, never inlinenfs {}blocks. - proxmox-lvm-encrypted (
proxmox-lvm-encryptedStorageClass): Default for all sensitive data — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host. - proxmox-lvm (
proxmox-lvmStorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver. - NFS server: Proxmox host at 192.168.1.127. HDD NFS at
/srv/nfs(2TB ext4 LVpve/nfs-data), SSD NFS at/srv/nfs-ssd(100GB ext4 LVssd/nfs-ssd-data). Exports useasyncmode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned. - SQLite on NFS is unreliable (fsync issues) — always use proxmox-lvm or local disk for databases.
- NFS mount options: Always
soft,timeo=30,retrans=3to prevent uninterruptible sleep (D state). - NFS export directory must exist on the Proxmox host before Terraform can create the PV.
- Backup (3-2-1): Copy 1 = live PVCs on sdc. Copy 2 = sda
/mnt/backup(PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→pve-backup/, NFS→nfs/+nfs-ssd/via inotify change tracking). - daily-backup (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number +
?mode=ro), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify). - offsite-sync-backup (Daily 06:00): Step 1: sda→Synology
pve-backup/. Step 2: NFS→Synologynfs/+nfs-ssd/viarsync --files-from(inotify change log). Monthly full--delete. - nfs-change-tracker.service: inotifywait on
/srv/nfs+/srv/nfs-ssd, logs to/mnt/backup/.nfs-changes.log. Incremental syncs complete in seconds. - Synology layout (
/volume1/Backup/Viki/):pve-backup/(from sda),nfs/(from/srv/nfs),nfs-ssd/(from/srv/nfs-ssd).truenas/renamed tonfs/,pve-backup/nfs-mirror/removed.
Shared Variables (never hardcode)
var.nfs_server (192.168.1.127), var.redis_host, var.postgresql_host, var.mysql_host, var.ollama_host, var.mail_host
Redis Service Naming (read before wiring a new consumer)
The Redis stack (stacks/redis/) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
| Endpoint | Port(s) | Use for | Backed by |
|---|---|---|---|
redis-master.redis.svc.cluster.local |
6379 (redis), 26379 (sentinel) | Default for new services. Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches var.redis_host. |
kubernetes_service.redis_master → HAProxy → Bitnami StatefulSet |
redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local |
26379 | Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq). Use a sentinel-aware client with master name mymaster. Example: stacks/nextcloud/chart_values.yaml:32-54. |
Bitnami-created headless service → pod DNS |
redis.redis.svc.cluster.local |
6379 | Do NOT use. Helm chart's default service — selector patched by null_resource.patch_redis_service to match redis-haproxy, so today it behaves like redis-master. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). |
Bitnami chart (patched) |
HAProxy's timeout client 30s closes idle raw Redis connections — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
When onboarding a new service: start from redis-master.redis.svc.cluster.local:6379 via var.redis_host. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq sentinels array) AND the workload uses long-lived connections.
Kyverno Drift Suppression (# KYVERNO_LIFECYCLE_V1)
Kyverno's admission webhook mutates every pod with a dns_config { option { name = "ndots"; value = "2" } } block (fixes NxDomain search-domain floods — see k8s-ndots-search-domain-nxdomain-flood skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual spec[0].template[0].spec[0].dns_config drift.
Rule: every kubernetes_deployment, kubernetes_stateful_set, kubernetes_daemon_set, and kubernetes_cron_job_v1 MUST include the following lifecycle block, tagged with the # KYVERNO_LIFECYCLE_V1 marker so every site is greppable:
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
# kubernetes_cron_job_v1 (extra job_template nesting)
lifecycle {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
Why not a shared module? Terraform's ignore_changes meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When kubernetes_manifest resources get Kyverno generate.kyverno.io/* annotations mutated, a sibling convention # KYVERNO_MANIFEST_V1 will be introduced (Phase B).
Audit: rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l — should grow (never shrink). Add the marker to every new pod-owning resource. The _template/main.tf.example stub shows the canonical form.
Tier System
0-core | 1-cluster | 2-gpu | 3-edge | 4-aux — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit
resources {}get default limits (256Mi for edge/aux — causes OOMKill for heavy apps) - Always set explicit resources on containers that need more than defaults
- Opt-out: labels
resource-governance/custom-quota=true/resource-governance/custom-limitrange=true
Infrastructure
- Proxmox: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- Nodes: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- GPU:
node_selector = { "gpu": "true" }+ tolerationnvidia.com/gpu - Pull-through cache: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with
ctr --hosts-dir ''to bypass. - pfSense: 10.0.20.1 (gateway, firewall, DNS forwarding)
- MySQL InnoDB Cluster: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass
mysql-critical+ PDB, anti-affinity excludes k8s-node1 (GPU node) - SMTP:
var.mail_hostport 587 STARTTLS (not internal svc address — cert mismatch)
Contributor Onboarding
- Get Authentik account + Headscale VPN access (ask Viktor)
- Clone repo —
AGENTS.mdis auto-loaded by Codex - Create branch → edit → push → open PR
- Viktor reviews → CI applies → Slack notification
- Portal:
https://k8s-portal.viktorbarzin.me/onboardingfor full guide
Common Operations
- Deploy new service: Use
stacks/<existing-service>/as template. Create stack, add DNS in tfvars, apply platform then service. - Fix crashed pods: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- OOMKilled: Check
kubectl describe limitrange tier-defaults -n <ns>. Increaseresources.limits.memoryin the stack's main.tf. - Add a secret:
sops set secrets.sops.json '["key"]' '"value"'then commit. - NFS exports: Create dir on Proxmox host (
ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"), add to/etc/exports, runexportfs -ra.
Automated Service Upgrades
- Pipeline: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST →
claude-agent-service(K8s) →claude -p(upgrade agent) - Agent:
.claude/agents/service-upgrade.md— analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure - Config:
.claude/reference/upgrade-config.json— GitHub repo mappings, DB-backed services, skip patterns - Rate limit: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
- Skipped: databases,
:latest, custom images (viktorbarzin/*), infrastructure images - Risk: SAFE (2min verify) vs CAUTION (10min, DB backup, step through versions) based on changelog analysis
- Docs:
docs/architecture/automated-upgrades.md
Detailed Reference
See .claude/reference/patterns.md for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.