infra

No description

Find a file

Viktor Barzin 702db75f84 [redis] Stabilise patch_redis_service trigger + document service naming ## Context `null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`, so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add` even when nothing has changed. That noise hides real drift in the signal and trains us to ignore redis-stack plans — which is exactly what you don't want on a load-bearing patch. The patch itself is still load-bearing (three consumers hard-code bare `redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`, `stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local` and call it during pod startup). Removing the null_resource is a follow-up (beads T0) once those consumers migrate to `redis-master.redis.svc`. For now the goal is just: stop being noisy. ## This change 1. Replace the `always = timestamp()` trigger with two inputs that only change when re-patching is genuinely required: - `chart_version = helm_release.redis.version` — changes only on a Bitnami chart version bump, which is the one code path that rewrites the `redis` Service selector back to `component=node`. - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])` — changes only when HAProxy config is edited; aligned with the existing `checksum/config` annotation that rolls the Deployment on config change. Both attributes are known at plan time (verified against `hashicorp/helm` v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision` (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))` (readability unverified on v3), and `kubernetes_deployment.haproxy.id` (static `namespace/name`, never changes) — don't meet the bar. 2. Add a Redis Service Naming section to `AGENTS.md` that explicitly states the write/sentinel/avoid endpoints, so new consumers start from `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor already learned that lesson the hard way (memory id=748). ## What is NOT in this change - Deleting `null_resource.patch_redis_service` — still load-bearing (T0). - Deleting `kubernetes_service.redis_master` — stays as the declared write API. - Migrating any consumer off bare `redis.redis.svc` — T0 epic. - Per-client sentinel migration — T1 epic. - Retiring HAProxy — T2 epic (blocked on T1 + T3). ## Before / after Before (steady state): ``` scripts/tg plan Plan: 1 to add, 2 to change, 1 to destroy. # null_resource.patch_redis_service must be replaced # triggers = { "always" = "<timestamp>" } -> (known after apply) ``` After (steady state, post-apply): ``` scripts/tg plan No changes. Your infrastructure matches the configuration. ``` After (chart version bump): ``` scripts/tg plan # null_resource.patch_redis_service must be replaced # triggers = { "chart_version" = "25.3.2" -> "25.4.0" } ``` — the trigger fires only when it actually needs to. ## Test Plan ### Automated `scripts/tg plan` pre-change (confirms baseline noise): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply) } } Plan: 1 to add, 2 to change, 1 to destroy. ``` `scripts/tg plan` post-edit (confirms the one-time structural replacement): ``` # module.redis.null_resource.patch_redis_service must be replaced -/+ resource "null_resource" "patch_redis_service" { ~ triggers = { # forces replacement - "always" = "2026-04-19T10:39:40Z" -> null + "chart_version" = "25.3.2" + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775" } } ``` Apply is deferred to the operator — the working tree on the same file also contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage) that needs its own review before rolling out together. No `scripts/tg apply` run from this session. ### Manual Verification Reproduce locally: 1. `cd infra/stacks/redis && ../../scripts/tg plan` 2. Before apply: expect `null_resource.patch_redis_service` to be replaced exactly once, with the trigger map transitioning from `{always = <ts>}` to `{chart_version, haproxy_config}`. 3. After apply: `../../scripts/tg plan` twice in a row must both report `No changes.` (excluding unrelated drift from other work-in-progress). 4. Cluster-side invariant (must hold pre- and post-apply): `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'` → `{"app":"redis-haproxy"}` 5. Regression test for the trigger doing its job: bump `helm_release.redis.version` in a branch, `tg plan`, expect the null_resource to replace. Revert.		2026-04-19 12:17:52 +00:00
.beads	bd init: initialize beads issue tracking	2026-04-06 15:38:46 +03:00
.claude	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser	2026-04-19 10:54:33 +00:00
.git-crypt	Add 1 git-crypt collaborator [ci skip]	2025-10-24 18:00:00 +00:00
.github	chore: sort outage report service list alphabetically	2026-04-15 18:01:54 +00:00
.planning	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache	2026-03-06 23:55:57 +00:00
.woodpecker	[infra] Add Woodpecker pipeline to deploy PVE /etc/exports (Wave 6b)	2026-04-18 23:21:36 +00:00
ci	feat: CI/CD performance overhaul	2026-04-15 11:22:26 +00:00
cli	add IPv6 connectivity via Hurricane Electric 6in4 tunnel	2026-03-23 02:22:00 +02:00
diagram	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references	2026-02-23 19:38:55 +00:00
docs	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip]	2026-04-19 12:07:47 +00:00
modules	[infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]	2026-04-18 19:23:02 +00:00
playbooks	[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs	2026-02-22 22:59:38 +00:00
scripts	[mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip]	2026-04-19 12:07:47 +00:00
secrets	Woodpecker CI Update TLS Certificates Commit	2026-04-19 00:02:53 +00:00
stacks	[redis] Stabilise patch_redis_service trigger + document service naming	2026-04-19 12:17:52 +00:00
state/stacks	state(vault): update encrypted state	2026-04-18 22:12:55 +00:00
.gitattributes	Add broker-sync Terraform stack (#7 )	2026-04-17 21:17:45 +01:00
.gitignore	.gitignore: ignore terragrunt_rendered.json debug output	2026-04-18 13:18:05 +00:00
.sops.yaml	state: per-stack Transit keys for namespace-owner access control	2026-03-17 23:08:18 +00:00
AGENTS.md	[redis] Stabilise patch_redis_service trigger + document service naming	2026-04-19 12:17:52 +00:00
config.tfvars	[config] Remove ollama_host root variable	2026-04-18 11:14:53 +00:00
CONTRIBUTING.md	multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip]	2026-03-19 23:49:15 +00:00
LICENSE.txt	Drone CI Update TLS Certificates Commit	2025-10-12 00:13:18 +00:00
MEMORY.md	Update MEMORY.md timestamp	2026-03-07 16:43:15 +00:00
README.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00
setup-monitoring.sh	fix(monitoring): Add setup script for automated health check environment	2026-03-13 13:57:11 +00:00
terragrunt.hcl	[infra] Adopt Authentik catch-all Proxy Provider + Application into TF (Wave 6a)	2026-04-18 22:48:26 +00:00
tiers.tf	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk	2026-02-28 19:08:06 +00:00

README.md

This repo contains my infra-as-code sources.

My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.

Read more by visiting my website: https://viktorbarzin.me

Documentation

Full architecture documentation is available in docs/ — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more.

Adding a New User (Admin)

Adding a new namespace-owner to the cluster requires three steps — no code changes needed.

1. Authentik Group Assignment

In the Authentik admin UI, add the user to:

kubernetes-namespace-owners group (grants OIDC group claim for K8s RBAC)
Headscale Users group (if they need VPN access)

2. Vault KV Entry

Add a JSON entry to secret/platform → k8s_users key in Vault:

"username": {
  "role": "namespace-owner",
  "email": "user@example.com",
  "namespaces": ["username"],
  "domains": ["myapp"],
  "quota": {
    "cpu_requests": "2",
    "memory_requests": "4Gi",
    "memory_limits": "8Gi",
    "pods": "20"
  }
}

username key must match the user's Forgejo username (for Woodpecker admin access)
namespaces — K8s namespaces to create and grant admin access to
domains — subdomains under viktorbarzin.me for Cloudflare DNS records
quota — resource limits per namespace (defaults shown above)

3. Apply Stacks

vault login -method=oidc

cd stacks/vault && terragrunt apply --non-interactive
# Creates: namespace, Vault policy, identity entity, K8s deployer role

cd ../platform && terragrunt apply --non-interactive
# Creates: RBAC bindings, ResourceQuota, TLS secret, DNS records

cd ../woodpecker && terragrunt apply --non-interactive
# Adds user to Woodpecker admin list

What Gets Auto-Generated

Resource	Stack
Kubernetes namespace	vault
Vault policy (`namespace-owner-{user}`)	vault
Vault identity entity + OIDC alias	vault
K8s deployer Role + Vault K8s role	vault
RBAC RoleBinding (namespace admin)	platform
RBAC ClusterRoleBinding (cluster read-only)	platform
ResourceQuota	platform
TLS secret in namespace	platform
Cloudflare DNS records	platform
Woodpecker admin access	woodpecker

New User Onboarding

If you've been added as a namespace-owner, follow these steps to get started.

1. Join the VPN

# Install Tailscale: https://tailscale.com/download
tailscale login --login-server https://headscale.viktorbarzin.me
# Send the registration URL to Viktor, wait for approval
ping 10.0.20.100  # verify connectivity

2. Install Tools

Run the setup script to install kubectl, kubelogin, Vault CLI, Terraform, and Terragrunt:

# macOS
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)

# Linux
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)

3. Authenticate

# Log into Vault (opens browser for SSO)
vault login -method=oidc

# Test kubectl (opens browser for OIDC login)
kubectl get pods -n YOUR_NAMESPACE

4. Deploy Your First App

# Clone the infra repo
git clone https://github.com/ViktorBarzin/infra.git && cd infra

# Copy the stack template
cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf

# Edit main.tf — replace all <placeholders>

# Store secrets in Vault
vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123

# Submit a PR
git checkout -b feat/myapp
git add stacks/myapp/
git commit -m "add myapp stack"
git push -u origin feat/myapp

After review and merge, an admin runs cd stacks/myapp && terragrunt apply.

5. Set Up CI/CD (Optional)

Create .woodpecker.yml in your app's Forgejo repo:

steps:
  - name: build
    image: woodpeckerci/plugin-docker-buildx
    settings:
      repo: YOUR_DOCKERHUB_USER/myapp
      tag: ["${CI_PIPELINE_NUMBER}", "latest"]
      username:
        from_secret: dockerhub-username
      password:
        from_secret: dockerhub-token
      platforms: linux/amd64

  - name: deploy
    image: hashicorp/vault:1.18.1
    commands:
      - export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
      - export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login
          role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
      - KUBE_TOKEN=$(vault write -field=service_account_token
          kubernetes/creds/YOUR_NAMESPACE-deployer
          kubernetes_namespace=YOUR_NAMESPACE)
      - kubectl --server=https://kubernetes.default.svc
          --token=$KUBE_TOKEN
          --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          -n YOUR_NAMESPACE set image deployment/myapp
          myapp=YOUR_DOCKERHUB_USER/myapp:${CI_PIPELINE_NUMBER}

Useful Commands

# Check your pods
kubectl get pods -n YOUR_NAMESPACE

# View quota usage
kubectl describe resourcequota -n YOUR_NAMESPACE

# Store/read secrets
vault kv put secret/YOUR_USERNAME/myapp KEY=value
vault kv get secret/YOUR_USERNAME/myapp

# Get a short-lived K8s deploy token
vault write kubernetes/creds/YOUR_NAMESPACE-deployer \
  kubernetes_namespace=YOUR_NAMESPACE

Important Rules

All changes go through Terraform — never kubectl apply/edit/patch directly
Never put secrets in code — use Vault: vault kv put secret/YOUR_USERNAME/...
Always use a PR — never push directly to master
Docker images: build for linux/amd64, use versioned tags (not :latest)

git-crypt setup

To decrypt the secrets, you need to setup git-crypt.

Install git-crypt.
Setup gpg keys on the machine
git-crypt unlock

This will unlock the secrets and will lock them on commit