Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .beads | ||
| .claude | ||
| .git-crypt | ||
| .github | ||
| .planning | ||
| .woodpecker | ||
| ci | ||
| cli | ||
| diagram | ||
| docs | ||
| modules | ||
| playbooks | ||
| scripts | ||
| secrets | ||
| stacks | ||
| state/stacks | ||
| .gitattributes | ||
| .gitignore | ||
| .sops.yaml | ||
| AGENTS.md | ||
| config.tfvars | ||
| CONTRIBUTING.md | ||
| LICENSE.txt | ||
| MEMORY.md | ||
| README.md | ||
| terragrunt.hcl | ||
| tiers.tf | ||
This repo contains my infra-as-code sources.
My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.
Read more by visiting my website: https://viktorbarzin.me
Documentation
Full architecture documentation is available in docs/ — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more.
Adding a New User (Admin)
Adding a new namespace-owner to the cluster requires three steps — no code changes needed.
1. Authentik Group Assignment
In the Authentik admin UI, add the user to:
kubernetes-namespace-ownersgroup (grants OIDC group claim for K8s RBAC)Headscale Usersgroup (if they need VPN access)
2. Vault KV Entry
Add a JSON entry to secret/platform → k8s_users key in Vault:
"username": {
"role": "namespace-owner",
"email": "user@example.com",
"namespaces": ["username"],
"domains": ["myapp"],
"quota": {
"cpu_requests": "2",
"memory_requests": "4Gi",
"memory_limits": "8Gi",
"pods": "20"
}
}
usernamekey must match the user's Forgejo username (for Woodpecker admin access)namespaces— K8s namespaces to create and grant admin access todomains— subdomains underviktorbarzin.mefor Cloudflare DNS recordsquota— resource limits per namespace (defaults shown above)
3. Apply Stacks
vault login -method=oidc
cd stacks/vault && terragrunt apply --non-interactive
# Creates: namespace, Vault policy, identity entity, K8s deployer role
cd ../platform && terragrunt apply --non-interactive
# Creates: RBAC bindings, ResourceQuota, TLS secret, DNS records
cd ../woodpecker && terragrunt apply --non-interactive
# Adds user to Woodpecker admin list
What Gets Auto-Generated
| Resource | Stack |
|---|---|
| Kubernetes namespace | vault |
Vault policy (namespace-owner-{user}) |
vault |
| Vault identity entity + OIDC alias | vault |
| K8s deployer Role + Vault K8s role | vault |
| RBAC RoleBinding (namespace admin) | platform |
| RBAC ClusterRoleBinding (cluster read-only) | platform |
| ResourceQuota | platform |
| TLS secret in namespace | platform |
| Cloudflare DNS records | platform |
| Woodpecker admin access | woodpecker |
New User Onboarding
If you've been added as a namespace-owner, follow these steps to get started.
1. Join the VPN
# Install Tailscale: https://tailscale.com/download
tailscale login --login-server https://headscale.viktorbarzin.me
# Send the registration URL to Viktor, wait for approval
ping 10.0.20.100 # verify connectivity
2. Install Tools
Run the setup script to install kubectl, kubelogin, Vault CLI, Terraform, and Terragrunt:
# macOS
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
# Linux
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
3. Authenticate
# Log into Vault (opens browser for SSO)
vault login -method=oidc
# Test kubectl (opens browser for OIDC login)
kubectl get pods -n YOUR_NAMESPACE
4. Deploy Your First App
# Clone the infra repo
git clone https://github.com/ViktorBarzin/infra.git && cd infra
# Copy the stack template
cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf
# Edit main.tf — replace all <placeholders>
# Store secrets in Vault
vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
# Submit a PR
git checkout -b feat/myapp
git add stacks/myapp/
git commit -m "add myapp stack"
git push -u origin feat/myapp
After review and merge, an admin runs cd stacks/myapp && terragrunt apply.
5. Set Up CI/CD (Optional)
Create .woodpecker.yml in your app's Forgejo repo:
steps:
- name: build
image: woodpeckerci/plugin-docker-buildx
settings:
repo: YOUR_DOCKERHUB_USER/myapp
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
username:
from_secret: dockerhub-username
password:
from_secret: dockerhub-token
platforms: linux/amd64
- name: deploy
image: hashicorp/vault:1.18.1
commands:
- export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
- export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login
role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
- KUBE_TOKEN=$(vault write -field=service_account_token
kubernetes/creds/YOUR_NAMESPACE-deployer
kubernetes_namespace=YOUR_NAMESPACE)
- kubectl --server=https://kubernetes.default.svc
--token=$KUBE_TOKEN
--certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
-n YOUR_NAMESPACE set image deployment/myapp
myapp=YOUR_DOCKERHUB_USER/myapp:${CI_PIPELINE_NUMBER}
Useful Commands
# Check your pods
kubectl get pods -n YOUR_NAMESPACE
# View quota usage
kubectl describe resourcequota -n YOUR_NAMESPACE
# Store/read secrets
vault kv put secret/YOUR_USERNAME/myapp KEY=value
vault kv get secret/YOUR_USERNAME/myapp
# Get a short-lived K8s deploy token
vault write kubernetes/creds/YOUR_NAMESPACE-deployer \
kubernetes_namespace=YOUR_NAMESPACE
Important Rules
- All changes go through Terraform — never
kubectl apply/edit/patchdirectly - Never put secrets in code — use Vault:
vault kv put secret/YOUR_USERNAME/... - Always use a PR — never push directly to master
- Docker images: build for
linux/amd64, use versioned tags (not:latest)
git-crypt setup
To decrypt the secrets, you need to setup git-crypt.
- Install git-crypt.
- Setup gpg keys on the machine
git-crypt unlock
This will unlock the secrets and will lock them on commit