New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
The BuildKit builder cannot push to the insecure HTTP registry at
registry.viktorbarzin.lan:5050 because buildkit_config is not being
applied by the plugin. Simplified to DockerHub-only push for now.
Private registry caching and push can be re-added once buildkit_config
issue is resolved.
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint
TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
The plugin-docker-buildx (Codeberg version) changed CacheFrom from
string to StringSlice, which causes urfave/cli to split on commas.
The cache_images setting properly handles registry refs by generating
both --cache-from and --cache-to flags automatically.
- cache_from/cache_to must be plain strings, not YAML lists — the
plugin-docker-buildx treats them as single string values and the
Woodpecker settings layer was splitting comma-separated list items
into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
to fix Terraform plan error with newer Helm provider
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
Woodpecker CI pipeline now pushes tagged images and patches the
deployment with the build number tag. Using :latest as the Terraform
baseline so CI can override with specific build tags.
Secondary instance on a separate node replicates all zones from primary via
zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at
least 1 DNS pod survives voluntary disruptions. Setup job automates zone
transfer enablement and secondary zone creation via Technitium REST API.
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
get VPA Off mode (recommend only, no evictions) to prevent downtime
on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
Add Vertical Pod Autoscaler (recommender, updater, admission-controller)
and Goldilocks dashboard to monitor resource recommendations across all
namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.
- Scale to 2 replicas with RollingUpdate (maxUnavailable=0)
- Add topology spread constraint to place pods on different nodes
- Switch from single-threaded to ThreadingMixIn HTTP server so tarpit
slow-drip requests no longer block /auth and /healthz endpoints
- Scale admission controller to 2 replicas with topology spread across nodes
- Rewrite inject-priority-class-from-tier: use namespaceSelector instead of
API call per pod admission (eliminates Kyverno→API server round-trip)
- Rewrite sync-tier-label-from-namespace: same namespaceSelector approach
- Extract governance_tiers local to DRY up tier definitions
When both WOODPECKER_GITHUB and WOODPECKER_FORGEJO are enabled without
an explicit WOODPECKER_GITHUB_URL, the GitHub forge inherits the Forgejo
URL causing all GitHub API calls to hit forgejo.viktorbarzin.me with
GitHub OAuth credentials, resulting in 401 Unauthorized on repo add and
cron jobs. Also adds Forgejo forge variables to Terraform.
Admission controller was restarting every ~5min due to API server timeouts
causing leader election loss. failurePolicy:Fail meant the webhook blocked
all pod creation cluster-wide when Kyverno was unavailable.