When the pull-through proxy (10.0.20.10) is down, containerd now falls
back to the official upstream registries (registry-1.docker.io, ghcr.io)
instead of failing. Also cleans up stale disabled registry mirror dirs
and removes unnecessary containerd restart from the rollout script.
Both services migrated to unified ebooks namespace. Remove:
- Old stack directories and Terraform state
- calibre references from monitoring namespace lists
- calibre/audiobookshelf from operational scripts
- Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>)
- state-sync passes per-stack Transit URI + age keys on encrypt
- Vault policies scope namespace-owners to their stacks only:
- sops-admin: wildcard access to all transit keys
- sops-user-<name>: access only to owned stack keys
- Anca (plotting-book) can only decrypt plotting-book state
- Admin can decrypt everything (via admin Transit policy or age fallback)
- External group sops-plotting-book maps Authentik group to Vault policy
- Updated CLAUDE.md with state sync documentation
- .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state
- state-sync: try Vault Transit first, fall back to age key on disk
- Re-encrypted all 101 state files with both Vault Transit + age
- Normal workflow: vault login → decrypt via Transit (no key files)
- Bootstrap/DR: age key at ~/.config/sops/age/keys.txt
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
- Replace subPath ConfigMap mount with init container that copies openclaw.json
to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
Part of SOPS multi-user secrets migration.
- .sops.yaml: defines age recipients (Viktor + CI)
- scripts/tg: wrapper that decrypts secrets before running terragrunt
- .gitignore: excludes decrypted secrets.auto.tfvars.json
No functional change — terraform.tfvars still works as before.
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
get VPA Off mode (recommend only, no evictions) to prevent downtime
on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
kubectl exec, since MetalLB virtual IPs aren't reachable from outside
the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
not mounted by kured DaemonSet)
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.