Data-driven user onboarding: add a JSON entry to Vault KV k8s_users,
apply vault + platform + woodpecker stacks, and everything is auto-generated.
Vault stack: namespace creation, per-user Vault policies with secret isolation
via identity entities/aliases, K8s deployer roles, CI policy update.
Platform stack: domains field in k8s_users type, TLS secrets per user namespace,
user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal.
Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true.
K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner
dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt,
contributing page with CI pipeline template, versioned image tags in CI pipeline.
New: stacks/_template/ with copyable stack template for namespace-owners.
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
Vault KV stores notification_settings as nested JSON ({"slack":{"webhook_url":""}}).
TF code was passing the map object directly as a string env var value.
Fix: access ["slack"]["webhook_url"] with try() fallback.
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
The API server doesn't re-resolve priority from PriorityClassName after
webhook mutation. Changed from remove+add to replace with explicit
priority=1200000 and preemptionPolicy=PreemptLowerPriority.
Syncs secrets from Vault KV at secret/ci/global to Woodpecker
global secrets via REST API every 6 hours. Authenticates via K8s
SA JWT (woodpecker-sync role). New repos just add secrets to
Vault and use from_secret: in pipeline files.
Also removes k8s-dashboard static admin token — use
vault write kubernetes/creds/dashboard-admin instead.
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
- Add init container "install-dotfiles" that clones the dotfiles repo
and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
that manages the Mac config, eliminating the dual-sync problem
- Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug)
- Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)
The + in +00:00 timezone offsets was being URL-decoded to a space,
causing ValueError on the /api/memories/sync endpoint. Build :17
includes the fix. Using versioned tag instead of :latest to avoid
pull-through cache serving stale images.
Custom Docker image pre-installs the universal-calibre mod at build time,
eliminating ~10 min apt-get on every container start. Added NO_CHOWN=true
to skip recursive chown that hangs on NFS mounts. Tightened startup probe
since pod now starts in ~2 min instead of 15-20 min.
Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts
in 8h), causing sustained IO pressure on the PVE host spinning disk.
Increase liveness probe delay to 300s so WAL replay completes before
the probe kills the pod.
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
- Replace subPath ConfigMap mount with init container that copies openclaw.json
to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
SQLite on NFS causes DB corruption due to unreliable POSIX fcntl locking.
iSCSI provides a block device with a local filesystem where locking works
correctly. Same approach used for Redis, MySQL, PostgreSQL, etc.
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.
Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes
Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
- Replace semicolons with newlines in vault/main.tf variable blocks
(HCL does not support semicolons)
- Remove dependency "vault" from platform/terragrunt.hcl to break
cycle (vault already depends on platform)
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.