Split monolithic orchestrator into triage (haiku), historian (sonnet),
and report-writer (opus) stages. Each stage gets its own tool budget.
Added sev-context.sh for structured cluster context gathering.
- Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries
- Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh,
ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient
- Add fallback field to all Slack receivers for clean push notifications
- Multiply ratio exprs by 100 for readable percentages
- Rename "New Tailscale client" to CamelCase "NewTailscaleClient"
- Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive
The woodpecker server was crashing repeatedly with database authentication failures
because Vault rotates the database password every 24 hours, but the Helm release
had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time.
Changes:
- Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically
- Modified Helm values to use envFrom to inject the secret instead of hardcoding
- ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords
- Pod will auto-restart when secret changes (via reloader.stakater.com annotation)
- This eliminates the plan-time password snapshot that goes stale within 24h
The pod still has an unrelated image pull issue on k8s-node4 (containerd blob
corruption), but the database credentials mechanism is now correctly implemented.
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.
Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
blobs on node3/node4
The legacy `postgresql.dbaas` service had no endpoints after CNPG migration,
causing Woodpecker and other stacks to fail DB connections. Changed to
`pg-cluster-rw.dbaas` which points to the CNPG primary.
LimitRange defaults containers to 192Mi which is insufficient for
terragrunt apply on the platform stack (48 vault refs, many modules).
Set explicit 1Gi request / 2Gi limit via backend_options.
- build-cli.yml: comment out cache_from/cache_to to avoid BuildKit
"short read" errors from corrupted registry cache
- default.yml: add git pull --rebase before push in cleanup-and-push
to handle remote having newer commits
Interactive skill that collects user info, updates Vault KV k8s_users,
and applies vault/platform/woodpecker stacks. Includes verification
checklist and auto-generated resource table.
Admin section: how to add a new namespace-owner (Authentik group,
Vault KV entry, three terragrunt applies). Includes auto-generated
resource table.
User section: VPN setup, tool install, Vault/kubectl auth, first app
deployment from template, CI/CD pipeline example, useful commands,
and important rules.
Data-driven user onboarding: add a JSON entry to Vault KV k8s_users,
apply vault + platform + woodpecker stacks, and everything is auto-generated.
Vault stack: namespace creation, per-user Vault policies with secret isolation
via identity entities/aliases, K8s deployer roles, CI policy update.
Platform stack: domains field in k8s_users type, TLS secrets per user namespace,
user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal.
Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true.
K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner
dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt,
contributing page with CI pipeline template, versioned image tags in CI pipeline.
New: stacks/_template/ with copyable stack template for namespace-owners.
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
Vault KV stores notification_settings as nested JSON ({"slack":{"webhook_url":""}}).
TF code was passing the map object directly as a string env var value.
Fix: access ["slack"]["webhook_url"] with try() fallback.
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
The API server doesn't re-resolve priority from PriorityClassName after
webhook mutation. Changed from remove+add to replace with explicit
priority=1200000 and preemptionPolicy=PreemptLowerPriority.
Syncs secrets from Vault KV at secret/ci/global to Woodpecker
global secrets via REST API every 6 hours. Authenticates via K8s
SA JWT (woodpecker-sync role). New repos just add secrets to
Vault and use from_secret: in pipeline files.
Also removes k8s-dashboard static admin token — use
vault write kubernetes/creds/dashboard-admin instead.
- Upgrade model from sonnet to opus for subagent orchestration
- Add Write, Edit, Agent tools for spawning monitor subagents
- Add mandatory deployment workflow: pre-deploy snapshot, apply,
spawn background haiku pod monitor, react to results
- Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck
Pending, and probe failures within 3 min timeout
- Allow terragrunt apply and kubectl set image as safe operations
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
- Replace deprecated cal.name with cal_name() helper using get_display_name()
- URL-decode calendar names (Formula+1 → Formula 1)
- Use cal.search(event=True) instead of deprecated date_search()
- Default to showing all calendars instead of filtering to Personal
- Add init container "install-dotfiles" that clones the dotfiles repo
and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
that manages the Mac config, eliminating the dual-sync problem
- Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug)
- Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)
The + in +00:00 timezone offsets was being URL-decoded to a space,
causing ValueError on the /api/memories/sync endpoint. Build :17
includes the fix. Using versioned tag instead of :latest to avoid
pull-through cache serving stale images.