Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.
Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
viktorbarzin/trading-bot-service:latest, command
python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
(poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices
vault: re-enable pg-trading static role
- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
(was disabled 2026-04-06 with the rest of trading-bot stack)
kyverno: add postgres* to trusted-registries allowlist
- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
alpine*, nginx*, python* which were already there)
Final workers Pod containers (in order):
[0] signal-generator
[1] learning-engine
[2] market-data
[3] meet-kevin-watcher (NEW)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.
Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).
- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)
Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.
- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
manage their own Authentik objects
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
`tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
from the chart's auditStorage), emits JSON audit events to stdout. kubelet
captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
audit lines from ESO token issuance, KV reads, etc.
## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.
- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x
## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).
## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
vault audit log entry shows Traefik's pod IP instead of the real client IP —
the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
for_each unknown issues unrelated to this change; -target documented in error
message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
within ~10s each. Verified XFF config visible in pod's
/vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
/vault/audit/vault-audit.log) — no change needed.
## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
from memory id=1970 + `frigate, kured, default, changedetection` discovered
during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
(current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.
**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.
## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.
Two-part change:
stacks/kyverno/modules/kyverno/keel-annotations.tf
Trim the policy-level namespace exclude list from 31 → 16. The 16
remaining exclusions are the irreducible cluster-operator + state-
coupled set: keel itself, calico-system + tigera-operator (operator
loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
dbaas (state-coupled), kyverno, metallb-system, external-secrets,
proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
kube-system, vpa, sealed-secrets, infra-maintenance.
stacks/<each-of-15>/.../main.tf
Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
resource so the Kyverno mutate policy can target the workloads via
its namespaceSelector matchLabels.
Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.
Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
- wireguard: sclevine/wg:latest (image fixed today via iptables-nft
postStart shim)
- xray: teddysun/xray
- crowdsec-web: viktorbarzin/crowdsec_web
- monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
- traefik: nginx:1-alpine, openresty/openresty:alpine,
ghcr.io/tarampampam/error-pages:3
- redis: haproxy:3.1-alpine, redis:8-alpine
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
skipped — has non-standard lifecycle from earlier work).
* status-page: enrolled (was missing from original sweep).
* v6 retrigger marker on 17 stacks that never reached terragrunt
apply (#704 exit-1 halted mid-loop).
After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.
Idempotent — re-applying a stack whose state already matches is a no-op.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- stacks/vault/main.tf: register pg-recruiter-responder static role on
the postgresql connection (7d password rotation). Adds the role to
allowed_roles and creates vault_database_secret_backend_static_role
for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
Updated header doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.
Fix:
* Remove `annotations` block from `server.auditStorage` values
(with a comment recording why it can't live there).
* Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
with `force = true`, so Terraform adopts the existing annotations
and tracks the desired-state in IaC going forward. The autoresizer
cares about PVC annotations, not StatefulSet template annotations,
so this is functionally equivalent.
Done out-of-band before commit (helm state was already corrupted):
`helm rollback vault 15 -n vault` → revision 20 deployed (clean).
Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.
Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.
- dbaas: null_resource.pg_instagram_poster_db creates role + DB
(idempotent CREATE IF NOT EXISTS, password placeholder) — same
shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
+ add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
reloader.stakater.com/match=true bounces the pod on rotation.
Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.
Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7 of the vision-LLM benchmark plan. Adds:
- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
per-model analysis, top-N agreement, cost vs cloud APIs, sample
captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
of the flash-attention flag; without the value llama-server exits
before serving any request).
- llama-cpp.md architecture doc links the report so future operators
land on the deployed-and-evaluated model from one entry point.
300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New stacks/fire-planner/ mirrors payslip-ingest layout:
- ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner
- DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner
- FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC,
scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service
- Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData
(cc56ba29 fix; otherwise Grafana 11.2+ hits "you do not have default database")
Plus:
- vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles
- dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role
- monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard
(NW timeseries, MC fan chart, success heatmap, lifetime tax bars,
years-to-ruin table, optimal leave-UK stat, ending wealth stat,
UK success-by-strategy bars, sequence-risk correlation table)
- monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder
Apply order:
1. vault stack — creates the static role
2. dbaas stack — creates the database & role
3. external-secrets stack picks up vault-database refs (no change needed)
4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret
before full apply, per the plan-time-data-source pattern
5. monitoring stack — picks up the new dashboard ConfigMap
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.
Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.
Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.
Closes: code-gy7h
Mirrors Wealthfolio's daily_account_valuation / accounts / activities
from SQLite into a new PG database (wealthfolio_sync) every hour, so
Grafana can chart net worth, contributions, and growth over time.
Components:
- dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG
cluster (dynamic primary lookup so it survives failover).
- vault: pg-wealthfolio-sync static role rotates the password every 7d.
- wealthfolio: ExternalSecret pulls the rotated password into the WF
namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client +
busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload
psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in
the monitoring namespace (uid: wealth-pg).
- monitoring: new Wealth dashboard (wealth.json, 10 panels) — current
net worth / contribution / growth / ROI% stats, then time-series
for net worth, contribution-vs-market, growth area, per-account
stacked area, cash-vs-invested, and a 100-row activity log.
Initial sync: 6 accounts, 10,798 daily valuations, 518 activities.
Verified PG totals match SQLite latest snapshot exactly.
Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).
vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).
Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.
Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).
Refs: code-gy7h
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.
The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.
The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
New service stack at stacks/job-hunter/ mirroring the payslip-ingest
pattern: per-service CNPG database + role (via dbaas null_resource),
Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app
secrets and DB creds, Deployment with alembic-migrate init container,
ClusterIP Service, Grafana datasource ConfigMap.
Grafana dashboard job-hunter.json in Finance folder: new roles per
day, source breakdown, top companies, GBP salary distribution, recent
roles table (sorted by parse confidence then salary).
n8n weekly-digest workflow calls POST /digest/generate with bearer
auth every Monday 07:00 London; digest_runs table provides
idempotency.
Refs: code-snp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.
Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
[servarr] Starting apply...
ERROR: Cannot read PG credentials from Vault.
Run: vault login -method=oidc
[servarr] FAILED (exit 1)
Two root causes, two fixes here.
### 1. Vault `ci` role lacks Tier-1 PG backend creds
The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.
**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.
### 2. Apply-loop swallows stack failures
`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.
**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.
Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.
## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
TF file is already at 5.1.4 in git; once CI picks up this commit
it'll apply on its own, or Viktor can run `tg apply` locally now
that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
per-stack continuation so a single bad stack doesn't hide the
others' plans from the log. Just making the final status honest.
## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
# vault_kubernetes_auth_backend_role.ci will be updated in-place
~ token_policies = [
+ "terraform-state",
# (1 unchanged element hidden)
]
# vault_jwt_auth_backend.oidc will be updated in-place
~ tune = [...] # cosmetic provider-schema drift, pre-existing
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.
### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
curl -s -H "X-Vault-Token: $TOK" \
http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}
# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```
Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.
Refs: bd code-e1x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.
Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.
## What
### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
`static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
`payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
design), `wait-for postgresql.dbaas:5432` annotation, init container runs
`alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
uid `payslips-pg`) reading password from the db-creds K8s Secret.
### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).
### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
(username `payslip_ingest`, 7d rotation).
### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
`pg-cluster-1`.
### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
JSON object matching the schema to stdout. No network, no file writes outside /tmp,
no markdown fences.
## Trade-offs / decisions
- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
initially described. The Alembic migration still creates a `payslip_ingest`
schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
`stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
auto-creator).
## Test Plan
### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.
terraform fmt -check -recursive
(exit 0)
python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```
### Manual Verification (post-merge)
Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.
Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
(first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.
End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.
Closes: code-do7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightweight IPAM with auto-discovery scanning every 15min via fping.
Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster
with Vault-rotated credentials. Cloudflare DNS + Authentik auth.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.
- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
Stagger token periods across roles (7d/8d/9d/10d) to prevent
bulk lease revocation storms that caused transient 504s.
Periodic tokens auto-renew indefinitely, eliminating mass expiry.
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
- Add docker.io/library/ prefix to mysql and postgres backup images
to satisfy Kyverno require-trusted-registries policy (both CronJobs
were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
LimitRange default (256Mi) was silently applied, putting the operator
at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
(also fixed NFS export: vault-backup was a separate ZFS dataset not
in the TrueNAS NFS share — destroyed dataset, created directory)
- Add Transit mount + per-stack Transit keys to vault stack TF
- Auto-create sops-user-<name> policy scoping decrypt to owned stacks
- Auto-create sops-<name> external group + alias for Authentik mapping
- Add sops-admin policy to authentik-admins group
- Attach sops-user policy to namespace-owner identity entities
- Update add-user skill with SOPS onboarding steps and Authentik group
- Adding a user to k8s_users + applying vault stack = full SOPS access
[ci skip]
Data-driven user onboarding: add a JSON entry to Vault KV k8s_users,
apply vault + platform + woodpecker stacks, and everything is auto-generated.
Vault stack: namespace creation, per-user Vault policies with secret isolation
via identity entities/aliases, K8s deployer roles, CI policy update.
Platform stack: domains field in k8s_users type, TLS secrets per user namespace,
user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal.
Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true.
K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner
dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt,
contributing page with CI pipeline template, versioned image tags in CI pipeline.
New: stacks/_template/ with copyable stack template for namespace-owners.