Grafana 11's Postgres plugin shows 'you do not have default database'
on any panel whose target is missing rawQuery:true / editorMode:"code".
The query builder can't reason about a custom schema.table path and
blanks the panel.
The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer
one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana
to fall back to 'default database' and error out when rendering.
Ran sed over the JSON; all 8 panel+target type refs now match the installed
plugin name. UID (payslips-pg) was already correct.
Grafana can't auto-create the reserved 'General' folder ('A folder with
that name already exists'), which aborts the sidecar provisioner's walk
and drops every dashboard in that folder. Move uk-payslip to Finance so
it loads.
## Context
Wave 5b of the state-drift consolidation plan. Calico has run this cluster's
pod networking since 2024-07-30, installed via raw kubectl manifests —
tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan
flagged Calico as HIGH BLAST because the operator + Installation CR sit on
the critical path for pod scheduling; any mistake during adoption can
break CNI and block new pods cluster-wide within seconds.
This session takes the safe sub-step: adopt only the three namespaces.
Namespaces are label containers — TF managing their names + PSA labels
cannot disrupt Calico networking. Getting the operator, Installation CR,
and CRDs under TF requires dedicated prep (picking the right
`ignore_changes` fields to absorb operator-generated defaults in the
Installation CR, decoupling from the embedded PSA labels applied at
admission, and a low-traffic window). Deferred to `code-3ad`.
## This change
New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks
(Wave 8 convention, commit 8a99be11):
- `kubernetes_namespace.calico_system` ← id `calico-system`
- `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver`
- `kubernetes_namespace.tigera_operator` ← id `tigera-operator`
Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a
second `tg plan` that returns `No changes`. Zero cluster impact —
namespaces stayed exactly as they were cluster-side.
### terragrunt dependency choice
Deliberately no `dependency "platform"` clause — Calico is lower in the
stack than platform, so introducing a `platform → calico` or
`calico → platform` edge would invite cycle-like pain on first
bootstrap. The plan on this stack is always safe to run standalone.
### `ignore_changes` scope on each namespace
- `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy
stamp (Wave 3B sweep, commit 8b43692a).
- `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator
stamps these on `calico-system` + `calico-apiserver` to opt them out
of PSA. These labels aren't surfaced by the kubernetes provider as
part of the import (they arrive through a different field manager),
so left unmanaged to keep the plan clean. `tigera-operator` ns
doesn't get the PSA labels so they aren't ignored there.
## What is NOT in this change
- The three live workloads: `tigera-operator` Deployment in
`tigera-operator` ns, `calico-kube-controllers`/`calico-node`/
`calico-typha` workloads in `calico-system`, the `calico-apiserver`
in `calico-apiserver`. These are all reconciled by the tigera-operator
from the Installation CR — importing them into TF is redundant with
importing the CR itself.
- The `Installation` CR (`default`, apiVersion
`operator.tigera.io/v1`) — the user-authored minimal spec has since
been filled to 104 lines of operator-generated defaults. Adopting it
requires a well-scoped `ignore_changes` list on the `manifest` field.
Separate follow-up `code-3ad`.
- `.sops.yaml` / `tier0_stacks` updates — the original plan suggested
Tier 0 (local SOPS state) for the full Calico stack on the theory
that "network underpins all". With only three namespaces in the stack,
the argument doesn't hold: a failed Tier 1 plan on calico namespaces
cannot break networking, so no need to pay the Tier 0 tax.
## Verification
```
$ cd stacks/calico && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS
calico-kube-controllers-... 1/1 Running 0
calico-node-... 1/1 Running 0
... (all healthy, pre-existing)
```
Follow-up: code-3ad for operator + Installation CR adoption (needs
low-traffic window + ignore_changes scoping).
Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 6a of the state-drift consolidation plan. The Domain wide catch all
Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all)
+ the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have
run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2
hit, it was harder to reason about the config than it should have been —
the only source of truth was the Authentik admin UI. Bringing the provider
+ application under Terraform means future changes are reviewable in PRs
and recoverable from git if the admin UI misbehaves.
## This change
Adds the `goauthentik/authentik` provider to the repo's central
`terragrunt.hcl` `required_providers` (side-effect: every stack can now
declare authentik resources; this stack is the only current consumer).
Stack-local `stacks/authentik/authentik_provider.tf` holds the provider
instance configuration + API token wiring + two resources + their flow
data-source lookups.
### Auth
- API token stored in Vault at `secret/authentik/tf_api_token`, identifier
`terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable
by rewriting the Vault KV + any running TF apply picks it up on next
plan.
### Imports (both landed zero-diff)
- `authentik_application.catchall` ← id `domain-wide-catch-all`
- `authentik_provider_proxy.catchall` ← id `5`
### Flow references
Authorization + invalidation flows are looked up via `data
"authentik_flow"` by slug (`default-provider-authorization-implicit-consent`
+ `default-provider-invalidation-flow`). Keeping them as data sources
rather than hardcoded UUIDs means a flow recreation (slug unchanged)
doesn't require an HCL edit.
### `lifecycle { ignore_changes }` scope
On `authentik_provider_proxy.catchall`:
- `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the
live state references complex many-to-many relations that are easier
to manage from the Authentik UI than to serialise in HCL. Drift
suppressed.
- `skip_path_regex`, `internal_host`, all `basic_auth_*`,
`intercept_header_auth`, `access_token_validity` — either defaults or
UI-only tuning knobs that aren't part of Terraform's concern for this
catch-all provider.
On `authentik_application.catchall`:
- `meta_description`, `meta_launch_url`, `meta_icon`, `group`,
`backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` —
cosmetic/non-functional attributes; the Authentik UI is the right
place to edit these and drift on them isn't interesting.
## What is NOT in this change
- Outpost-binding resource — the embedded outpost's provider list is a
single-row many-to-many that the Authentik UI manages cleanly; adding
TF there would fight the UI without reducing drift.
- Property mappings and JWT federation source — managed via UI, drift
suppressed. A future wave can bring them in when someone actually
wants to edit them through code review.
- Other Authentik entities (Flows, Stages, Groups, RBAC policies) —
same rationale: UI is the natural editing surface. Adopt incrementally
as they become interesting to code-review.
## Verification
```
$ cd stacks/authentik && ../../scripts/tg plan | grep Plan:
Plan: 0 to add, 1 to change, 0 to destroy.
# module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift,
# unrelated to this commit (image_pull_policy Always -> IfNotPresent)
$ ../../scripts/tg state list | grep authentik_
authentik_application.catchall
authentik_provider_proxy.catchall
data.authentik_flow.default_authorization_implicit_consent
data.authentik_flow.default_provider_invalidation
```
## Reproduce locally
1. `git pull && cd stacks/authentik && ../../scripts/tg init`
2. Terraform pulls goauthentik/authentik provider (first time).
3. `tg plan` — expect only pgbouncer drift; authentik resources read-only.
Refs: Wave 6a of the state-drift consolidation (code-hl1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
## Context
Wave 7 of the state-drift consolidation plan. The drift-detection pipeline
(`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every
stack daily and Slack-posted a summary, but its output was ephemeral —
nothing persisted in Prometheus, so there was no historical view of which
stacks drift, when, or for how long. Following the convergence work in
waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4
mysql cleanup), the baseline is clean enough that *new* drift should
stand out. That only works if we have observability.
## This change
### `.woodpecker/drift-detection.yml`
Enhances the existing cron pipeline to push a batched set of metrics to
the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`)
after each run:
| Metric | Kind | Purpose |
|---|---|---|
| `drift_stack_state{stack}` | gauge, 0/1/2 | 0=clean, 1=drift, 2=error |
| `drift_stack_first_seen{stack}` | gauge (unix seconds) | Preserved across runs for drift-age tracking |
| `drift_stack_age_hours{stack}` | gauge (hours) | Computed from `first_seen` |
| `drift_stack_count` | gauge (count) | Total drifted stacks this run |
| `drift_error_count` | gauge (count) | Total plan-errored stacks |
| `drift_clean_count` | gauge (count) | Total clean stacks |
| `drift_detection_last_run_timestamp` | gauge (unix seconds) | Pipeline heartbeat |
First-seen preservation: on each drift hit, the pipeline queries
Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}`
value. If present and non-zero, reuse it; otherwise stamp with `NOW`.
That means age-hours grows monotonically until the stack goes clean
(at which point state=0 resets first_seen by omission).
Atomic batched push: all metrics for a run are POST'd in a single
HTTP request. Pushgateway doesn't support atomic multi-metric updates
natively, but batching at the pipeline layer prevents half-updated
state if the curl is interrupted mid-run (the second call would just
fail the entire run and alert on `DriftDetectionStale`).
### `stacks/monitoring/.../prometheus_chart_values.tpl`
New `Infrastructure Drift` alert group with three rules:
- **DriftDetectionStale** (warning, 30m): fires if
`drift_detection_last_run_timestamp` is older than 26h. Gives a 2h
grace window on top of the 24h cron so transient Pushgateway or
cluster unavailability doesn't false-alarm. Guards against the
pipeline silently failing or the cron not firing.
- **DriftUnaddressed** (warning, 1h): fires if any stack has
`drift_stack_age_hours > 72` — three days of unacknowledged drift.
Three days is long enough to absorb weekends + typical review cycles
but short enough to force follow-up before drift compounds.
- **DriftStacksMany** (warning, 30m): fires if `drift_stack_count > 10`
in a single run. Sudden wide drift usually signals systemic causes
(new admission webhook, provider version bump, cluster-wide CRD
upgrade) rather than individual configuration errors, and the alert
body nudges toward that diagnosis.
Applied to `stacks/monitoring` this session — 1 helm_release changed,
no other drift surfaced.
## What is NOT in this change
- The Wave 7 **GitHub issue auto-filer** — the full plan included
filing a `drift-detected` issue per drifted stack. Deferred because
it requires wiring the `file-issue` skill's convention + a gh token
exposed to Woodpecker, both of which need separate setup. The Slack
alert covers the same need at lower fidelity in the meantime.
- The Wave 7 **PG drift_history table** — would provide the richest
historical view but adds a new DB schema dependency for a CI
pipeline. Pushgateway + Prometheus handle the 72h window we care
about; PG history is nice-to-have for quarterly reviews.
- Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the
baseline has been stable for a few cycles.
Follow-ups tracked: file dedicated beads items for GH-issue filer + PG
drift_history.
## Verification
```
$ cd stacks/monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
# After next cron run (cron expr: "drift-detection" in Woodpecker UI):
$ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
| grep -c '^drift_'
# expect a positive number
```
## Reproduce locally
1. `git pull`
2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules | jq '.data.groups[] | select(.name == "Infrastructure Drift")'`
3. Manually trigger the Woodpecker cron and watch Pushgateway populate.
Refs: Wave 7 umbrella (code-hl1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 5c of the state-drift consolidation plan. `local-path-provisioner`
(Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw
`kubectl apply` against the upstream manifest. It serves as the cluster's
default StorageClass and is still actively in use — the 2026-04-18 live
survey showed helper-pod-delete cycles running against existing PVCs.
Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding),
ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown
scripts, StorageClass `local-path` (default), and the 1-replica
Deployment itself. Seven resources total.
## This change
New Tier 1 stack `stacks/local-path/` with all seven resources, adopted
via Wave 8's HCL `import {}` block convention (commit 8a99be11):
- `kubernetes_namespace.local_path_storage` → id `local-path-storage`
- `kubernetes_service_account.local_path_provisioner` →
id `local-path-storage/local-path-provisioner-service-account`
- `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role`
- `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind`
- `kubernetes_config_map.local_path_config` →
id `local-path-storage/local-path-config`
- `kubernetes_storage_class_v1.local_path` → id `local-path`
- `kubernetes_deployment.local_path_provisioner` →
id `local-path-storage/local-path-provisioner`
Conventions applied:
- Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
Goldilocks `vpa-update-mode` label drift (Wave 3B, commit 8b43692a).
- Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
ndots dns_config drift (Wave 3A, commit c9d221d5 + 327ce215).
- ServiceAccount + pod spec pin `automount_service_account_token = false`
and `enable_service_links = false` to match the live spec exactly.
- `import {}` stanzas removed after the apply converged to zero-diff
(per AGENTS.md → "Adopting Existing Resources").
## Apply outcome
`Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.`
The 3 in-place changes were:
- `kubernetes_config_map.local_path_config.data` — whitespace/format
reshuffle. The live ConfigMap contained the upstream manifest's
hand-indented JSON + YAML; my HCL uses canonical `jsonencode` /
heredoc. Semantic content identical, so the provisioner continued
running (no pod restart).
- `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true`
— TF-only attribute, no cluster impact.
- `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false`
+ `is-default-class` annotation re-asserted — TF-schema reconciliation
only; the StorageClass remained default throughout.
Post-apply `scripts/tg plan` returns `No changes`.
## Verification
```
$ cd stacks/local-path && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl -n local-path-storage get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
local-path-provisioner 1/1 1 1 55d
$ kubectl get sc local-path
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer
```
## What is NOT in this change
- Helm-release adoption — local-path-provisioner was never installed via
Helm in this cluster; raw manifests only. Keeping native typed
resources rather than retrofitting a chart.
- PV-path customisation — sticks with upstream default
`/opt/local-path-provisioner` on all nodes (via
`DEFAULT_PATH_FOR_NON_LISTED_NODES`).
Closes: code-3gp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.
The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`
So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.
## Flow
```
user: bd assign <id> agent
│
▼
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy? skip) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Decisions
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
`curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The script copies it into `/tmp/.beads/` because bd
may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
`beads-task-runner` never trips the reaper. Failures trip it; pod crashes
(in-memory job state lost) also trip it.
## What is NOT in this change
- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
`cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.
## Deviations from plan
Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
serializes `notes` as a string (not an array), and every `bd note` bumps
`updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
`-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.
## Test plan
### Automated
```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!
$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.
$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```
### Manual verification
1. **Apply**
```
vault login -method=oidc
cd infra/stacks/beads-server
scripts/tg apply
```
Expect: `kubernetes_config_map.beads_metadata`,
`kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
created. No changes to existing resources.
2. **CronJobs exist with right schedule**
```
kubectl -n beads-server get cronjob
```
Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`,
both with `SUSPEND=False`.
3. **End-to-end smoke**
```
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' line and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '{status, notes}'
```
Expect notes to contain `auto-dispatcher claimed at …` and
`dispatched: job=<uuid>`, status `in_progress`.
4. **Reaper smoke**
Assign + dispatch a long bead, then
`kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
30 min + one reaper tick, `bd show <id>` shows `blocked` with a
`reaper: no progress for Nm — blocking` note.
5. **Kill switch**
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
kubectl -n beads-server get cronjob
```
Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
nothing happens within 5 min. Re-apply with `=true` to re-enable.
Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.
Closes: code-8sm
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces
of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all
cluster changes via TF" invariant and drifting silently:
1. **kured** (Helm release): deployed 265d ago via `helm install kured` on
the CLI. Values were edited only via `helm upgrade` — never captured.
Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri
02:00–06:00 London reboot window, Slack notifyUrl, and a custom
`/sentinel/gated-reboot-required` sentinel file.
2. **kured-sentinel-gate**: a custom DaemonSet + ServiceAccount +
ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem
(memory 390) when kured rebooted nodes during a containerd overlayfs
outage and turned a single-node blip into a 26h cluster outage.
The gate DaemonSet creates `/var/run/gated-reboot-required` only when
(a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all
calico-node pods Running, (d) no node transitioned Ready in the last
30 minutes (cool-down). kured's `rebootSentinel` then points at the
gated file so reboots are effectively gated by cluster health.
Applied 33d ago via `kubectl apply` — no TF footprint.
Both are now codified in the new `stacks/kured/` (Tier 1, PG state).
## This change
- New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl`
(standard platform-dep) + `secrets` symlink.
- All 6 resources adopted via Wave 8's HCL `import {}` block pattern
(commit 8a99be11) — written as `import {}` stanzas in the initial
commit, plan-applied to zero, then stanzas deleted before this commit
per the convention:
- `kubernetes_namespace.kured` (id: `kured`)
- `helm_release.kured` (id: `kured/kured`)
- `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`)
- `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`)
- `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- Slack notifyUrl moved from inline helm values into Vault at
`secret/kured` under key `slack_kured_webhook`, consumed via
`data "vault_kv_secret_v2"`. No plaintext secret in git.
- Namespace gets `tier = "1-cluster"` label (new — previously untiered,
so Kyverno auto-quotas applied cluster-tier defaults on kured pods).
Benign additive change; pod specs have explicit resources anyway.
- DaemonSet + SA get `automount_service_account_token = false` /
`enable_service_links = false` to match the live pod spec exactly —
otherwise TF schema defaults would flip these fields.
- DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift
(Wave 3A convention, commit c9d221d5 + 327ce215).
- Namespace carries the same marker on the
`goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep,
commit 8b43692a).
## Import outcomes
Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.`
The 3 in-place changes were all TF-schema reconciliation, not cluster
mutations:
- `helm_release.kured.values` — format reshuffle; the imported state
stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic
YAML is byte-identical, so the triggered Helm upgrade was a no-op on
the cluster side (revision bumped 2→3, zero pod restarts).
- `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new
label added. Already discussed above.
- `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true
— TF-only attribute, no k8s impact.
Post-apply `scripts/tg plan` on `stacks/kured` returns:
`No changes. Your infrastructure matches the configuration.`
## What is NOT in this change
- `import {}` stanzas — intentionally removed after the apply landed.
They would be no-ops and would clutter future diffs. Per Wave 8
convention (AGENTS.md → "Adopting Existing Resources").
- Calico adoption (Wave 5b) — separate higher-blast change, needs a
dedicated low-traffic window.
- local-path-storage (Wave 5c) — check-or-remove task still open.
## Verification
```
$ kubectl -n kured get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
kured 5 5 5 5 5
kured-sentinel-gate 5 5 5 5 5
$ helm -n kured list
NAME NAMESPACE REVISION STATUS CHART APP VERSION
kured kured 3 deployed kured-5.11.0 1.21.0
$ cd stacks/kured && ../../scripts/tg plan | tail -1
No changes. Your infrastructure matches the configuration.
```
## Reproduce locally
1. `git pull`
2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes
3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready.
Closes: code-q8k
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL
support — confirmed 2026-04-18 via repo inspection). The DB lives on
an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod.
First attempt at a standalone backup CronJob failed with Multi-Attach
error: RWO volume is already attached to the running WF pod, so no
separate pod can mount it. Switched to a backup sidecar in the same
pod — shares the PVC mount naturally.
## This change
- `container "backup"` added to the WF Deployment:
- alpine:3.20 + sqlite + busybox-suid (for crond).
- Mounts /data read-only (shared with WF container) + /backup (new
NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup).
- Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh
which runs `sqlite3 .backup` (WAL-safe online snapshot, zero
downtime), copies secrets.json, and prunes anything older than 30d.
- 16Mi request / 64Mi limit — sleeps most of the time.
- NFS volume declared in pod spec — server from the existing
`var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created
on the PVE host in the same session.
Removed the standalone backup CronJob that couldn't work.
## Verification
### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 1 changed, 1 destroyed (the transient CronJob).
### Manual (2026-04-18)
$ kubectl -n wealthfolio get pods -l app=wealthfolio
wealthfolio-95d8bd498-cj8kw 2/2 Running
$ kubectl -n wealthfolio logs <pod> -c backup
wealthfolio-backup sidecar ready; next 04:30 UTC
$ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh
wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M)
$ ls /srv/nfs/wealthfolio-backup/
2026-04-18T22-24-55/ ← first sidecar-produced backup
## Reproduce locally
1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh
2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/
3. Expected: new dated folder appears with wealthfolio.db + secrets.json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.
The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.
Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.
## This change
Two edits to the embedded probe script inside the cronjob resource:
```
- # Step 2: Wait for delivery, retry IMAP up to 3 min
+ # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
...
- for attempt in range(9):
+ for attempt in range(15):
...
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```
Flow (before):
```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```
Flow (after):
```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
│
└─ timeout ─► log, continue to next loop
```
## What is NOT in this change
- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
3600s + for: 10m. Those fire only on sustained multi-hour issues and
should not be loosened — they would mask future regressions. Probe
success rate is now expected to recover to ≥95% from the two upstream
fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
cleanup of stale e2e-probe-* messages.
## Test Plan
### Automated
`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:
```
# module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
- for attempt in range(9):
+ for attempt in range(15):
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```
`scripts/tg apply`:
```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
### Manual Verification
1. Trigger the probe manually:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
`kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
successful run should still complete in < 60s now that postscreen
is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
Prometheus — expect ≥95% (was ~65% before all three fixes).
## Reproduce locally
1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.
Closes: code-18e
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Postfix inside docker-mailserver was spamming fatal errors at roughly
1 per minute — 5,464 of them in a 24h window — all of the same shape:
```
postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache:
unable to get exclusive lock: Resource temporarily unavailable
```
Every time one of these fires, the postscreen process dies mid-connection
and the inbound SMTP session is dropped. Legitimate mail (including Brevo
deliveries for our e2e email-roundtrip probe) gets re-queued by the sender
and arrives late — frequently past the probe's 180s IMAP polling window,
producing a 35%/7d probe success rate and the EmailRoundtripStale alert
noise that was originally flagged as "probably nothing."
## Root cause
`master.cf` declares postscreen with `maxproc=1`, but postscreen still
re-spawns per incoming connection (or for short-lived reopens), and each
instance opens the shared btree cache with an exclusive file lock. Under
any concurrency (two TCP SYNs arriving close together, or a retry during
teardown), the second process hits EWOULDBLOCK on fcntl and Postfix
treats that as fatal.
Three options were considered:
| Option | Verdict |
|--------|---------|
| (a) Disable cache (postscreen_cache_map = ) | ✓ chosen |
| (b) Switch btree → lmdb | ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) |
| (c) proxy:btree via proxymap | ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" |
| (d) Memcached sidecar | ✗ new moving part; deferred |
Option (a) is a small trade-off: legitimate clients re-run the
greet-action / bare-newline-action checks on every fresh TCP session
instead of hitting the 7-day whitelist cache. At our volume (~100
deliveries/day, ~72 of which are the probe itself) that's negligible CPU.
DNSBL re-evaluation is also avoided only partially, but this mailserver
already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role
was doing nothing anyway.
## This change
Appends a stanza to the user-merged postfix main.cf stored in
`variable.postfix_cf` that sets `postscreen_cache_map =` (empty value).
Postfix treats an empty cache_map as "no persistent cache" — per-session
decisions are still enforced, they just aren't cached across sessions.
Before:
```
smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock)
├─ concurrent access → fcntl EWOULDBLOCK → fatal
└─ connection dropped, sender retries, mail arrives late
```
After:
```
smtpd ──► postscreen (no cache, per-session checks only)
└─ no shared file, no lock → no fatal, no dropped session
```
No change to master.cf (postscreen still the front-end), no change to
DNSBL / greet / bare-newline policy.
## What is NOT in this change
- Dovecot userdb dedup (shipped in the previous commit).
- Email-roundtrip probe widening (next commit).
- Rebuilding docker-mailserver image with lmdb support (deferred —
disabling the cache is simpler and sufficient at our volume).
## Test Plan
### Automated
`postconf -m` in the running container to confirm lmdb is genuinely absent
(ruling out option (b) before we commit to (a)):
```
btree cidr environ fail hash inline internal ldap memcache
nis pcre pipemap proxy randmap regexp socketmap static tcp
texthash unionmap unix
```
No lmdb entry — confirmed.
`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:
```
~ "postfix-main.cf" = <<-EOT
+ postscreen_cache_map =
```
`scripts/tg apply`:
```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
Reloader triggers pod rollout — baseline error count before apply was 34
`unable to get exclusive lock` lines per `--tail=500` log window.
### Manual Verification
Post-rollout, when the new pod is Ready:
1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
Expect: empty (no value)
2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 | grep -c "unable to get exclusive lock"`
Expect: 0 new occurrences (any hits are from before the rollout).
3. Trigger a probe run manually:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
then `kubectl -n mailserver logs job/probe-verify-...`
Expect: `Round-trip SUCCESS` with duration < 120s.
## Reproduce locally
1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
2. Expect: `postscreen_cache_map =` (empty value)
3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m | grep -c "unable to get exclusive lock"`
4. Expect: 0
Closes: code-1dc
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Dovecot auth logs have been steadily spamming
`passwd-file /etc/dovecot/userdb: User r730-idrac@viktorbarzin.me exists more
than once` (and the same for vaultwarden@) at ~31 occurrences per 500 log
lines. Under load this flakes IMAP auth for the e2e email-roundtrip probe
(spam@viktorbarzin.me uses the catch-all), which was masquerading as "Brevo
or probe timing" noise.
## Root cause
docker-mailserver builds Dovecot's `/etc/dovecot/userdb` from two sources:
real accounts (`postfix-accounts.cf`) AND virtual-alias entries whose
*target* resolves to a local mailbox (`postfix-virtual.cf`). When the same
address appears as BOTH a real mailbox AND an alias whose target is another
local mailbox, the generated userdb has two lines for that username pointing
to different home directories — e.g.:
r730-idrac@viktorbarzin.me:...:/var/mail/.../r730-idrac/home
r730-idrac@viktorbarzin.me:...:/var/mail/.../spam/home ← from alias
Dovecot's passwd-file driver rejects the duplicate, and every subsequent
auth lookup logs the error.
This affected exactly two addresses:
- r730-idrac@viktorbarzin.me (real account + alias → spam@)
- vaultwarden@viktorbarzin.me (real account + alias → me@)
Other aliases are fine: they either forward to external addresses (gmail
etc.) — no local userdb entry generated — or map an address to itself
(me@ → me@) which docker-mailserver dedups internally.
Note: removing the real accounts is not an option because Vaultwarden uses
`vaultwarden@viktorbarzin.me` as its live SMTP_USERNAME
(stacks/vaultwarden/modules/vaultwarden/main.tf:121).
## This change
Introduces a `local.postfix_virtual` that concatenates the Vault-sourced
aliases with `extra/aliases.txt`, then filters out any line matching the
exact "LHS RHS" shape where both sides are in `var.mailserver_accounts` and
LHS != RHS. That is, only the pure local→local redundant entries are
dropped; all forwarding aliases and the catch-all are preserved.
The filter is self-healing: if a future alias ever collides with a real
account, it gets silently suppressed instead of breaking Dovecot auth.
```
Vault mailserver_aliases ─┐
├─ concat ─ split \n ─ filter ─ join \n ─► postfix-virtual.cf
extra/aliases.txt ─────────┘ │
└── drop if LHS+RHS both in
mailserver_accounts and
LHS != RHS
```
Filtered entries (confirmed via locally-simulated filter on live data):
- r730-idrac@viktorbarzin.mespam@viktorbarzin.me
- vaultwarden@viktorbarzin.meme@viktorbarzin.me
Preserved (sample): postmaster→me, contact→me, alarm-valchedrym→self+3 ext,
lubohristov→gmail, yoana→gmail, @viktorbarzin.me→spam (catch-all), all four
disposable `*-generated@` aliases.
## What is NOT in this change
- Real accounts in Vault (`secret/platform.mailserver_accounts`) are
untouched — vaultwarden SMTP auth keeps working.
- Postfix postscreen btree lock contention (separate commit).
- Email-roundtrip probe IMAP window (separate commit).
## Test Plan
### Automated
`terraform validate` — passes (docker-mailserver module):
```
Success! The configuration is valid, but there were some validation warnings as shown above.
```
`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:
```
# module.mailserver.kubernetes_config_map.mailserver_config will be updated in-place
~ resource "kubernetes_config_map" "mailserver_config" {
~ data = {
~ "postfix-virtual.cf" = (sensitive value)
# (9 unchanged elements hidden)
}
id = "mailserver/mailserver.config"
}
Plan: 0 to add, 1 to change, 0 to destroy.
```
`scripts/tg apply` — applied:
```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
### Manual Verification
Post-apply configmap content (the two lines are gone):
```
$ kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'
postmaster@viktorbarzin.meme@viktorbarzin.mecontact@viktorbarzin.meme@viktorbarzin.meme@viktorbarzin.meme@viktorbarzin.melubohristov@viktorbarzin.melyubomir.hristov3@gmail.comalarm-valchedrym@viktorbarzin.me alarm-valchedrym@...,vbarzin@...,emil.barzin@...,me@...
yoana@viktorbarzin.medivcheva.yoana@gmail.com
@viktorbarzin.me spam@viktorbarzin.mefirmly-gerardo-generated@viktorbarzin.meme@viktorbarzin.meclosely-keith-generated@viktorbarzin.mevbarzin@gmail.comliterally-paolo-generated@viktorbarzin.meviktorbarzin@fb.comhastily-stefanie-generated@viktorbarzin.meelliestamenova@gmail.com
```
Reloader triggers a pod rollout; once new pod is Ready:
- `kubectl -n mailserver exec <pod> -c docker-mailserver -- cut -d: -f1 /etc/dovecot/userdb | sort | uniq -d`
expected output: empty (no duplicate usernames)
- `kubectl -n mailserver logs <pod> -c docker-mailserver --tail=500 | grep -c "exists more than once"`
expected output: 0 (baseline was 31/500 lines)
## Reproduce locally
1. `kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'`
2. Expect: no `r730-idrac@viktorbarzin.me spam@viktorbarzin.me` line and no
`vaultwarden@viktorbarzin.me me@viktorbarzin.me` line.
3. After pod restart: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=500 | grep -c "exists more than once"` → 0.
Closes: code-27l
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.
During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.
Live pre-apply snapshot via `bin/rails runner`:
enqueued=18 (cache=2, data_migrations=4, default=12)
scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.
## What changed
Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:
- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
job gets OOM-killed and container-restarted in place without evicting
the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
the existing `dawarich-secrets` K8s Secret (populated by the existing
ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
reference the 2026-02-23 block relied on — that data source no longer
exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
to separate commits (plan: 2 → 5 → 10 with 15-30min observation
between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
container-scoped restart on stall, verified `pgrep` is at
/usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
Sidekiq's Rails initialisation matches web.
Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
drain in-flight jobs on SIGTERM during rolls (default 30s not enough
for reverse-geocoding batches).
## What is NOT in this change
- Prometheus exporter for Sidekiq metrics. The first apply turned on
`PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
`prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
metrics over TCP to a separate exporter server process — and the
freikin/dawarich image does not start one. Client logged ~2/sec
"Connection refused" errors until we flipped ENABLED back to "false"
in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
for the same reason (nothing listening on :9394). Filed code-1q5
(blocks code-459) to add a third sidecar container running
`bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
the 4 drafted alerts (DawarichSidekiqDown /
QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
monitoring/prometheus_chart_values.tpl; they reference metrics that
don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
out of scope per plan.
## Other changes bundled in
Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.
## Topology trade-off recorded
Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
`BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
the container block + apply.
- Shared pod network namespace — only one process can bind any given
port. This is why the plan explicitly avoided declaring a new
`port { name = "prometheus" }` on the sidekiq container (the web
container already reserves 9394 by name).
Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".
## Rollback
Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
apply — full disable, backlog stays in Redis DB 1, recoverable.
Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.
## Refs
- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
Untouched; processing the backlog does not fix this but ensures
future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
same reasoning.
## Test Plan
### Automated
```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
# kubernetes_deployment.dawarich (sidekiq container + probes + lifecycle)
# kubernetes_namespace.dawarich (drops stale goldilocks label, pre-existing drift)
# module.tls_secret.kubernetes_secret.tls_secret (Kyverno clone-label drift, pre-existing)
$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```
### Manual Verification
Setup: kubectl context against the k8s cluster (10.0.20.100).
1. Pod has both containers Ready with zero restarts:
```
$ kubectl -n dawarich get pods -o wide
NAME READY STATUS RESTARTS AGE
dawarich-75b4ff9fbf-qh56v 2/2 Running 0 <fresh>
```
2. Sidekiq container is actively processing jobs:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
Sidekiq 8.0.10 connecting to Redis ... db: 1
queues: [data_migrations, points, default, mailers, families,
imports, exports, stats, trips, tracks,
reverse_geocoding, visit_suggesting, places,
app_version_checking, cache, archival, digests,
low_priority]
Performing DataMigrations::BackfillMotionDataJob ...
Backfilled motion_data for N000 points (N climbing)
```
3. Rails Sidekiq::API snapshot — procs registered, counters moving:
```
$ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
require "sidekiq/api"
s = Sidekiq::Stats.new
puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
'
processed=7 failed=2 procs=1
retry=0 dead=0
```
(The 2 "failures" are cumulative across two pod lifecycles during
the Prometheus env flip — retried successfully, neither retry nor
dead set holds any jobs.)
4. Per-container memory well under the 1Gi limit:
```
$ kubectl -n dawarich top pod --containers
POD NAME CPU MEMORY
dawarich-75b4ff9fbf-qh56v dawarich 1m 272Mi (of 896Mi)
dawarich-75b4ff9fbf-qh56v dawarich-sidekiq 79m 333Mi (of 1Gi)
```
5. No "Prometheus Exporter, failed to send" log lines since the second
apply:
```
$ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
| grep -c "Prometheus Exporter"
0
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.
Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:
1. **Not reviewable** — it's an out-of-band state mutation that leaves
no trace in git beyond the subsequent `resource {}` block. A
reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
confusing half-adopted shape.
`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.
Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.
## This change
- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
Not the CLI" section sitting right after Execution. Includes the
canonical 5-step workflow (write resource → add import stanza → plan
to zero → apply → drop stanza), the reasoning, and a per-provider ID
format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
Terraform State two-tier section pointing back to AGENTS.md. Keeps
CLAUDE.md's quick-reference density intact while making sure the rule
is reachable from the Claude-instructions path.
## What is NOT in this change
- Any actual imports — this is a pure docs landing. Wave 5 will
demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
in the repo history — `import {}` blocks are delete-after-apply, so
retro-documenting them is not useful.
Closes: code-[wave8-task]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):
1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
(files in /usr/share/agent-seed/), the beads-task-runner agent, and
hmac.compare_digest bearer verification. The tag moves from 382d6b14
to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
needs the Dispatch button + /api/agent-dispatch + /api/agent-status
routes. Tag moves from :latest to :17a38e43 (fork HEAD on
github.com/ViktorBarzin/beadboard).
## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
"Use 8-char git SHA tags — `:latest` causes stale pull-through cache").
## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
- claude-agent-service:0c24c9b6 verified via
`docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
contains both seed files.
- beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.
## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
+ copies beads-task-runner.md), beadboard pod rolls with new env vars
sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
-o yaml | grep image:` should show :17a38e43.
Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:
app.kubernetes.io/managed-by = kyverno
generate.kyverno.io/policy-name = sync-tls-secret
generate.kyverno.io/policy-namespace = ""
generate.kyverno.io/rule-name = sync-tls-secret
generate.kyverno.io/source-kind = Secret
generate.kyverno.io/source-namespace = kyverno
generate.kyverno.io/source-uid = <uid>
generate.kyverno.io/source-version = v1
generate.kyverno.io/source-group = ""
generate.kyverno.io/clone-source = ""
Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.
## This change
Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.
The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.
## What is NOT in this change
- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
— a different admission controller, different resource, different fix.
Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
policy is the only generate policy that produces Secrets in this repo
(verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).
## Verification
Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
(kubernetes_namespace.dawarich — Goldilocks drift, untouched)
(module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)
After: Plan: 0 to add, 1 to change, 0 to destroy.
(kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```
Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member
Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone`
on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name
(selector switched to the standalone pod), all 20 databases/688 tables/14
users were dump-restored, and Vault rotated credentials against the new
instance. The InnoDB Cluster has been dark since — Phase 4 was to remove
the dead code and decommission its cluster-side Helm state.
Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster
+ mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0
(30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles."
## This change
Phase 4 scope executed in this session (beads code-qai):
1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack:
- `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR
+ MySQL Router Deployment + 8 Services (mysql-cluster, -instances,
ports 6446/6448/6447/6449/6450/8443, etc.)
- `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator
Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles
- `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns
- `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover
permissions patch that existed to work around the sidecar's kopf
permissions bug; unused without the operator
- `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra`
- `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override
`innodb_doublewrite=OFF` via subPath mount; standalone does not need it
2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not
garbage-collect PVCs; 30Gi reclaimed.
3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf`
covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section
and all six resources above.
The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context
deadline exceeded"). Uninstallation had in fact completed cluster-side by
that point but TF rolled back the state delta. A second `terragrunt destroy
-target` call with the same args resolved cleanly — destroyed the remaining
2 tracked resources (the first pass cleared 4) and encrypted+committed the
Tier 0 state.
## What is NOT in this change
- CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these
on uninstall. Verified clean: `kubectl get crd | grep mysql.oracle.com`
returns nothing.
- Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not
a TF-managed resource.
- Stale Vault DB roles (health, linkwarden, affine, woodpecker,
claude_memory, crowdsec, technitium) for services migrated MySQL→PG —
sandbox denies `vault list database/roles` as credential scouting, so
the user handles this manually.
- 2 state-commits preceding this one (`30fa411b`, `6cf3575e`) are automatic
SOPS-encrypted-state commits produced by `scripts/tg` after each
`terragrunt destroy` pass. Standard Tier 0 workflow.
## Verification
```
$ helm list -A | grep -E 'mysql-cluster|mysql-operator'
(no output)
$ kubectl get ns mysql-operator
Error from server (NotFound): namespaces "mysql-operator" not found
$ kubectl get pvc -n dbaas datadir-mysql-cluster-0
Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found
$ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone
NAME READY STATUS RESTARTS AGE
mysql-standalone-0 1/1 Running 1 (118m ago) 2d
$ ../../scripts/tg state list | grep -i 'mysql_operator\|mysql_cluster\|mysql_sidecar\|mysql_extra_cnf'
(no output)
$ ../../scripts/tg plan | grep -E 'mysql_cluster|mysql_operator|mysql_sidecar|mysql_extra_cnf'
(no output — Wave 2 drift is gone; remaining plan items are pre-existing
drift unrelated to this change, see Wave 3 + in-flight payslip work)
```
## Reproduce locally
1. `git pull`
2. `cd stacks/dbaas && ../../scripts/tg state list | grep mysql_cluster` → no output
3. `helm list -A | grep mysql-cluster` → no output
Closes: code-qai
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Pod was OOMKilled after today's broker-sync Phase 3 import grew the
activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow
matches across 6 accounts). `/api/v1/net-worth` and
`/valuations/history` materialise the full history in memory to render
the dashboard chart.
`kubectl describe pod` showed Back-off restarting failed container;
`kubectl top pod` reported 14Mi steady-state but spikes crossed the
64Mi cap.
## This change
Bump container resources to:
- requests.memory: 64Mi → 256Mi
- limits.memory: 64Mi → 1Gi
CPU unchanged. 1Gi is generous for the current 700-activity DB +
chart rendering, with headroom for another year of growth before we
need to revisit (VPA will flag if actual use exceeds upperBound).
## Verification
### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 4 changed, 0 destroyed.
### Manual
$ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}'
→ {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}}
$ kubectl -n wealthfolio get pods -l app=wealthfolio
NAME READY STATUS RESTARTS AGE
wealthfolio-86c8696b9c-nzwkf 1/1 Running 0 51s
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.
Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.
## What
### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
`static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
`payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
design), `wait-for postgresql.dbaas:5432` annotation, init container runs
`alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
uid `payslips-pg`) reading password from the db-creds K8s Secret.
### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).
### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
(username `payslip_ingest`, 7d rotation).
### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
`pg-cluster-1`.
### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
JSON object matching the schema to stdout. No network, no file writes outside /tmp,
no markdown fences.
## Trade-offs / decisions
- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
initially described. The Alembic migration still creates a `payslip_ingest`
schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
`stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
auto-creator).
## Test Plan
### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.
terraform fmt -check -recursive
(exit 0)
python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```
### Manual Verification (post-merge)
Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.
Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
(first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.
End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.
Closes: code-do7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Viktor's UK workplace pension is at Fidelity PlanViewer. The broker-sync
provider + CLI landed in the broker-sync repo (commits 804e6a8 +
7c9be54); this commit adds the infra bits so the monthly sync runs
in-cluster like the other broker-sync jobs.
One successful manual backfill on 2026-04-18 pulled 51 contributions +
valuation into a new WF WORKPLACE_PENSION account; Net Worth moved from
£865k → £1,003k. This commit productionises that flow.
## This change
- New kubernetes_cron_job_v1.fidelity in stacks/broker-sync/main.tf:
- Schedule: 05:00 UK on the 20th of each month (after mid-month
payroll settles; finance data shows credits on the 13th-18th).
- Suspended by default — unsuspend once broker-sync image is rebuilt
with Chromium baked in (Dockerfile change shipped separately in the
broker-sync repo).
- Init container materialises the storage_state JSON (projected from
the broker-sync-secrets K8s Secret, synced from Vault by ESO) to the
encrypted PVC at /data/fidelity_storage_state.json. Chromium then
loads it.
- Container: broker-sync fidelity-ingest with WF + FIDELITY_* env
vars. Memory request 512Mi, limit 1280Mi — Chromium is hungry.
- Lifecycle ignore_changes on dns_config per the KYVERNO_LIFECYCLE_V1
convention documented in AGENTS.md.
## What is NOT in this change
- The Vault keys fidelity_storage_state + fidelity_plan_id — already
staged via `vault kv patch` on 2026-04-18.
- Dockerfile Chromium install — in broker-sync repo (commit 7c9be54).
- Prometheus BrokerSyncFidelityFailed alert — deferred until the
CronJob has run successfully for a month and we have a baseline.
Existing broker-sync CronJobs also don't have per-job alerts yet;
filing as a follow-up.
## Verification
### Automated
terraform fmt ran clean. `terragrunt plan` would show a single new
kubernetes_cron_job_v1 (suspended, so no pods scheduled).
### Manual (after apply + image rebuild)
1. Build + push broker-sync:<sha> with Chromium.
2. `scripts/tg apply stacks/broker-sync` (updates image_tag + adds
fidelity CronJob).
3. Unsuspend: `kubectl -n broker-sync patch cronjob broker-sync-fidelity \
-p '{"spec":{"suspend":false}}'` OR flip the tf flag.
4. Trigger a test run: `kubectl -n broker-sync create job \
fidelity-test --from=cronjob/broker-sync-fidelity`.
5. Expect logs: `fidelity-ingest: fetched=N new=N imported=N failed=0`.
6. On FidelitySessionError: run `broker-sync fidelity-seed` locally +
`vault kv patch secret/broker-sync fidelity_storage_state=@...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The iSCSI CSI driver was deployed against a TrueNAS appliance at 10.0.10.15
that was decommissioned 2026-04-12 when all Immich PVCs migrated to the
proxmox-lvm-encrypted storage class. The stack has been dead code since —
live survey (2026-04-18):
- iscsi-csi namespace: empty (0 resources), 27h old (since last TF apply)
- No iscsi CSI driver registered in the cluster
- No PVs/PVCs reference iscsi
- TF state held only the empty namespace
- helm_release.democratic_csi was not in state (already gone pre-session)
Leaving the stack around meant every `terragrunt run --all plan` would
drift (TF wanted to create the helm release again) and every CI run would
try to pull `truenas_api_key` + `truenas_ssh_private_key` from Vault
against a TrueNAS that no longer exists. Beads tracking: code-gw0.
## This change
- `scripts/tg destroy` in stacks/iscsi-csi (1 resource destroyed — the namespace).
- `rm -rf stacks/iscsi-csi/` — removes modules/, main.tf, terragrunt.hcl,
secrets symlink, and the 4 terragrunt-generated files (backend.tf,
providers.tf, cloudflare_provider.tf, tiers.tf).
- Dropped PG schema `iscsi-csi` on `10.0.20.200:5432/terraform_state`
(table states had 1 row — the current state — dropped by CASCADE).
- Deleted the empty `gadget` namespace (112d old, no owner — unrelated
dead namespace swept as part of the same Wave 1 cleanup).
## What is NOT in this change
- Vault database role cleanup for the 7 MySQL-migrated services
(health, linkwarden, affine, woodpecker, claude_memory, crowdsec,
technitium). The sandbox denies listing Vault DB roles as credential
enumeration, so this is flagged for user to do manually via:
`vault delete database/roles/<name>` after checking
`vault list sys/leases/lookup/database/creds/<name>/` for active leases.
## Reproduce locally
1. `git pull`
2. `ls stacks/ | grep iscsi` → no output
3. `kubectl get ns iscsi-csi gadget` → both NotFound
4. psql to 10.0.20.200:5432/terraform_state → `\dn` shows no iscsi-csi schema
## Test Plan
### Automated
```
$ kubectl --kubeconfig config get ns iscsi-csi
Error from server (NotFound): namespaces "iscsi-csi" not found
$ kubectl --kubeconfig config get ns gadget
Error from server (NotFound): namespaces "gadget" not found
$ PGPASSWORD=... psql -h 10.0.20.200 -U ... -d terraform_state -c '\dn' | grep iscsi
(no output)
$ ls stacks/iscsi-csi 2>&1
ls: cannot access 'stacks/iscsi-csi': No such file or directory
```
### Manual Verification
None required — destroy was a no-op for workloads (namespace was empty).
Closes: code-b6l
Closes: code-gw0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Review of the BeadBoard Dispatch wiring found that the claude-agent-service
Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and
`COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land
on paths that are volume-mounted at runtime:
- `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398)
- `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427)
Kubernetes mounts hide image-layer content at those paths, so the COPYs are
dead. The companion commit in `claude-agent-service` restages both files to
`/usr/share/agent-seed/` (an image-layer path that is never mounted).
Additionally, the beads-task-runner agent rails expect
`/workspace/scratch/<job_id>/` to exist, but nothing was creating it.
## Layout before / after
```
Before (dead COPYs):
image layer runtime (mounted volumes hide the files)
----------- -----------------------------------
/workspace/ <- hidden by PVC mount
.beads/
metadata.json <- UNREACHABLE
/home/agent/.claude/ <- hidden by emptyDir mount
agents/
beads-task-runner.md <- UNREACHABLE
After (init container seeds volumes at pod start):
image layer runtime
----------- ------------------------------------
/usr/share/agent-seed/
beads-metadata.json --+
beads-task-runner.md --+-> copied by seed-beads-agent init
container into the mounted volumes
on every pod start:
/workspace/.beads/metadata.json
/workspace/scratch/
/home/agent/.claude/agents/beads-task-runner.md
```
## What
### New init container: `seed-beads-agent`
- Positioned AFTER `git-init`, BEFORE the main container.
- Uses the same service image (`${local.image}:${local.image_tag}`) — the
seed files are baked in at `/usr/share/agent-seed/`.
- Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`).
- Shell body:
mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json
cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md
- Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`.
- Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`).
### Formatting
- `terraform fmt -recursive` also normalised whitespace in the token-expiry
locals block and the CronJob container definition. No semantic change.
## What is NOT in this change
- No image tag bump. The Dockerfile refactor that produces the
`/usr/share/agent-seed/` path lands in the claude-agent-service repo
and will roll in on the next CI build. Until that build ships and the
tag is bumped in this file, the new init container will `cp` from a
path that doesn't exist yet — so do NOT apply this commit until the
corresponding image tag bump is ready. The commit is declarative prep.
- No changes to storage class, RBAC, Service, or any other init.
- The main container mounts remain unchanged — only the init containers
prepare volume contents.
## Test Plan
### Automated
```
$ terraform fmt -check -recursive stacks/claude-agent-service/
(no output — clean)
$ terraform -chdir=stacks/claude-agent-service/ init -backend=false
Terraform has been successfully initialized!
$ terraform -chdir=stacks/claude-agent-service/ validate
Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1)
Success! The configuration is valid, but there were some validation warnings
as shown above.
```
### Manual Verification (after image bump + apply)
1. Bump `local.image_tag` in main.tf to the SHA of a build that has
`/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE | jq ...`
or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`).
2. `scripts/tg apply stacks/claude-agent-service`
3. `kubectl -n claude-agent get pods -w` — all init containers complete.
4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch`
Expected: all three paths exist; first two are regular files with the
expected content, `scratch` is a directory.
5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json`
Expected: `dolt.beads-server.svc.cluster.local`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
BeadBoard is the Next.js task visualization dashboard shipped in this
stack. We want users to trigger headless Claude agent runs directly from
a beads task row — "one-click dispatch" — instead of copy-pasting `bd`
IDs into a terminal. The agent runs in-cluster as claude-agent-service
(see stacks/claude-agent-service/), protected by a bearer token in
Vault at secret/claude-agent-service/api_bearer_token.
For BeadBoard to POST to /execute we need the service URL and the
bearer token available inside the pod as env vars. The URL is static
(cluster DNS); the token must come through External Secrets Operator
so rotation in Vault propagates without re-applying Terraform.
Secondary cleanup: the container was still pinned to :latest which
violates the 8-char-SHA convention and causes stale pulls through the
registry cache (see .claude/CLAUDE.md, Docker images). The image tag
is now variable-driven; the GHA pipeline will override the default
once it publishes the first SHA.
## This change
- Adds an ExternalSecret `beadboard-agent-service` in the
`beads-server` namespace, mirroring the pattern in
stacks/claude-agent-service/main.tf (same Vault path
`secret/claude-agent-service`, same `vault-kv` ClusterSecretStore,
same 15m refresh). Exposes exactly one key: `api_bearer_token`.
- Adds two env vars to the `beadboard` container:
- `CLAUDE_AGENT_SERVICE_URL` — static cluster URL
(`http://claude-agent-service.claude-agent.svc.cluster.local:8080`)
- `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the
ESO-managed Secret, key `api_bearer_token`
- Adds `reloader.stakater.com/auto = "true"` on the Deployment's
top-level metadata — matches the convention used by rybbit,
claude-memory, onlyoffice. When ESO refreshes the K8s Secret
because Vault rotated the token, Reloader restarts the pod so the
new token is picked up (env vars are read once at boot).
- Adds `variable "beadboard_image_tag"` (default `"latest"`, with a
one-line comment flagging the temporary default). The image
reference now interpolates `${var.beadboard_image_tag}`. No tfvars
file is touched — orchestrator will flip the default to the first
real 8-char SHA once GHA publishes it.
## What is NOT in this change
- No GHA workflow additions. The pipeline that builds
`registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard
repo and is out of scope here.
- No Vault-side changes. `secret/claude-agent-service/api_bearer_token`
already exists (it powers the claude-agent-service deployment
itself).
- No Terraform `apply`. Orchestrator applies.
## Data flow
Vault (secret/claude-agent-service)
│ refresh every 15m
▼
ESO → K8s Secret `beadboard-agent-service` (beads-server ns)
│ envFrom.secretKeyRef
▼
BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env)
│ Authorization: Bearer <token>
▼
claude-agent-service.claude-agent.svc:8080 /execute
On Vault rotation: ESO picks up new value at next refresh → K8s
Secret data changes → Reloader sees annotation + referenced Secret
changed → rolling-recreates the beadboard pod with the new token.
## Test Plan
### Automated
- `terraform fmt -recursive stacks/beads-server/` — clean (formatted
the file once; subsequent run is a no-op).
- `terraform -chdir=stacks/beads-server validate` (after
`terraform init -backend=false`) — `Success! The configuration is
valid`. The 14 "Deprecated Resource" warnings are pre-existing
(`kubernetes_namespace` vs `_v1` etc.) and unrelated to this
change.
### Manual Verification
1. Orchestrator applies:
`scripts/tg -chdir=stacks/beads-server apply`
2. Verify the ExternalSecret synced:
`kubectl -n beads-server get externalsecret beadboard-agent-service`
Expected: `Ready=True`, `SyncedAt` recent.
3. Verify the K8s Secret exists with one key:
`kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8`
Expected: first 8 chars of the bearer token.
4. Verify the deployment picked up the env vars:
`kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT`
Expected: both env entries present, bearer via `secretKeyRef`.
5. Verify the reloader annotation is on the Deployment metadata:
`kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'`
Expected: `true`.
6. Verify the image tag resolved to the variable default (for now):
`kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'`
Expected: `registry.viktorbarzin.me:5050/beadboard:latest`
(will become `...:<sha>` once `beadboard_image_tag` default is
updated).
7. Smoke-test the env var inside the pod:
`kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'`
Expected: URL printed, first 8 chars of token printed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.
This commit removes two cleanup artifacts left behind by that migration.
## This change
1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
skill doc for the obsolete SSH-based pattern. Already in `archived/`,
harmless but noise; deleting prevents anyone copy-pasting the old approach.
2. Removes `kubernetes_secret.ssh_key` from
`stacks/claude-agent-service/main.tf`. The Secret was created from the
`devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
into the agent pod. The pod's `git-init` init container uses HTTPS +
`$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
and `https://github.com/` URL via `git config url.insteadOf`, so no
downstream `git` invocation could fall through to SSH even if it tried.
3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
the SSH key resource was its only consumer.
## What is NOT in this change
- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
Removing it requires read/modify/put of the full secret and the upside
is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
left untouched per no-adjacent-refactor rule.
## Test plan
### Automated
- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
pre-existing lines 464-505 are flagged; no new fmt warnings introduced
by these deletions.
### Manual verification
1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
The `ci_secrets` data source removal is plan-time only; does not appear
in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
with a minimal prompt; expect job completes with `exit_code=0`.
Closes: code-bck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The `rybbit-analytics` Cloudflare Worker hit the free-tier quota of 100k
requests/day. CF GraphQL analytics showed **97,153 invocations in the last
24h**, up from ~0 before 2026-04-17 21:26 UTC when Rybbit script injection
migrated off the broken Traefik rewrite-body plugin (Yaegi ResponseWriter
bug on Traefik v3.6.12) onto this Worker.
Root cause: `wrangler.toml` registered two wildcard routes
(`viktorbarzin.me/*` + `*.viktorbarzin.me/*`) which match every Cloudflare-
proxied request on the zone. Only 27 of ~119 proxied hostnames appear in
`SITE_IDS` in `index.js`; the rest burn Worker invocations for nothing since
`siteId` is `null` and the Worker no-ops. Worse, the wildcard caught
`rybbit.viktorbarzin.me` itself — every tracker `script.js` fetch and event
POST round-trip was spawning its own Worker invocation (self-amplification).
CF GraphQL per-host breakdown (last 24h, zone `viktorbarzin.me`):
- Top waste (NOT in SITE_IDS): tuya-bridge 96.6k, beadboard 55.8k,
terminal 30.2k, authentik 19.9k, claude-memory 12.6k
- Sum of 27 SITE_IDS hosts: 47.2k
- `rybbit.viktorbarzin.me` self-amplifier: 782
- Projected post-narrow: 46.4k/day (52% reduction, well under quota)
## This change
Replaces the two wildcards with an explicit list of the **26** hostnames
present in `SITE_IDS`. `rybbit.viktorbarzin.me` is deliberately excluded
even though it has a site ID — it serves `/api/script.js` (JS) and
`/api/track` (JSON), both of which fail the Worker's `text/html`
content-type guard anyway. Leaving it routed just burned invocations.
BEFORE AFTER
────────────────────────── ──────────────────────────────────
viktorbarzin.me/* ┐ viktorbarzin.me/* ┐
*.viktorbarzin.me/* ┘ www.viktorbarzin.me/* │
actualbudget.vb.me/* │
→ matches ~119 hosts ... (26 total) │ → matches
→ ~97k Worker inv/day stirling-pdf.vb.me/* │ only 26
→ rybbit → self-amplifies vaultwarden.vb.me/* ┘ specific
hosts
rybbit.vb.me INTENTIONALLY
EXCLUDED (self-amplifier)
Deployment is unchanged — this Worker is not in Terraform. Deploy from
`stacks/rybbit/worker/` via:
CLOUDFLARE_EMAIL=vbarzin@gmail.com \
CLOUDFLARE_API_KEY=$(vault kv get -field=cloudflare_api_key secret/platform) \
npx --yes wrangler@latest deploy
`wrangler deploy` replaces all worker routes on the zone with the list from
`wrangler.toml`, so there is no cleanup step. Already deployed today as
version `d7f83980-a499-40f5-ba55-f8e18d531863` — this commit just captures
the source of truth in git.
## What is NOT in this change
- Self-hosted injection (nginx `sub_filter` sidecar, compiled Traefik
plugin). Deferred — revisit only if analytics traffic grows past 80k/day
again, or if we add more high-traffic hosts to `SITE_IDS`.
- Cloudflare Workers Paid plan ($5/mo for 10M requests). User declined.
- Moving the Worker into Terraform. Out of scope.
- Any Rybbit backend/frontend changes. Rybbit itself continues running.
## Test plan
### Automated
Post-deploy CF API enumeration of zone routes:
$ curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
"https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
| jq -r '.result[] | "\(.pattern)\t→ \(.script)"' | wc -l
26
# Wildcards gone:
$ curl -s ... | jq -r '.result[].pattern' | grep -c '\*\.'
0
### Manual Verification
Script injection behaviour, verified via `curl`:
1. SITE_IDS host — script IS injected:
$ curl -s -L https://viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
<script src="https://rybbit.viktorbarzin.me/api/script.js"
data-site-id="da853a2438d0" defer>
$ curl -s -L https://calibre.viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
<script src="https://rybbit.viktorbarzin.me/api/script.js"
data-site-id="ce5f8aed6bbb" defer>
2. Non-SITE_IDS host — script NOT injected:
$ curl -s -L https://tuya-bridge.viktorbarzin.me/ | grep -c 'data-site-id'
0
3. `rybbit.viktorbarzin.me` bypasses Worker entirely — tracker returns raw JS:
$ curl -sI https://rybbit.viktorbarzin.me/api/script.js | grep -i content-type
content-type: application/javascript; charset=utf-8
### Reproduce locally
# 1. Confirm the Worker sees only the 26 narrowed routes.
CF_EMAIL=vbarzin@gmail.com
CF_KEY=$(vault kv get -field=cloudflare_api_key secret/platform)
ZONE_ID=fd2c5dd4efe8fe38958944e74d0ced6d
curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
"https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
| jq -r '.result[] | .pattern' | sort
# 2. 24h after deploy, re-check invocation count — expect < 80k.
curl -s https://api.cloudflare.com/client/v4/graphql \
-H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
-H "Content-Type: application/json" \
-d '{"query":"query($acc:String!,$since:Time!,$until:Time!){viewer{accounts(filter:{accountTag:$acc}){workersInvocationsAdaptive(limit:100,filter:{datetime_geq:$since,datetime_leq:$until}){sum{requests} dimensions{scriptName date}}}}}",
"variables":{"acc":"02e035473cfc4834fb10c5d35470d8b4",
"since":"'"$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)"'",
"until":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'
Follow-up monitoring tracked in code-dka (P3, 3-day check).
Closes: code-l9b
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.
Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.
Updates the post-mortem P1 row to capture this for future readers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Actual Budget v26.4.0 (released 2026-04-05) re-introduces the Sankey
chart report for income/expense flow visualization (PR #7220). An earlier
experimental implementation was deleted in March 2024 (PR #2417) but a
proper reimplementation with "Other" grouping, date-range selection, and
percentage toggle is now shipped behind the experimental feature flag.
Viktor wanted Sankey visualization of budget cash flow; this is the lowest-
cost path since his existing Actual Budget deployment already holds all the
transaction data.
## This change
Bumps the `tag` input on all three factory module calls (viktor, anca, emo)
from `26.3.0` to `26.4.0`. No breaking changes, schema migrations, or config
changes per the 26.4.0 release notes.
## Rollout
Applied via `scripts/tg apply --non-interactive`. All three pods rolled
successfully to `actualbudget/actual-server:26.4.0` and passed readiness
probes. The http-api sidecars (`jhonderson/actual-http-api`) were untouched.
## Post-upgrade
Users need to toggle Settings → Experimental features → Sankey report to
access the chart, then Reports → new Sankey widget.
Closes: code-oof
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.
## Root cause
The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.
## What's captured
- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
red-herring suspicion of the new Rybbit Worker, cached sessions masking
the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
per-user symptoms fast; UI-managed Authentik config is invisible to git)
## Follow-ups not in this commit
- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
(see discussion in beads code-zru)
Closes: code-zru
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.
Follow-up to 8a054752 and 50dea8f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These files are regenerated by Terragrunt on every run and have a
"# Generated by Terragrunt. Sig: ..." header. Earlier today multiple parallel
agents working on bd-w97 accidentally staged them, requiring two corrective
commits (3e11bd1b, 4eb68d6b). Preventing the recurrence at the source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous commit (c0ac24a5, [meshcentral] Import existing cluster
state + PVC) unintentionally committed two Terragrunt-generated
provider/locals files. These are auto-generated on every plan/apply
(marked 'Generated by Terragrunt. Sig:') and do not belong in the
repo. Mirrors 3e11bd1b which did the same cleanup for kyverno.
Removes from tracking only — files remain on disk so concurrent work
is unaffected.
Updates: code-w97
Imported the two proxmox-lvm-encrypted PVCs into the Tier 1 PG state.
All other declared resources (namespace, deployment, service, ingress,
NFS-backed PV/PVC, tls secret) were already state-managed.
Imported:
- kubernetes_persistent_volume_claim.data_encrypted
(meshcentral/meshcentral-data-encrypted, proxmox-lvm-encrypted, 1Gi)
- kubernetes_persistent_volume_claim.files_encrypted
(meshcentral/meshcentral-files-encrypted, proxmox-lvm-encrypted, 1Gi)
Pre-import plan: 2 to add, 3 to change, 0 to destroy
Post-import plan: 0 to add, 5 to change, 0 to destroy (benign drift)
Apply: 0 added, 5 changed, 0 destroyed
Benign drift reconciled on apply:
- PVC wait_until_bound attribute aligned (true -> false)
- tls-secret Kyverno sync labels cleared
- deployment/namespace annotation drift
Source reconciliation: none required. Both declared PVCs already match
the cluster (proxmox-lvm-encrypted, 1Gi, RWO, names identical). NFS
PV/PVC meshcentral-backups-host (nfs-truenas, 10Gi, RWX) remained
bound throughout. Deployment kept 1/1 replicas on the same pod
(meshcentral-6c4f47c6f8-mj8sk).
Commits the auto-generated cloudflare_provider.tf and tiers.tf so the
stack matches the repo convention used by its peers.
Updates: code-w97
My previous commit (dacf3d9e, [kyverno] Import existing cluster state)
unintentionally picked up two Terragrunt-generated provider/locals
files from the meshcentral stack that a parallel worker had just
created. These are auto-generated on every plan/apply (marked
"Generated by Terragrunt. Sig:") and do not belong in the repo.
Removes from tracking only — files remain on disk so concurrent work
is unaffected.
Files removed:
- stacks/meshcentral/cloudflare_provider.tf
- stacks/meshcentral/tiers.tf
No impact on the kyverno import work. State-level changes from
dacf3d9e (3 imports + 3 in-place updates) stand.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All resources were already present in the Tier 1 PG state — no imports
required. The travel_blog stack has no PVC (content baked into the
Docker image, deployed via Woodpecker with 1.4GB context).
Pre-apply plan: 0 to add, 4 to change, 0 to destroy
Apply: 0 added, 4 changed, 0 destroyed
Post-apply plan: 0 to add, 3 to change, 0 to destroy (persistent benign drift)
Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode=off label removed
- Ingress external-monitor=false annotation removed (now auto-managed
by ingress_factory dns_type)
- TLS secret Kyverno sync labels removed
Post-apply drift (persists via external controllers, out of scope):
- Kyverno re-injects ndots:2 dns_config and sync-tls-secret labels
- Goldilocks re-adds vpa-update-mode label
(tracked separately — future work to add lifecycle ignore_changes)
Image tag viktorbarzin/travel_blog:latest unchanged — TF matches cluster.
Deployment remains at replicas=0 (intentional, per source comment:
"Scaled down — clears ExternalAccessDivergence alert"). Site is
intentionally offline.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imported 3 missing cluster resources into the Tier 1 PG state for the
kyverno stack. The Helm release, 6 PriorityClasses, 14 ClusterPolicies,
both Secrets (registry-credentials, tls-secret), and all prior RBAC
resources were already managed in state. The strip-cpu-limits
ClusterPolicy (commit 1de2ee30, 56m prior to this import) was already
in state from its targeted apply.
Resources imported:
- module.kyverno.kubernetes_cluster_role_v1.kyverno_cleanup_pods
(kyverno:cleanup-controller:pods — RBAC for ClusterCleanupPolicy)
- module.kyverno.kubernetes_cluster_role_binding_v1.kyverno_cleanup_pods
(kyverno:cleanup-controller:pods — binding to cleanup-controller SA)
- module.kyverno.kubernetes_manifest.cleanup_failed_pods
(apiVersion=kyverno.io/v2,kind=ClusterCleanupPolicy,name=cleanup-failed-pods)
All three originated from commit cf578516 (auto-cleanup failed/evicted
pods), which added the declarations but apparently never made it into
PG state before the global state reorg.
Pre-import plan: 3 to add, 2 to change, 0 to destroy
Post-import plan: 0 to add, 3 to change, 0 to destroy (benign)
Apply: 0 added, 3 changed, 0 destroyed
Benign drift reconciled on apply:
- cleanup_failed_pods manifest field populated in state post-import
(annotations re-applied, no spec change)
- registry_credentials + tls_secret: null `generate.kyverno.io/clone-source`
label dropped from Terraform metadata (no K8s object change — the label
was only `null` in state, never existed on the live Secret)
Safety checks — all clean:
- ClusterPolicy count: 16 (unchanged, 14 owned here + 1 external
goldilocks-vpa-auto-mode + strip-cpu-limits); all status=Ready=True
- ClusterCleanupPolicy cleanup-failed-pods: intact, schedule 15 * * * *
- helm_release.kyverno: no diff (revision unchanged)
- Mutating/validating webhook configurations: 3 + 7 intact
- All 4 Kyverno Deployments Running (admission x2, background, cleanup, reports)
Kyverno failurePolicy stays Ignore (forceFailurePolicyIgnore=true) so
admission degrades open if ever unavailable.
Updates: code-w97
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imported both resources for the pvc-autoresizer stack into the Tier 1 PG
state. The stack was previously unmanaged — cluster had the running
controller from a prior manual helm install (rev 1, 2026-04-03).
Resources imported:
- module.pvc_autoresizer.kubernetes_namespace.pvc_autoresizer (pvc-autoresizer)
- module.pvc_autoresizer.helm_release.pvc_autoresizer (pvc-autoresizer/pvc-autoresizer)
Pre-import plan: 2 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 2 to change, 0 to destroy (benign drift)
Apply: 0 added, 2 changed, 0 destroyed
Benign drift reconciled on apply:
- Namespace goldilocks.fairwinds.com/vpa-update-mode=off label removed
(Kyverno ClusterPolicy goldilocks-vpa-auto-mode re-adds it immediately)
- Helm release metadata refresh only (atomic read-back, revision 1 -> 2;
chart pvc-autoresizer-0.17.0 and app 0.20.0 unchanged — no upgrade)
Controller pods pvc-autoresizer-controller-7dcc745f68-57bk6 and -n4bh9
stayed Running throughout (restart counts unchanged: 17 and 1, both
pre-existing from pre-apply state). No PVCs entered non-Bound state.
Updates: code-w97
Imported all 9 cluster resources into the Tier 1 PG state. Stack was
previously unmanaged — source was fully declared in main.tf but state
was empty.
Pre-import plan: 9 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 9 to change, 0 to destroy
Apply: 0 added, 9 changed, 0 destroyed
Resources imported:
- kubernetes_namespace.tor-proxy
- kubernetes_deployment.tor-proxy
- kubernetes_deployment.torrserver
- kubernetes_service.tor-proxy
- kubernetes_service.torrserver
- kubernetes_service.torrserver-bt (LoadBalancer, IP 10.0.20.200)
- kubernetes_persistent_volume_claim.torrserver_data_proxmox
- module.tls_secret.kubernetes_secret.tls_secret
- module.torrserver_ingress.kubernetes_ingress_v1.proxied-ingress
Service pods tor-proxy-7fb4644dd8-npdwg and torrserver-7788ff4c4d-jnh85
stayed Running throughout. Tor circuit preserved — no deployment restarts.
Updates: code-w97