Compare commits

...

22 commits

Author SHA1 Message Date
eee694c915 [payslip-extractor] Add PAYSLIP_TEXT fast path
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
2026-04-18 22:48:07 +00:00
Viktor Barzin
b28c76e371 [infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift
## Context

Wave 7 of the state-drift consolidation plan. The drift-detection pipeline
(`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every
stack daily and Slack-posted a summary, but its output was ephemeral —
nothing persisted in Prometheus, so there was no historical view of which
stacks drift, when, or for how long. Following the convergence work in
waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4
mysql cleanup), the baseline is clean enough that *new* drift should
stand out. That only works if we have observability.

## This change

### `.woodpecker/drift-detection.yml`

Enhances the existing cron pipeline to push a batched set of metrics to
the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`)
after each run:

| Metric | Kind | Purpose |
|---|---|---|
| `drift_stack_state{stack}` | gauge, 0/1/2 | 0=clean, 1=drift, 2=error |
| `drift_stack_first_seen{stack}` | gauge (unix seconds) | Preserved across runs for drift-age tracking |
| `drift_stack_age_hours{stack}` | gauge (hours) | Computed from `first_seen` |
| `drift_stack_count` | gauge (count) | Total drifted stacks this run |
| `drift_error_count` | gauge (count) | Total plan-errored stacks |
| `drift_clean_count` | gauge (count) | Total clean stacks |
| `drift_detection_last_run_timestamp` | gauge (unix seconds) | Pipeline heartbeat |

First-seen preservation: on each drift hit, the pipeline queries
Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}`
value. If present and non-zero, reuse it; otherwise stamp with `NOW`.
That means age-hours grows monotonically until the stack goes clean
(at which point state=0 resets first_seen by omission).

Atomic batched push: all metrics for a run are POST'd in a single
HTTP request. Pushgateway doesn't support atomic multi-metric updates
natively, but batching at the pipeline layer prevents half-updated
state if the curl is interrupted mid-run (the second call would just
fail the entire run and alert on `DriftDetectionStale`).

### `stacks/monitoring/.../prometheus_chart_values.tpl`

New `Infrastructure Drift` alert group with three rules:

- **DriftDetectionStale** (warning, 30m): fires if
  `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h
  grace window on top of the 24h cron so transient Pushgateway or
  cluster unavailability doesn't false-alarm. Guards against the
  pipeline silently failing or the cron not firing.
- **DriftUnaddressed** (warning, 1h): fires if any stack has
  `drift_stack_age_hours > 72` — three days of unacknowledged drift.
  Three days is long enough to absorb weekends + typical review cycles
  but short enough to force follow-up before drift compounds.
- **DriftStacksMany** (warning, 30m): fires if `drift_stack_count > 10`
  in a single run. Sudden wide drift usually signals systemic causes
  (new admission webhook, provider version bump, cluster-wide CRD
  upgrade) rather than individual configuration errors, and the alert
  body nudges toward that diagnosis.

Applied to `stacks/monitoring` this session — 1 helm_release changed,
no other drift surfaced.

## What is NOT in this change

- The Wave 7 **GitHub issue auto-filer** — the full plan included
  filing a `drift-detected` issue per drifted stack. Deferred because
  it requires wiring the `file-issue` skill's convention + a gh token
  exposed to Woodpecker, both of which need separate setup. The Slack
  alert covers the same need at lower fidelity in the meantime.
- The Wave 7 **PG drift_history table** — would provide the richest
  historical view but adds a new DB schema dependency for a CI
  pipeline. Pushgateway + Prometheus handle the 72h window we care
  about; PG history is nice-to-have for quarterly reviews.
- Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the
  baseline has been stable for a few cycles.

Follow-ups tracked: file dedicated beads items for GH-issue filer + PG
drift_history.

## Verification

```
$ cd stacks/monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

# After next cron run (cron expr: "drift-detection" in Woodpecker UI):
$ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
    | grep -c '^drift_'
# expect a positive number
```

## Reproduce locally
1. `git pull`
2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules | jq '.data.groups[] | select(.name == "Infrastructure Drift")'`
3. Manually trigger the Woodpecker cron and watch Pushgateway populate.

Refs: Wave 7 umbrella (code-hl1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:42:51 +00:00
Viktor Barzin
124a756351 [infra] Adopt local-path-provisioner into Terraform (Wave 5c)
## Context

Wave 5c of the state-drift consolidation plan. `local-path-provisioner`
(Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw
`kubectl apply` against the upstream manifest. It serves as the cluster's
default StorageClass and is still actively in use — the 2026-04-18 live
survey showed helper-pod-delete cycles running against existing PVCs.

Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding),
ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown
scripts, StorageClass `local-path` (default), and the 1-replica
Deployment itself. Seven resources total.

## This change

New Tier 1 stack `stacks/local-path/` with all seven resources, adopted
via Wave 8's HCL `import {}` block convention (commit 8a99be11):

- `kubernetes_namespace.local_path_storage` → id `local-path-storage`
- `kubernetes_service_account.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner-service-account`
- `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role`
- `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind`
- `kubernetes_config_map.local_path_config` →
  id `local-path-storage/local-path-config`
- `kubernetes_storage_class_v1.local_path` → id `local-path`
- `kubernetes_deployment.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner`

Conventions applied:
- Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  Goldilocks `vpa-update-mode` label drift (Wave 3B, commit 8b43692a).
- Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  ndots dns_config drift (Wave 3A, commit c9d221d5 + 327ce215).
- ServiceAccount + pod spec pin `automount_service_account_token = false`
  and `enable_service_links = false` to match the live spec exactly.
- `import {}` stanzas removed after the apply converged to zero-diff
  (per AGENTS.md → "Adopting Existing Resources").

## Apply outcome

`Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were:
- `kubernetes_config_map.local_path_config.data` — whitespace/format
  reshuffle. The live ConfigMap contained the upstream manifest's
  hand-indented JSON + YAML; my HCL uses canonical `jsonencode` /
  heredoc. Semantic content identical, so the provisioner continued
  running (no pod restart).
- `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true`
  — TF-only attribute, no cluster impact.
- `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false`
  + `is-default-class` annotation re-asserted — TF-schema reconciliation
  only; the StorageClass remained default throughout.

Post-apply `scripts/tg plan` returns `No changes`.

## Verification

```
$ cd stacks/local-path && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.

$ kubectl -n local-path-storage get deploy
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
local-path-provisioner   1/1     1            1           55d

$ kubectl get sc local-path
NAME                    PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE
local-path (default)    rancher.io/local-path    Delete          WaitForFirstConsumer
```

## What is NOT in this change

- Helm-release adoption — local-path-provisioner was never installed via
  Helm in this cluster; raw manifests only. Keeping native typed
  resources rather than retrofitting a chart.
- PV-path customisation — sticks with upstream default
  `/opt/local-path-provisioner` on all nodes (via
  `DEFAULT_PATH_FOR_NON_LISTED_NODES`).

Closes: code-3gp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:39:55 +00:00
Viktor Barzin
1a7f68fe5b [beads-server] Auto-dispatch agent beads via CronJobs
## Context

Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.

The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`

So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.

## Flow

```
  user: bd assign <id> agent
         │
         ▼
  Dolt @ dolt.beads-server.svc:3306  ◄──── every 2 min ────┐
         │                                                  │
         ▼                                                  │
  CronJob: beads-dispatcher                                 │
    1. GET beadboard/api/agent-status  (busy? skip)         │
    2. bd query 'assignee=agent AND status=open'            │
    3. bd update -s in_progress   (claim)                   │
    4. POST beadboard/api/agent-dispatch                    │
    5. bd note "dispatched: job=…"                          │
         │                                                  │
         ▼                                                  │
  claude-agent-service /execute                             │
    beads-task-runner agent runs; notes/closes bead         │
         │                                                  │
         ▼                                                  │
  done  ──► next tick picks up the next bead ───────────────┘

  CronJob: beads-reaper  (every 10 min)
    for bead (assignee=agent, status=in_progress, updated_at > 30 min):
      bd note   "reaper: no progress for Nm — blocking"
      bd update -s blocked
```

## Decisions

- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
  client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
  2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
  Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
  Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
  `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
  Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
  the image-seeded file. The script copies it into `/tmp/.beads/` because bd
  may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
  When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
  `beads-task-runner` never trips the reaper. Failures trip it; pod crashes
  (in-memory job state lost) also trip it.

## What is NOT in this change

- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
  `cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.

## Deviations from plan

Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
  serializes `notes` as a string (not an array), and every `bd note` bumps
  `updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
  `-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.

## Test plan

### Automated

```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!

$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1)  # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.

$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```

### Manual verification

1. **Apply**
   ```
   vault login -method=oidc
   cd infra/stacks/beads-server
   scripts/tg apply
   ```
   Expect: `kubernetes_config_map.beads_metadata`,
   `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
   created. No changes to existing resources.

2. **CronJobs exist with right schedule**
   ```
   kubectl -n beads-server get cronjob
   ```
   Expect `beads-dispatcher  */2 * * * *` and `beads-reaper  */10 * * * *`,
   both with `SUSPEND=False`.

3. **End-to-end smoke**
   ```
   bd create "auto-dispatch smoke test" \
       -d "Read /etc/hostname inside the agent sandbox and close." \
       --acceptance "bd note includes 'hostname=' line and bead is closed."
   bd assign <new-id> agent
   # within 2 min:
   bd show <new-id> --json | jq '{status, notes}'
   ```
   Expect notes to contain `auto-dispatcher claimed at …` and
   `dispatched: job=<uuid>`, status `in_progress`.

4. **Reaper smoke**
   Assign + dispatch a long bead, then
   `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
   30 min + one reaper tick, `bd show <id>` shows `blocked` with a
   `reaper: no progress for Nm — blocking` note.

5. **Kill switch**
   ```
   cd infra/stacks/beads-server
   scripts/tg apply -var=beads_dispatcher_enabled=false
   kubectl -n beads-server get cronjob
   ```
   Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
   nothing happens within 5 min. Re-apply with `=true` to re-enable.

Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.

Closes: code-8sm

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
Viktor Barzin
01955916b2 [infra] Adopt kured + sentinel-gate into Terraform (Wave 5a)
## Context

Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces
of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all
cluster changes via TF" invariant and drifting silently:

1. **kured** (Helm release): deployed 265d ago via `helm install kured` on
   the CLI. Values were edited only via `helm upgrade` — never captured.
   Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri
   02:00–06:00 London reboot window, Slack notifyUrl, and a custom
   `/sentinel/gated-reboot-required` sentinel file.

2. **kured-sentinel-gate**: a custom DaemonSet + ServiceAccount +
   ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem
   (memory 390) when kured rebooted nodes during a containerd overlayfs
   outage and turned a single-node blip into a 26h cluster outage.
   The gate DaemonSet creates `/var/run/gated-reboot-required` only when
   (a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all
   calico-node pods Running, (d) no node transitioned Ready in the last
   30 minutes (cool-down). kured's `rebootSentinel` then points at the
   gated file so reboots are effectively gated by cluster health.
   Applied 33d ago via `kubectl apply` — no TF footprint.

Both are now codified in the new `stacks/kured/` (Tier 1, PG state).

## This change

- New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl`
  (standard platform-dep) + `secrets` symlink.
- All 6 resources adopted via Wave 8's HCL `import {}` block pattern
  (commit 8a99be11) — written as `import {}` stanzas in the initial
  commit, plan-applied to zero, then stanzas deleted before this commit
  per the convention:
    - `kubernetes_namespace.kured` (id: `kured`)
    - `helm_release.kured` (id: `kured/kured`)
    - `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
    - `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- Slack notifyUrl moved from inline helm values into Vault at
  `secret/kured` under key `slack_kured_webhook`, consumed via
  `data "vault_kv_secret_v2"`. No plaintext secret in git.
- Namespace gets `tier = "1-cluster"` label (new — previously untiered,
  so Kyverno auto-quotas applied cluster-tier defaults on kured pods).
  Benign additive change; pod specs have explicit resources anyway.
- DaemonSet + SA get `automount_service_account_token = false` /
  `enable_service_links = false` to match the live pod spec exactly —
  otherwise TF schema defaults would flip these fields.
- DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift
  (Wave 3A convention, commit c9d221d5 + 327ce215).
- Namespace carries the same marker on the
  `goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep,
  commit 8b43692a).

## Import outcomes

Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were all TF-schema reconciliation, not cluster
mutations:

- `helm_release.kured.values` — format reshuffle; the imported state
  stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic
  YAML is byte-identical, so the triggered Helm upgrade was a no-op on
  the cluster side (revision bumped 2→3, zero pod restarts).
- `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new
  label added. Already discussed above.
- `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true
  — TF-only attribute, no k8s impact.

Post-apply `scripts/tg plan` on `stacks/kured` returns:
`No changes. Your infrastructure matches the configuration.`

## What is NOT in this change

- `import {}` stanzas — intentionally removed after the apply landed.
  They would be no-ops and would clutter future diffs. Per Wave 8
  convention (AGENTS.md → "Adopting Existing Resources").
- Calico adoption (Wave 5b) — separate higher-blast change, needs a
  dedicated low-traffic window.
- local-path-storage (Wave 5c) — check-or-remove task still open.

## Verification

```
$ kubectl -n kured get ds
NAME                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
kured                 5         5         5       5            5
kured-sentinel-gate   5         5         5       5            5

$ helm -n kured list
NAME     NAMESPACE   REVISION  STATUS    CHART          APP VERSION
kured    kured       3         deployed  kured-5.11.0   1.21.0

$ cd stacks/kured && ../../scripts/tg plan | tail -1
No changes. Your infrastructure matches the configuration.
```

## Reproduce locally
1. `git pull`
2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes
3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready.

Closes: code-q8k

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:33:29 +00:00
Viktor Barzin
10fd88aec5 wealthfolio: add nightly backup sidecar — SQLite → NFS
## Context

Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL
support — confirmed 2026-04-18 via repo inspection). The DB lives on
an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod.

First attempt at a standalone backup CronJob failed with Multi-Attach
error: RWO volume is already attached to the running WF pod, so no
separate pod can mount it. Switched to a backup sidecar in the same
pod — shares the PVC mount naturally.

## This change

- `container "backup"` added to the WF Deployment:
  - alpine:3.20 + sqlite + busybox-suid (for crond).
  - Mounts /data read-only (shared with WF container) + /backup (new
    NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup).
  - Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh
    which runs `sqlite3 .backup` (WAL-safe online snapshot, zero
    downtime), copies secrets.json, and prunes anything older than 30d.
  - 16Mi request / 64Mi limit — sleeps most of the time.
- NFS volume declared in pod spec — server from the existing
  `var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created
  on the PVE host in the same session.

Removed the standalone backup CronJob that couldn't work.

## Verification

### Automated

`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 1 changed, 1 destroyed (the transient CronJob).

### Manual (2026-04-18)

$ kubectl -n wealthfolio get pods -l app=wealthfolio
wealthfolio-95d8bd498-cj8kw   2/2   Running
$ kubectl -n wealthfolio logs <pod> -c backup
wealthfolio-backup sidecar ready; next 04:30 UTC
$ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh
wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M)
$ ls /srv/nfs/wealthfolio-backup/
2026-04-18T22-24-55/   ← first sidecar-produced backup

## Reproduce locally

1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh
2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/
3. Expected: new dated folder appears with wealthfolio.db + secrets.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:25:19 +00:00
Viktor Barzin
9e5d7cd825 state(vault): update encrypted state 2026-04-18 22:12:55 +00:00
Viktor Barzin
402fd1fbac state(dbaas): update encrypted state 2026-04-18 22:12:09 +00:00
Viktor Barzin
345ba2182f [mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout
## Context

After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.

The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.

Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.

## This change

Two edits to the embedded probe script inside the cronjob resource:

```
-    # Step 2: Wait for delivery, retry IMAP up to 3 min
+    # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
  ...
-    for attempt in range(9):
+    for attempt in range(15):
  ...
-            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```

Flow (before):

```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```

Flow (after):

```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
                                           │
                                           └─ timeout ─► log, continue to next loop
```

## What is NOT in this change

- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
  3600s + for: 10m. Those fire only on sustained multi-hour issues and
  should not be loosened — they would mask future regressions. Probe
  success rate is now expected to recover to ≥95% from the two upstream
  fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
  cleanup of stale e2e-probe-* messages.

## Test Plan

### Automated

`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:

```
  # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
  -     for attempt in range(9):
  +     for attempt in range(15):
  -             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
  +             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

1. Trigger the probe manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
   `kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
   successful run should still complete in < 60s now that postscreen
   is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
   Prometheus — expect ≥95% (was ~65% before all three fixes).

## Reproduce locally

1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.

Closes: code-18e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:33:56 +00:00
Viktor Barzin
e2516b07a3 [mailserver] Disable postscreen btree cache to stop SMTP lock-contention stalls
## Context

Postfix inside docker-mailserver was spamming fatal errors at roughly
1 per minute — 5,464 of them in a 24h window — all of the same shape:

```
postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache:
unable to get exclusive lock: Resource temporarily unavailable
```

Every time one of these fires, the postscreen process dies mid-connection
and the inbound SMTP session is dropped. Legitimate mail (including Brevo
deliveries for our e2e email-roundtrip probe) gets re-queued by the sender
and arrives late — frequently past the probe's 180s IMAP polling window,
producing a 35%/7d probe success rate and the EmailRoundtripStale alert
noise that was originally flagged as "probably nothing."

## Root cause

`master.cf` declares postscreen with `maxproc=1`, but postscreen still
re-spawns per incoming connection (or for short-lived reopens), and each
instance opens the shared btree cache with an exclusive file lock. Under
any concurrency (two TCP SYNs arriving close together, or a retry during
teardown), the second process hits EWOULDBLOCK on fcntl and Postfix
treats that as fatal.

Three options were considered:

  | Option | Verdict |
  |--------|---------|
  | (a) Disable cache (postscreen_cache_map = )  | ✓ chosen |
  | (b) Switch btree → lmdb                       | ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) |
  | (c) proxy:btree via proxymap                  | ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" |
  | (d) Memcached sidecar                         | ✗ new moving part; deferred |

Option (a) is a small trade-off: legitimate clients re-run the
greet-action / bare-newline-action checks on every fresh TCP session
instead of hitting the 7-day whitelist cache. At our volume (~100
deliveries/day, ~72 of which are the probe itself) that's negligible CPU.
DNSBL re-evaluation is also avoided only partially, but this mailserver
already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role
was doing nothing anyway.

## This change

Appends a stanza to the user-merged postfix main.cf stored in
`variable.postfix_cf` that sets `postscreen_cache_map =` (empty value).
Postfix treats an empty cache_map as "no persistent cache" — per-session
decisions are still enforced, they just aren't cached across sessions.

Before:

```
smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock)
                ├─ concurrent access → fcntl EWOULDBLOCK → fatal
                └─ connection dropped, sender retries, mail arrives late
```

After:

```
smtpd ──► postscreen (no cache, per-session checks only)
                └─ no shared file, no lock → no fatal, no dropped session
```

No change to master.cf (postscreen still the front-end), no change to
DNSBL / greet / bare-newline policy.

## What is NOT in this change

- Dovecot userdb dedup (shipped in the previous commit).
- Email-roundtrip probe widening (next commit).
- Rebuilding docker-mailserver image with lmdb support (deferred —
  disabling the cache is simpler and sufficient at our volume).

## Test Plan

### Automated

`postconf -m` in the running container to confirm lmdb is genuinely absent
(ruling out option (b) before we commit to (a)):

```
btree  cidr  environ  fail  hash  inline  internal  ldap  memcache
nis  pcre  pipemap  proxy  randmap  regexp  socketmap  static  tcp
texthash  unionmap  unix
```

No lmdb entry — confirmed.

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  ~ "postfix-main.cf" = <<-EOT
      + postscreen_cache_map =
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

Reloader triggers pod rollout — baseline error count before apply was 34
`unable to get exclusive lock` lines per `--tail=500` log window.

### Manual Verification

Post-rollout, when the new pod is Ready:

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
   Expect: empty (no value)
2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 | grep -c "unable to get exclusive lock"`
   Expect: 0 new occurrences (any hits are from before the rollout).
3. Trigger a probe run manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
   then `kubectl -n mailserver logs job/probe-verify-...`
   Expect: `Round-trip SUCCESS` with duration < 120s.

## Reproduce locally

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
2. Expect: `postscreen_cache_map =` (empty value)
3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m | grep -c "unable to get exclusive lock"`
4. Expect: 0

Closes: code-1dc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:32:48 +00:00
Viktor Barzin
01a718e17b [mailserver] Filter redundant local→local aliases to fix Dovecot 'exists more than once'
## Context

Dovecot auth logs have been steadily spamming
`passwd-file /etc/dovecot/userdb: User r730-idrac@viktorbarzin.me exists more
than once` (and the same for vaultwarden@) at ~31 occurrences per 500 log
lines. Under load this flakes IMAP auth for the e2e email-roundtrip probe
(spam@viktorbarzin.me uses the catch-all), which was masquerading as "Brevo
or probe timing" noise.

## Root cause

docker-mailserver builds Dovecot's `/etc/dovecot/userdb` from two sources:
real accounts (`postfix-accounts.cf`) AND virtual-alias entries whose
*target* resolves to a local mailbox (`postfix-virtual.cf`). When the same
address appears as BOTH a real mailbox AND an alias whose target is another
local mailbox, the generated userdb has two lines for that username pointing
to different home directories — e.g.:

  r730-idrac@viktorbarzin.me:...:/var/mail/.../r730-idrac/home
  r730-idrac@viktorbarzin.me:...:/var/mail/.../spam/home      ← from alias

Dovecot's passwd-file driver rejects the duplicate, and every subsequent
auth lookup logs the error.

This affected exactly two addresses:
- r730-idrac@viktorbarzin.me (real account + alias → spam@)
- vaultwarden@viktorbarzin.me  (real account + alias → me@)

Other aliases are fine: they either forward to external addresses (gmail
etc.) — no local userdb entry generated — or map an address to itself
(me@ → me@) which docker-mailserver dedups internally.

Note: removing the real accounts is not an option because Vaultwarden uses
`vaultwarden@viktorbarzin.me` as its live SMTP_USERNAME
(stacks/vaultwarden/modules/vaultwarden/main.tf:121).

## This change

Introduces a `local.postfix_virtual` that concatenates the Vault-sourced
aliases with `extra/aliases.txt`, then filters out any line matching the
exact "LHS RHS" shape where both sides are in `var.mailserver_accounts` and
LHS != RHS. That is, only the pure local→local redundant entries are
dropped; all forwarding aliases and the catch-all are preserved.

The filter is self-healing: if a future alias ever collides with a real
account, it gets silently suppressed instead of breaking Dovecot auth.

```
  Vault mailserver_aliases  ─┐
                              ├─ concat ─ split \n ─ filter ─ join \n ─► postfix-virtual.cf
  extra/aliases.txt ─────────┘                        │
                                                       └── drop if LHS+RHS both in
                                                           mailserver_accounts and
                                                           LHS != RHS
```

Filtered entries (confirmed via locally-simulated filter on live data):
- r730-idrac@viktorbarzin.me spam@viktorbarzin.me
- vaultwarden@viktorbarzin.me me@viktorbarzin.me

Preserved (sample): postmaster→me, contact→me, alarm-valchedrym→self+3 ext,
lubohristov→gmail, yoana→gmail, @viktorbarzin.me→spam (catch-all), all four
disposable `*-generated@` aliases.

## What is NOT in this change

- Real accounts in Vault (`secret/platform.mailserver_accounts`) are
  untouched — vaultwarden SMTP auth keeps working.
- Postfix postscreen btree lock contention (separate commit).
- Email-roundtrip probe IMAP window (separate commit).

## Test Plan

### Automated

`terraform validate` — passes (docker-mailserver module):

```
Success! The configuration is valid, but there were some validation warnings as shown above.
```

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  # module.mailserver.kubernetes_config_map.mailserver_config will be updated in-place
  ~ resource "kubernetes_config_map" "mailserver_config" {
      ~ data = {
          ~ "postfix-virtual.cf" = (sensitive value)
            # (9 unchanged elements hidden)
        }
        id = "mailserver/mailserver.config"
    }
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply` — applied:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

Post-apply configmap content (the two lines are gone):

```
$ kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'
postmaster@viktorbarzin.me me@viktorbarzin.me
contact@viktorbarzin.me me@viktorbarzin.me
me@viktorbarzin.me me@viktorbarzin.me
lubohristov@viktorbarzin.me lyubomir.hristov3@gmail.com
alarm-valchedrym@viktorbarzin.me alarm-valchedrym@...,vbarzin@...,emil.barzin@...,me@...
yoana@viktorbarzin.me divcheva.yoana@gmail.com

@viktorbarzin.me spam@viktorbarzin.me
firmly-gerardo-generated@viktorbarzin.me me@viktorbarzin.me
closely-keith-generated@viktorbarzin.me vbarzin@gmail.com
literally-paolo-generated@viktorbarzin.me viktorbarzin@fb.com
hastily-stefanie-generated@viktorbarzin.me elliestamenova@gmail.com
```

Reloader triggers a pod rollout; once new pod is Ready:
- `kubectl -n mailserver exec <pod> -c docker-mailserver -- cut -d: -f1 /etc/dovecot/userdb | sort | uniq -d`
  expected output: empty (no duplicate usernames)
- `kubectl -n mailserver logs <pod> -c docker-mailserver --tail=500 | grep -c "exists more than once"`
  expected output: 0 (baseline was 31/500 lines)

## Reproduce locally

1. `kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'`
2. Expect: no `r730-idrac@viktorbarzin.me spam@viktorbarzin.me` line and no
   `vaultwarden@viktorbarzin.me me@viktorbarzin.me` line.
3. After pod restart: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=500 | grep -c "exists more than once"` → 0.

Closes: code-27l

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:29:02 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
e612baac15 [dawarich] Re-enable Sidekiq worker with resource limits + probes
## Context

Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.

During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.

Live pre-apply snapshot via `bin/rails runner`:
  enqueued=18  (cache=2, data_migrations=4, default=12)
  scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
  reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.

## What changed

Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:

- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
  768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
  job gets OOM-killed and container-restarted in place without evicting
  the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
  instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
  the existing `dawarich-secrets` K8s Secret (populated by the existing
  ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
  reference the 2026-02-23 block relied on — that data source no longer
  exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
  to separate commits (plan: 2 → 5 → 10 with 15-30min observation
  between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
  container-scoped restart on stall, verified `pgrep` is at
  /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
  RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
  Sidekiq's Rails initialisation matches web.

Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
  drain in-flight jobs on SIGTERM during rolls (default 30s not enough
  for reverse-geocoding batches).

## What is NOT in this change

- Prometheus exporter for Sidekiq metrics. The first apply turned on
  `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
  `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
  metrics over TCP to a separate exporter server process — and the
  freikin/dawarich image does not start one. Client logged ~2/sec
  "Connection refused" errors until we flipped ENABLED back to "false"
  in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
  for the same reason (nothing listening on :9394). Filed code-1q5
  (blocks code-459) to add a third sidecar container running
  `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
  the 4 drafted alerts (DawarichSidekiqDown /
  QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
  actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
  monitoring/prometheus_chart_values.tpl; they reference metrics that
  don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
  code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
  out of scope per plan.

## Other changes bundled in

Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.

## Topology trade-off recorded

Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
  strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
  `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
  the container block + apply.
- Shared pod network namespace — only one process can bind any given
  port. This is why the plan explicitly avoided declaring a new
  `port { name = "prometheus" }` on the sidekiq container (the web
  container already reserves 9394 by name).

Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".

## Rollback

Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
   no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
   apply — full disable, backlog stays in Redis DB 1, recoverable.

Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.

## Refs

- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
  after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
  delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
  Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
  Untouched; processing the backlog does not fix this but ensures
  future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
  same reasoning.

## Test Plan

### Automated

```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
#   kubernetes_deployment.dawarich         (sidekiq container + probes + lifecycle)
#   kubernetes_namespace.dawarich          (drops stale goldilocks label, pre-existing drift)
#   module.tls_secret.kubernetes_secret.tls_secret  (Kyverno clone-label drift, pre-existing)

$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.

(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```

### Manual Verification

Setup: kubectl context against the k8s cluster (10.0.20.100).

1. Pod has both containers Ready with zero restarts:
   ```
   $ kubectl -n dawarich get pods -o wide
   NAME                        READY  STATUS   RESTARTS  AGE
   dawarich-75b4ff9fbf-qh56v   2/2    Running  0         <fresh>
   ```

2. Sidekiq container is actively processing jobs:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
   Sidekiq 8.0.10 connecting to Redis ... db: 1
   queues: [data_migrations, points, default, mailers, families,
            imports, exports, stats, trips, tracks,
            reverse_geocoding, visit_suggesting, places,
            app_version_checking, cache, archival, digests,
            low_priority]
   Performing DataMigrations::BackfillMotionDataJob ...
   Backfilled motion_data for N000 points (N climbing)
   ```

3. Rails Sidekiq::API snapshot — procs registered, counters moving:
   ```
   $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
       require "sidekiq/api"
       s = Sidekiq::Stats.new
       puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
     '
   processed=7 failed=2 procs=1
   retry=0 dead=0
   ```
   (The 2 "failures" are cumulative across two pod lifecycles during
   the Prometheus env flip — retried successfully, neither retry nor
   dead set holds any jobs.)

4. Per-container memory well under the 1Gi limit:
   ```
   $ kubectl -n dawarich top pod --containers
   POD                         NAME              CPU    MEMORY
   dawarich-75b4ff9fbf-qh56v   dawarich          1m     272Mi  (of 896Mi)
   dawarich-75b4ff9fbf-qh56v   dawarich-sidekiq  79m    333Mi  (of 1Gi)
   ```

5. No "Prometheus Exporter, failed to send" log lines since the second
   apply:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
       | grep -c "Prometheus Exporter"
   0
   ```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:13:05 +00:00
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
2b8bb849c0 [infra] Bump claude-agent-service + beadboard image tags
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):

1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
   (files in /usr/share/agent-seed/), the beads-task-runner agent, and
   hmac.compare_digest bearer verification. The tag moves from 382d6b14
   to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
   CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
   needs the Dispatch button + /api/agent-dispatch + /api/agent-status
   routes. Tag moves from :latest to :17a38e43 (fork HEAD on
   github.com/ViktorBarzin/beadboard).

## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
  default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
  "Use 8-char git SHA tags — `:latest` causes stale pull-through cache").

## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
  - claude-agent-service:0c24c9b6 verified via
    `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
    contains both seed files.
  - beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.

## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
   seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
   + copies beads-task-runner.md), beadboard pod rolls with new env vars
   sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
   should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
   -o yaml | grep image:` should show :17a38e43.

Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
Viktor Barzin
8d94688dde [infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]
## Context

Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:

  app.kubernetes.io/managed-by          = kyverno
  generate.kyverno.io/policy-name       = sync-tls-secret
  generate.kyverno.io/policy-namespace  = ""
  generate.kyverno.io/rule-name         = sync-tls-secret
  generate.kyverno.io/source-kind       = Secret
  generate.kyverno.io/source-namespace  = kyverno
  generate.kyverno.io/source-uid        = <uid>
  generate.kyverno.io/source-version    = v1
  generate.kyverno.io/source-group      = ""
  generate.kyverno.io/clone-source      = ""

Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.

## This change

Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.

The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.

## What is NOT in this change

- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
  — a different admission controller, different resource, different fix.
  Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
  stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
  sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
  policy is the only generate policy that produces Secrets in this repo
  (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).

## Verification

Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
   (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)

After:  Plan: 0 to add, 1 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```

Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:23:02 +00:00
Viktor Barzin
f79e3c563e [infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip]
## Context

On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member
Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone`
on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name
(selector switched to the standalone pod), all 20 databases/688 tables/14
users were dump-restored, and Vault rotated credentials against the new
instance. The InnoDB Cluster has been dark since — Phase 4 was to remove
the dead code and decommission its cluster-side Helm state.

Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster
+ mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0
(30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles."

## This change

Phase 4 scope executed in this session (beads code-qai):

1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack:
   - `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR
     + MySQL Router Deployment + 8 Services (mysql-cluster, -instances,
     ports 6446/6448/6447/6449/6450/8443, etc.)
   - `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator
     Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles
   - `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns
   - `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover
     permissions patch that existed to work around the sidecar's kopf
     permissions bug; unused without the operator
   - `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra`
   - `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override
     `innodb_doublewrite=OFF` via subPath mount; standalone does not need it
2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not
   garbage-collect PVCs; 30Gi reclaimed.
3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf`
   covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section
   and all six resources above.

The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context
deadline exceeded"). Uninstallation had in fact completed cluster-side by
that point but TF rolled back the state delta. A second `terragrunt destroy
-target` call with the same args resolved cleanly — destroyed the remaining
2 tracked resources (the first pass cleared 4) and encrypted+committed the
Tier 0 state.

## What is NOT in this change

- CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these
  on uninstall. Verified clean: `kubectl get crd | grep mysql.oracle.com`
  returns nothing.
- Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not
  a TF-managed resource.
- Stale Vault DB roles (health, linkwarden, affine, woodpecker,
  claude_memory, crowdsec, technitium) for services migrated MySQL→PG —
  sandbox denies `vault list database/roles` as credential scouting, so
  the user handles this manually.
- 2 state-commits preceding this one (`30fa411b`, `6cf3575e`) are automatic
  SOPS-encrypted-state commits produced by `scripts/tg` after each
  `terragrunt destroy` pass. Standard Tier 0 workflow.

## Verification

```
$ helm list -A | grep -E 'mysql-cluster|mysql-operator'
(no output)

$ kubectl get ns mysql-operator
Error from server (NotFound): namespaces "mysql-operator" not found

$ kubectl get pvc -n dbaas datadir-mysql-cluster-0
Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found

$ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone
NAME                 READY   STATUS    RESTARTS       AGE
mysql-standalone-0   1/1     Running   1 (118m ago)   2d

$ ../../scripts/tg state list | grep -i 'mysql_operator\|mysql_cluster\|mysql_sidecar\|mysql_extra_cnf'
(no output)

$ ../../scripts/tg plan | grep -E 'mysql_cluster|mysql_operator|mysql_sidecar|mysql_extra_cnf'
(no output — Wave 2 drift is gone; remaining plan items are pre-existing
drift unrelated to this change, see Wave 3 + in-flight payslip work)
```

## Reproduce locally
1. `git pull`
2. `cd stacks/dbaas && ../../scripts/tg state list | grep mysql_cluster` → no output
3. `helm list -A | grep mysql-cluster` → no output

Closes: code-qai

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:19:48 +00:00
Viktor Barzin
6cf3575ed9 state(dbaas): update encrypted state 2026-04-18 19:17:31 +00:00
Viktor Barzin
30fa411bf7 state(dbaas): update encrypted state 2026-04-18 19:17:20 +00:00
Viktor Barzin
61e94c21fe state(dbaas): update encrypted state 2026-04-18 19:16:41 +00:00
Viktor Barzin
c75beaac6c wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request)
## Context

Pod was OOMKilled after today's broker-sync Phase 3 import grew the
activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow
matches across 6 accounts). `/api/v1/net-worth` and
`/valuations/history` materialise the full history in memory to render
the dashboard chart.

`kubectl describe pod` showed Back-off restarting failed container;
`kubectl top pod` reported 14Mi steady-state but spikes crossed the
64Mi cap.

## This change

Bump container resources to:
- requests.memory: 64Mi → 256Mi
- limits.memory:  64Mi → 1Gi

CPU unchanged. 1Gi is generous for the current 700-activity DB +
chart rendering, with headroom for another year of growth before we
need to revisit (VPA will flag if actual use exceeds upperBound).

## Verification

### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 4 changed, 0 destroyed.

### Manual
$ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}'
→ {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}}

$ kubectl -n wealthfolio get pods -l app=wealthfolio
NAME                           READY   STATUS    RESTARTS   AGE
wealthfolio-86c8696b9c-nzwkf   1/1     Running   0          51s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:13:05 +00:00
140 changed files with 5870 additions and 4189 deletions

View file

@ -48,6 +48,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **Tier 0 details**: Decrypt priority: Vault Transit (primary) → age key fallback. Encrypt: both Vault Transit + age recipients. Scripts: `scripts/state-sync {encrypt|decrypt|commit} [stack]`.
- **Adding operator**: Generate age key (`age-keygen`), add pubkey to `.sops.yaml`, run `sops updatekeys` on Tier 0 `.enc` files. For Tier 1, only Vault access is needed.
- **Migration script**: `scripts/migrate-state-to-pg` (one-shot, idempotent) migrates Tier 1 stacks from local to PG.
- **Adopting existing resources**: use HCL `import {}` blocks (TF 1.5+), not `terraform import` CLI. Commit stanza → plan-to-zero → apply → delete stanza. Canonical reason: reviewable in PR, plan-safe, idempotent, tier-agnostic. Full rules + per-provider ID formats in `AGENTS.md` → "Adopting Existing Resources".
## Secrets Management — Vault KV
- **Vault is the sole source of truth** for secrets.

View file

@ -1,22 +1,26 @@
---
name: payslip-extractor
description: "Extract structured UK payslip fields from a base64-encoded PDF into strict JSON."
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
model: haiku
allowedTools:
- Bash
- Read
---
You are a headless payslip-field extractor. You receive a prompt containing a base64-encoded UK payslip PDF plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
## Your single job
Given a prompt that contains:
- A line of the form `PDF_BASE64: <base64-blob>`
- A JSON schema describing the target fields
Given a prompt that contains EITHER:
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
## Fast path: PAYSLIP_TEXT is present
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
## Processing steps
### Step 1. Extract and decode the base64 PDF

View file

@ -42,10 +42,15 @@ steps:
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
# ── Run terraform plan on all stacks ──
# Emits two timestamps per drifted stack so the Pushgateway/Prometheus
# side can compute drift-age-hours via `time() - drift_stack_first_seen`.
- |
DRIFTED=""
CLEAN=0
ERRORS=""
NOW=$(date +%s)
# Metrics accumulator — written once per stack, then pushed as a batch.
METRICS=""
for stack_dir in stacks/*/; do
stack=$(basename "$stack_dir")
@ -56,12 +61,50 @@ steps:
EXIT=$?
case $EXIT in
0) echo "OK (no changes)"; CLEAN=$((CLEAN + 1)) ;;
1) echo "ERROR"; ERRORS="$ERRORS $stack" ;;
2) echo "DRIFT DETECTED"; DRIFTED="$DRIFTED $stack" ;;
0)
echo "OK (no changes)"
CLEAN=$((CLEAN + 1))
# drift_stack_state=0 means clean; age-hours irrelevant so we
# still push 0 so per-stack gauges don't go stale.
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 0\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} 0\n"
;;
1)
echo "ERROR"
ERRORS="$ERRORS $stack"
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 2\n"
;;
2)
echo "DRIFT DETECTED"
DRIFTED="$DRIFTED $stack"
# Fetch first-seen timestamp from Pushgateway (preserve across runs).
FIRST_SEEN=$(curl -s "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics" \
| awk -v s="$stack" '$1 == "drift_stack_first_seen{stack=\""s"\"}" {print $2; exit}')
if [ -z "$FIRST_SEEN" ] || [ "$FIRST_SEEN" = "0" ]; then
FIRST_SEEN="$NOW"
fi
AGE_HOURS=$(( (NOW - FIRST_SEEN) / 3600 ))
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 1\n"
METRICS="${METRICS}drift_stack_first_seen{stack=\"$stack\"} $FIRST_SEEN\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} $AGE_HOURS\n"
;;
esac
done
# Summary counters — single gauge per run.
DRIFT_COUNT=$(echo "$DRIFTED" | wc -w)
ERROR_COUNT=$(echo "$ERRORS" | wc -w)
METRICS="${METRICS}drift_stack_count $DRIFT_COUNT\n"
METRICS="${METRICS}drift_error_count $ERROR_COUNT\n"
METRICS="${METRICS}drift_clean_count $CLEAN\n"
METRICS="${METRICS}drift_detection_last_run_timestamp $NOW\n"
# ── Push to Pushgateway ──
# One batched push keeps the run atomic: either all metrics land or none.
printf "%b" "$METRICS" | curl -s --data-binary @- \
http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drift-detection \
|| echo "(pushgateway unavailable, metrics lost for this run)"
echo ""
echo "=== Drift Detection Summary ==="
echo "Clean: $CLEAN stacks"

View file

@ -15,6 +15,49 @@
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
## Adopting Existing Resources — Use `import {}` Blocks, Not the CLI
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL `import {}` block (Terraform 1.5+). Do **NOT** use `terraform import` on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
**Canonical workflow:**
1. Write the `resource` block that matches the live object.
2. In the same stack, add an `import {}` stanza naming the target and the provider-specific ID:
```hcl
import {
to = helm_release.kured
id = "kured/kured" # Helm ID format: <namespace>/<release-name>
}
resource "helm_release" "kured" {
name = "kured"
namespace = "kured"
repository = "https://kubereboot.github.io/charts/"
chart = "kured"
version = "5.7.0"
# ... values matching the live release
}
```
3. `scripts/tg plan` — every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is **0 changes**.
4. `scripts/tg apply` — the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.
5. After the apply lands cleanly, **delete the `import {}` block** in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
**Why `import {}` and not `terraform import`:**
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
- Plan-safe: the `import` plan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after.
- Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use `scripts/tg`.
- Re-runnable: if the apply fails partway through, the `import {}` block is idempotent. The CLI path's state mutation is not.
**Finding the provider-specific ID:** each provider has its own convention.
| Resource | ID format | Example |
|---|---|---|
| `helm_release` | `<namespace>/<release-name>` | `kured/kured` |
| `kubernetes_manifest` | `{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}}` | (pass as HCL object literal) |
| `kubernetes_<kind>_v1` | `<namespace>/<name>` for namespaced, `<name>` for cluster-scoped | `kube-system/coredns` |
| `authentik_provider_proxy` | provider UUID | `0eecac07-97c7-443c-...` |
| `cloudflare_record` | `<zone-id>/<record-id>` | `abc123/def456` |
## Secrets Management (SOPS)
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)

View file

@ -0,0 +1,185 @@
# Beads Auto-Dispatch Runbook
Users can hand work to the headless `beads-task-runner` agent by assigning a
bead to the sentinel user `agent`. Two CronJobs in the `beads-server`
namespace drive the pipeline:
- **`beads-dispatcher`** — every 2 min: picks up the highest-priority
`assignee=agent`/`status=open` bead with non-empty acceptance criteria,
claims it by flipping to `in_progress`, and POSTs it to BeadBoard's
`/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with
the existing bearer-token flow.
- **`beads-reaper`** — every 10 min: flips any `assignee=agent` +
`status=in_progress` bead whose `updated_at` is older than 30 min to
`status=blocked` with an explanatory note. Catches pod crashes mid-run.
The manual BeadBoard Dispatch button continues to work in parallel.
## Flow diagram
```
user: bd assign <id> agent
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy?) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Usage
### Hand a bead to the agent
```
bd create "Title" \
-d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
--acceptance "Concrete, verifiable criteria" \
-p 2
bd assign <new-id> agent
```
**Acceptance criteria is required.** Beads without it are skipped by the
dispatcher and stay in `open` forever. This is intentional — the
`beads-task-runner` agent expects clear done conditions.
### Take a bead back (unassign)
```
bd assign <id> ""
```
If the bead is already `in_progress`, also reset it:
```
bd update <id> -s open
```
### Pause auto-dispatch
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
```
This sets `spec.suspend: true` on both CronJobs. Existing running jobs
continue; no new ticks fire. Re-enable by re-applying with
`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch
remains available while paused.
### Read the logs
```
# Recent dispatcher runs
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
kubectl -n beads-server logs job/<dispatcher-job-name>
# Tail the underlying agent once a bead dispatches
kubectl -n claude-agent logs -l app=claude-agent-service -f
# Inspect reaper decisions
kubectl -n beads-server get jobs | grep beads-reaper | tail
kubectl -n beads-server logs job/<reaper-job-name>
```
### Inspect a specific bead's dispatch history
```
bd show <id> --json | jq '{status, assignee, notes, updated_at}'
```
Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed
at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail
lives on the bead itself.
## Reaper semantics — when a bead becomes `blocked`
The reaper flips a bead to `blocked` if:
- `assignee = agent`, AND
- `status = in_progress`, AND
- `updated_at` is more than **30 minutes** in the past.
Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner`
agent never trips the reaper — it notes progress as it works. A `blocked`
bead is a signal that:
- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test),
- the job hit its 15-minute budget timeout inside `claude-agent-service`
without notes (rare — the agent usually notes failure before exiting),
- `claude-agent-service` was restarted during the run (in-memory job state
is lost; see [known risks](#known-risks)).
Recovery: read the reaper note, reopen manually if appropriate:
```
bd update <id> -s open
bd assign <id> agent # re-arm for next dispatcher tick
```
## Design choices
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches `claude-agent-service`'s single-slot
`asyncio.Lock`. With a 2-min poll cadence and ~5-min average run,
throughput is ~12 beads/hour. Parallelism is a separate plan.
- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's
manual Dispatch button. Broader-privilege agents stay manual.
- **CronJob (not in-service polling, not n8n)** — matches existing infra
pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed,
easy to pause.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/`
because `bd` may touch the parent directory and ConfigMap mounts are
read-only.
## Known risks
- **In-memory job state in `claude-agent-service`** — if the pod restarts
mid-run, the job record is lost. The reaper catches this after 30 min.
Persistent job store is deferred.
- **Prompt injection via bead fields** — a malicious bead description could
try to steer the agent. The `beads-task-runner` rails + token budget +
timeout are the defense. Identical exposure as the manual Dispatch button.
- **Image tag drift**`claude_agent_service_image_tag` in
`stacks/beads-server/main.tf` mirrors `local.image_tag` in
`stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds,
or the dispatcher/reaper will run on an older layer. (They only need
`bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.)
- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and
`.updated_at`. If a future `bd` upgrade renames these, the reaper breaks
silently (no reaping, no alert). `BD_VERSION` is pinned in the image
Dockerfile.
## Verification after change
```
# Both CronJobs exist with the right schedule / SUSPEND state
kubectl -n beads-server get cronjob
# End-to-end smoke test
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '.notes'
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'
```

View file

@ -18,4 +18,8 @@ resource "kubernetes_secret" "tls_secret" {
"tls.key" = var.tls_key == "" ? file("${path.root}/secrets/privkey.pem") : var.tls_key
}
type = "kubernetes.io/tls"
lifecycle {
# KYVERNO_LIFECYCLE_V1: the sync-tls-secret policy stamps generate.kyverno.io/* + app.kubernetes.io/managed-by labels on this generated Secret
ignore_changes = [metadata[0].labels]
}
}

View file

@ -116,6 +116,10 @@ resource "kubernetes_deployment" "actualbudget" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "actualbudget" {
@ -214,6 +218,10 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "actualbudget-http-api" {
@ -304,4 +312,8 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -59,6 +59,10 @@ resource "kubernetes_namespace" "actualbudget" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -90,6 +90,10 @@ resource "kubernetes_namespace" "affine" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -319,6 +323,10 @@ resource "kubernetes_deployment" "affine" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "affine" {

View file

@ -31,6 +31,10 @@ resource "kubernetes_namespace" "authentik" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_resource_quota" "authentik" {

View file

@ -115,6 +115,10 @@ resource "kubernetes_deployment" "pgbouncer" {
}
}
depends_on = [kubernetes_secret.pgbouncer_auth]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# --- 4 Service ---

View file

@ -3,10 +3,25 @@ variable "tls_secret_name" {
sensitive = true
}
# Temporary default until GHA pipeline publishes the first 8-char SHA tag.
variable "beadboard_image_tag" {
type = string
default = "latest"
default = "17a38e43"
}
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf keep in
# sync when the claude-agent-service image is rebuilt. Reused here because the
# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image
# already ships.
variable "claude_agent_service_image_tag" {
type = string
default = "0c24c9b6"
}
# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The
# manual BeadBoard Dispatch button keeps working either way.
variable "beads_dispatcher_enabled" {
type = bool
default = true
}
resource "kubernetes_namespace" "beads" {
@ -16,6 +31,10 @@ resource "kubernetes_namespace" "beads" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_persistent_volume_claim" "dolt_data" {
@ -668,3 +687,274 @@ module "beadboard_ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# Beads auto-dispatch (dispatcher + reaper CronJobs)
#
# Flow:
# user: bd assign <id> agent
# > CronJob: beads-dispatcher (every 2 min)
# 1. GET BeadBoard /api/agent-status skip if claude-agent-service busy
# 2. bd query 'assignee=agent AND status=open' pick highest priority
# 3. bd update -s in_progress (claim; next tick won't re-pick)
# 4. POST BeadBoard /api/agent-dispatch reuses prompt-build + bearer flow
# 5. bd note "dispatched: job=<id>" (or rollback + note on failure)
#
# CronJob: beads-reaper (every 10 min)
# for bead (assignee=agent, status=in_progress, updated_at > 30m):
# bd update -s blocked + bd note (recover from pod crashes mid-run)
#
# The claude-agent-service image ships bd + jq + curl no separate image built.
resource "kubernetes_config_map" "beads_metadata" {
metadata {
name = "beads-metadata"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"metadata.json" = jsonencode({
database = "dolt"
backend = "dolt"
dolt_mode = "server"
dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
dolt_server_port = 3306
dolt_server_user = "beads"
dolt_database = "code"
project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da"
})
}
}
locals {
claude_agent_service_image = "registry.viktorbarzin.me/claude-agent-service:${var.claude_agent_service_image_tag}"
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
beads_script_prelude = <<-EOT
set -euo pipefail
# bd with Dolt server mode needs metadata.json in a directory it can walk.
# ConfigMap mounts are read-only copy to a writable location before use.
mkdir -p /tmp/.beads
cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json
EOT
}
resource "kubernetes_cron_job_v1" "beads_dispatcher" {
metadata {
name = "beads-dispatcher"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/2 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-dispatcher"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "dispatcher"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false')
if [ "$BUSY" != "false" ]; then
echo "claude-agent-service is busy — skipping tick"
exit 0
fi
BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \
| jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)]
| sort_by(.priority, .updated_at)[0].id // empty')
if [ -z "$BEAD" ]; then
echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)"
exit 0
fi
echo "picked bead: $BEAD"
bd --db /tmp/.beads update "$BEAD" -s in_progress
bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
-H 'Content-Type: application/json' \
-d "{\"taskId\":\"$BEAD\"}" \
"$${BEADBOARD_URL}/api/agent-dispatch")
CODE=$(printf '%s' "$RESP" | tail -n1)
BODY=$(printf '%s' "$RESP" | sed '$d')
if [ "$CODE" = "200" ]; then
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"')
bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID"
echo "dispatched $BEAD as job $JOB_ID"
else
# Roll the claim back so the next tick can retry.
bd --db /tmp/.beads update "$BEAD" -s open
bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY"
echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2
exit 1
fi
EOT
]
env {
name = "BEADBOARD_URL"
value = local.beadboard_internal_url
}
env {
name = "API_BEARER_TOKEN"
value_from {
secret_key_ref {
name = "beadboard-agent-service"
key = "api_bearer_token"
}
}
}
env {
name = "BEADS_ACTOR"
value = "beads-dispatcher"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_cron_job_v1" "beads_reaper" {
metadata {
name = "beads-reaper"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-reaper"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "reaper"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
THRESHOLD_MIN=30
NOW=$(date -u +%s)
bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \
| jq -c '.[]' \
| while read -r BEAD_JSON; do
ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id')
LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at')
# Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3.
LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))")
AGE_MIN=$(( (NOW - LAST_TS) / 60 ))
if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then
bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking"
bd --db /tmp/.beads update "$ID" -s blocked
echo "REAPED $ID (stale $${AGE_MIN}m)"
else
echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)"
fi
done
EOT
]
env {
name = "BEADS_ACTOR"
value = "beads-reaper"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "website" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -71,6 +75,10 @@ resource "kubernetes_deployment" "blog" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "blog" {

View file

@ -14,6 +14,10 @@ resource "kubernetes_namespace" "broker_sync" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Secrets for all providers. Seeded in Vault at `secret/broker-sync`:
@ -122,6 +126,10 @@ resource "kubernetes_cron_job_v1" "version_probe" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Trading212 steady-state daily sync. Phase 1 deliverable.
@ -218,6 +226,10 @@ resource "kubernetes_cron_job_v1" "trading212" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# IMAP ingest InvestEngine + Schwab email parsers, one combined pod.
@ -343,6 +355,10 @@ resource "kubernetes_cron_job_v1" "imap" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# CSV drop-folder processor Scottish Widows, Fidelity quarterly, Freetrade, etc.
@ -431,6 +447,10 @@ resource "kubernetes_cron_job_v1" "csv_drop" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Monthly HMRC FX reconciliation rewrites last-month activities with official
@ -519,6 +539,10 @@ resource "kubernetes_cron_job_v1" "fx_reconcile" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Backup: snapshot sync.db / fx.db / csv-archive into NFS daily, keep 30 days.
@ -596,6 +620,10 @@ resource "kubernetes_cron_job_v1" "backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "changedetection" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -182,6 +186,10 @@ resource "kubernetes_deployment" "changedetection" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "changedetection" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "city-guesser" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -62,6 +66,10 @@ resource "kubernetes_deployment" "city-guesser" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "city-guesser" {

View file

@ -11,7 +11,7 @@ data "vault_kv_secret_v2" "viktor_secrets" {
locals {
namespace = "claude-agent"
image = "registry.viktorbarzin.me/claude-agent-service"
image_tag = "382d6b14"
image_tag = "0c24c9b6"
labels = {
app = "claude-agent-service"
}
@ -28,6 +28,10 @@ resource "kubernetes_namespace" "claude_agent" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# --- Secrets ---
@ -586,4 +590,8 @@ resource "kubernetes_cron_job_v1" "claude_oauth_expiry_monitor" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -20,6 +20,10 @@ resource "kubernetes_namespace" "claude-memory" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -238,7 +242,8 @@ resource "kubernetes_deployment" "claude-memory" {
lifecycle {
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
ignore_changes = [
spec[0].template[0].spec[0].container[0].image
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}
}

View file

@ -9,6 +9,10 @@ resource "kubernetes_namespace" "cloudflared" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
variable "tier" { type = string }
@ -89,6 +93,10 @@ resource "kubernetes_deployment" "cloudflared" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_pod_disruption_budget_v1" "cloudflared" {

View file

@ -10,6 +10,10 @@ resource "kubernetes_namespace" "cnpg_system" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# -----------------------------------------------------------------------------

View file

@ -54,6 +54,10 @@ resource "kubernetes_namespace" "coturn" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -189,6 +193,10 @@ resource "kubernetes_deployment" "coturn" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# LoadBalancer service with MetalLB exposes STUN/TURN signaling + relay ports

View file

@ -31,6 +31,10 @@ resource "kubernetes_namespace" "crowdsec" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_config_map" "crowdsec_custom_scenarios" {
@ -233,6 +237,10 @@ resource "kubernetes_deployment" "crowdsec-web" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "crowdsec-web" {
@ -358,6 +366,10 @@ resource "kubernetes_cron_job_v1" "crowdsec_blocklist_import" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Service account for the blocklist import job (needs kubectl exec permissions)

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "cyberchef" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -72,6 +76,10 @@ resource "kubernetes_deployment" "cyberchef" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "cyberchef" {

View file

@ -18,6 +18,10 @@ resource "kubernetes_namespace" "dashy" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_config_map" "config" {
@ -95,6 +99,10 @@ resource "kubernetes_deployment" "dashy" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "dashy" {

View file

@ -88,6 +88,7 @@ resource "kubernetes_deployment" "dawarich" {
}
}
spec {
termination_grace_period_seconds = 60
container {
image = "freikin/dawarich:${var.image_version}"
@ -200,81 +201,132 @@ resource "kubernetes_deployment" "dawarich" {
}
}
}
# container {
# image = "freikin/dawarich:${var.image_version}"
# name = "dawarich-sidekiq"
# command = ["sidekiq-entrypoint.sh"]
# args = ["bundle exec sidekiq"]
# env {
# name = "REDIS_URL"
# value = "redis://redis.redis.svc.cluster.local:6379"
# }
# env {
# name = "DATABASE_HOST"
# value = "postgresql.dbaas"
# }
# env {
# name = "DATABASE_USERNAME"
# value = "dawarich"
# }
# env {
# name = "DATABASE_PASSWORD"
# value = data.vault_kv_secret_v2.secrets.data["db_password"]
# }
# env {
# name = "DATABASE_NAME"
# value = "dawarich"
# }
# env {
# name = "MIN_MINUTES_SPENT_IN_CITY"
# value = "60"
# }
# env {
# name = "BACKGROUND_PROCESSING_CONCURRENCY"
# value = "10"
# }
# env {
# name = "ENABLE_TELEMETRY"
# value = "true"
# }
# env {
# name = "APPLICATION_HOST"
# value = "dawarich.viktorbarzin.me"
# }
# # env {
# # name = "PROMETHEUS_EXPORTER_ENABLED"
# # value = "false"
# # }
# # env {
# # name = "PROMETHEUS_EXPORTER_HOST"
# # value = "dawarich.dawarich"
# # }
# # env {
# # name = "PHOTON_API_HOST"
# # value = "photon.dawarich:2322"
# # # value = "photon.komoot.io"
# # }
# # env {
# # name = "PHOTON_API_USE_HTTPS"
# # value = "false"
# # }
# env {
# name = "GEOAPIFY_API_KEY"
# value = data.vault_kv_secret_v2.secrets.data["geoapify_api_key"]
# }
# env {
# name = "SELF_HOSTED"
# value = "true"
# }
# # volume_mount {
# # name = "watched"
# # mount_path = "/var/app/tmp/imports/watched"
# # }
# }
container {
image = "freikin/dawarich:${var.image_version}"
name = "dawarich-sidekiq"
command = ["sidekiq-entrypoint.sh"]
args = ["bundle exec sidekiq"]
env {
name = "REDIS_URL"
value = "redis://${var.redis_host}:6379"
}
env {
name = "DATABASE_HOST"
value = var.postgresql_host
}
env {
name = "DATABASE_USERNAME"
value = "dawarich"
}
env {
name = "DATABASE_PASSWORD"
value_from {
secret_key_ref {
name = "dawarich-secrets"
key = "db_password"
}
}
}
env {
name = "DATABASE_NAME"
value = "dawarich"
}
env {
name = "MIN_MINUTES_SPENT_IN_CITY"
value = "60"
}
env {
name = "TIME_ZONE"
value = "Europe/London"
}
env {
name = "DISTANCE_UNIT"
value = "km"
}
env {
name = "BACKGROUND_PROCESSING_CONCURRENCY"
value = "2"
}
env {
name = "ENABLE_TELEMETRY"
value = "true"
}
env {
name = "APPLICATION_HOSTS"
value = "dawarich.viktorbarzin.me"
}
# Prometheus exporter disabled until a standalone `prometheus_exporter`
# server sidecar is added see follow-up bead. The client middleware
# pushes over TCP to PROMETHEUS_EXPORTER_HOST:PORT, it does not start
# a listener itself. Keeping ENABLED=false silences the reconnect
# log spam (~2/sec) from PrometheusExporter::Client.
env {
name = "PROMETHEUS_EXPORTER_ENABLED"
value = "false"
}
env {
name = "RAILS_ENV"
value = "production"
}
env {
name = "SECRET_KEY_BASE"
value_from {
secret_key_ref {
name = "dawarich-secrets"
key = "secret_key_base"
}
}
}
env {
name = "RAILS_LOG_TO_STDOUT"
value = "true"
}
env {
name = "SELF_HOSTED"
value = "true"
}
env {
name = "GEOAPIFY_API_KEY"
value_from {
secret_key_ref {
name = "dawarich-secrets"
key = "geoapify_api_key"
}
}
}
resources {
requests = {
cpu = "50m"
memory = "768Mi"
}
limits = {
memory = "1Gi"
}
}
liveness_probe {
exec {
command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"]
}
initial_delay_seconds = 90
period_seconds = 30
timeout_seconds = 5
failure_threshold = 3
}
readiness_probe {
exec {
command = ["/bin/sh", "-c", "pgrep -f 'bundle exec sidekiq' >/dev/null"]
}
initial_delay_seconds = 30
period_seconds = 15
timeout_seconds = 5
}
}
}
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -394,3 +446,71 @@ module "ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# Paired with DawarichIngestionStale alert in monitoring/prometheus_chart_values.tpl.
resource "kubernetes_cron_job_v1" "ingestion_freshness_monitor" {
metadata {
name = "ingestion-freshness-monitor"
namespace = kubernetes_namespace.dawarich.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
schedule = "30 6 * * *"
starting_deadline_seconds = 300
successful_jobs_history_limit = 1
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 3600
template {
metadata {}
spec {
restart_policy = "OnFailure"
container {
name = "ingestion-freshness-monitor"
image = "docker.io/library/postgres:16-alpine"
env {
name = "PGPASSWORD"
value_from {
secret_key_ref {
name = "dawarich-secrets"
key = "db_password"
}
}
}
command = ["/bin/sh", "-c", <<-EOT
set -eu
apk add --no-cache curl >/dev/null 2>&1 || true
TS=$(PGPASSWORD=$PGPASSWORD psql -h ${var.postgresql_host} -U dawarich -d dawarich -t -A -c \
"SELECT COALESCE(EXTRACT(epoch FROM MAX(created_at))::bigint, 0) FROM points WHERE user_id = 1;")
NOW=$(date +%s)
if [ -z "$TS" ] || [ "$TS" = "0" ]; then
echo "ERROR: no points found for user_id=1"
exit 1
fi
AGE_H=$(( (NOW - TS) / 3600 ))
echo "last_point_ts=$TS now=$NOW age_hours=$AGE_H"
curl -sf --data-binary @- "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/dawarich-ingestion-freshness/user/viktor" <<METRICS
# TYPE dawarich_last_point_ingested_timestamp gauge
dawarich_last_point_ingested_timestamp $TS
# TYPE dawarich_ingestion_monitor_last_push_timestamp gauge
dawarich_ingestion_monitor_last_push_timestamp $NOW
METRICS
EOT
]
}
}
}
}
}
}
lifecycle {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}

View file

@ -37,6 +37,10 @@ resource "kubernetes_namespace" "dbaas" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Override Kyverno tier-1-cluster LimitRange (max 4Gi) to allow MySQL 6Gi limit
@ -83,301 +87,6 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name
}
#### MYSQL InnoDB Cluster via MySQL Operator
#
# 3 MySQL servers with Group Replication + 1 MySQL Router for auto-failover.
# Operator installed in mysql-operator namespace (toleration for control-plane).
# Init containers are slow (~20 min each) due to mysqlsh plugin loading.
resource "kubernetes_namespace" "mysql_operator" {
metadata {
name = "mysql-operator"
labels = {
tier = "1-cluster"
}
}
}
resource "helm_release" "mysql_operator" {
namespace = kubernetes_namespace.mysql_operator.metadata[0].name
create_namespace = false
name = "mysql-operator"
timeout = 300
repository = "https://mysql.github.io/mysql-operator/"
chart = "mysql-operator"
version = "2.2.7"
# NOTE: The mysql-operator chart (2.2.7) does NOT expose a resources values key.
# The resources block below is ignored by the chart. Without explicit resources
# on the deployment, the LimitRange default (256Mi) applies silently.
# Fix: kubectl patch deployment mysql-operator -n mysql-operator --type=json \
# -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources","value":{"requests":{"cpu":"100m","memory":"256Mi"},"limits":{"memory":"512Mi"}}}]'
values = [yamlencode({
resources = {
requests = {
cpu = "100m"
memory = "256Mi"
}
limits = {
memory = "512Mi"
}
}
})]
}
# The mysql-sidecar ClusterRole created by the Helm chart is missing
# namespace and CRD list/watch permissions needed by the kopf framework
# in the sidecar container. Without these, the sidecar enters degraded
# mode and never completes InnoDB cluster join operations.
resource "kubernetes_cluster_role" "mysql_sidecar_extra" {
metadata {
name = "mysql-sidecar-extra"
}
rule {
api_groups = [""]
resources = ["namespaces"]
verbs = ["list", "watch"]
}
rule {
api_groups = ["apiextensions.k8s.io"]
resources = ["customresourcedefinitions"]
verbs = ["list", "watch"]
}
}
resource "kubernetes_cluster_role_binding" "mysql_sidecar_extra" {
metadata {
name = "mysql-sidecar-extra"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = kubernetes_cluster_role.mysql_sidecar_extra.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = "mysql-cluster-sa"
namespace = kubernetes_namespace.dbaas.metadata[0].name
}
}
# ConfigMap for MySQL extra config mounted as subPath over 99-extra.cnf
# This is the only reliable way to persist innodb_doublewrite=OFF because:
# - spec.mycnf only applies on initial cluster creation
# - The operator's initconf container overwrites 99-extra.cnf on every pod start
# - SET PERSIST doesn't support innodb_doublewrite (static variable)
resource "kubernetes_config_map" "mysql_extra_cnf" {
metadata {
name = "mysql-extra-cnf"
namespace = kubernetes_namespace.dbaas.metadata[0].name
}
data = {
"99-extra.cnf" = <<-EOT
[mysqld]
innodb_doublewrite=OFF
EOT
}
}
resource "helm_release" "mysql_cluster" {
namespace = kubernetes_namespace.dbaas.metadata[0].name
create_namespace = false
name = "mysql-cluster"
timeout = 900
repository = "https://mysql.github.io/mysql-operator/"
chart = "mysql-innodbcluster"
version = "2.2.7"
values = [yamlencode({
serverInstances = 1
routerInstances = 1
serverVersion = "8.4.4"
credentials = {
root = {
user = "root"
password = var.dbaas_root_password
host = "%"
}
}
tls = {
useSelfSigned = true
}
datadirVolumeClaimTemplate = {
storageClassName = "proxmox-lvm-encrypted"
metadata = {
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "20%"
"resize.topolvm.io/storage_limit" = "100Gi"
}
}
resources = {
requests = {
storage = "30Gi"
}
}
}
serverConfig = {
mycnf = <<-EOT
[mysqld]
skip-name-resolve
mysql-native-password=ON
# Auto-recovery after crashes: rejoin group without manual intervention
group_replication_autorejoin_tries=2016
group_replication_exit_state_action=OFFLINE_MODE
group_replication_member_expel_timeout=30
group_replication_unreachable_majority_timeout=60
group_replication_start_on_boot=ON
# Cap XCom cache to prevent unbounded growth (default 1GB causes OOM)
group_replication_message_cache_size=134217728
# Reduce log buffer (16MB sufficient for this workload, was 64MB)
innodb_log_buffer_size=16777216
# Limit connections (peak usage ~40, no need for 151)
max_connections=80
# --- Disk write reduction (HDD/LVM thin) ---
# Flush redo log once per second, not per commit. Up to 1s data loss on MySQL crash,
# but group replication provides redundancy across 3 nodes.
innodb_flush_log_at_trx_commit=0
# OS decides when to flush binlog (not per commit)
sync_binlog=0
# HDD-tuned I/O capacity (default 200/2000 is for SSD)
innodb_io_capacity=100
innodb_io_capacity_max=200
# 1GB redo log capacity larger log means less frequent checkpoint flushes
innodb_redo_log_capacity=1073741824
# 1GB buffer pool
innodb_buffer_pool_size=1073741824
# Disable doublewrite halves write amplification. Safe with group replication
# (crashed node can re-clone from healthy replica rather than relying on local recovery)
innodb_doublewrite=OFF
# Flush neighbors on HDD (coalesce adjacent dirty pages into single I/O)
innodb_flush_neighbors=1
# Reduce page cleaner aggressiveness
innodb_lru_scan_depth=256
innodb_page_cleaners=1
# Reduce adaptive flushing let dirty pages accumulate longer before background flush
innodb_adaptive_flushing_lwm=10
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
EOT
}
# Top-level resources apply to SIDECAR container
# VPA shows sidecar needs only 248Mi target / 334Mi upper bound
# Setting to 350Mi (was 2Gi/4Gi - 17× over-provisioned)
resources = {
requests = {
cpu = "250m"
memory = "350Mi"
}
limits = {
memory = "350Mi"
}
}
podSpec = {
affinity = {
nodeAffinity = {
requiredDuringSchedulingIgnoredDuringExecution = {
nodeSelectorTerms = [{
matchExpressions = [{
key = "kubernetes.io/hostname"
operator = "NotIn"
values = ["k8s-node1"]
}]
}]
}
}
podAntiAffinity = {
preferredDuringSchedulingIgnoredDuringExecution = [{
weight = 100
podAffinityTerm = {
labelSelector = {
matchLabels = {
"component" = "mysqld"
}
}
topologyKey = "kubernetes.io/hostname"
}
}]
}
}
# Container-specific resources for MYSQL container
# VPA shows 2.98Gi target / 5.26Gi upper bound
# Current usage ~1.8Gi peak. Reducing limit from 4Gi to 3Gi
containers = [
{
name = "mysql"
resources = {
requests = {
memory = "2Gi"
cpu = "250m"
}
limits = {
memory = "3Gi"
}
}
},
{
# MySQL operator sidecar (kopf Python control loop)
# VPA upper bound: 334Mi. Was 6Gi limit 17× over-provisioned.
name = "sidecar"
resources = {
requests = {
memory = "350Mi"
cpu = "50m"
}
limits = {
memory = "512Mi"
}
}
}
]
initContainers = [
{
name = "fixdatadir"
resources = {
requests = { memory = "64Mi", cpu = "25m" }
limits = { memory = "64Mi" }
}
},
{
name = "initconf"
resources = {
requests = { memory = "256Mi", cpu = "50m" }
limits = { memory = "256Mi" }
}
},
{
name = "initmysql"
resources = {
requests = { memory = "512Mi", cpu = "250m" }
limits = { memory = "512Mi" }
}
}
]
}
# MySQL Router - explicitly set resources (chart does not expose router.resources)
# VPA shows 100Mi upper bound, setting to 128Mi
# Note: This requires manual kubectl patch after helm release:
# kubectl patch deployment mysql-cluster-router -n dbaas --type=json -p='[
# {"op": "replace", "path": "/spec/template/spec/containers/0/resources",
# "value": {"requests": {"cpu": "25m", "memory": "128Mi"}, "limits": {"memory": "128Mi"}}}]'
# TODO: migrate to mysql-operator fork or wait for upstream router.resources support
})]
depends_on = [helm_release.mysql_operator]
}
#### MYSQL Standalone (migration target)
#
# Standalone MySQL without Group Replication. Eliminates ~95 GB/day of GR
@ -747,6 +456,10 @@ resource "kubernetes_cron_job_v1" "mysql-backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Per-database MySQL backups (enables single-database restore without affecting others)
@ -842,6 +555,10 @@ resource "kubernetes_cron_job_v1" "mysql-backup-per-db" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# resource "kubernetes_persistent_volume" "mysql" {
@ -1047,6 +764,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "phpmyadmin" {
@ -1574,6 +1295,10 @@ resource "kubernetes_deployment" "pgadmin" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "pgadmin" {
metadata {
@ -1682,6 +1407,10 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Per-database PostgreSQL backups (enables single-database restore without affecting others)
@ -1789,4 +1518,8 @@ resource "kubernetes_cron_job_v1" "postgresql-backup-per-db" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -7,6 +7,10 @@ resource "kubernetes_namespace" "descheduler" {
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_cluster_role" "descheduler" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "diun" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {

View file

@ -19,6 +19,10 @@ resource "kubernetes_namespace" "ebook2audiobook" {
tier = local.tiers.gpu
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
@ -115,6 +119,10 @@ resource "kubernetes_deployment" "ebook2audiobook" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -313,6 +321,10 @@ resource "kubernetes_deployment" "audiblez" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -399,6 +411,10 @@ resource "kubernetes_deployment" "audiblez-web" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "audiblez-web" {

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "ebooks" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# ExternalSecrets for all three sources

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "echo" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -69,6 +73,10 @@ resource "kubernetes_deployment" "echo" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "echo" {

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "excalidraw" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
@ -112,6 +116,10 @@ resource "kubernetes_deployment" "excalidraw" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "draw" {

View file

@ -5,6 +5,10 @@ resource "kubernetes_namespace" "external_secrets" {
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "external_secrets" {

View file

@ -14,6 +14,10 @@ resource "kubernetes_namespace" "f1-stream" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -137,6 +141,10 @@ resource "kubernetes_deployment" "f1-stream" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "foolery" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "forgejo" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -130,6 +134,10 @@ resource "kubernetes_deployment" "forgejo" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "forgejo" {

View file

@ -57,6 +57,10 @@ resource "kubernetes_namespace" "freedify" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -10,6 +10,10 @@ resource "kubernetes_namespace" "immich" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -184,6 +188,10 @@ resource "kubernetes_deployment" "freshrss" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "freshrss" {

View file

@ -15,6 +15,10 @@ resource "kubernetes_namespace" "frigate" {
# "istio-injection" : "enabled"
# }
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -219,6 +223,10 @@ for name, det in stats.get('detectors', {}).items():
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "frigate" {

View file

@ -53,6 +53,10 @@ resource "kubernetes_namespace" "grampsweb" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -322,6 +326,10 @@ resource "kubernetes_deployment" "grampsweb" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "grampsweb" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "hackmd" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -160,6 +164,10 @@ resource "kubernetes_deployment" "hackmd" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "hackmd" {

View file

@ -28,6 +28,10 @@ resource "kubernetes_namespace" "headscale" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -245,6 +249,10 @@ resource "kubernetes_deployment" "headscale" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "headscale" {
metadata {
@ -482,6 +490,10 @@ resource "kubernetes_cron_job_v1" "headscale_backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Grafana dashboard

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "health" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -141,6 +145,10 @@ resource "kubernetes_deployment" "health" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "health" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "hermes_agent" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -18,6 +18,10 @@ resource "kubernetes_namespace" "homepage" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "homepage" {
@ -113,6 +117,10 @@ resource "kubernetes_deployment" "cache_proxy" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "cache_proxy" {

View file

@ -95,6 +95,10 @@ resource "kubernetes_deployment" "immich-frame" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -100,6 +100,10 @@ resource "kubernetes_namespace" "immich" {
tier = local.tiers.gpu
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -778,6 +782,10 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# POWER TOOLS

View file

@ -188,6 +188,10 @@ resource "kubernetes_cron_job_v1" "backup-etcd" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Weekly etcd defragmentation prevents fragmentation buildup that causes slow requests
@ -242,6 +246,10 @@ resource "kubernetes_cron_job_v1" "defrag-etcd" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Clean up evicted/failed pods cluster-wide daily
@ -277,6 +285,10 @@ resource "kubernetes_cron_job_v1" "cleanup-failed-pods" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service_account" "cleanup_sa" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "insta2spotify" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {

View file

@ -8,6 +8,10 @@ resource "kubernetes_namespace" "isponsorblocktv" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Before running, setup config using
# docker run --rm -it -v ./youtube:/app/data -e TERM=$TERM -e COLORTERM=$COLORTERM ghcr.io/dmunozv04/isponsorblocktv --setup
@ -87,4 +91,8 @@ resource "kubernetes_deployment" "isponsorblocktv-vermont" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "jsoncrack" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
@ -52,6 +56,10 @@ resource "kubernetes_deployment" "jsoncrack" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "jsoncrack" {

View file

@ -34,6 +34,10 @@ resource "kubernetes_namespace" "k8s-dashboard" {
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# }

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "k8s_portal" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "kms" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -92,6 +96,10 @@ resource "kubernetes_deployment" "kms-web-page" {
}
}
depends_on = [kubernetes_config_map.kms-web-page]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "kms-web-page" {
@ -172,6 +180,10 @@ resource "kubernetes_deployment" "windows_kms" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "windows_kms" {

252
stacks/kured/main.tf Normal file
View file

@ -0,0 +1,252 @@
# kured Kubernetes Reboot Daemon
#
# Auto-reboots nodes when /var/run/reboot-required exists on the host (set by
# unattended-upgrades). The reboot process is gated by a custom sentinel file
# (kured-sentinel-gate DaemonSet below) so reboots only happen when:
# - all nodes Ready
# - all calico-node pods Running
# - no node has transitioned Ready in the last 30 minutes (cool-down)
#
# History:
# - 2026-03 post-mortem (memory 390): 26h cluster outage triggered by kured
# rebooting nodes while containerd's overlayfs snapshotter was corrupted.
# Remediation included the sentinel gate and a tight reboot window
# (Mon-Fri 02:00-06:00 London).
# - 2026-04-18: adopted into Terraform (Wave 5a). Previously helm-installed
# manually + kubectl-applied sentinel gate.
resource "kubernetes_namespace" "kured" {
metadata {
name = "kured"
labels = {
"istio-injection" = "disabled"
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# -----------------------------------------------------------------------------
# kured Helm release
# -----------------------------------------------------------------------------
resource "helm_release" "kured" {
namespace = kubernetes_namespace.kured.metadata[0].name
create_namespace = false
name = "kured"
chart = "kured"
repository = "https://kubereboot.github.io/charts/"
version = "5.11.0"
values = [yamlencode({
configuration = {
period = "1h0m0s"
timeZone = "Europe/London"
startTime = "02:00"
endTime = "06:00"
rebootDays = ["mo", "tu", "we", "th", "fr"]
rebootSentinel = "/sentinel/gated-reboot-required"
notifyUrl = data.vault_kv_secret_v2.secrets.data["slack_kured_webhook"]
}
reboot_days = "mon,tue,wed,thu,fri"
window_end = "06:00"
window_start = "22:00"
service = {
annotations = {
"prometheus.io/scrape" = "true"
"prometheus.io/port" = "8080"
"prometheus.io/path" = "/metrics"
}
}
})]
}
data "vault_kv_secret_v2" "secrets" {
mount = "secret"
name = "kured"
}
# -----------------------------------------------------------------------------
# kured-sentinel-gate
#
# Runs a DaemonSet that creates /var/run/gated-reboot-required ONLY when all
# safety preconditions are met (see script). kured's rebootSentinel points at
# this file, so reboots are effectively blocked unless every check passes.
# -----------------------------------------------------------------------------
resource "kubernetes_service_account" "kured_sentinel_gate" {
metadata {
name = "kured-sentinel-gate"
namespace = kubernetes_namespace.kured.metadata[0].name
}
automount_service_account_token = false
}
resource "kubernetes_cluster_role" "kured_sentinel_gate" {
metadata {
name = "kured-sentinel-gate"
}
rule {
api_groups = [""]
resources = ["nodes"]
verbs = ["list"]
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["list"]
}
}
resource "kubernetes_cluster_role_binding" "kured_sentinel_gate" {
metadata {
name = "kured-sentinel-gate"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = kubernetes_cluster_role.kured_sentinel_gate.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.kured_sentinel_gate.metadata[0].name
namespace = kubernetes_namespace.kured.metadata[0].name
}
}
resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" {
metadata {
name = "kured-sentinel-gate"
namespace = kubernetes_namespace.kured.metadata[0].name
labels = {
app = "kured-sentinel-gate"
tier = local.tiers.cluster
}
}
spec {
selector {
match_labels = {
app = "kured-sentinel-gate"
}
}
template {
metadata {
labels = {
app = "kured-sentinel-gate"
}
}
spec {
service_account_name = kubernetes_service_account.kured_sentinel_gate.metadata[0].name
automount_service_account_token = false
enable_service_links = false
toleration {
effect = "NoSchedule"
key = "node-role.kubernetes.io/control-plane"
operator = "Equal"
}
toleration {
effect = "NoSchedule"
key = "node-role.kubernetes.io/master"
operator = "Equal"
}
container {
name = "gate"
image = "bitnami/kubectl:latest"
image_pull_policy = "Always"
command = [
"/bin/bash",
"-c",
<<-EOT
while true; do
echo "[$(date)] Checking reboot gate conditions..."
# Check 1: Does the host need a reboot?
if [ ! -f /host/var-run/reboot-required ]; then
echo " No reboot required on this host"
rm -f /host/var-run/gated-reboot-required
sleep 300
continue
fi
echo " Host has /var/run/reboot-required"
# Check 2: Are ALL nodes Ready?
NOT_READY=$(kubectl get nodes --no-headers | grep -v ' Ready' | wc -l | tr -d ' ')
if [ "$NOT_READY" -gt 0 ]; then
echo " BLOCKED: $NOT_READY node(s) not Ready"
rm -f /host/var-run/gated-reboot-required
sleep 300
continue
fi
echo " All nodes Ready"
# Check 3: Are ALL calico-node pods Running?
CALICO_NOT_RUNNING=$(kubectl get pods -n calico-system -l k8s-app=calico-node --no-headers 2>/dev/null | grep -v Running | wc -l | tr -d ' ')
if [ "$CALICO_NOT_RUNNING" -gt 0 ]; then
echo " BLOCKED: $CALICO_NOT_RUNNING calico-node pod(s) not Running"
rm -f /host/var-run/gated-reboot-required
sleep 300
continue
fi
echo " All calico-node pods Running"
# Check 4: No node rebooted in last 30 minutes (cool-down)
RECENT_REBOOT=0
while IFS= read -r transition_time; do
if [ -n "$transition_time" ]; then
transition_epoch=$(date -d "$transition_time" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%SZ" "$transition_time" +%s 2>/dev/null)
now_epoch=$(date +%s)
diff=$(( now_epoch - transition_epoch ))
if [ "$diff" -lt 1800 ]; then
RECENT_REBOOT=1
break
fi
fi
done < <(kubectl get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$RECENT_REBOOT" -eq 1 ]; then
echo " BLOCKED: A node transitioned Ready within the last 30 minutes (cool-down)"
rm -f /host/var-run/gated-reboot-required
sleep 300
continue
fi
echo " No recent node reboots (30m cool-down clear)"
# All checks passed create gated sentinel
echo " ALL CHECKS PASSED — creating /var/run/gated-reboot-required"
touch /host/var-run/gated-reboot-required
sleep 300
done
EOT
]
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "64Mi"
}
}
volume_mount {
name = "var-run"
mount_path = "/host/var-run"
}
}
volume {
name = "var-run"
host_path {
path = "/var/run"
type = "Directory"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

1
stacks/kured/secrets Symbolic link
View file

@ -0,0 +1 @@
../../secrets

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -6,6 +6,10 @@ resource "kubernetes_namespace" "kyverno" {
"istio-injection" : "disabled"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "kyverno" {

View file

@ -21,6 +21,10 @@ resource "kubernetes_namespace" "linkwarden" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -197,6 +201,10 @@ resource "kubernetes_deployment" "linkwarden" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "linkwarden" {
metadata {

198
stacks/local-path/main.tf Normal file
View file

@ -0,0 +1,198 @@
# local-path-provisioner
#
# Rancher's local-path provisioner backs PVCs with node-local
# /opt/local-path-provisioner directories. Currently serves as the default
# StorageClass. Deployed via raw kubectl apply 55d ago; adopted into TF
# (Wave 5c) on 2026-04-18.
#
# Upstream: https://github.com/rancher/local-path-provisioner
# Version pinned to rancher/local-path-provisioner:v0.0.31
resource "kubernetes_namespace" "local_path_storage" {
metadata {
name = "local-path-storage"
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_service_account" "local_path_provisioner" {
metadata {
name = "local-path-provisioner-service-account"
namespace = kubernetes_namespace.local_path_storage.metadata[0].name
}
automount_service_account_token = false
}
resource "kubernetes_cluster_role" "local_path_provisioner" {
metadata {
name = "local-path-provisioner-role"
}
rule {
api_groups = [""]
resources = ["nodes", "persistentvolumeclaims", "configmaps", "pods", "pods/log"]
verbs = ["get", "list", "watch"]
}
rule {
api_groups = [""]
resources = ["persistentvolumes"]
verbs = ["get", "list", "watch", "create", "patch", "update", "delete"]
}
rule {
api_groups = [""]
resources = ["events"]
verbs = ["create", "patch"]
}
rule {
api_groups = ["storage.k8s.io"]
resources = ["storageclasses"]
verbs = ["get", "list", "watch"]
}
}
resource "kubernetes_cluster_role_binding" "local_path_provisioner" {
metadata {
name = "local-path-provisioner-bind"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = kubernetes_cluster_role.local_path_provisioner.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.local_path_provisioner.metadata[0].name
namespace = kubernetes_namespace.local_path_storage.metadata[0].name
}
}
resource "kubernetes_config_map" "local_path_config" {
metadata {
name = "local-path-config"
namespace = kubernetes_namespace.local_path_storage.metadata[0].name
}
data = {
"config.json" = jsonencode({
nodePathMap = [{
node = "DEFAULT_PATH_FOR_NON_LISTED_NODES"
paths = ["/opt/local-path-provisioner"]
}]
})
"helperPod.yaml" = <<-EOT
apiVersion: v1
kind: Pod
metadata:
name: helper-pod
spec:
priorityClassName: system-node-critical
tolerations:
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
containers:
- name: helper-pod
image: busybox
imagePullPolicy: IfNotPresent
EOT
"setup" = <<-EOT
#!/bin/sh
set -eu
mkdir -m 0777 -p "$VOL_DIR"
EOT
"teardown" = <<-EOT
#!/bin/sh
set -eu
rm -rf "$VOL_DIR"
EOT
}
}
resource "kubernetes_storage_class_v1" "local_path" {
metadata {
name = "local-path"
annotations = {
"storageclass.kubernetes.io/is-default-class" = "true"
}
}
storage_provisioner = "rancher.io/local-path"
reclaim_policy = "Delete"
volume_binding_mode = "WaitForFirstConsumer"
allow_volume_expansion = false
}
resource "kubernetes_deployment" "local_path_provisioner" {
metadata {
name = "local-path-provisioner"
namespace = kubernetes_namespace.local_path_storage.metadata[0].name
labels = {
tier = "default"
}
}
spec {
replicas = 1
selector {
match_labels = {
app = "local-path-provisioner"
}
}
template {
metadata {
labels = {
app = "local-path-provisioner"
}
}
spec {
service_account_name = kubernetes_service_account.local_path_provisioner.metadata[0].name
automount_service_account_token = false
enable_service_links = false
container {
name = "local-path-provisioner"
image = "rancher/local-path-provisioner:v0.0.31"
image_pull_policy = "IfNotPresent"
command = [
"local-path-provisioner",
"--debug",
"start",
"--config",
"/etc/config/config.json",
]
env {
name = "POD_NAMESPACE"
value_from {
field_ref {
field_path = "metadata.namespace"
}
}
}
env {
name = "CONFIG_MOUNT_PATH"
value = "/etc/config/"
}
volume_mount {
name = "config-volume"
mount_path = "/etc/config/"
}
}
volume {
name = "config-volume"
config_map {
name = kubernetes_config_map.local_path_config.metadata[0].name
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

1
stacks/local-path/secrets Symbolic link
View file

@ -0,0 +1 @@
../../secrets

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -14,6 +14,26 @@ variable "email_monitor_imap_password" {
sensitive = true
}
# Build the virtual-alias map, dropping aliases where BOTH the source and
# target are real mailboxes in var.mailserver_accounts (and are different).
# Without this filter, docker-mailserver emits two passwd-file userdb lines
# for the source address its own mailbox home plus the alias target's home
# and Dovecot logs 'exists more than once' on every auth lookup. Aliases
# that forward to external addresses (gmail etc.) or to self are safe.
locals {
_account_set = keys(var.mailserver_accounts)
_virtual_lines = split("\n", format("%s%s", var.postfix_account_aliases, file("${path.module}/extra/aliases.txt")))
postfix_virtual = join("\n", [
for line in local._virtual_lines : line
if !(
length(split(" ", line)) == 2 &&
contains(local._account_set, split(" ", line)[0]) &&
contains(local._account_set, split(" ", line)[1]) &&
split(" ", line)[0] != split(" ", line)[1]
)
])
}
resource "kubernetes_namespace" "mailserver" {
metadata {
name = "mailserver"
@ -25,6 +45,10 @@ resource "kubernetes_namespace" "mailserver" {
# "istio-injection" : "enabled"
# }
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -93,7 +117,7 @@ resource "kubernetes_config_map" "mailserver_config" {
# Actual mail settings
"postfix-accounts.cf" = join("\n", [for user, pass in var.mailserver_accounts : "${user}|${bcrypt(pass, 6)}"])
"postfix-main.cf" = var.postfix_cf
"postfix-virtual.cf" = format("%s%s", var.postfix_account_aliases, file("${path.module}/extra/aliases.txt"))
"postfix-virtual.cf" = local.postfix_virtual
KeyTable = "mail._domainkey.viktorbarzin.me viktorbarzin.me:mail:/etc/opendkim/keys/viktorbarzin.me-mail.key\n"
SigningTable = "*@viktorbarzin.me mail._domainkey.viktorbarzin.me\n"
@ -598,15 +622,15 @@ try:
resp.raise_for_status()
print(f"Sent test email via Brevo: {resp.status_code} marker={marker}")
# Step 2: Wait for delivery, retry IMAP up to 3 min
# Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
found = False
for attempt in range(9):
for attempt in range(15):
time.sleep(20)
try:
imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
imap.login(IMAP_USER, IMAP_PASS)
imap.select("INBOX")
_, msg_ids = imap.search(None, "SUBJECT", marker)
@ -700,5 +724,9 @@ sys.exit(0 if success else 1)
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -231,6 +231,10 @@ resource "kubernetes_deployment" "roundcubemail" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "roundcubemail" {

View file

@ -23,6 +23,18 @@ smtpd_tls_loglevel = 1
smtpd_client_connection_rate_limit = 10
smtpd_client_message_rate_limit = 30
anvil_rate_time_unit = 60s
# Disable the postscreen decision cache. The default (btree) driver
# requires an exclusive file lock for every access, and with postscreen
# re-spawning per connection (master.cf: maxproc=1) that produces thousands
# of 'unable to get exclusive lock' fatals per day stalling SMTP
# acceptance and starving inbound delivery. lmdb would avoid the lock but
# isn't compiled into docker-mailserver 15.0.0's Postfix build
# (postconf -m no lmdb). Proxy:btree is unsafe because postscreen does
# its own locking. An empty value disables the cache entirely legitimate
# clients pay the greet/bare-newline re-check on every new TCP session,
# which is trivial at our volume (~100 deliveries/day).
postscreen_cache_map =
EOT
}

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "matrix" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# DB credentials from Vault database engine (rotated every 24h)
@ -192,6 +196,10 @@ resource "kubernetes_deployment" "matrix" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "matrix" {

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "meshcentral" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -231,6 +235,10 @@ EOT
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -257,14 +265,14 @@ resource "kubernetes_service" "meshcentral" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.meshcentral.metadata[0].name
name = "meshcentral"
tls_secret_name = var.tls_secret_name
port = 80
protected = true
anti_ai_scraping = false
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.meshcentral.metadata[0].name
name = "meshcentral"
tls_secret_name = var.tls_secret_name
port = 80
protected = true
anti_ai_scraping = false
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "MeshCentral"

View file

@ -7,6 +7,10 @@ resource "kubernetes_namespace" "metallb" {
app = "metallb"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "metallb" {

View file

@ -8,6 +8,10 @@ resource "kubernetes_namespace" "metrics-server" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -50,6 +50,10 @@ resource "kubernetes_deployment" "goflow2" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "goflow2" {

View file

@ -91,6 +91,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "idrac-redfish-exporter" {

View file

@ -100,6 +100,10 @@ resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
*/

View file

@ -39,6 +39,10 @@ resource "kubernetes_namespace" "monitoring" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -88,6 +92,10 @@ resource "kubernetes_cron_job_v1" "monitor_prom" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------
@ -211,6 +219,10 @@ resource "kubernetes_cron_job_v1" "dns_anomaly_monitor" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Expose Pushgateway via NodePort so the PVE host can push LVM snapshot metrics

View file

@ -1787,6 +1787,30 @@ serverFiles:
severity: warning
annotations:
summary: "Privatebin has no available replicas"
- alert: DawarichIngestionStale
expr: (time() - dawarich_last_point_ingested_timestamp{user="viktor"}) > 172800
for: 15m
labels:
severity: warning
annotations:
summary: "Dawarich: no points from viktor in >2 days"
description: "The iOS Dawarich app likely stopped sending location points. Open the app, verify it's running, and check background location permissions. Server-side is healthy when this alert fires — the issue is client-side."
- alert: DawarichIngestionMonitorStale
expr: (time() - dawarich_ingestion_monitor_last_push_timestamp{user="viktor"}) > 129600
for: 15m
labels:
severity: warning
annotations:
summary: "Dawarich ingestion freshness monitor hasn't pushed in >36h"
description: "CronJob ingestion-freshness-monitor in dawarich ns isn't running or failing. Check `kubectl -n dawarich get cronjob ingestion-freshness-monitor` and recent Job logs."
- alert: DawarichIngestionMonitorNeverRun
expr: absent(dawarich_ingestion_monitor_last_push_timestamp{user="viktor"})
for: 2h
labels:
severity: warning
annotations:
summary: "Dawarich ingestion freshness monitor has never pushed"
description: "Expected `dawarich_ingestion_monitor_last_push_timestamp` to appear once the daily CronJob runs. Check the CronJob in dawarich namespace."
- name: "Network Traffic (GoFlow2)"
rules:
- alert: GoFlow2Down
@ -1939,6 +1963,38 @@ serverFiles:
severity: warning
annotations:
summary: "Authentik outpost restarted {{ $value | printf \"%.0f\" }} times in 30m — check for OOM or crash loop"
- name: Infrastructure Drift
# Metrics pushed by .woodpecker/drift-detection.yml after each cron run.
# See Wave 7 of the state-drift consolidation plan.
rules:
- alert: DriftDetectionStale
# Drift detection pipeline hasn't reported in 26h. Either the cron
# didn't fire, or the job is failing before the push step.
expr: time() - max(drift_detection_last_run_timestamp) > 26 * 3600
for: 30m
labels:
severity: warning
annotations:
summary: "Drift detection hasn't reported in {{ $value | humanizeDuration }} — check Woodpecker pipeline 'drift-detection'"
- alert: DriftUnaddressed
# Any stack drifted for >72h without being reconciled. Either apply
# to bring config in line, or update HCL to match desired state.
expr: max(drift_stack_age_hours) > 72
for: 1h
labels:
severity: warning
annotations:
summary: "A stack has been drifted for {{ $value | printf \"%.0f\" }}h — run scripts/tg plan across stacks to identify and reconcile"
- alert: DriftStacksMany
# More than 10 stacks drifting simultaneously usually means a
# systemic issue (cluster upgrade, new admission controller,
# provider version bump) rather than individual misconfigurations.
expr: drift_stack_count > 10
for: 30m
labels:
severity: warning
annotations:
summary: "{{ $value | printf \"%.0f\" }} stacks drifting — likely a systemic cause (new admission webhook, provider upgrade). Check the most recent drift-detection run in Woodpecker."
extraScrapeConfigs: |
- job_name: 'proxmox-host'

View file

@ -86,6 +86,10 @@ resource "kubernetes_deployment" "pve_exporter" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "proxmox-exporter" {

View file

@ -90,6 +90,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "snmp-exporter" {

View file

@ -18,6 +18,10 @@ resource "kubernetes_namespace" "n8n" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -277,6 +281,10 @@ resource "kubernetes_deployment" "n8n" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "n8n" {

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "navidrome" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -198,6 +202,10 @@ resource "kubernetes_deployment" "navidrome" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "navidrome" {

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "netbox" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -196,6 +200,10 @@ resource "kubernetes_deployment" "netbox" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "netbox" {
metadata {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "networking-toolbox" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -66,6 +70,10 @@ resource "kubernetes_deployment" "networking-toolbox" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "networking-toolbox" {

View file

@ -32,6 +32,10 @@ resource "kubernetes_namespace" "nextcloud" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -463,6 +467,10 @@ resource "kubernetes_cron_job_v1" "nextcloud_watchdog" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_cron_job_v1" "nextcloud-backup" {
@ -533,4 +541,8 @@ resource "kubernetes_cron_job_v1" "nextcloud-backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -8,6 +8,10 @@ resource "kubernetes_namespace" "nfs_csi" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "nfs_csi_driver" {

View file

@ -38,6 +38,10 @@ resource "kubernetes_namespace" "novelapp" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -83,6 +87,7 @@ resource "kubernetes_deployment" "novelapp" {
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
ignore_changes = [
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}
spec {

View file

@ -12,6 +12,10 @@ resource "kubernetes_namespace" "ntfy" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -151,6 +155,10 @@ resource "kubernetes_deployment" "ntfy" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "ntfy" {

View file

@ -13,10 +13,14 @@ resource "kubernetes_namespace" "nvidia" {
labels = {
"istio-injection" : "disabled"
tier = var.tier
"resource-governance/custom-quota" = "true"
"resource-governance/custom-quota" = "true"
"resource-governance/custom-limitrange" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Custom LimitRange overrides Kyverno tier-2-gpu default (1Gi per container)
@ -177,6 +181,10 @@ resource "kubernetes_deployment" "nvidia-exporter" {
}
}
depends_on = [helm_release.nvidia-gpu-operator]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "nvidia-exporter" {

View file

@ -16,6 +16,10 @@ resource "kubernetes_namespace" "onlyoffice" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -220,6 +224,10 @@ resource "kubernetes_deployment" "onlyoffice-document-server" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "onlyoffice" {

View file

@ -23,6 +23,10 @@ resource "kubernetes_namespace" "openclaw" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -602,6 +606,10 @@ resource "kubernetes_deployment" "openclaw" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "openclaw" {
@ -803,6 +811,10 @@ resource "kubernetes_deployment" "task_webhook" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "task_webhook" {
@ -940,6 +952,10 @@ resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# --- CronJob: Task processor polls Forgejo issues and triggers OpenClaw ---
@ -1032,6 +1048,10 @@ resource "kubernetes_cron_job_v1" "task_processor" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# --- OpenLobster: Multi-user Telegram AI assistant (trial) ---

View file

@ -14,6 +14,10 @@ resource "kubernetes_namespace" "osm-routing" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_resource_quota_v1" "osm_routing" {
@ -108,6 +112,10 @@ resource "kubernetes_deployment" "osrm-foot" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "osrm-foot" {
@ -189,6 +197,10 @@ resource "kubernetes_deployment" "osrm-bicycle" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "osrm-bicycle" {
@ -274,6 +286,10 @@ resource "kubernetes_deployment" "otp" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "otp" {

View file

@ -52,6 +52,10 @@ resource "kubernetes_namespace" "owntracks" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -177,6 +181,10 @@ resource "kubernetes_deployment" "owntracks" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -26,6 +26,10 @@ resource "kubernetes_namespace" "paperless-ngx" {
# "istio-injection" : "enabled"
# }
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -198,6 +202,10 @@ resource "kubernetes_deployment" "paperless-ngx" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "paperless-ngx" {

View file

@ -22,6 +22,10 @@ resource "kubernetes_namespace" "payslip_ingest" {
"istio-injection" = "disabled"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# App secrets sourced from multiple Vault KV keys.

View file

@ -20,6 +20,10 @@ resource "kubernetes_namespace" "phpipam" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -368,6 +372,10 @@ resource "kubernetes_cron_job_v1" "phpipam_dns_sync" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# CronJob: Import devices from pfSense (Kea DHCP leases + ARP table) into phpIPAM
@ -564,6 +572,10 @@ PYEOF
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# CronJob: Import devices from remote sites (London + Valchedrym) via SSH
@ -724,4 +736,8 @@ PYEOF
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "plotting-book" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
@ -83,6 +87,7 @@ resource "kubernetes_deployment" "plotting-book" {
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
ignore_changes = [
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}
spec {
@ -308,6 +313,10 @@ resource "kubernetes_cron_job_v1" "plotting_book_backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Sealed Secrets encrypted secrets safe to commit to git

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "poison_fountain" {
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -174,6 +178,10 @@ resource "kubernetes_deployment" "poison_fountain" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# Internal service (for ForwardAuth from Traefik)
@ -293,4 +301,8 @@ resource "kubernetes_cron_job_v1" "poison_fetcher" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -11,6 +11,10 @@ resource "kubernetes_namespace" "priority-pass" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {

View file

@ -13,6 +13,10 @@ resource "kubernetes_namespace" "privatebin" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -101,6 +105,10 @@ resource "kubernetes_deployment" "privatebin" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "privatebin" {

View file

@ -6,6 +6,10 @@ resource "kubernetes_namespace" "proxmox_csi" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "proxmox_csi" {

View file

@ -7,6 +7,10 @@ resource "kubernetes_namespace" "pvc_autoresizer" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "helm_release" "pvc_autoresizer" {

View file

@ -90,6 +90,10 @@ resource "kubernetes_namespace" "realestate-crawler" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -154,7 +158,8 @@ resource "kubernetes_deployment" "realestate-crawler-ui" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].container[0].image
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}
}
@ -300,7 +305,8 @@ resource "kubernetes_deployment" "realestate-crawler-api" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].container[0].image
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}
}
@ -463,6 +469,10 @@ resource "kubernetes_deployment" "realestate-crawler-celery" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "realestate-crawler-celery-metrics" {
@ -570,4 +580,8 @@ resource "kubernetes_deployment" "realestate-crawler-celery-beat" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -9,6 +9,10 @@ resource "kubernetes_namespace" "redis" {
tier = var.tier
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -236,6 +240,10 @@ resource "kubernetes_deployment" "haproxy" {
}
depends_on = [helm_release.redis]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# Dedicated service for HAProxy master-only routing.
@ -368,4 +376,8 @@ resource "kubernetes_cron_job_v1" "redis-backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

Some files were not shown because too many files have changed in this diff Show more