Commit graph

2914 commits

Author SHA1 Message Date
Viktor Barzin
124a756351 [infra] Adopt local-path-provisioner into Terraform (Wave 5c)
## Context

Wave 5c of the state-drift consolidation plan. `local-path-provisioner`
(Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw
`kubectl apply` against the upstream manifest. It serves as the cluster's
default StorageClass and is still actively in use — the 2026-04-18 live
survey showed helper-pod-delete cycles running against existing PVCs.

Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding),
ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown
scripts, StorageClass `local-path` (default), and the 1-replica
Deployment itself. Seven resources total.

## This change

New Tier 1 stack `stacks/local-path/` with all seven resources, adopted
via Wave 8's HCL `import {}` block convention (commit 8a99be11):

- `kubernetes_namespace.local_path_storage` → id `local-path-storage`
- `kubernetes_service_account.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner-service-account`
- `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role`
- `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind`
- `kubernetes_config_map.local_path_config` →
  id `local-path-storage/local-path-config`
- `kubernetes_storage_class_v1.local_path` → id `local-path`
- `kubernetes_deployment.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner`

Conventions applied:
- Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  Goldilocks `vpa-update-mode` label drift (Wave 3B, commit 8b43692a).
- Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  ndots dns_config drift (Wave 3A, commit c9d221d5 + 327ce215).
- ServiceAccount + pod spec pin `automount_service_account_token = false`
  and `enable_service_links = false` to match the live spec exactly.
- `import {}` stanzas removed after the apply converged to zero-diff
  (per AGENTS.md → "Adopting Existing Resources").

## Apply outcome

`Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were:
- `kubernetes_config_map.local_path_config.data` — whitespace/format
  reshuffle. The live ConfigMap contained the upstream manifest's
  hand-indented JSON + YAML; my HCL uses canonical `jsonencode` /
  heredoc. Semantic content identical, so the provisioner continued
  running (no pod restart).
- `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true`
  — TF-only attribute, no cluster impact.
- `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false`
  + `is-default-class` annotation re-asserted — TF-schema reconciliation
  only; the StorageClass remained default throughout.

Post-apply `scripts/tg plan` returns `No changes`.

## Verification

```
$ cd stacks/local-path && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.

$ kubectl -n local-path-storage get deploy
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
local-path-provisioner   1/1     1            1           55d

$ kubectl get sc local-path
NAME                    PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE
local-path (default)    rancher.io/local-path    Delete          WaitForFirstConsumer
```

## What is NOT in this change

- Helm-release adoption — local-path-provisioner was never installed via
  Helm in this cluster; raw manifests only. Keeping native typed
  resources rather than retrofitting a chart.
- PV-path customisation — sticks with upstream default
  `/opt/local-path-provisioner` on all nodes (via
  `DEFAULT_PATH_FOR_NON_LISTED_NODES`).

Closes: code-3gp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:39:55 +00:00
Viktor Barzin
1a7f68fe5b [beads-server] Auto-dispatch agent beads via CronJobs
## Context

Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.

The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`

So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.

## Flow

```
  user: bd assign <id> agent
         │
         ▼
  Dolt @ dolt.beads-server.svc:3306  ◄──── every 2 min ────┐
         │                                                  │
         ▼                                                  │
  CronJob: beads-dispatcher                                 │
    1. GET beadboard/api/agent-status  (busy? skip)         │
    2. bd query 'assignee=agent AND status=open'            │
    3. bd update -s in_progress   (claim)                   │
    4. POST beadboard/api/agent-dispatch                    │
    5. bd note "dispatched: job=…"                          │
         │                                                  │
         ▼                                                  │
  claude-agent-service /execute                             │
    beads-task-runner agent runs; notes/closes bead         │
         │                                                  │
         ▼                                                  │
  done  ──► next tick picks up the next bead ───────────────┘

  CronJob: beads-reaper  (every 10 min)
    for bead (assignee=agent, status=in_progress, updated_at > 30 min):
      bd note   "reaper: no progress for Nm — blocking"
      bd update -s blocked
```

## Decisions

- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
  client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
  2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
  Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
  Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
  `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
  Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
  the image-seeded file. The script copies it into `/tmp/.beads/` because bd
  may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
  When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
  `beads-task-runner` never trips the reaper. Failures trip it; pod crashes
  (in-memory job state lost) also trip it.

## What is NOT in this change

- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
  `cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.

## Deviations from plan

Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
  serializes `notes` as a string (not an array), and every `bd note` bumps
  `updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
  `-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.

## Test plan

### Automated

```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!

$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1)  # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.

$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```

### Manual verification

1. **Apply**
   ```
   vault login -method=oidc
   cd infra/stacks/beads-server
   scripts/tg apply
   ```
   Expect: `kubernetes_config_map.beads_metadata`,
   `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
   created. No changes to existing resources.

2. **CronJobs exist with right schedule**
   ```
   kubectl -n beads-server get cronjob
   ```
   Expect `beads-dispatcher  */2 * * * *` and `beads-reaper  */10 * * * *`,
   both with `SUSPEND=False`.

3. **End-to-end smoke**
   ```
   bd create "auto-dispatch smoke test" \
       -d "Read /etc/hostname inside the agent sandbox and close." \
       --acceptance "bd note includes 'hostname=' line and bead is closed."
   bd assign <new-id> agent
   # within 2 min:
   bd show <new-id> --json | jq '{status, notes}'
   ```
   Expect notes to contain `auto-dispatcher claimed at …` and
   `dispatched: job=<uuid>`, status `in_progress`.

4. **Reaper smoke**
   Assign + dispatch a long bead, then
   `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
   30 min + one reaper tick, `bd show <id>` shows `blocked` with a
   `reaper: no progress for Nm — blocking` note.

5. **Kill switch**
   ```
   cd infra/stacks/beads-server
   scripts/tg apply -var=beads_dispatcher_enabled=false
   kubectl -n beads-server get cronjob
   ```
   Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
   nothing happens within 5 min. Re-apply with `=true` to re-enable.

Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.

Closes: code-8sm

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
Viktor Barzin
01955916b2 [infra] Adopt kured + sentinel-gate into Terraform (Wave 5a)
## Context

Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces
of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all
cluster changes via TF" invariant and drifting silently:

1. **kured** (Helm release): deployed 265d ago via `helm install kured` on
   the CLI. Values were edited only via `helm upgrade` — never captured.
   Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri
   02:00–06:00 London reboot window, Slack notifyUrl, and a custom
   `/sentinel/gated-reboot-required` sentinel file.

2. **kured-sentinel-gate**: a custom DaemonSet + ServiceAccount +
   ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem
   (memory 390) when kured rebooted nodes during a containerd overlayfs
   outage and turned a single-node blip into a 26h cluster outage.
   The gate DaemonSet creates `/var/run/gated-reboot-required` only when
   (a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all
   calico-node pods Running, (d) no node transitioned Ready in the last
   30 minutes (cool-down). kured's `rebootSentinel` then points at the
   gated file so reboots are effectively gated by cluster health.
   Applied 33d ago via `kubectl apply` — no TF footprint.

Both are now codified in the new `stacks/kured/` (Tier 1, PG state).

## This change

- New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl`
  (standard platform-dep) + `secrets` symlink.
- All 6 resources adopted via Wave 8's HCL `import {}` block pattern
  (commit 8a99be11) — written as `import {}` stanzas in the initial
  commit, plan-applied to zero, then stanzas deleted before this commit
  per the convention:
    - `kubernetes_namespace.kured` (id: `kured`)
    - `helm_release.kured` (id: `kured/kured`)
    - `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
    - `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- Slack notifyUrl moved from inline helm values into Vault at
  `secret/kured` under key `slack_kured_webhook`, consumed via
  `data "vault_kv_secret_v2"`. No plaintext secret in git.
- Namespace gets `tier = "1-cluster"` label (new — previously untiered,
  so Kyverno auto-quotas applied cluster-tier defaults on kured pods).
  Benign additive change; pod specs have explicit resources anyway.
- DaemonSet + SA get `automount_service_account_token = false` /
  `enable_service_links = false` to match the live pod spec exactly —
  otherwise TF schema defaults would flip these fields.
- DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift
  (Wave 3A convention, commit c9d221d5 + 327ce215).
- Namespace carries the same marker on the
  `goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep,
  commit 8b43692a).

## Import outcomes

Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were all TF-schema reconciliation, not cluster
mutations:

- `helm_release.kured.values` — format reshuffle; the imported state
  stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic
  YAML is byte-identical, so the triggered Helm upgrade was a no-op on
  the cluster side (revision bumped 2→3, zero pod restarts).
- `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new
  label added. Already discussed above.
- `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true
  — TF-only attribute, no k8s impact.

Post-apply `scripts/tg plan` on `stacks/kured` returns:
`No changes. Your infrastructure matches the configuration.`

## What is NOT in this change

- `import {}` stanzas — intentionally removed after the apply landed.
  They would be no-ops and would clutter future diffs. Per Wave 8
  convention (AGENTS.md → "Adopting Existing Resources").
- Calico adoption (Wave 5b) — separate higher-blast change, needs a
  dedicated low-traffic window.
- local-path-storage (Wave 5c) — check-or-remove task still open.

## Verification

```
$ kubectl -n kured get ds
NAME                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
kured                 5         5         5       5            5
kured-sentinel-gate   5         5         5       5            5

$ helm -n kured list
NAME     NAMESPACE   REVISION  STATUS    CHART          APP VERSION
kured    kured       3         deployed  kured-5.11.0   1.21.0

$ cd stacks/kured && ../../scripts/tg plan | tail -1
No changes. Your infrastructure matches the configuration.
```

## Reproduce locally
1. `git pull`
2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes
3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready.

Closes: code-q8k

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:33:29 +00:00
Viktor Barzin
10fd88aec5 wealthfolio: add nightly backup sidecar — SQLite → NFS
## Context

Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL
support — confirmed 2026-04-18 via repo inspection). The DB lives on
an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod.

First attempt at a standalone backup CronJob failed with Multi-Attach
error: RWO volume is already attached to the running WF pod, so no
separate pod can mount it. Switched to a backup sidecar in the same
pod — shares the PVC mount naturally.

## This change

- `container "backup"` added to the WF Deployment:
  - alpine:3.20 + sqlite + busybox-suid (for crond).
  - Mounts /data read-only (shared with WF container) + /backup (new
    NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup).
  - Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh
    which runs `sqlite3 .backup` (WAL-safe online snapshot, zero
    downtime), copies secrets.json, and prunes anything older than 30d.
  - 16Mi request / 64Mi limit — sleeps most of the time.
- NFS volume declared in pod spec — server from the existing
  `var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created
  on the PVE host in the same session.

Removed the standalone backup CronJob that couldn't work.

## Verification

### Automated

`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 1 changed, 1 destroyed (the transient CronJob).

### Manual (2026-04-18)

$ kubectl -n wealthfolio get pods -l app=wealthfolio
wealthfolio-95d8bd498-cj8kw   2/2   Running
$ kubectl -n wealthfolio logs <pod> -c backup
wealthfolio-backup sidecar ready; next 04:30 UTC
$ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh
wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M)
$ ls /srv/nfs/wealthfolio-backup/
2026-04-18T22-24-55/   ← first sidecar-produced backup

## Reproduce locally

1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh
2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/
3. Expected: new dated folder appears with wealthfolio.db + secrets.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:25:19 +00:00
Viktor Barzin
9e5d7cd825 state(vault): update encrypted state 2026-04-18 22:12:55 +00:00
Viktor Barzin
402fd1fbac state(dbaas): update encrypted state 2026-04-18 22:12:09 +00:00
Viktor Barzin
345ba2182f [mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout
## Context

After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.

The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.

Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.

## This change

Two edits to the embedded probe script inside the cronjob resource:

```
-    # Step 2: Wait for delivery, retry IMAP up to 3 min
+    # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
  ...
-    for attempt in range(9):
+    for attempt in range(15):
  ...
-            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```

Flow (before):

```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```

Flow (after):

```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
                                           │
                                           └─ timeout ─► log, continue to next loop
```

## What is NOT in this change

- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
  3600s + for: 10m. Those fire only on sustained multi-hour issues and
  should not be loosened — they would mask future regressions. Probe
  success rate is now expected to recover to ≥95% from the two upstream
  fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
  cleanup of stale e2e-probe-* messages.

## Test Plan

### Automated

`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:

```
  # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
  -     for attempt in range(9):
  +     for attempt in range(15):
  -             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
  +             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

1. Trigger the probe manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
   `kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
   successful run should still complete in < 60s now that postscreen
   is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
   Prometheus — expect ≥95% (was ~65% before all three fixes).

## Reproduce locally

1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.

Closes: code-18e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:33:56 +00:00
Viktor Barzin
e2516b07a3 [mailserver] Disable postscreen btree cache to stop SMTP lock-contention stalls
## Context

Postfix inside docker-mailserver was spamming fatal errors at roughly
1 per minute — 5,464 of them in a 24h window — all of the same shape:

```
postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache:
unable to get exclusive lock: Resource temporarily unavailable
```

Every time one of these fires, the postscreen process dies mid-connection
and the inbound SMTP session is dropped. Legitimate mail (including Brevo
deliveries for our e2e email-roundtrip probe) gets re-queued by the sender
and arrives late — frequently past the probe's 180s IMAP polling window,
producing a 35%/7d probe success rate and the EmailRoundtripStale alert
noise that was originally flagged as "probably nothing."

## Root cause

`master.cf` declares postscreen with `maxproc=1`, but postscreen still
re-spawns per incoming connection (or for short-lived reopens), and each
instance opens the shared btree cache with an exclusive file lock. Under
any concurrency (two TCP SYNs arriving close together, or a retry during
teardown), the second process hits EWOULDBLOCK on fcntl and Postfix
treats that as fatal.

Three options were considered:

  | Option | Verdict |
  |--------|---------|
  | (a) Disable cache (postscreen_cache_map = )  | ✓ chosen |
  | (b) Switch btree → lmdb                       | ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) |
  | (c) proxy:btree via proxymap                  | ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" |
  | (d) Memcached sidecar                         | ✗ new moving part; deferred |

Option (a) is a small trade-off: legitimate clients re-run the
greet-action / bare-newline-action checks on every fresh TCP session
instead of hitting the 7-day whitelist cache. At our volume (~100
deliveries/day, ~72 of which are the probe itself) that's negligible CPU.
DNSBL re-evaluation is also avoided only partially, but this mailserver
already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role
was doing nothing anyway.

## This change

Appends a stanza to the user-merged postfix main.cf stored in
`variable.postfix_cf` that sets `postscreen_cache_map =` (empty value).
Postfix treats an empty cache_map as "no persistent cache" — per-session
decisions are still enforced, they just aren't cached across sessions.

Before:

```
smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock)
                ├─ concurrent access → fcntl EWOULDBLOCK → fatal
                └─ connection dropped, sender retries, mail arrives late
```

After:

```
smtpd ──► postscreen (no cache, per-session checks only)
                └─ no shared file, no lock → no fatal, no dropped session
```

No change to master.cf (postscreen still the front-end), no change to
DNSBL / greet / bare-newline policy.

## What is NOT in this change

- Dovecot userdb dedup (shipped in the previous commit).
- Email-roundtrip probe widening (next commit).
- Rebuilding docker-mailserver image with lmdb support (deferred —
  disabling the cache is simpler and sufficient at our volume).

## Test Plan

### Automated

`postconf -m` in the running container to confirm lmdb is genuinely absent
(ruling out option (b) before we commit to (a)):

```
btree  cidr  environ  fail  hash  inline  internal  ldap  memcache
nis  pcre  pipemap  proxy  randmap  regexp  socketmap  static  tcp
texthash  unionmap  unix
```

No lmdb entry — confirmed.

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  ~ "postfix-main.cf" = <<-EOT
      + postscreen_cache_map =
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

Reloader triggers pod rollout — baseline error count before apply was 34
`unable to get exclusive lock` lines per `--tail=500` log window.

### Manual Verification

Post-rollout, when the new pod is Ready:

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
   Expect: empty (no value)
2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 | grep -c "unable to get exclusive lock"`
   Expect: 0 new occurrences (any hits are from before the rollout).
3. Trigger a probe run manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
   then `kubectl -n mailserver logs job/probe-verify-...`
   Expect: `Round-trip SUCCESS` with duration < 120s.

## Reproduce locally

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
2. Expect: `postscreen_cache_map =` (empty value)
3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m | grep -c "unable to get exclusive lock"`
4. Expect: 0

Closes: code-1dc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:32:48 +00:00
Viktor Barzin
01a718e17b [mailserver] Filter redundant local→local aliases to fix Dovecot 'exists more than once'
## Context

Dovecot auth logs have been steadily spamming
`passwd-file /etc/dovecot/userdb: User r730-idrac@viktorbarzin.me exists more
than once` (and the same for vaultwarden@) at ~31 occurrences per 500 log
lines. Under load this flakes IMAP auth for the e2e email-roundtrip probe
(spam@viktorbarzin.me uses the catch-all), which was masquerading as "Brevo
or probe timing" noise.

## Root cause

docker-mailserver builds Dovecot's `/etc/dovecot/userdb` from two sources:
real accounts (`postfix-accounts.cf`) AND virtual-alias entries whose
*target* resolves to a local mailbox (`postfix-virtual.cf`). When the same
address appears as BOTH a real mailbox AND an alias whose target is another
local mailbox, the generated userdb has two lines for that username pointing
to different home directories — e.g.:

  r730-idrac@viktorbarzin.me:...:/var/mail/.../r730-idrac/home
  r730-idrac@viktorbarzin.me:...:/var/mail/.../spam/home      ← from alias

Dovecot's passwd-file driver rejects the duplicate, and every subsequent
auth lookup logs the error.

This affected exactly two addresses:
- r730-idrac@viktorbarzin.me (real account + alias → spam@)
- vaultwarden@viktorbarzin.me  (real account + alias → me@)

Other aliases are fine: they either forward to external addresses (gmail
etc.) — no local userdb entry generated — or map an address to itself
(me@ → me@) which docker-mailserver dedups internally.

Note: removing the real accounts is not an option because Vaultwarden uses
`vaultwarden@viktorbarzin.me` as its live SMTP_USERNAME
(stacks/vaultwarden/modules/vaultwarden/main.tf:121).

## This change

Introduces a `local.postfix_virtual` that concatenates the Vault-sourced
aliases with `extra/aliases.txt`, then filters out any line matching the
exact "LHS RHS" shape where both sides are in `var.mailserver_accounts` and
LHS != RHS. That is, only the pure local→local redundant entries are
dropped; all forwarding aliases and the catch-all are preserved.

The filter is self-healing: if a future alias ever collides with a real
account, it gets silently suppressed instead of breaking Dovecot auth.

```
  Vault mailserver_aliases  ─┐
                              ├─ concat ─ split \n ─ filter ─ join \n ─► postfix-virtual.cf
  extra/aliases.txt ─────────┘                        │
                                                       └── drop if LHS+RHS both in
                                                           mailserver_accounts and
                                                           LHS != RHS
```

Filtered entries (confirmed via locally-simulated filter on live data):
- r730-idrac@viktorbarzin.me spam@viktorbarzin.me
- vaultwarden@viktorbarzin.me me@viktorbarzin.me

Preserved (sample): postmaster→me, contact→me, alarm-valchedrym→self+3 ext,
lubohristov→gmail, yoana→gmail, @viktorbarzin.me→spam (catch-all), all four
disposable `*-generated@` aliases.

## What is NOT in this change

- Real accounts in Vault (`secret/platform.mailserver_accounts`) are
  untouched — vaultwarden SMTP auth keeps working.
- Postfix postscreen btree lock contention (separate commit).
- Email-roundtrip probe IMAP window (separate commit).

## Test Plan

### Automated

`terraform validate` — passes (docker-mailserver module):

```
Success! The configuration is valid, but there were some validation warnings as shown above.
```

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  # module.mailserver.kubernetes_config_map.mailserver_config will be updated in-place
  ~ resource "kubernetes_config_map" "mailserver_config" {
      ~ data = {
          ~ "postfix-virtual.cf" = (sensitive value)
            # (9 unchanged elements hidden)
        }
        id = "mailserver/mailserver.config"
    }
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply` — applied:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

Post-apply configmap content (the two lines are gone):

```
$ kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'
postmaster@viktorbarzin.me me@viktorbarzin.me
contact@viktorbarzin.me me@viktorbarzin.me
me@viktorbarzin.me me@viktorbarzin.me
lubohristov@viktorbarzin.me lyubomir.hristov3@gmail.com
alarm-valchedrym@viktorbarzin.me alarm-valchedrym@...,vbarzin@...,emil.barzin@...,me@...
yoana@viktorbarzin.me divcheva.yoana@gmail.com

@viktorbarzin.me spam@viktorbarzin.me
firmly-gerardo-generated@viktorbarzin.me me@viktorbarzin.me
closely-keith-generated@viktorbarzin.me vbarzin@gmail.com
literally-paolo-generated@viktorbarzin.me viktorbarzin@fb.com
hastily-stefanie-generated@viktorbarzin.me elliestamenova@gmail.com
```

Reloader triggers a pod rollout; once new pod is Ready:
- `kubectl -n mailserver exec <pod> -c docker-mailserver -- cut -d: -f1 /etc/dovecot/userdb | sort | uniq -d`
  expected output: empty (no duplicate usernames)
- `kubectl -n mailserver logs <pod> -c docker-mailserver --tail=500 | grep -c "exists more than once"`
  expected output: 0 (baseline was 31/500 lines)

## Reproduce locally

1. `kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'`
2. Expect: no `r730-idrac@viktorbarzin.me spam@viktorbarzin.me` line and no
   `vaultwarden@viktorbarzin.me me@viktorbarzin.me` line.
3. After pod restart: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=500 | grep -c "exists more than once"` → 0.

Closes: code-27l

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:29:02 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
e612baac15 [dawarich] Re-enable Sidekiq worker with resource limits + probes
## Context

Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.

During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.

Live pre-apply snapshot via `bin/rails runner`:
  enqueued=18  (cache=2, data_migrations=4, default=12)
  scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
  reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.

## What changed

Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:

- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
  768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
  job gets OOM-killed and container-restarted in place without evicting
  the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
  instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
  the existing `dawarich-secrets` K8s Secret (populated by the existing
  ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
  reference the 2026-02-23 block relied on — that data source no longer
  exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
  to separate commits (plan: 2 → 5 → 10 with 15-30min observation
  between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
  container-scoped restart on stall, verified `pgrep` is at
  /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
  RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
  Sidekiq's Rails initialisation matches web.

Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
  drain in-flight jobs on SIGTERM during rolls (default 30s not enough
  for reverse-geocoding batches).

## What is NOT in this change

- Prometheus exporter for Sidekiq metrics. The first apply turned on
  `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
  `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
  metrics over TCP to a separate exporter server process — and the
  freikin/dawarich image does not start one. Client logged ~2/sec
  "Connection refused" errors until we flipped ENABLED back to "false"
  in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
  for the same reason (nothing listening on :9394). Filed code-1q5
  (blocks code-459) to add a third sidecar container running
  `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
  the 4 drafted alerts (DawarichSidekiqDown /
  QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
  actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
  monitoring/prometheus_chart_values.tpl; they reference metrics that
  don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
  code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
  out of scope per plan.

## Other changes bundled in

Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.

## Topology trade-off recorded

Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
  strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
  `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
  the container block + apply.
- Shared pod network namespace — only one process can bind any given
  port. This is why the plan explicitly avoided declaring a new
  `port { name = "prometheus" }` on the sidekiq container (the web
  container already reserves 9394 by name).

Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".

## Rollback

Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
   no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
   apply — full disable, backlog stays in Redis DB 1, recoverable.

Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.

## Refs

- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
  after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
  delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
  Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
  Untouched; processing the backlog does not fix this but ensures
  future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
  same reasoning.

## Test Plan

### Automated

```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
#   kubernetes_deployment.dawarich         (sidekiq container + probes + lifecycle)
#   kubernetes_namespace.dawarich          (drops stale goldilocks label, pre-existing drift)
#   module.tls_secret.kubernetes_secret.tls_secret  (Kyverno clone-label drift, pre-existing)

$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.

(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```

### Manual Verification

Setup: kubectl context against the k8s cluster (10.0.20.100).

1. Pod has both containers Ready with zero restarts:
   ```
   $ kubectl -n dawarich get pods -o wide
   NAME                        READY  STATUS   RESTARTS  AGE
   dawarich-75b4ff9fbf-qh56v   2/2    Running  0         <fresh>
   ```

2. Sidekiq container is actively processing jobs:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
   Sidekiq 8.0.10 connecting to Redis ... db: 1
   queues: [data_migrations, points, default, mailers, families,
            imports, exports, stats, trips, tracks,
            reverse_geocoding, visit_suggesting, places,
            app_version_checking, cache, archival, digests,
            low_priority]
   Performing DataMigrations::BackfillMotionDataJob ...
   Backfilled motion_data for N000 points (N climbing)
   ```

3. Rails Sidekiq::API snapshot — procs registered, counters moving:
   ```
   $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
       require "sidekiq/api"
       s = Sidekiq::Stats.new
       puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
     '
   processed=7 failed=2 procs=1
   retry=0 dead=0
   ```
   (The 2 "failures" are cumulative across two pod lifecycles during
   the Prometheus env flip — retried successfully, neither retry nor
   dead set holds any jobs.)

4. Per-container memory well under the 1Gi limit:
   ```
   $ kubectl -n dawarich top pod --containers
   POD                         NAME              CPU    MEMORY
   dawarich-75b4ff9fbf-qh56v   dawarich          1m     272Mi  (of 896Mi)
   dawarich-75b4ff9fbf-qh56v   dawarich-sidekiq  79m    333Mi  (of 1Gi)
   ```

5. No "Prometheus Exporter, failed to send" log lines since the second
   apply:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
       | grep -c "Prometheus Exporter"
   0
   ```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:13:05 +00:00
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
2b8bb849c0 [infra] Bump claude-agent-service + beadboard image tags
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):

1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
   (files in /usr/share/agent-seed/), the beads-task-runner agent, and
   hmac.compare_digest bearer verification. The tag moves from 382d6b14
   to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
   CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
   needs the Dispatch button + /api/agent-dispatch + /api/agent-status
   routes. Tag moves from :latest to :17a38e43 (fork HEAD on
   github.com/ViktorBarzin/beadboard).

## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
  default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
  "Use 8-char git SHA tags — `:latest` causes stale pull-through cache").

## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
  - claude-agent-service:0c24c9b6 verified via
    `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
    contains both seed files.
  - beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.

## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
   seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
   + copies beads-task-runner.md), beadboard pod rolls with new env vars
   sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
   should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
   -o yaml | grep image:` should show :17a38e43.

Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
Viktor Barzin
8d94688dde [infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]
## Context

Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:

  app.kubernetes.io/managed-by          = kyverno
  generate.kyverno.io/policy-name       = sync-tls-secret
  generate.kyverno.io/policy-namespace  = ""
  generate.kyverno.io/rule-name         = sync-tls-secret
  generate.kyverno.io/source-kind       = Secret
  generate.kyverno.io/source-namespace  = kyverno
  generate.kyverno.io/source-uid        = <uid>
  generate.kyverno.io/source-version    = v1
  generate.kyverno.io/source-group      = ""
  generate.kyverno.io/clone-source      = ""

Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.

## This change

Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.

The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.

## What is NOT in this change

- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
  — a different admission controller, different resource, different fix.
  Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
  stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
  sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
  policy is the only generate policy that produces Secrets in this repo
  (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).

## Verification

Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
   (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)

After:  Plan: 0 to add, 1 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```

Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:23:02 +00:00
Viktor Barzin
f79e3c563e [infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip]
## Context

On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member
Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone`
on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name
(selector switched to the standalone pod), all 20 databases/688 tables/14
users were dump-restored, and Vault rotated credentials against the new
instance. The InnoDB Cluster has been dark since — Phase 4 was to remove
the dead code and decommission its cluster-side Helm state.

Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster
+ mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0
(30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles."

## This change

Phase 4 scope executed in this session (beads code-qai):

1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack:
   - `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR
     + MySQL Router Deployment + 8 Services (mysql-cluster, -instances,
     ports 6446/6448/6447/6449/6450/8443, etc.)
   - `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator
     Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles
   - `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns
   - `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover
     permissions patch that existed to work around the sidecar's kopf
     permissions bug; unused without the operator
   - `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra`
   - `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override
     `innodb_doublewrite=OFF` via subPath mount; standalone does not need it
2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not
   garbage-collect PVCs; 30Gi reclaimed.
3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf`
   covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section
   and all six resources above.

The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context
deadline exceeded"). Uninstallation had in fact completed cluster-side by
that point but TF rolled back the state delta. A second `terragrunt destroy
-target` call with the same args resolved cleanly — destroyed the remaining
2 tracked resources (the first pass cleared 4) and encrypted+committed the
Tier 0 state.

## What is NOT in this change

- CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these
  on uninstall. Verified clean: `kubectl get crd | grep mysql.oracle.com`
  returns nothing.
- Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not
  a TF-managed resource.
- Stale Vault DB roles (health, linkwarden, affine, woodpecker,
  claude_memory, crowdsec, technitium) for services migrated MySQL→PG —
  sandbox denies `vault list database/roles` as credential scouting, so
  the user handles this manually.
- 2 state-commits preceding this one (`30fa411b`, `6cf3575e`) are automatic
  SOPS-encrypted-state commits produced by `scripts/tg` after each
  `terragrunt destroy` pass. Standard Tier 0 workflow.

## Verification

```
$ helm list -A | grep -E 'mysql-cluster|mysql-operator'
(no output)

$ kubectl get ns mysql-operator
Error from server (NotFound): namespaces "mysql-operator" not found

$ kubectl get pvc -n dbaas datadir-mysql-cluster-0
Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found

$ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone
NAME                 READY   STATUS    RESTARTS       AGE
mysql-standalone-0   1/1     Running   1 (118m ago)   2d

$ ../../scripts/tg state list | grep -i 'mysql_operator\|mysql_cluster\|mysql_sidecar\|mysql_extra_cnf'
(no output)

$ ../../scripts/tg plan | grep -E 'mysql_cluster|mysql_operator|mysql_sidecar|mysql_extra_cnf'
(no output — Wave 2 drift is gone; remaining plan items are pre-existing
drift unrelated to this change, see Wave 3 + in-flight payslip work)
```

## Reproduce locally
1. `git pull`
2. `cd stacks/dbaas && ../../scripts/tg state list | grep mysql_cluster` → no output
3. `helm list -A | grep mysql-cluster` → no output

Closes: code-qai

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:19:48 +00:00
Viktor Barzin
6cf3575ed9 state(dbaas): update encrypted state 2026-04-18 19:17:31 +00:00
Viktor Barzin
30fa411bf7 state(dbaas): update encrypted state 2026-04-18 19:17:20 +00:00
Viktor Barzin
61e94c21fe state(dbaas): update encrypted state 2026-04-18 19:16:41 +00:00
Viktor Barzin
c75beaac6c wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request)
## Context

Pod was OOMKilled after today's broker-sync Phase 3 import grew the
activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow
matches across 6 accounts). `/api/v1/net-worth` and
`/valuations/history` materialise the full history in memory to render
the dashboard chart.

`kubectl describe pod` showed Back-off restarting failed container;
`kubectl top pod` reported 14Mi steady-state but spikes crossed the
64Mi cap.

## This change

Bump container resources to:
- requests.memory: 64Mi → 256Mi
- limits.memory:  64Mi → 1Gi

CPU unchanged. 1Gi is generous for the current 700-activity DB +
chart rendering, with headroom for another year of growth before we
need to revisit (VPA will flag if actual use exceeds upperBound).

## Verification

### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 4 changed, 0 destroyed.

### Manual
$ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}'
→ {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}}

$ kubectl -n wealthfolio get pods -l app=wealthfolio
NAME                           READY   STATUS    RESTARTS   AGE
wealthfolio-86c8696b9c-nzwkf   1/1     Running   0          51s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:13:05 +00:00
Viktor Barzin
43b4e1d372 [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context

New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.

Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.

## What

### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
  WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
  `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
  `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
  design), `wait-for postgresql.dbaas:5432` annotation, init container runs
  `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
  dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
  uid `payslips-pg`) reading password from the db-creds K8s Secret.

### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).

### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
  (username `payslip_ingest`, 7d rotation).

### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
  idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
  `pg-cluster-1`.

### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
  JSON object matching the schema to stdout. No network, no file writes outside /tmp,
  no markdown fences.

## Trade-offs / decisions

- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
  initially described. The Alembic migration still creates a `payslip_ingest`
  schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
  and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
  surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
  `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
  auto-creator).

## Test Plan

### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.

terraform fmt -check -recursive
(exit 0)

python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```

### Manual Verification (post-merge)

Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.

Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
   (first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.

End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.

Closes: code-do7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
Viktor Barzin
81e7c3d6ee state(dbaas): update encrypted state 2026-04-18 18:59:51 +00:00
Viktor Barzin
bde713f8a4 broker-sync: add Fidelity PlanViewer CronJob (suspended)
## Context

Viktor's UK workplace pension is at Fidelity PlanViewer. The broker-sync
provider + CLI landed in the broker-sync repo (commits 804e6a8 +
7c9be54); this commit adds the infra bits so the monthly sync runs
in-cluster like the other broker-sync jobs.

One successful manual backfill on 2026-04-18 pulled 51 contributions +
valuation into a new WF WORKPLACE_PENSION account; Net Worth moved from
£865k → £1,003k. This commit productionises that flow.

## This change

- New kubernetes_cron_job_v1.fidelity in stacks/broker-sync/main.tf:
  - Schedule: 05:00 UK on the 20th of each month (after mid-month
    payroll settles; finance data shows credits on the 13th-18th).
  - Suspended by default — unsuspend once broker-sync image is rebuilt
    with Chromium baked in (Dockerfile change shipped separately in the
    broker-sync repo).
  - Init container materialises the storage_state JSON (projected from
    the broker-sync-secrets K8s Secret, synced from Vault by ESO) to the
    encrypted PVC at /data/fidelity_storage_state.json. Chromium then
    loads it.
  - Container: broker-sync fidelity-ingest with WF + FIDELITY_* env
    vars. Memory request 512Mi, limit 1280Mi — Chromium is hungry.
  - Lifecycle ignore_changes on dns_config per the KYVERNO_LIFECYCLE_V1
    convention documented in AGENTS.md.

## What is NOT in this change

- The Vault keys fidelity_storage_state + fidelity_plan_id — already
  staged via `vault kv patch` on 2026-04-18.
- Dockerfile Chromium install — in broker-sync repo (commit 7c9be54).
- Prometheus BrokerSyncFidelityFailed alert — deferred until the
  CronJob has run successfully for a month and we have a baseline.
  Existing broker-sync CronJobs also don't have per-job alerts yet;
  filing as a follow-up.

## Verification

### Automated
terraform fmt ran clean. `terragrunt plan` would show a single new
kubernetes_cron_job_v1 (suspended, so no pods scheduled).

### Manual (after apply + image rebuild)

1. Build + push broker-sync:<sha> with Chromium.
2. `scripts/tg apply stacks/broker-sync` (updates image_tag + adds
   fidelity CronJob).
3. Unsuspend: `kubectl -n broker-sync patch cronjob broker-sync-fidelity \
     -p '{"spec":{"suspend":false}}'` OR flip the tf flag.
4. Trigger a test run: `kubectl -n broker-sync create job \
     fidelity-test --from=cronjob/broker-sync-fidelity`.
5. Expect logs: `fidelity-ingest: fetched=N new=N imported=N failed=0`.
6. On FidelitySessionError: run `broker-sync fidelity-seed` locally +
   `vault kv patch secret/broker-sync fidelity_storage_state=@...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:51:20 +00:00
Viktor Barzin
4f54c959d7 [infra] Remove iscsi-csi stack — TrueNAS decommissioned [ci skip]
## Context

The iSCSI CSI driver was deployed against a TrueNAS appliance at 10.0.10.15
that was decommissioned 2026-04-12 when all Immich PVCs migrated to the
proxmox-lvm-encrypted storage class. The stack has been dead code since —
live survey (2026-04-18):

- iscsi-csi namespace: empty (0 resources), 27h old (since last TF apply)
- No iscsi CSI driver registered in the cluster
- No PVs/PVCs reference iscsi
- TF state held only the empty namespace
- helm_release.democratic_csi was not in state (already gone pre-session)

Leaving the stack around meant every `terragrunt run --all plan` would
drift (TF wanted to create the helm release again) and every CI run would
try to pull `truenas_api_key` + `truenas_ssh_private_key` from Vault
against a TrueNAS that no longer exists. Beads tracking: code-gw0.

## This change

- `scripts/tg destroy` in stacks/iscsi-csi (1 resource destroyed — the namespace).
- `rm -rf stacks/iscsi-csi/` — removes modules/, main.tf, terragrunt.hcl,
  secrets symlink, and the 4 terragrunt-generated files (backend.tf,
  providers.tf, cloudflare_provider.tf, tiers.tf).
- Dropped PG schema `iscsi-csi` on `10.0.20.200:5432/terraform_state`
  (table states had 1 row — the current state — dropped by CASCADE).
- Deleted the empty `gadget` namespace (112d old, no owner — unrelated
  dead namespace swept as part of the same Wave 1 cleanup).

## What is NOT in this change

- Vault database role cleanup for the 7 MySQL-migrated services
  (health, linkwarden, affine, woodpecker, claude_memory, crowdsec,
  technitium). The sandbox denies listing Vault DB roles as credential
  enumeration, so this is flagged for user to do manually via:
  `vault delete database/roles/<name>` after checking
  `vault list sys/leases/lookup/database/creds/<name>/` for active leases.

## Reproduce locally
1. `git pull`
2. `ls stacks/ | grep iscsi` → no output
3. `kubectl get ns iscsi-csi gadget` → both NotFound
4. psql to 10.0.20.200:5432/terraform_state → `\dn` shows no iscsi-csi schema

## Test Plan

### Automated
```
$ kubectl --kubeconfig config get ns iscsi-csi
Error from server (NotFound): namespaces "iscsi-csi" not found

$ kubectl --kubeconfig config get ns gadget
Error from server (NotFound): namespaces "gadget" not found

$ PGPASSWORD=... psql -h 10.0.20.200 -U ... -d terraform_state -c '\dn' | grep iscsi
(no output)

$ ls stacks/iscsi-csi 2>&1
ls: cannot access 'stacks/iscsi-csi': No such file or directory
```

### Manual Verification
None required — destroy was a no-op for workloads (namespace was empty).

Closes: code-b6l
Closes: code-gw0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:49:40 +00:00
Viktor Barzin
e1d20457c4 [infra/claude-agent-service] Seed beads metadata + scratch dir at runtime
## Context

Review of the BeadBoard Dispatch wiring found that the claude-agent-service
Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and
`COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land
on paths that are volume-mounted at runtime:

  - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398)
  - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427)

Kubernetes mounts hide image-layer content at those paths, so the COPYs are
dead. The companion commit in `claude-agent-service` restages both files to
`/usr/share/agent-seed/` (an image-layer path that is never mounted).

Additionally, the beads-task-runner agent rails expect
`/workspace/scratch/<job_id>/` to exist, but nothing was creating it.

## Layout before / after

```
  Before (dead COPYs):

    image layer          runtime (mounted volumes hide the files)
    -----------          -----------------------------------
    /workspace/          <- hidden by PVC mount
      .beads/
        metadata.json    <- UNREACHABLE
    /home/agent/.claude/ <- hidden by emptyDir mount
      agents/
        beads-task-runner.md  <- UNREACHABLE

  After (init container seeds volumes at pod start):

    image layer          runtime
    -----------          ------------------------------------
    /usr/share/agent-seed/
      beads-metadata.json    --+
      beads-task-runner.md    --+-> copied by seed-beads-agent init
                                    container into the mounted volumes
                                    on every pod start:
                                      /workspace/.beads/metadata.json
                                      /workspace/scratch/
                                      /home/agent/.claude/agents/beads-task-runner.md
```

## What

### New init container: `seed-beads-agent`
  - Positioned AFTER `git-init`, BEFORE the main container.
  - Uses the same service image (`${local.image}:${local.image_tag}`) — the
    seed files are baked in at `/usr/share/agent-seed/`.
  - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`).
  - Shell body:
      mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
      cp /usr/share/agent-seed/beads-metadata.json     /workspace/.beads/metadata.json
      cp /usr/share/agent-seed/beads-task-runner.md    /home/agent/.claude/agents/beads-task-runner.md
  - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`.
  - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`).

### Formatting
  - `terraform fmt -recursive` also normalised whitespace in the token-expiry
    locals block and the CronJob container definition. No semantic change.

## What is NOT in this change

  - No image tag bump. The Dockerfile refactor that produces the
    `/usr/share/agent-seed/` path lands in the claude-agent-service repo
    and will roll in on the next CI build. Until that build ships and the
    tag is bumped in this file, the new init container will `cp` from a
    path that doesn't exist yet — so do NOT apply this commit until the
    corresponding image tag bump is ready. The commit is declarative prep.
  - No changes to storage class, RBAC, Service, or any other init.
  - The main container mounts remain unchanged — only the init containers
    prepare volume contents.

## Test Plan

### Automated

```
$ terraform fmt -check -recursive stacks/claude-agent-service/
(no output — clean)

$ terraform -chdir=stacks/claude-agent-service/ init -backend=false
Terraform has been successfully initialized!

$ terraform -chdir=stacks/claude-agent-service/ validate
Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1)
Success! The configuration is valid, but there were some validation warnings
as shown above.
```

### Manual Verification (after image bump + apply)

1. Bump `local.image_tag` in main.tf to the SHA of a build that has
   `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE | jq ...`
   or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`).
2. `scripts/tg apply stacks/claude-agent-service`
3. `kubectl -n claude-agent get pods -w` — all init containers complete.
4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch`
   Expected: all three paths exist; first two are regular files with the
   expected content, `scratch` is a directory.
5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json`
   Expected: `dolt.beads-server.svc.cluster.local`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:23:19 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
a62b43d19e [infra] Document intended ignore_changes drift-workarounds [ci skip]
## Context

The infra repo has 31 `ignore_changes` blocks. Phase 1 of the state-drift
consolidation audit classified 21 as legitimate (immutable fields, cloud-computed
values) and 10 as intentional workarounds for known drift sources. The remaining
10 were indistinguishable from accidental/forgotten drift suppression without
reading the surrounding context.

This commit adds a uniform `# DRIFT_WORKAROUND: <reason>, reviewed 2026-04-18`
marker above the 8 intended-workaround blocks (6 CI image-tag decoupling + 2
non-deterministic secret hashes) so they are easy to distinguish from
accidental drift suppression during future audits.

## What is NOT in this change

- Functional behavior — `ignore_changes` lists are byte-identical.
- The Kyverno `dns_config` ignore paths (covered by Wave 3 shared module).
- Workarounds being removed — the CI decoupling is intentional by user decision.

## Files touched

CI image-tag decoupling (6):
- stacks/k8s-portal/modules/k8s-portal/main.tf (also has dns_config for Kyverno)
- stacks/novelapp/main.tf
- stacks/claude-memory/main.tf
- stacks/plotting-book/main.tf
- stacks/trading-bot/main.tf (api deployment)
- stacks/trading-bot/main.tf (workers deployment — 6 containers)

Non-deterministic secret hashes (2):
- stacks/owntracks/main.tf (htpasswd bcrypt)
- stacks/mailserver/modules/mailserver/main.tf (postfix-accounts.cf)

## Test Plan

### Automated
```
$ rg DRIFT_WORKAROUND stacks/ | wc -l
8

$ terraform fmt -recursive stacks/k8s-portal stacks/novelapp stacks/claude-memory \
    stacks/plotting-book stacks/trading-bot stacks/owntracks stacks/mailserver
(no output — already formatted)

$ git diff --stat
 stacks/claude-memory/main.tf                 | 1 +
 stacks/k8s-portal/modules/k8s-portal/main.tf | 1 +
 stacks/mailserver/modules/mailserver/main.tf | 3 ++-
 stacks/novelapp/main.tf                      | 1 +
 stacks/owntracks/main.tf                     | 1 +
 stacks/plotting-book/main.tf                 | 1 +
 stacks/trading-bot/main.tf                   | 2 ++
 7 files changed, 9 insertions(+), 1 deletion(-)
```

### Manual Verification
No apply required — HCL comments only, zero effect on plan output.

## Reproduce locally
1. `cd infra && git pull`
2. `rg "DRIFT_WORKAROUND.*reviewed 2026-04-18" stacks/ | wc -l` → expect 8
3. `terraform fmt -check -recursive stacks/` → expect clean exit

Closes: code-yrg

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:08:10 +00:00
Viktor Barzin
91165e31b9 [infra/beads-server] Wire BeadBoard to claude-agent-service
## Context

BeadBoard is the Next.js task visualization dashboard shipped in this
stack. We want users to trigger headless Claude agent runs directly from
a beads task row — "one-click dispatch" — instead of copy-pasting `bd`
IDs into a terminal. The agent runs in-cluster as claude-agent-service
(see stacks/claude-agent-service/), protected by a bearer token in
Vault at secret/claude-agent-service/api_bearer_token.

For BeadBoard to POST to /execute we need the service URL and the
bearer token available inside the pod as env vars. The URL is static
(cluster DNS); the token must come through External Secrets Operator
so rotation in Vault propagates without re-applying Terraform.

Secondary cleanup: the container was still pinned to :latest which
violates the 8-char-SHA convention and causes stale pulls through the
registry cache (see .claude/CLAUDE.md, Docker images). The image tag
is now variable-driven; the GHA pipeline will override the default
once it publishes the first SHA.

## This change

- Adds an ExternalSecret `beadboard-agent-service` in the
  `beads-server` namespace, mirroring the pattern in
  stacks/claude-agent-service/main.tf (same Vault path
  `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore,
  same 15m refresh). Exposes exactly one key: `api_bearer_token`.

- Adds two env vars to the `beadboard` container:
  - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL
    (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`)
  - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the
    ESO-managed Secret, key `api_bearer_token`

- Adds `reloader.stakater.com/auto = "true"` on the Deployment's
  top-level metadata — matches the convention used by rybbit,
  claude-memory, onlyoffice. When ESO refreshes the K8s Secret
  because Vault rotated the token, Reloader restarts the pod so the
  new token is picked up (env vars are read once at boot).

- Adds `variable "beadboard_image_tag"` (default `"latest"`, with a
  one-line comment flagging the temporary default). The image
  reference now interpolates `${var.beadboard_image_tag}`. No tfvars
  file is touched — orchestrator will flip the default to the first
  real 8-char SHA once GHA publishes it.

## What is NOT in this change

- No GHA workflow additions. The pipeline that builds
  `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard
  repo and is out of scope here.
- No Vault-side changes. `secret/claude-agent-service/api_bearer_token`
  already exists (it powers the claude-agent-service deployment
  itself).
- No Terraform `apply`. Orchestrator applies.

## Data flow

  Vault (secret/claude-agent-service)
    │  refresh every 15m
    ▼
  ESO → K8s Secret `beadboard-agent-service` (beads-server ns)
    │  envFrom.secretKeyRef
    ▼
  BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env)
    │  Authorization: Bearer <token>
    ▼
  claude-agent-service.claude-agent.svc:8080 /execute

On Vault rotation: ESO picks up new value at next refresh → K8s
Secret data changes → Reloader sees annotation + referenced Secret
changed → rolling-recreates the beadboard pod with the new token.

## Test Plan

### Automated
- `terraform fmt -recursive stacks/beads-server/` — clean (formatted
  the file once; subsequent run is a no-op).
- `terraform -chdir=stacks/beads-server validate` (after
  `terraform init -backend=false`) — `Success! The configuration is
  valid`. The 14 "Deprecated Resource" warnings are pre-existing
  (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this
  change.

### Manual Verification
1. Orchestrator applies:
   `scripts/tg -chdir=stacks/beads-server apply`
2. Verify the ExternalSecret synced:
   `kubectl -n beads-server get externalsecret beadboard-agent-service`
   Expected: `Ready=True`, `SyncedAt` recent.
3. Verify the K8s Secret exists with one key:
   `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8`
   Expected: first 8 chars of the bearer token.
4. Verify the deployment picked up the env vars:
   `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT`
   Expected: both env entries present, bearer via `secretKeyRef`.
5. Verify the reloader annotation is on the Deployment metadata:
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'`
   Expected: `true`.
6. Verify the image tag resolved to the variable default (for now):
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'`
   Expected: `registry.viktorbarzin.me:5050/beadboard:latest`
   (will become `...:<sha>` once `beadboard_image_tag` default is
   updated).
7. Smoke-test the env var inside the pod:
   `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'`
   Expected: URL printed, first 8 chars of token printed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
Viktor Barzin
82b7866bc9 [claude-agent-service] Remove orphaned DevVM SSH key wiring
## Context

The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.

This commit removes two cleanup artifacts left behind by that migration.

## This change

1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
   skill doc for the obsolete SSH-based pattern. Already in `archived/`,
   harmless but noise; deleting prevents anyone copy-pasting the old approach.

2. Removes `kubernetes_secret.ssh_key` from
   `stacks/claude-agent-service/main.tf`. The Secret was created from the
   `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
   into the agent pod. The pod's `git-init` init container uses HTTPS +
   `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
   and `https://github.com/` URL via `git config url.insteadOf`, so no
   downstream `git` invocation could fall through to SSH even if it tried.

3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
   the SSH key resource was its only consumer.

## What is NOT in this change

- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
  Removing it requires read/modify/put of the full secret and the upside
  is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
  non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
  left untouched per no-adjacent-refactor rule.

## Test plan

### Automated

- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
  pre-existing lines 464-505 are flagged; no new fmt warnings introduced
  by these deletions.

### Manual verification

1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
   The `ci_secrets` data source removal is plan-time only; does not appear
   in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
   curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
   with a minimal prompt; expect job completes with `exit_code=0`.

Closes: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:31:15 +00:00
Viktor Barzin
9a2e920006 [rybbit] Narrow CF Worker routes to SITE_IDS hosts — fix free-tier quota breach
## Context

The `rybbit-analytics` Cloudflare Worker hit the free-tier quota of 100k
requests/day. CF GraphQL analytics showed **97,153 invocations in the last
24h**, up from ~0 before 2026-04-17 21:26 UTC when Rybbit script injection
migrated off the broken Traefik rewrite-body plugin (Yaegi ResponseWriter
bug on Traefik v3.6.12) onto this Worker.

Root cause: `wrangler.toml` registered two wildcard routes
(`viktorbarzin.me/*` + `*.viktorbarzin.me/*`) which match every Cloudflare-
proxied request on the zone. Only 27 of ~119 proxied hostnames appear in
`SITE_IDS` in `index.js`; the rest burn Worker invocations for nothing since
`siteId` is `null` and the Worker no-ops. Worse, the wildcard caught
`rybbit.viktorbarzin.me` itself — every tracker `script.js` fetch and event
POST round-trip was spawning its own Worker invocation (self-amplification).

CF GraphQL per-host breakdown (last 24h, zone `viktorbarzin.me`):
- Top waste (NOT in SITE_IDS): tuya-bridge 96.6k, beadboard 55.8k,
  terminal 30.2k, authentik 19.9k, claude-memory 12.6k
- Sum of 27 SITE_IDS hosts: 47.2k
- `rybbit.viktorbarzin.me` self-amplifier: 782
- Projected post-narrow: 46.4k/day (52% reduction, well under quota)

## This change

Replaces the two wildcards with an explicit list of the **26** hostnames
present in `SITE_IDS`. `rybbit.viktorbarzin.me` is deliberately excluded
even though it has a site ID — it serves `/api/script.js` (JS) and
`/api/track` (JSON), both of which fail the Worker's `text/html`
content-type guard anyway. Leaving it routed just burned invocations.

    BEFORE                              AFTER
    ──────────────────────────          ──────────────────────────────────
    viktorbarzin.me/*          ┐        viktorbarzin.me/*          ┐
    *.viktorbarzin.me/*        ┘        www.viktorbarzin.me/*      │
                                        actualbudget.vb.me/*       │
    → matches ~119 hosts                ... (26 total)             │ → matches
    → ~97k Worker inv/day                stirling-pdf.vb.me/*      │   only 26
    → rybbit → self-amplifies            vaultwarden.vb.me/*       ┘   specific
                                                                        hosts
                                        rybbit.vb.me INTENTIONALLY
                                        EXCLUDED (self-amplifier)

Deployment is unchanged — this Worker is not in Terraform. Deploy from
`stacks/rybbit/worker/` via:

    CLOUDFLARE_EMAIL=vbarzin@gmail.com \
    CLOUDFLARE_API_KEY=$(vault kv get -field=cloudflare_api_key secret/platform) \
    npx --yes wrangler@latest deploy

`wrangler deploy` replaces all worker routes on the zone with the list from
`wrangler.toml`, so there is no cleanup step. Already deployed today as
version `d7f83980-a499-40f5-ba55-f8e18d531863` — this commit just captures
the source of truth in git.

## What is NOT in this change

- Self-hosted injection (nginx `sub_filter` sidecar, compiled Traefik
  plugin). Deferred — revisit only if analytics traffic grows past 80k/day
  again, or if we add more high-traffic hosts to `SITE_IDS`.
- Cloudflare Workers Paid plan ($5/mo for 10M requests). User declined.
- Moving the Worker into Terraform. Out of scope.
- Any Rybbit backend/frontend changes. Rybbit itself continues running.

## Test plan

### Automated

Post-deploy CF API enumeration of zone routes:

    $ curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
        "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | "\(.pattern)\t→ \(.script)"' | wc -l
    26

    # Wildcards gone:
    $ curl -s ... | jq -r '.result[].pattern' | grep -c '\*\.'
    0

### Manual Verification

Script injection behaviour, verified via `curl`:

1. SITE_IDS host — script IS injected:

       $ curl -s -L https://viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="da853a2438d0" defer>

       $ curl -s -L https://calibre.viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="ce5f8aed6bbb" defer>

2. Non-SITE_IDS host — script NOT injected:

       $ curl -s -L https://tuya-bridge.viktorbarzin.me/ | grep -c 'data-site-id'
       0

3. `rybbit.viktorbarzin.me` bypasses Worker entirely — tracker returns raw JS:

       $ curl -sI https://rybbit.viktorbarzin.me/api/script.js | grep -i content-type
       content-type: application/javascript; charset=utf-8

### Reproduce locally

    # 1. Confirm the Worker sees only the 26 narrowed routes.
    CF_EMAIL=vbarzin@gmail.com
    CF_KEY=$(vault kv get -field=cloudflare_api_key secret/platform)
    ZONE_ID=fd2c5dd4efe8fe38958944e74d0ced6d
    curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | .pattern' | sort

    # 2. 24h after deploy, re-check invocation count — expect < 80k.
    curl -s https://api.cloudflare.com/client/v4/graphql \
      -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      -H "Content-Type: application/json" \
      -d '{"query":"query($acc:String!,$since:Time!,$until:Time!){viewer{accounts(filter:{accountTag:$acc}){workersInvocationsAdaptive(limit:100,filter:{datetime_geq:$since,datetime_leq:$until}){sum{requests} dimensions{scriptName date}}}}}",
           "variables":{"acc":"02e035473cfc4834fb10c5d35470d8b4",
                        "since":"'"$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)"'",
                        "until":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'

Follow-up monitoring tracked in code-dka (P3, 3-day check).

Closes: code-l9b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:15 +00:00
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
9ea7eec362 [actualbudget] Upgrade 26.3.0 → 26.4.0 for native Sankey report
## Context

Actual Budget v26.4.0 (released 2026-04-05) re-introduces the Sankey
chart report for income/expense flow visualization (PR #7220). An earlier
experimental implementation was deleted in March 2024 (PR #2417) but a
proper reimplementation with "Other" grouping, date-range selection, and
percentage toggle is now shipped behind the experimental feature flag.

Viktor wanted Sankey visualization of budget cash flow; this is the lowest-
cost path since his existing Actual Budget deployment already holds all the
transaction data.

## This change

Bumps the `tag` input on all three factory module calls (viktor, anca, emo)
from `26.3.0` to `26.4.0`. No breaking changes, schema migrations, or config
changes per the 26.4.0 release notes.

## Rollout

Applied via `scripts/tg apply --non-interactive`. All three pods rolled
successfully to `actualbudget/actual-server:26.4.0` and passed readiness
probes. The http-api sidecars (`jhonderson/actual-http-api`) were untouched.

## Post-upgrade

Users need to toggle Settings → Experimental features → Sankey report to
access the chart, then Reports → new Sankey widget.

Closes: code-oof

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:19:27 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
6e19dce99e [docs] automated-upgrades: document long-lived OAuth + expiry monitoring
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.

Follow-up to 8a054752 and 50dea8f0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:00:07 +00:00
Viktor Barzin
e4a96591b3 .gitignore: ignore Terragrunt-generated cloudflare_provider.tf and tiers.tf
These files are regenerated by Terragrunt on every run and have a
"# Generated by Terragrunt. Sig: ..." header. Earlier today multiple parallel
agents working on bd-w97 accidentally staged them, requiring two corrective
commits (3e11bd1b, 4eb68d6b). Preventing the recurrence at the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:36:45 +00:00
Viktor Barzin
4eb68d6b1a [meshcentral] Remove accidentally-committed Terragrunt-generated files
My previous commit (c0ac24a5, [meshcentral] Import existing cluster
state + PVC) unintentionally committed two Terragrunt-generated
provider/locals files. These are auto-generated on every plan/apply
(marked 'Generated by Terragrunt. Sig:') and do not belong in the
repo. Mirrors 3e11bd1b which did the same cleanup for kyverno.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Updates: code-w97
2026-04-18 12:35:44 +00:00
Viktor Barzin
c0ac24a54c [meshcentral] Import existing cluster state + PVC (bd-w97)
Imported the two proxmox-lvm-encrypted PVCs into the Tier 1 PG state.
All other declared resources (namespace, deployment, service, ingress,
NFS-backed PV/PVC, tls secret) were already state-managed.

Imported:
- kubernetes_persistent_volume_claim.data_encrypted
    (meshcentral/meshcentral-data-encrypted, proxmox-lvm-encrypted, 1Gi)
- kubernetes_persistent_volume_claim.files_encrypted
    (meshcentral/meshcentral-files-encrypted, proxmox-lvm-encrypted, 1Gi)

Pre-import plan: 2 to add, 3 to change, 0 to destroy
Post-import plan: 0 to add, 5 to change, 0 to destroy (benign drift)
Apply: 0 added, 5 changed, 0 destroyed

Benign drift reconciled on apply:
- PVC wait_until_bound attribute aligned (true -> false)
- tls-secret Kyverno sync labels cleared
- deployment/namespace annotation drift

Source reconciliation: none required. Both declared PVCs already match
the cluster (proxmox-lvm-encrypted, 1Gi, RWO, names identical). NFS
PV/PVC meshcentral-backups-host (nfs-truenas, 10Gi, RWX) remained
bound throughout. Deployment kept 1/1 replicas on the same pod
(meshcentral-6c4f47c6f8-mj8sk).

Commits the auto-generated cloudflare_provider.tf and tiers.tf so the
stack matches the repo convention used by its peers.

Updates: code-w97
2026-04-18 12:35:26 +00:00
Viktor Barzin
3e11bd1b67 [kyverno] Remove accidentally-committed Terragrunt-generated files
My previous commit (dacf3d9e, [kyverno] Import existing cluster state)
unintentionally picked up two Terragrunt-generated provider/locals
files from the meshcentral stack that a parallel worker had just
created. These are auto-generated on every plan/apply (marked
"Generated by Terragrunt. Sig:") and do not belong in the repo.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Files removed:
- stacks/meshcentral/cloudflare_provider.tf
- stacks/meshcentral/tiers.tf

No impact on the kyverno import work. State-level changes from
dacf3d9e (3 imports + 3 in-place updates) stand.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:59 +00:00
Viktor Barzin
2fe3bb3307 [travel_blog] Import existing cluster state (bd-w97)
All resources were already present in the Tier 1 PG state — no imports
required. The travel_blog stack has no PVC (content baked into the
Docker image, deployed via Woodpecker with 1.4GB context).

Pre-apply plan: 0 to add, 4 to change, 0 to destroy
Apply: 0 added, 4 changed, 0 destroyed
Post-apply plan: 0 to add, 3 to change, 0 to destroy (persistent benign drift)

Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode=off label removed
- Ingress external-monitor=false annotation removed (now auto-managed
  by ingress_factory dns_type)
- TLS secret Kyverno sync labels removed

Post-apply drift (persists via external controllers, out of scope):
- Kyverno re-injects ndots:2 dns_config and sync-tls-secret labels
- Goldilocks re-adds vpa-update-mode label
  (tracked separately — future work to add lifecycle ignore_changes)

Image tag viktorbarzin/travel_blog:latest unchanged — TF matches cluster.
Deployment remains at replicas=0 (intentional, per source comment:
"Scaled down — clears ExternalAccessDivergence alert"). Site is
intentionally offline.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:36 +00:00
Viktor Barzin
dacf3d9e11 [kyverno] Import existing cluster state (bd-w97)
Imported 3 missing cluster resources into the Tier 1 PG state for the
kyverno stack. The Helm release, 6 PriorityClasses, 14 ClusterPolicies,
both Secrets (registry-credentials, tls-secret), and all prior RBAC
resources were already managed in state. The strip-cpu-limits
ClusterPolicy (commit 1de2ee30, 56m prior to this import) was already
in state from its targeted apply.

Resources imported:
- module.kyverno.kubernetes_cluster_role_v1.kyverno_cleanup_pods
  (kyverno:cleanup-controller:pods — RBAC for ClusterCleanupPolicy)
- module.kyverno.kubernetes_cluster_role_binding_v1.kyverno_cleanup_pods
  (kyverno:cleanup-controller:pods — binding to cleanup-controller SA)
- module.kyverno.kubernetes_manifest.cleanup_failed_pods
  (apiVersion=kyverno.io/v2,kind=ClusterCleanupPolicy,name=cleanup-failed-pods)

All three originated from commit cf578516 (auto-cleanup failed/evicted
pods), which added the declarations but apparently never made it into
PG state before the global state reorg.

Pre-import plan:  3 to add,  2 to change, 0 to destroy
Post-import plan: 0 to add,  3 to change, 0 to destroy (benign)
Apply:            0 added,   3 changed,   0 destroyed

Benign drift reconciled on apply:
- cleanup_failed_pods manifest field populated in state post-import
  (annotations re-applied, no spec change)
- registry_credentials + tls_secret: null `generate.kyverno.io/clone-source`
  label dropped from Terraform metadata (no K8s object change — the label
  was only `null` in state, never existed on the live Secret)

Safety checks — all clean:
- ClusterPolicy count: 16 (unchanged, 14 owned here + 1 external
  goldilocks-vpa-auto-mode + strip-cpu-limits); all status=Ready=True
- ClusterCleanupPolicy cleanup-failed-pods: intact, schedule 15 * * * *
- helm_release.kyverno: no diff (revision unchanged)
- Mutating/validating webhook configurations: 3 + 7 intact
- All 4 Kyverno Deployments Running (admission x2, background, cleanup, reports)

Kyverno failurePolicy stays Ignore (forceFailurePolicyIgnore=true) so
admission degrades open if ever unavailable.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:32 +00:00
Viktor Barzin
9ea4ccf17e [pvc-autoresizer] Import existing cluster state (bd-w97)
Imported both resources for the pvc-autoresizer stack into the Tier 1 PG
state. The stack was previously unmanaged — cluster had the running
controller from a prior manual helm install (rev 1, 2026-04-03).

Resources imported:
- module.pvc_autoresizer.kubernetes_namespace.pvc_autoresizer (pvc-autoresizer)
- module.pvc_autoresizer.helm_release.pvc_autoresizer (pvc-autoresizer/pvc-autoresizer)

Pre-import plan:  2 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 2 to change, 0 to destroy (benign drift)
Apply:            0 added, 2 changed, 0 destroyed

Benign drift reconciled on apply:
- Namespace goldilocks.fairwinds.com/vpa-update-mode=off label removed
  (Kyverno ClusterPolicy goldilocks-vpa-auto-mode re-adds it immediately)
- Helm release metadata refresh only (atomic read-back, revision 1 -> 2;
  chart pvc-autoresizer-0.17.0 and app 0.20.0 unchanged — no upgrade)

Controller pods pvc-autoresizer-controller-7dcc745f68-57bk6 and -n4bh9
stayed Running throughout (restart counts unchanged: 17 and 1, both
pre-existing from pre-apply state). No PVCs entered non-Bound state.

Updates: code-w97
2026-04-18 12:33:37 +00:00
Viktor Barzin
7b88479278 [tor-proxy] Import existing cluster state (bd-w97)
Imported all 9 cluster resources into the Tier 1 PG state. Stack was
previously unmanaged — source was fully declared in main.tf but state
was empty.

Pre-import plan: 9 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 9 to change, 0 to destroy
Apply: 0 added, 9 changed, 0 destroyed

Resources imported:
- kubernetes_namespace.tor-proxy
- kubernetes_deployment.tor-proxy
- kubernetes_deployment.torrserver
- kubernetes_service.tor-proxy
- kubernetes_service.torrserver
- kubernetes_service.torrserver-bt (LoadBalancer, IP 10.0.20.200)
- kubernetes_persistent_volume_claim.torrserver_data_proxmox
- module.tls_secret.kubernetes_secret.tls_secret
- module.torrserver_ingress.kubernetes_ingress_v1.proxied-ingress

Service pods tor-proxy-7fb4644dd8-npdwg and torrserver-7788ff4c4d-jnh85
stayed Running throughout. Tor circuit preserved — no deployment restarts.

Updates: code-w97
2026-04-18 12:33:26 +00:00
Viktor Barzin
8a42a1708d [isponsorblocktv] Import existing cluster state (bd-w97)
Imported kubernetes_persistent_volume_claim.data_proxmox into the
Tier 1 PG state. Namespace and deployment were already managed.

Pre-import plan: 1 to add, 2 to change, 0 to destroy
Post-import plan: 0 to add, 3 to change, 0 to destroy (benign drift)
Apply: 0 added, 3 changed, 0 destroyed

Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode label removed
- PVC wait_until_bound aligned (true -> false)

Service pod isponsorblocktv-vermont-55bdb8998-889hn stayed Running
on the same PVC throughout.

Updates: code-w97
2026-04-18 12:31:46 +00:00
Viktor Barzin
50dea8f0a7 [monitoring] Add Claude OAuth token expiry monitoring + alerts
## Context

The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.

A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.

## This change

**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.

**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).

**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon`  — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical`      — <7d,  critical, for 10m
- `ClaudeOAuthTokenMonitorStale`  — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning

Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).

## Verification

```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09

$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
  primary: 365.2 days
  spare-1: 365.2 days
  spare-2: 365.2 days
```

## Rotation playbook (future)

1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
   (or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
   `stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
   `ClaudeOAuthTokenExpiringSoon` "for:" duration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:27:11 +00:00
Viktor Barzin
8a05475218 [claude-agent-service] Add CLAUDE_CODE_OAUTH_TOKEN env var — 1-year long-lived auth
## Context

Earlier today we hit a silent auth failure on the upgrade agent: the
short-lived `sk-ant-oat01-*` access token in `.credentials.json` had
expired and the CLI's refresh path failed (refresh token either stale
or invalidated after the creds sat in Vault for 5 days).

The real fix isn't "refresh more often" — it's switching to the
long-lived auth mechanism `claude setup-token` provides. Unlike
`claude login` (OAuth flow → 6–8h access token + refresh token JSON),
`setup-token` mints a single opaque token valid for **1 year** that
the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh
dance, no JSON file, no rotation for a year.

## This change

Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing
`claude-agent-secrets` ExternalSecret, sourced from a new
`claude_oauth_token` field at `secret/claude-agent-service`. The
container already pulls that secret via `envFrom`, so no other wiring
needed.

The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON
file when both are present, so this is additive — `.credentials.json`
stays mounted as a fallback while we validate the long-lived path.
Future cleanup can remove the JSON mount entirely.

Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd`
→ n8n → claude-agent-service /execute → agent job `fea5ff70dcfe`
completed in 30s with exit_code=0, agent correctly identified no
matching stack and aborted without changes. No API auth errors.

## Spares

Harvested two additional long-lived tokens and stored them at
`secret/claude-agent-service-spare-{1,2}` for failover if the
primary is compromised or revoked. Verified both coexist with the
primary (no revocation on mint).

## What is NOT in this change

- No removal of `.credentials.json` mount or its Vault source (keep
  as fallback until we've run for 24h on env-var auth with no issues).
- No cron rotator — 1-year TTL means this can be a yearly manual
  rotation, alerted on from Vault metadata. If we add rotation, we'll
  source from the spares pool rather than minting new tokens.

## Reproduce locally

```
1. vault login -method=oidc
2. vault kv get -field=claude_oauth_token secret/claude-agent-service | head -c 25
3. cd stacks/claude-agent-service && ../../scripts/tg apply
4. kubectl -n claude-agent exec deploy/claude-agent-service -- \
     printenv CLAUDE_CODE_OAUTH_TOKEN   # should be 108 chars
5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md)
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:12:30 +00:00
Viktor Barzin
50e8184d99 [uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob
## Context

Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.

## This change

Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.

- `local.internal_monitors` — list of desired monitors (name, type,
  connection string, Vault password key, interval, retries). Seeded with
  the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
  referenced DB passwords from Vault `secret/viktor` at apply time. Secret
  key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
  list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
  looks up monitors by name, creates if missing, patches if drifted,
  leaves id and history untouched when already in desired state.

## Why this approach (Option B, not a Terraform provider)

The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.

## Monitor 663: no-op on first sync

Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:

  Monitor MySQL Standalone (dbaas) (id=663) already in desired state
  Internal monitor sync complete

DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.

## Vault key — left manual (by design)

`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.

## Plan + apply

  Plan: 3 to add, 0 to change, 0 to destroy.
  Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
  Re-plan: No changes. Your infrastructure matches the configuration.

Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.

Closes: code-ed2
2026-04-18 12:04:17 +00:00
Viktor Barzin
d3bdf87676 [docs] Clarify external-monitor auto-annotation in CLAUDE.md
## Context
During a false-alarm investigation of terminal.viktorbarzin.me, an Explore
agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in
config.tfvars (a legacy fallback list) instead of the ingress_factory
auto-annotation. Both [External] monitors for terminal/terminal-ro exist and
are active — the original agent just looked in the wrong place.

## This change
Expands the Monitoring & Alerting bullet to spell out the mechanism:
ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when
dns_type != "none", and cloudflare_proxied_names is a legacy fallback for
the 17 hostnames not yet migrated. Future agents debugging "is this
monitored?" questions should not check cloudflare_proxied_names.

## What is NOT in this change
No Terraform, no K8s, no service config. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:45:56 +00:00
Viktor Barzin
dad62647cd [grampsweb] Align PVC resource to encrypted storage; imported state
## Context

Grampsweb stack had an empty Terraform state — 7 K8s resources (namespace,
PVC, service, deployment, ingress, ExternalSecret manifest, TLS secret)
existed in the cluster but weren't tracked. This blocked commit 7b248897
(ollama LLM env-var removal) from being applied because any apply would
attempt to re-create existing resources.

Additionally, the TF source declared a **grampsweb-data-proxmox** PVC on
**storage_class=proxmox-lvm**, while the cluster had **grampsweb-data-encrypted**
on **proxmox-lvm-encrypted** (1 Gi, bound). The deployment was referencing
the encrypted PVC. This divergence predated this change — the source was
simply out of date vs cluster reality.

## This change

Two things:

1. **Source alignment** (the only file diff):
   - Renames `kubernetes_persistent_volume_claim.data_proxmox` →
     `data_encrypted`, metadata.name to match cluster, storage class to
     `proxmox-lvm-encrypted`.
   - Updates the deployment volume `claim_name` reference accordingly.
   - Aligns with the newer project convention documented in
     `.claude/CLAUDE.md`: "Default for sensitive data is
     proxmox-lvm-encrypted" and "Convention: PVC names end in `-encrypted`".
   - No destroy/recreate: the PVC and deployment already use the encrypted
     PVC in the cluster; TF source now just describes reality.

2. **State imports** (out-of-band, via `scripts/tg import`, not in diff):
   - `kubernetes_namespace.grampsweb` <- `grampsweb`
   - `kubernetes_persistent_volume_claim.data_encrypted` <- `grampsweb/grampsweb-data-encrypted`
   - `kubernetes_service.grampsweb` <- `grampsweb/grampsweb`
   - `kubernetes_deployment.grampsweb` <- `grampsweb/grampsweb`
   - `module.ingress.kubernetes_ingress_v1.proxied-ingress` <- `grampsweb/family`
   - `module.tls_secret.kubernetes_secret.tls_secret` <- `grampsweb/tls-secret`
   - `kubernetes_manifest.external_secret` <- `apiVersion=external-secrets.io/v1beta1,kind=ExternalSecret,namespace=grampsweb,name=grampsweb-secrets`

## Apply result

`Apply complete! Resources: 0 added, 7 changed, 0 destroyed.`

In-place updates applied:
- Deployment: dropped `GRAMPSWEB_LLM_BASE_URL` + `GRAMPSWEB_LLM_MODEL` env
  vars (both containers) — realising the intent of commit 7b248897.
- Ingress: realigned Traefik middleware annotation + cleaned stale
  `uptime.viktorbarzin.me/external-monitor=false` annotation.
- TLS secret: removed Kyverno-generated labels (Kyverno's
  `sync-tls-secret` ClusterPolicy re-applies them on next reconcile —
  no functional impact; same pattern in 29 other stacks using
  `setup_tls_secret` module).
- Namespace, PVC, service: trivial metadata alignments (label /
  `wait_until_bound` / `wait_for_load_balancer`).
- `kubernetes_manifest.external_secret`: populated the `manifest`
  attribute after import (expected).

## What is NOT in this change

- No replica bump: deployment stays at `replicas=0` (stack is intentionally
  inactive per 2026-03-14 OOM incident note).
- No destroy/recreate of any resource.
- The broader code-w97 (11 stacks with empty state) is NOT closed — only
  grampsweb is imported. 10 stacks remain: beads-server, insta2spotify,
  isponsorblocktv, kyverno, meshcentral, pvc-autoresizer, shadowsocks,
  tor-proxy, travel_blog, + meshcentral PVC.

## Reproduce locally

```
KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# Deployment still replicas=0; PVC grampsweb-data-encrypted Bound; ingress 'family'
# on family.viktorbarzin.me; ExternalSecret SecretSynced True.

cd /home/wizard/code/infra/stacks/grampsweb
/home/wizard/code/infra/scripts/tg plan
# Expected: 'No changes.' (clean state after apply).
```

## Test Plan

### Automated
```
$ cd /home/wizard/code/infra/stacks/grampsweb && /home/wizard/code/infra/scripts/tg plan
Plan: 0 to add, 7 to change, 0 to destroy.  [pre-apply]

$ /home/wizard/code/infra/scripts/tg apply --non-interactive
Plan: 0 to add, 7 to change, 0 to destroy.
kubernetes_namespace.grampsweb: Modifications complete after 0s [id=grampsweb]
kubernetes_persistent_volume_claim.data_encrypted: Modifications complete after 0s [id=grampsweb/grampsweb-data-encrypted]
kubernetes_service.grampsweb: Modifications complete after 0s [id=grampsweb/grampsweb]
module.ingress.kubernetes_ingress_v1.proxied-ingress: Modifications complete after 0s [id=grampsweb/family]
module.tls_secret.kubernetes_secret.tls_secret: Modifications complete after 0s [id=grampsweb/tls-secret]
kubernetes_manifest.external_secret: Modifications complete after 0s
kubernetes_deployment.grampsweb: Modifications complete after 1s [id=grampsweb/grampsweb]
Apply complete! Resources: 0 added, 7 changed, 0 destroyed.

$ terraform fmt -check -recursive stacks/grampsweb
(no output - formatted clean)
```

### Manual Verification
```
$ KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# - deployment.apps/grampsweb 0/0 0 0 47d   (replicas=0 preserved)
# - service/grampsweb ClusterIP 10.106.232.205:80/TCP
# - persistentvolumeclaim/grampsweb-data-encrypted Bound pvc-c9a5dcf4... 1Gi RWO proxmox-lvm-encrypted
# - ingress/family traefik family.viktorbarzin.me -> 10.0.20.200:80,443
# - externalsecret/grampsweb-secrets vault-kv 15m SecretSynced True
# - secret/tls-secret kubernetes.io/tls
# No pod crashes (no pods — replicas=0).
```

Closes: code-8m6
2026-04-18 11:37:45 +00:00
Viktor Barzin
1de2ee307f kyverno: strip resources.limits.cpu cluster-wide via ClusterPolicy
Context
-------
The cluster policy is "no CPU limits anywhere" — CFS throttling causes
more harm than good for bursty single-threaded workloads (Node.js,
Python). LimitRanges are already correct (defaultRequest.cpu only, no
default.cpu), but 22 pods still carried CPU limits injected by upstream
Helm chart defaults — CrowdSec (lapi + agents), descheduler,
kubernetes-dashboard (×4), nvidia gpu-operator.

Previous attempts were ad-hoc: patch each values.yaml, occasionally
missing things on chart upgrade. This replaces that with a declarative
Kyverno mutation at admission time.

This change
-----------
Adds a new ClusterPolicy `strip-cpu-limits` with two foreach rules:

  strip-container-cpu-limit      → containers[]
  strip-initcontainer-cpu-limit  → initContainers[]

Each rule uses `patchesJson6902` with an `op: remove` on
`resources/limits/cpu`. JSON6902 `remove` fails on missing paths, so
per-element preconditions gate the mutation — pods without CPU limits
pass through untouched. A top-level rule precondition short-circuits
using JMESPath filter (`[?resources.limits.cpu != null] | length(@) > 0`)
so the mutation is a no-op for the overwhelming majority of pods.

Admission-time only. No `mutateExistingOnPolicyUpdate`, no `background`.
Existing pods keep their CPU limits until they're restarted naturally
(Helm upgrade, node drain, rollout). We rely on churn, not forced
restarts, to avoid unnecessary thrash.

Memory limits are preserved — they prevent OOM, still useful.

Flow
----

    admission request → match Pod + CREATE
                     → top-level precondition: any container has limits.cpu?
                           no  → skip (fast path)
                           yes → foreach container:
                                   element.limits.cpu present?
                                       no  → skip element
                                       yes → remove /spec/containers/N/resources/limits/cpu
                     → same again for initContainers
                     → mutated pod proceeds to API server

Verification
------------
  kubectl run test-strip-cpu --overrides='{limits:{cpu:500m,memory:64Mi}}'
    → admitted pod.resources = {limits:{memory:64Mi}, requests:{cpu:50m,memory:32Mi}}
    → CPU limit stripped, memory preserved, requests untouched

  kubectl rollout restart deploy/kubernetes-dashboard-metrics-scraper
    → new pod.resources = {limits:{memory:400Mi}, requests:{cpu:100m,memory:200Mi}}
    → cluster-wide count of pods with CPU limits: 22 → 21

Rollout
-------
Remaining 21 pods will drop their CPU limits on natural churn. No manual
restarts in this change — user may want to time a mass restart with a
maintenance window.

Closes: code-eaf
Closes: code-4bz

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:34:39 +00:00