Commit graph

2875 commits

Author SHA1 Message Date
cc56ba2939 [payslip-ingest] Move Payslips datasource 'database' into jsonData
Grafana 11.2+ Postgres plugin reads the DB name from jsonData.database
(see grafana/grafana#112418). The top-level 'database' field is silently
ignored by the frontend — datasource health checks and POST /api/ds/query
still work because the backend honors it, but every dashboard panel fails
with 'you do not have default database'.

Rolling back to the supported shape fixes rendering for all 4 uk-payslip
panels.
2026-04-18 23:23:07 +00:00
Viktor Barzin
f6cff262f0 broker-sync: chown fidelity_storage_state to broker uid in init container
## Context

First end-to-end test of the broker-sync-fidelity CronJob failed with
`PermissionError: [Errno 13] Permission denied:
'/data/fidelity_storage_state.json'`. Init container runs as root (uid
0) but the broker-sync container runs as uid 10001; chmod 600 without
chown made the file unreadable from the main container.

## This change

Added `chown 10001:10001` before the existing `chmod 600` in the
`stage-storage-state` init container command. Init container has
CAP_CHOWN by default as root, so this succeeds.

## Verification

$ kubectl apply -f test-pod.yaml   # same init + main pattern
$ kubectl logs fidelity-debug -c broker-sync
...
broker_sync.providers.fidelity_planviewer.FidelitySessionError:
    PlanViewer session stale — run `broker-sync fidelity-seed`

Init container succeeded + main container read the file + Playwright
launched Chromium + navigated to PlanViewer + hit the 15-min idle page
→ exactly the intended behaviour for a stale session. Next step
(out-of-band): Viktor paste a fresh SMS OTP and re-seed via
fidelity-seed on Viktor's laptop or the existing chat-driven flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:22:43 +00:00
Viktor Barzin
43254ccd3f [infra] Add Woodpecker pipeline to deploy PVE /etc/exports (Wave 6b)
## Context

Wave 6b of the state-drift consolidation plan. `scripts/pve-nfs-exports` is
the git-managed source of truth for the Proxmox host's NFS export table
(file header documents this since the 2026-04-14 NFS outage post-mortem).
Deploying it was runbook-only — `scp` then `ssh ... exportfs -ra` — which
means a change could sit unpushed-to-PVE indefinitely, and nothing alerted
on divergence between git and host.

Wave 6b closes that loop: a Woodpecker pipeline watches
`scripts/pve-nfs-exports` on the `master` branch, diffs against the
current host file, and scp's the new content followed by `exportfs -ra`.
The same 2-shell-command runbook, now a CI step that runs on every push
and is manually triggerable.

## This change

- New pipeline `.woodpecker/pve-nfs-exports-sync.yml` — path-filtered push
  trigger + manual.
- SSH credentials provisioned 2026-04-18:
  - ed25519 keypair `woodpecker-pve-nfs-exports-sync`
  - Public key in `root@192.168.1.127:~/.ssh/authorized_keys`
  - Private key in Vault `secret/woodpecker/pve_ssh_key` (plus known_hosts
    entry for deterministic host-key pinning from Vault)
  - Woodpecker repo-level secret `pve_ssh_key` (id 139) bound to the infra
    repo's `push`/`manual`/`cron` events
- Pipeline steps: install openssh + curl (alpine image) → stage private
  key from secret → ssh-keyscan the PVE host into known_hosts → diff
  current vs. proposed exports (shown in pipeline log) → scp → exportfs
  -ra → Slack notify status.

## What is NOT in this change

- Drift detection (git-truth vs. host-truth) via cron: this pipeline only
  fires on *push*, so a host-side edit wouldn't be caught. Could add a
  daily cron that just runs the diff step and alerts if non-empty. Left
  as a refinement if drift becomes an issue.
- Pulling known_hosts from Vault rather than ssh-keyscan on each run: the
  keyscan is simpler and works against key rotation without needing a
  Vault round-trip. Pulling from Vault is the right answer the moment we
  add MITM risk, which the internal network doesn't have today.

## Reproduce locally
Edit `scripts/pve-nfs-exports`, push to master. Watch the pipeline in
Woodpecker. Verify on PVE: `ssh root@192.168.1.127 "md5sum /etc/exports"`
matches `md5sum scripts/pve-nfs-exports` in the repo.

Closes: code-dne

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:21:36 +00:00
Viktor Barzin
b9e9c3f084 [mailserver] Update SPF + docs for Brevo migration [ci skip]
## Context

Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when
variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`.
Postfix immediately began using Brevo for user mail — but the SPF TXT record
at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every
Brevo-relayed message failed SPF alignment and was spam-foldered or
DMARC-quarantined by Gmail/Outlook.

Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`:

    "v=spf1 include:mailgun.org -all"  <-- wrong sender network

User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`.
Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo
sends quarantined rather than outright rejected while we validate Brevo's
sending IPs + rate limits for real user mail. Tighten to `-all` once the
relay is proven stable.

The docs in `docs/architecture/mailserver.md` still described the old
Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets
table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change",
those are updated in the same commit.

## This change

Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs):

1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content
   flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`,
   with an inline comment pointing at the mailserver docs for rationale.
2. `docs/architecture/mailserver.md` —
   - Last-updated stamp moved to 2026-04-18 with the cutover note.
   - Overview paragraph now says "relays through Brevo EU" (was Mailgun).
   - DNS table SPF row reflects the new value plus an annotated history
     note ("was include:mailgun.org -all until 2026-04-18").
   - DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua
     target and flags that the current live record still points at
     e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569.
   - Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo
     relay credentials; `mailgun_api_key` annotated as retained for the
     E2E roundtrip probe only (inbound delivery testing, not user mail).

Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf`
to avoid sweeping up two unrelated pre-existing drifts that the Terraform
state shows on this stack: the DMARC + mail._domainkey_rspamd records are
stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and
a naive refresh+apply would normalize them in the state back to single
strings. Those drifts are semantically equivalent (DNS concatenates
adjacent TXT strings at resolution time) and are out of scope for this
commit — they'll be handled under their own ticket.

## What is NOT in this change

- DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1),
  still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses
  in the live record.
- DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and
  `mail_domainkey_rspamd` — pre-existing Cloudflare representation drift,
  untouched here.
- Removal of Mailgun references in history/decision sections of the docs,
  or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API
  on purpose for inbound delivery testing (code-569 scope).
- Mailgun DKIM record `s1._domainkey` — left in place; still consumed by
  the roundtrip probe.
- Other pending items from the 2026-04-18 mail audit plan.

## Test Plan

### Automated

Targeted plan showed exactly one change, no other drift sneaking in:

    module.cloudflared.cloudflare_record.mail_spf will be updated in-place
      ~ content = "\"v=spf1 include:mailgun.org -all\""
             -> "\"v=spf1 include:spf.brevo.com ~all\""
    Plan: 0 to add, 1 to change, 0 to destroy.

Apply result:

    Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

DNS propagation verified on three independent resolvers immediately after
apply:

    $ dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
    "v=spf1 include:spf.brevo.com ~all"

    $ dig TXT viktorbarzin.me @8.8.8.8 +short | grep spf
    "v=spf1 include:spf.brevo.com ~all"

    $ dig TXT viktorbarzin.me @10.0.20.201 +short | grep spf   # Technitium primary
    "v=spf1 include:spf.brevo.com ~all"

### Manual Verification

Setup: nothing extra — change is already live (TF applied before commit
per home-lab convention; `[ci skip]` in title).

1. Confirm SPF is the Brevo-only record from an external resolver:

       dig TXT viktorbarzin.me @1.1.1.1 +short

   Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference.

2. Send a test email via the mailserver (through Brevo relay) to a Gmail
   account and view the original headers:

       Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me
       ...
       Received-SPF: Pass (google.com: domain of ... designates ... as
       permitted sender)

   Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this
   change because the envelope sender IP was a Brevo IP not covered by
   `include:mailgun.org`).

3. Confirm no live Mailgun references in the mailserver doc:

       grep -n mailgun.org infra/docs/architecture/mailserver.md

   Expected: only annotated-history mentions — SPF "was ... until
   2026-04-18" and DMARC "current live record still points at
   e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active
   Mailgun relay.

## Reproduce locally

    cd infra
    git pull
    dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
    # expected: "v=spf1 include:spf.brevo.com ~all"

    # inspect the TF change:
    git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf

    # inspect the doc change:
    git show HEAD -- docs/architecture/mailserver.md

Closes: code-q8p
Closes: code-9pe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:13:47 +00:00
06e3425a39 [monitoring] Set rawQuery+editorMode on uk-payslip panel targets
Grafana 11's Postgres plugin shows 'you do not have default database'
on any panel whose target is missing rawQuery:true / editorMode:"code".
The query builder can't reason about a custom schema.table path and
blanks the panel.
2026-04-18 23:12:45 +00:00
ed820e9b58 [monitoring] Fix uk-payslip datasource type to grafana-postgresql-datasource
The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer
one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana
to fall back to 'default database' and error out when rendering.

Ran sed over the JSON; all 8 panel+target type refs now match the installed
plugin name. UID (payslips-pg) was already correct.
2026-04-18 23:10:13 +00:00
471e946133 [monitoring] Put uk-payslip dashboard in Finance folder
Grafana can't auto-create the reserved 'General' folder ('A folder with
that name already exists'), which aborts the sidecar provisioner's walk
and drops every dashboard in that folder. Move uk-payslip to Finance so
it loads.
2026-04-18 23:03:22 +00:00
Viktor Barzin
11082f7e83 [infra] Partial Calico adoption: namespaces only (Wave 5b)
## Context

Wave 5b of the state-drift consolidation plan. Calico has run this cluster's
pod networking since 2024-07-30, installed via raw kubectl manifests —
tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan
flagged Calico as HIGH BLAST because the operator + Installation CR sit on
the critical path for pod scheduling; any mistake during adoption can
break CNI and block new pods cluster-wide within seconds.

This session takes the safe sub-step: adopt only the three namespaces.
Namespaces are label containers — TF managing their names + PSA labels
cannot disrupt Calico networking. Getting the operator, Installation CR,
and CRDs under TF requires dedicated prep (picking the right
`ignore_changes` fields to absorb operator-generated defaults in the
Installation CR, decoupling from the embedded PSA labels applied at
admission, and a low-traffic window). Deferred to `code-3ad`.

## This change

New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks
(Wave 8 convention, commit 8a99be11):

- `kubernetes_namespace.calico_system` ← id `calico-system`
- `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver`
- `kubernetes_namespace.tigera_operator` ← id `tigera-operator`

Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a
second `tg plan` that returns `No changes`. Zero cluster impact —
namespaces stayed exactly as they were cluster-side.

### terragrunt dependency choice
Deliberately no `dependency "platform"` clause — Calico is lower in the
stack than platform, so introducing a `platform → calico` or
`calico → platform` edge would invite cycle-like pain on first
bootstrap. The plan on this stack is always safe to run standalone.

### `ignore_changes` scope on each namespace
- `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy
  stamp (Wave 3B sweep, commit 8b43692a).
- `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator
  stamps these on `calico-system` + `calico-apiserver` to opt them out
  of PSA. These labels aren't surfaced by the kubernetes provider as
  part of the import (they arrive through a different field manager),
  so left unmanaged to keep the plan clean. `tigera-operator` ns
  doesn't get the PSA labels so they aren't ignored there.

## What is NOT in this change

- The three live workloads: `tigera-operator` Deployment in
  `tigera-operator` ns, `calico-kube-controllers`/`calico-node`/
  `calico-typha` workloads in `calico-system`, the `calico-apiserver`
  in `calico-apiserver`. These are all reconciled by the tigera-operator
  from the Installation CR — importing them into TF is redundant with
  importing the CR itself.
- The `Installation` CR (`default`, apiVersion
  `operator.tigera.io/v1`) — the user-authored minimal spec has since
  been filled to 104 lines of operator-generated defaults. Adopting it
  requires a well-scoped `ignore_changes` list on the `manifest` field.
  Separate follow-up `code-3ad`.
- `.sops.yaml` / `tier0_stacks` updates — the original plan suggested
  Tier 0 (local SOPS state) for the full Calico stack on the theory
  that "network underpins all". With only three namespaces in the stack,
  the argument doesn't hold: a failed Tier 1 plan on calico namespaces
  cannot break networking, so no need to pay the Tier 0 tax.

## Verification

```
$ cd stacks/calico && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.

$ kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS
calico-kube-controllers-...                1/1     Running   0
calico-node-...                            1/1     Running   0
... (all healthy, pre-existing)
```

Follow-up: code-3ad for operator + Installation CR adoption (needs
low-traffic window + ignore_changes scoping).

Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:52:56 +00:00
Viktor Barzin
16d9fd8bde [infra] Adopt Authentik catch-all Proxy Provider + Application into TF (Wave 6a)
## Context

Wave 6a of the state-drift consolidation plan. The Domain wide catch all
Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all)
+ the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have
run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2
hit, it was harder to reason about the config than it should have been —
the only source of truth was the Authentik admin UI. Bringing the provider
+ application under Terraform means future changes are reviewable in PRs
and recoverable from git if the admin UI misbehaves.

## This change

Adds the `goauthentik/authentik` provider to the repo's central
`terragrunt.hcl` `required_providers` (side-effect: every stack can now
declare authentik resources; this stack is the only current consumer).
Stack-local `stacks/authentik/authentik_provider.tf` holds the provider
instance configuration + API token wiring + two resources + their flow
data-source lookups.

### Auth
- API token stored in Vault at `secret/authentik/tf_api_token`, identifier
  `terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable
  by rewriting the Vault KV + any running TF apply picks it up on next
  plan.

### Imports (both landed zero-diff)
- `authentik_application.catchall` ← id `domain-wide-catch-all`
- `authentik_provider_proxy.catchall` ← id `5`

### Flow references
Authorization + invalidation flows are looked up via `data
"authentik_flow"` by slug (`default-provider-authorization-implicit-consent`
+ `default-provider-invalidation-flow`). Keeping them as data sources
rather than hardcoded UUIDs means a flow recreation (slug unchanged)
doesn't require an HCL edit.

### `lifecycle { ignore_changes }` scope
On `authentik_provider_proxy.catchall`:
- `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the
  live state references complex many-to-many relations that are easier
  to manage from the Authentik UI than to serialise in HCL. Drift
  suppressed.
- `skip_path_regex`, `internal_host`, all `basic_auth_*`,
  `intercept_header_auth`, `access_token_validity` — either defaults or
  UI-only tuning knobs that aren't part of Terraform's concern for this
  catch-all provider.

On `authentik_application.catchall`:
- `meta_description`, `meta_launch_url`, `meta_icon`, `group`,
  `backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` —
  cosmetic/non-functional attributes; the Authentik UI is the right
  place to edit these and drift on them isn't interesting.

## What is NOT in this change

- Outpost-binding resource — the embedded outpost's provider list is a
  single-row many-to-many that the Authentik UI manages cleanly; adding
  TF there would fight the UI without reducing drift.
- Property mappings and JWT federation source — managed via UI, drift
  suppressed. A future wave can bring them in when someone actually
  wants to edit them through code review.
- Other Authentik entities (Flows, Stages, Groups, RBAC policies) —
  same rationale: UI is the natural editing surface. Adopt incrementally
  as they become interesting to code-review.

## Verification

```
$ cd stacks/authentik && ../../scripts/tg plan | grep Plan:
Plan: 0 to add, 1 to change, 0 to destroy.
  # module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift,
  # unrelated to this commit (image_pull_policy Always -> IfNotPresent)

$ ../../scripts/tg state list | grep authentik_
authentik_application.catchall
authentik_provider_proxy.catchall
data.authentik_flow.default_authorization_implicit_consent
data.authentik_flow.default_provider_invalidation
```

## Reproduce locally
1. `git pull && cd stacks/authentik && ../../scripts/tg init`
2. Terraform pulls goauthentik/authentik provider (first time).
3. `tg plan` — expect only pgbouncer drift; authentik resources read-only.

Refs: Wave 6a of the state-drift consolidation (code-hl1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:48:26 +00:00
eee694c915 [payslip-extractor] Add PAYSLIP_TEXT fast path
payslip-ingest now runs pdftotext locally before calling claude-agent-service,
shrinking the prompt ~20-100x. Agent file documents both paths: PAYSLIP_TEXT
(fast) and PDF_BASE64 (fallback for scanned-image PDFs or when pdftotext
fails).
2026-04-18 22:48:07 +00:00
Viktor Barzin
b28c76e371 [infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift
## Context

Wave 7 of the state-drift consolidation plan. The drift-detection pipeline
(`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every
stack daily and Slack-posted a summary, but its output was ephemeral —
nothing persisted in Prometheus, so there was no historical view of which
stacks drift, when, or for how long. Following the convergence work in
waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4
mysql cleanup), the baseline is clean enough that *new* drift should
stand out. That only works if we have observability.

## This change

### `.woodpecker/drift-detection.yml`

Enhances the existing cron pipeline to push a batched set of metrics to
the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`)
after each run:

| Metric | Kind | Purpose |
|---|---|---|
| `drift_stack_state{stack}` | gauge, 0/1/2 | 0=clean, 1=drift, 2=error |
| `drift_stack_first_seen{stack}` | gauge (unix seconds) | Preserved across runs for drift-age tracking |
| `drift_stack_age_hours{stack}` | gauge (hours) | Computed from `first_seen` |
| `drift_stack_count` | gauge (count) | Total drifted stacks this run |
| `drift_error_count` | gauge (count) | Total plan-errored stacks |
| `drift_clean_count` | gauge (count) | Total clean stacks |
| `drift_detection_last_run_timestamp` | gauge (unix seconds) | Pipeline heartbeat |

First-seen preservation: on each drift hit, the pipeline queries
Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}`
value. If present and non-zero, reuse it; otherwise stamp with `NOW`.
That means age-hours grows monotonically until the stack goes clean
(at which point state=0 resets first_seen by omission).

Atomic batched push: all metrics for a run are POST'd in a single
HTTP request. Pushgateway doesn't support atomic multi-metric updates
natively, but batching at the pipeline layer prevents half-updated
state if the curl is interrupted mid-run (the second call would just
fail the entire run and alert on `DriftDetectionStale`).

### `stacks/monitoring/.../prometheus_chart_values.tpl`

New `Infrastructure Drift` alert group with three rules:

- **DriftDetectionStale** (warning, 30m): fires if
  `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h
  grace window on top of the 24h cron so transient Pushgateway or
  cluster unavailability doesn't false-alarm. Guards against the
  pipeline silently failing or the cron not firing.
- **DriftUnaddressed** (warning, 1h): fires if any stack has
  `drift_stack_age_hours > 72` — three days of unacknowledged drift.
  Three days is long enough to absorb weekends + typical review cycles
  but short enough to force follow-up before drift compounds.
- **DriftStacksMany** (warning, 30m): fires if `drift_stack_count > 10`
  in a single run. Sudden wide drift usually signals systemic causes
  (new admission webhook, provider version bump, cluster-wide CRD
  upgrade) rather than individual configuration errors, and the alert
  body nudges toward that diagnosis.

Applied to `stacks/monitoring` this session — 1 helm_release changed,
no other drift surfaced.

## What is NOT in this change

- The Wave 7 **GitHub issue auto-filer** — the full plan included
  filing a `drift-detected` issue per drifted stack. Deferred because
  it requires wiring the `file-issue` skill's convention + a gh token
  exposed to Woodpecker, both of which need separate setup. The Slack
  alert covers the same need at lower fidelity in the meantime.
- The Wave 7 **PG drift_history table** — would provide the richest
  historical view but adds a new DB schema dependency for a CI
  pipeline. Pushgateway + Prometheus handle the 72h window we care
  about; PG history is nice-to-have for quarterly reviews.
- Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the
  baseline has been stable for a few cycles.

Follow-ups tracked: file dedicated beads items for GH-issue filer + PG
drift_history.

## Verification

```
$ cd stacks/monitoring && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

# After next cron run (cron expr: "drift-detection" in Woodpecker UI):
$ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \
    | grep -c '^drift_'
# expect a positive number
```

## Reproduce locally
1. `git pull`
2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules | jq '.data.groups[] | select(.name == "Infrastructure Drift")'`
3. Manually trigger the Woodpecker cron and watch Pushgateway populate.

Refs: Wave 7 umbrella (code-hl1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:42:51 +00:00
Viktor Barzin
124a756351 [infra] Adopt local-path-provisioner into Terraform (Wave 5c)
## Context

Wave 5c of the state-drift consolidation plan. `local-path-provisioner`
(Rancher's node-local dynamic PV provisioner) was deployed 55d ago via raw
`kubectl apply` against the upstream manifest. It serves as the cluster's
default StorageClass and is still actively in use — the 2026-04-18 live
survey showed helper-pod-delete cycles running against existing PVCs.

Unmanaged until now: namespace, ServiceAccount, ClusterRole (+ binding),
ConfigMap with provisioner config.json + helperPod.yaml + setup/teardown
scripts, StorageClass `local-path` (default), and the 1-replica
Deployment itself. Seven resources total.

## This change

New Tier 1 stack `stacks/local-path/` with all seven resources, adopted
via Wave 8's HCL `import {}` block convention (commit 8a99be11):

- `kubernetes_namespace.local_path_storage` → id `local-path-storage`
- `kubernetes_service_account.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner-service-account`
- `kubernetes_cluster_role.local_path_provisioner` → id `local-path-provisioner-role`
- `kubernetes_cluster_role_binding.local_path_provisioner` → id `local-path-provisioner-bind`
- `kubernetes_config_map.local_path_config` →
  id `local-path-storage/local-path-config`
- `kubernetes_storage_class_v1.local_path` → id `local-path`
- `kubernetes_deployment.local_path_provisioner` →
  id `local-path-storage/local-path-provisioner`

Conventions applied:
- Namespace gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  Goldilocks `vpa-update-mode` label drift (Wave 3B, commit 8b43692a).
- Deployment gets `# KYVERNO_LIFECYCLE_V1` marker suppressing the
  ndots dns_config drift (Wave 3A, commit c9d221d5 + 327ce215).
- ServiceAccount + pod spec pin `automount_service_account_token = false`
  and `enable_service_links = false` to match the live spec exactly.
- `import {}` stanzas removed after the apply converged to zero-diff
  (per AGENTS.md → "Adopting Existing Resources").

## Apply outcome

`Apply complete! Resources: 7 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were:
- `kubernetes_config_map.local_path_config.data` — whitespace/format
  reshuffle. The live ConfigMap contained the upstream manifest's
  hand-indented JSON + YAML; my HCL uses canonical `jsonencode` /
  heredoc. Semantic content identical, so the provisioner continued
  running (no pod restart).
- `kubernetes_deployment.local_path_provisioner.wait_for_rollout = true`
  — TF-only attribute, no cluster impact.
- `kubernetes_storage_class_v1.local_path.allow_volume_expansion = false`
  + `is-default-class` annotation re-asserted — TF-schema reconciliation
  only; the StorageClass remained default throughout.

Post-apply `scripts/tg plan` returns `No changes`.

## Verification

```
$ cd stacks/local-path && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.

$ kubectl -n local-path-storage get deploy
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
local-path-provisioner   1/1     1            1           55d

$ kubectl get sc local-path
NAME                    PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE
local-path (default)    rancher.io/local-path    Delete          WaitForFirstConsumer
```

## What is NOT in this change

- Helm-release adoption — local-path-provisioner was never installed via
  Helm in this cluster; raw manifests only. Keeping native typed
  resources rather than retrofitting a chart.
- PV-path customisation — sticks with upstream default
  `/opt/local-path-provisioner` on all nodes (via
  `DEFAULT_PATH_FOR_NON_LISTED_NODES`).

Closes: code-3gp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:39:55 +00:00
Viktor Barzin
1a7f68fe5b [beads-server] Auto-dispatch agent beads via CronJobs
## Context

Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.

The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`

So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.

## Flow

```
  user: bd assign <id> agent
         │
         ▼
  Dolt @ dolt.beads-server.svc:3306  ◄──── every 2 min ────┐
         │                                                  │
         ▼                                                  │
  CronJob: beads-dispatcher                                 │
    1. GET beadboard/api/agent-status  (busy? skip)         │
    2. bd query 'assignee=agent AND status=open'            │
    3. bd update -s in_progress   (claim)                   │
    4. POST beadboard/api/agent-dispatch                    │
    5. bd note "dispatched: job=…"                          │
         │                                                  │
         ▼                                                  │
  claude-agent-service /execute                             │
    beads-task-runner agent runs; notes/closes bead         │
         │                                                  │
         ▼                                                  │
  done  ──► next tick picks up the next bead ───────────────┘

  CronJob: beads-reaper  (every 10 min)
    for bead (assignee=agent, status=in_progress, updated_at > 30 min):
      bd note   "reaper: no progress for Nm — blocking"
      bd update -s blocked
```

## Decisions

- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
  client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
  2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
  Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
  Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
  `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
  Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
  the image-seeded file. The script copies it into `/tmp/.beads/` because bd
  may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
  When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
  `beads-task-runner` never trips the reaper. Failures trip it; pod crashes
  (in-memory job state lost) also trip it.

## What is NOT in this change

- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
  `cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.

## Deviations from plan

Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
  serializes `notes` as a string (not an array), and every `bd note` bumps
  `updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
  `-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.

## Test plan

### Automated

```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!

$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1)  # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.

$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```

### Manual verification

1. **Apply**
   ```
   vault login -method=oidc
   cd infra/stacks/beads-server
   scripts/tg apply
   ```
   Expect: `kubernetes_config_map.beads_metadata`,
   `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
   created. No changes to existing resources.

2. **CronJobs exist with right schedule**
   ```
   kubectl -n beads-server get cronjob
   ```
   Expect `beads-dispatcher  */2 * * * *` and `beads-reaper  */10 * * * *`,
   both with `SUSPEND=False`.

3. **End-to-end smoke**
   ```
   bd create "auto-dispatch smoke test" \
       -d "Read /etc/hostname inside the agent sandbox and close." \
       --acceptance "bd note includes 'hostname=' line and bead is closed."
   bd assign <new-id> agent
   # within 2 min:
   bd show <new-id> --json | jq '{status, notes}'
   ```
   Expect notes to contain `auto-dispatcher claimed at …` and
   `dispatched: job=<uuid>`, status `in_progress`.

4. **Reaper smoke**
   Assign + dispatch a long bead, then
   `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
   30 min + one reaper tick, `bd show <id>` shows `blocked` with a
   `reaper: no progress for Nm — blocking` note.

5. **Kill switch**
   ```
   cd infra/stacks/beads-server
   scripts/tg apply -var=beads_dispatcher_enabled=false
   kubectl -n beads-server get cronjob
   ```
   Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
   nothing happens within 5 min. Re-apply with `=true` to re-enable.

Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.

Closes: code-8sm

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
Viktor Barzin
01955916b2 [infra] Adopt kured + sentinel-gate into Terraform (Wave 5a)
## Context

Wave 5a of the state-drift consolidation plan. Two cluster-critical pieces
of infrastructure lived OUTSIDE Terraform — invisible to the repo's "all
cluster changes via TF" invariant and drifting silently:

1. **kured** (Helm release): deployed 265d ago via `helm install kured` on
   the CLI. Values were edited only via `helm upgrade` — never captured.
   Chart version `kured-5.11.0`, app `1.21.0`, configured for Mon–Fri
   02:00–06:00 London reboot window, Slack notifyUrl, and a custom
   `/sentinel/gated-reboot-required` sentinel file.

2. **kured-sentinel-gate**: a custom DaemonSet + ServiceAccount +
   ClusterRole + ClusterRoleBinding. Built after the 2026-03 post-mortem
   (memory 390) when kured rebooted nodes during a containerd overlayfs
   outage and turned a single-node blip into a 26h cluster outage.
   The gate DaemonSet creates `/var/run/gated-reboot-required` only when
   (a) host has `/var/run/reboot-required`, (b) all nodes Ready, (c) all
   calico-node pods Running, (d) no node transitioned Ready in the last
   30 minutes (cool-down). kured's `rebootSentinel` then points at the
   gated file so reboots are effectively gated by cluster health.
   Applied 33d ago via `kubectl apply` — no TF footprint.

Both are now codified in the new `stacks/kured/` (Tier 1, PG state).

## This change

- New stack `stacks/kured/` with `main.tf` (247 lines) + `terragrunt.hcl`
  (standard platform-dep) + `secrets` symlink.
- All 6 resources adopted via Wave 8's HCL `import {}` block pattern
  (commit 8a99be11) — written as `import {}` stanzas in the initial
  commit, plan-applied to zero, then stanzas deleted before this commit
  per the convention:
    - `kubernetes_namespace.kured` (id: `kured`)
    - `helm_release.kured` (id: `kured/kured`)
    - `kubernetes_service_account.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
    - `kubernetes_cluster_role.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_cluster_role_binding.kured_sentinel_gate` (id: `kured-sentinel-gate`)
    - `kubernetes_daemon_set_v1.kured_sentinel_gate` (id: `kured/kured-sentinel-gate`)
- Slack notifyUrl moved from inline helm values into Vault at
  `secret/kured` under key `slack_kured_webhook`, consumed via
  `data "vault_kv_secret_v2"`. No plaintext secret in git.
- Namespace gets `tier = "1-cluster"` label (new — previously untiered,
  so Kyverno auto-quotas applied cluster-tier defaults on kured pods).
  Benign additive change; pod specs have explicit resources anyway.
- DaemonSet + SA get `automount_service_account_token = false` /
  `enable_service_links = false` to match the live pod spec exactly —
  otherwise TF schema defaults would flip these fields.
- DaemonSet carries `# KYVERNO_LIFECYCLE_V1` suppressing dns_config drift
  (Wave 3A convention, commit c9d221d5 + 327ce215).
- Namespace carries the same marker on the
  `goldilocks.fairwinds.com/vpa-update-mode` label (Wave 3B sweep,
  commit 8b43692a).

## Import outcomes

Apply result: `Resources: 6 imported, 0 added, 3 changed, 0 destroyed.`

The 3 in-place changes were all TF-schema reconciliation, not cluster
mutations:

- `helm_release.kured.values` — format reshuffle; the imported state
  stored values as a nested map, HCL uses `[yamlencode(...)]`. Semantic
  YAML is byte-identical, so the triggered Helm upgrade was a no-op on
  the cluster side (revision bumped 2→3, zero pod restarts).
- `kubernetes_namespace.kured.labels["tier"]` = `"1-cluster"` — new
  label added. Already discussed above.
- `kubernetes_daemon_set_v1.kured_sentinel_gate.wait_for_rollout` = true
  — TF-only attribute, no k8s impact.

Post-apply `scripts/tg plan` on `stacks/kured` returns:
`No changes. Your infrastructure matches the configuration.`

## What is NOT in this change

- `import {}` stanzas — intentionally removed after the apply landed.
  They would be no-ops and would clutter future diffs. Per Wave 8
  convention (AGENTS.md → "Adopting Existing Resources").
- Calico adoption (Wave 5b) — separate higher-blast change, needs a
  dedicated low-traffic window.
- local-path-storage (Wave 5c) — check-or-remove task still open.

## Verification

```
$ kubectl -n kured get ds
NAME                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
kured                 5         5         5       5            5
kured-sentinel-gate   5         5         5       5            5

$ helm -n kured list
NAME     NAMESPACE   REVISION  STATUS    CHART          APP VERSION
kured    kured       3         deployed  kured-5.11.0   1.21.0

$ cd stacks/kured && ../../scripts/tg plan | tail -1
No changes. Your infrastructure matches the configuration.
```

## Reproduce locally
1. `git pull`
2. `cd stacks/kured && ../../scripts/tg plan` → 0 changes
3. `kubectl -n kured get ds,pods` — 5 kured + 5 sentinel-gate pods Ready.

Closes: code-q8k

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:33:29 +00:00
Viktor Barzin
10fd88aec5 wealthfolio: add nightly backup sidecar — SQLite → NFS
## Context

Upstream Wealthfolio uses SQLite exclusively (Diesel ORM, no PG/MySQL
support — confirmed 2026-04-18 via repo inspection). The DB lives on
an RWO PVC (proxmox-lvm-encrypted) held 24/7 by the main pod.

First attempt at a standalone backup CronJob failed with Multi-Attach
error: RWO volume is already attached to the running WF pod, so no
separate pod can mount it. Switched to a backup sidecar in the same
pod — shares the PVC mount naturally.

## This change

- `container "backup"` added to the WF Deployment:
  - alpine:3.20 + sqlite + busybox-suid (for crond).
  - Mounts /data read-only (shared with WF container) + /backup (new
    NFS volume at 192.168.1.127:/srv/nfs/wealthfolio-backup).
  - Writes /etc/crontabs/root with a `30 4 * * *` line + /scripts/backup.sh
    which runs `sqlite3 .backup` (WAL-safe online snapshot, zero
    downtime), copies secrets.json, and prunes anything older than 30d.
  - 16Mi request / 64Mi limit — sleeps most of the time.
- NFS volume declared in pod spec — server from the existing
  `var.nfs_server` variable; path `/srv/nfs/wealthfolio-backup` created
  on the PVE host in the same session.

Removed the standalone backup CronJob that couldn't work.

## Verification

### Automated

`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 1 changed, 1 destroyed (the transient CronJob).

### Manual (2026-04-18)

$ kubectl -n wealthfolio get pods -l app=wealthfolio
wealthfolio-95d8bd498-cj8kw   2/2   Running
$ kubectl -n wealthfolio logs <pod> -c backup
wealthfolio-backup sidecar ready; next 04:30 UTC
$ kubectl -n wealthfolio exec <pod> -c backup -- /scripts/backup.sh
wealthfolio-backup: /backup/2026-04-18T22-24-55 (34.2M)
$ ls /srv/nfs/wealthfolio-backup/
2026-04-18T22-24-55/   ← first sidecar-produced backup

## Reproduce locally

1. kubectl -n wealthfolio exec $(kubectl -n wealthfolio get pods -l app=wealthfolio -o jsonpath='{.items[0].metadata.name}') -c backup -- /scripts/backup.sh
2. ssh root@192.168.1.127 ls /srv/nfs/wealthfolio-backup/
3. Expected: new dated folder appears with wealthfolio.db + secrets.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:25:19 +00:00
Viktor Barzin
9e5d7cd825 state(vault): update encrypted state 2026-04-18 22:12:55 +00:00
Viktor Barzin
402fd1fbac state(dbaas): update encrypted state 2026-04-18 22:12:09 +00:00
Viktor Barzin
345ba2182f [mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout
## Context

After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.

The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.

Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.

## This change

Two edits to the embedded probe script inside the cronjob resource:

```
-    # Step 2: Wait for delivery, retry IMAP up to 3 min
+    # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
  ...
-    for attempt in range(9):
+    for attempt in range(15):
  ...
-            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```

Flow (before):

```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```

Flow (after):

```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
                                           │
                                           └─ timeout ─► log, continue to next loop
```

## What is NOT in this change

- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
  3600s + for: 10m. Those fire only on sustained multi-hour issues and
  should not be loosened — they would mask future regressions. Probe
  success rate is now expected to recover to ≥95% from the two upstream
  fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
  cleanup of stale e2e-probe-* messages.

## Test Plan

### Automated

`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:

```
  # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
  -     for attempt in range(9):
  +     for attempt in range(15):
  -             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
  +             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

1. Trigger the probe manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
   `kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
   successful run should still complete in < 60s now that postscreen
   is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
   Prometheus — expect ≥95% (was ~65% before all three fixes).

## Reproduce locally

1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.

Closes: code-18e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:33:56 +00:00
Viktor Barzin
e2516b07a3 [mailserver] Disable postscreen btree cache to stop SMTP lock-contention stalls
## Context

Postfix inside docker-mailserver was spamming fatal errors at roughly
1 per minute — 5,464 of them in a 24h window — all of the same shape:

```
postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache:
unable to get exclusive lock: Resource temporarily unavailable
```

Every time one of these fires, the postscreen process dies mid-connection
and the inbound SMTP session is dropped. Legitimate mail (including Brevo
deliveries for our e2e email-roundtrip probe) gets re-queued by the sender
and arrives late — frequently past the probe's 180s IMAP polling window,
producing a 35%/7d probe success rate and the EmailRoundtripStale alert
noise that was originally flagged as "probably nothing."

## Root cause

`master.cf` declares postscreen with `maxproc=1`, but postscreen still
re-spawns per incoming connection (or for short-lived reopens), and each
instance opens the shared btree cache with an exclusive file lock. Under
any concurrency (two TCP SYNs arriving close together, or a retry during
teardown), the second process hits EWOULDBLOCK on fcntl and Postfix
treats that as fatal.

Three options were considered:

  | Option | Verdict |
  |--------|---------|
  | (a) Disable cache (postscreen_cache_map = )  | ✓ chosen |
  | (b) Switch btree → lmdb                       | ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) |
  | (c) proxy:btree via proxymap                  | ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" |
  | (d) Memcached sidecar                         | ✗ new moving part; deferred |

Option (a) is a small trade-off: legitimate clients re-run the
greet-action / bare-newline-action checks on every fresh TCP session
instead of hitting the 7-day whitelist cache. At our volume (~100
deliveries/day, ~72 of which are the probe itself) that's negligible CPU.
DNSBL re-evaluation is also avoided only partially, but this mailserver
already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role
was doing nothing anyway.

## This change

Appends a stanza to the user-merged postfix main.cf stored in
`variable.postfix_cf` that sets `postscreen_cache_map =` (empty value).
Postfix treats an empty cache_map as "no persistent cache" — per-session
decisions are still enforced, they just aren't cached across sessions.

Before:

```
smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock)
                ├─ concurrent access → fcntl EWOULDBLOCK → fatal
                └─ connection dropped, sender retries, mail arrives late
```

After:

```
smtpd ──► postscreen (no cache, per-session checks only)
                └─ no shared file, no lock → no fatal, no dropped session
```

No change to master.cf (postscreen still the front-end), no change to
DNSBL / greet / bare-newline policy.

## What is NOT in this change

- Dovecot userdb dedup (shipped in the previous commit).
- Email-roundtrip probe widening (next commit).
- Rebuilding docker-mailserver image with lmdb support (deferred —
  disabling the cache is simpler and sufficient at our volume).

## Test Plan

### Automated

`postconf -m` in the running container to confirm lmdb is genuinely absent
(ruling out option (b) before we commit to (a)):

```
btree  cidr  environ  fail  hash  inline  internal  ldap  memcache
nis  pcre  pipemap  proxy  randmap  regexp  socketmap  static  tcp
texthash  unionmap  unix
```

No lmdb entry — confirmed.

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  ~ "postfix-main.cf" = <<-EOT
      + postscreen_cache_map =
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

Reloader triggers pod rollout — baseline error count before apply was 34
`unable to get exclusive lock` lines per `--tail=500` log window.

### Manual Verification

Post-rollout, when the new pod is Ready:

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
   Expect: empty (no value)
2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 | grep -c "unable to get exclusive lock"`
   Expect: 0 new occurrences (any hits are from before the rollout).
3. Trigger a probe run manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
   then `kubectl -n mailserver logs job/probe-verify-...`
   Expect: `Round-trip SUCCESS` with duration < 120s.

## Reproduce locally

1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map`
2. Expect: `postscreen_cache_map =` (empty value)
3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m | grep -c "unable to get exclusive lock"`
4. Expect: 0

Closes: code-1dc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:32:48 +00:00
Viktor Barzin
01a718e17b [mailserver] Filter redundant local→local aliases to fix Dovecot 'exists more than once'
## Context

Dovecot auth logs have been steadily spamming
`passwd-file /etc/dovecot/userdb: User r730-idrac@viktorbarzin.me exists more
than once` (and the same for vaultwarden@) at ~31 occurrences per 500 log
lines. Under load this flakes IMAP auth for the e2e email-roundtrip probe
(spam@viktorbarzin.me uses the catch-all), which was masquerading as "Brevo
or probe timing" noise.

## Root cause

docker-mailserver builds Dovecot's `/etc/dovecot/userdb` from two sources:
real accounts (`postfix-accounts.cf`) AND virtual-alias entries whose
*target* resolves to a local mailbox (`postfix-virtual.cf`). When the same
address appears as BOTH a real mailbox AND an alias whose target is another
local mailbox, the generated userdb has two lines for that username pointing
to different home directories — e.g.:

  r730-idrac@viktorbarzin.me:...:/var/mail/.../r730-idrac/home
  r730-idrac@viktorbarzin.me:...:/var/mail/.../spam/home      ← from alias

Dovecot's passwd-file driver rejects the duplicate, and every subsequent
auth lookup logs the error.

This affected exactly two addresses:
- r730-idrac@viktorbarzin.me (real account + alias → spam@)
- vaultwarden@viktorbarzin.me  (real account + alias → me@)

Other aliases are fine: they either forward to external addresses (gmail
etc.) — no local userdb entry generated — or map an address to itself
(me@ → me@) which docker-mailserver dedups internally.

Note: removing the real accounts is not an option because Vaultwarden uses
`vaultwarden@viktorbarzin.me` as its live SMTP_USERNAME
(stacks/vaultwarden/modules/vaultwarden/main.tf:121).

## This change

Introduces a `local.postfix_virtual` that concatenates the Vault-sourced
aliases with `extra/aliases.txt`, then filters out any line matching the
exact "LHS RHS" shape where both sides are in `var.mailserver_accounts` and
LHS != RHS. That is, only the pure local→local redundant entries are
dropped; all forwarding aliases and the catch-all are preserved.

The filter is self-healing: if a future alias ever collides with a real
account, it gets silently suppressed instead of breaking Dovecot auth.

```
  Vault mailserver_aliases  ─┐
                              ├─ concat ─ split \n ─ filter ─ join \n ─► postfix-virtual.cf
  extra/aliases.txt ─────────┘                        │
                                                       └── drop if LHS+RHS both in
                                                           mailserver_accounts and
                                                           LHS != RHS
```

Filtered entries (confirmed via locally-simulated filter on live data):
- r730-idrac@viktorbarzin.me spam@viktorbarzin.me
- vaultwarden@viktorbarzin.me me@viktorbarzin.me

Preserved (sample): postmaster→me, contact→me, alarm-valchedrym→self+3 ext,
lubohristov→gmail, yoana→gmail, @viktorbarzin.me→spam (catch-all), all four
disposable `*-generated@` aliases.

## What is NOT in this change

- Real accounts in Vault (`secret/platform.mailserver_accounts`) are
  untouched — vaultwarden SMTP auth keeps working.
- Postfix postscreen btree lock contention (separate commit).
- Email-roundtrip probe IMAP window (separate commit).

## Test Plan

### Automated

`terraform validate` — passes (docker-mailserver module):

```
Success! The configuration is valid, but there were some validation warnings as shown above.
```

`scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`:

```
  # module.mailserver.kubernetes_config_map.mailserver_config will be updated in-place
  ~ resource "kubernetes_config_map" "mailserver_config" {
      ~ data = {
          ~ "postfix-virtual.cf" = (sensitive value)
            # (9 unchanged elements hidden)
        }
        id = "mailserver/mailserver.config"
    }
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply` — applied:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

Post-apply configmap content (the two lines are gone):

```
$ kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'
postmaster@viktorbarzin.me me@viktorbarzin.me
contact@viktorbarzin.me me@viktorbarzin.me
me@viktorbarzin.me me@viktorbarzin.me
lubohristov@viktorbarzin.me lyubomir.hristov3@gmail.com
alarm-valchedrym@viktorbarzin.me alarm-valchedrym@...,vbarzin@...,emil.barzin@...,me@...
yoana@viktorbarzin.me divcheva.yoana@gmail.com

@viktorbarzin.me spam@viktorbarzin.me
firmly-gerardo-generated@viktorbarzin.me me@viktorbarzin.me
closely-keith-generated@viktorbarzin.me vbarzin@gmail.com
literally-paolo-generated@viktorbarzin.me viktorbarzin@fb.com
hastily-stefanie-generated@viktorbarzin.me elliestamenova@gmail.com
```

Reloader triggers a pod rollout; once new pod is Ready:
- `kubectl -n mailserver exec <pod> -c docker-mailserver -- cut -d: -f1 /etc/dovecot/userdb | sort | uniq -d`
  expected output: empty (no duplicate usernames)
- `kubectl -n mailserver logs <pod> -c docker-mailserver --tail=500 | grep -c "exists more than once"`
  expected output: 0 (baseline was 31/500 lines)

## Reproduce locally

1. `kubectl -n mailserver get cm mailserver.config -o jsonpath='{.data.postfix-virtual\.cf}'`
2. Expect: no `r730-idrac@viktorbarzin.me spam@viktorbarzin.me` line and no
   `vaultwarden@viktorbarzin.me me@viktorbarzin.me` line.
3. After pod restart: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=500 | grep -c "exists more than once"` → 0.

Closes: code-27l

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:29:02 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
e612baac15 [dawarich] Re-enable Sidekiq worker with resource limits + probes
## Context

Sidekiq was commented out in main.tf:203–274 on 2026-02-23 after the
unbounded 10-thread worker drove the whole pod into memory pressure —
the kubelet then evicted the web container along with it. Viktor's
recollection was "it was crashing"; the cgroup-root cause was that the
Sidekiq container had no `resources.limits.memory` set, so a misbehaving
job could pull the entire pod down instead of being OOM-killed and
restarted in isolation.

During the ~55 days the worker was off, POSTs to /api/v1 continued to
enqueue jobs in Redis DB 1 (Dawarich uses redis-master.redis:6379/1, not
the cluster default DB 0). track_segments and digests tables stayed
empty because nothing was processing the backfill queue (beads
code-459). Dawarich was also bumped 0.37.1 → 1.6.1 on 2026-04-16, so
Sidekiq was untested against the new release in this environment.

Live pre-apply snapshot via `bin/rails runner`:
  enqueued=18  (cache=2, data_migrations=4, default=12)
  scheduled=16, retry=0, dead=0, procs=0, processed/failed=0 (stats
  reset by the 1.6.1 upgrade)
Queue latencies ~50h — lines up with code-e9c (iOS client stopped
POSTing on 2026-04-16), not with the nominal 55-day gap. Redis DB 1
was therefore a small, recoverable backlog, not the disaster the plan
originally feared — no pre-apply triage needed.

## What changed

Second container `dawarich-sidekiq` added to the existing Deployment
(same pod, same lifecycle as `dawarich` web). Key differences vs the
2026-02-23 commented block:

- `resources.limits.memory = 1Gi`, `requests = { cpu = 50m, memory =
  768Mi }`. Burstable QoS — cgroup is now bounded, so a runaway Sidekiq
  job gets OOM-killed and container-restarted in place without evicting
  the whole pod (web stays Ready).
- Hosts parametrised via `var.redis_host` / `var.postgresql_host`
  instead of hardcoded FQDNs; matches the web container's pattern.
- DB / secret / Geoapify creds via `value_from.secret_key_ref` against
  the existing `dawarich-secrets` K8s Secret (populated by the existing
  ExternalSecret). Removes the plan-time `data.vault_kv_secret_v2`
  reference the 2026-02-23 block relied on — that data source no longer
  exists in this stack.
- `BACKGROUND_PROCESSING_CONCURRENCY = "2"` (was "10"). Ramp deferred
  to separate commits (plan: 2 → 5 → 10 with 15-30min observation
  between bumps).
- Liveness + readiness `pgrep -f 'bundle exec sidekiq'` probes —
  container-scoped restart on stall, verified `pgrep` is at
  /usr/bin/pgrep in the Debian-trixie-based freikin/dawarich image.
- Same Rails boot envs as the web container (TIME_ZONE, DISTANCE_UNIT,
  RAILS_ENV, RAILS_LOG_TO_STDOUT, SECRET_KEY_BASE, SELF_HOSTED) so
  Sidekiq's Rails initialisation matches web.

Pod-level additions:
- `termination_grace_period_seconds = 60` — gives Sidekiq time to
  drain in-flight jobs on SIGTERM during rolls (default 30s not enough
  for reverse-geocoding batches).

## What is NOT in this change

- Prometheus exporter for Sidekiq metrics. The first apply turned on
  `PROMETHEUS_EXPORTER_ENABLED=true`, which enabled the
  `prometheus_exporter` gem's CLIENT middleware. That middleware PUSHes
  metrics over TCP to a separate exporter server process — and the
  freikin/dawarich image does not start one. Client logged ~2/sec
  "Connection refused" errors until we flipped ENABLED back to "false"
  in this commit. `pod.annotations["prometheus.io/scrape"]` reverted
  for the same reason (nothing listening on :9394). Filed code-1q5
  (blocks code-459) to add a third sidecar container running
  `bundle exec prometheus_exporter -p 9394 -b 0.0.0.0` and restore
  the 4 drafted alerts (DawarichSidekiqDown /
  QueueLatencyHigh / DeadGrowing / FailureRateHigh) once metrics are
  actually being emitted.
- The 4 drafted Sidekiq alerts — reverted from
  monitoring/prometheus_chart_values.tpl; they reference metrics that
  don't exist yet. Restoration is part of code-1q5.
- Concurrency ramp past 2 and the 24h burn-in gate that closes
  code-459 — separate future commits.
- Liveness/readiness probes on the web container — pre-existing gap,
  out of scope per plan.

## Other changes bundled in

Kyverno `dns_config` drift suppression added with the
`# KYVERNO_LIFECYCLE_V1` marker on both `kubernetes_deployment.dawarich`
AND `kubernetes_cron_job_v1.ingestion_freshness_monitor`. Plan only
called it out for the Deployment, but the CronJob shows identical
drift (Kyverno injects ndots=2 on every pod template, Terraform wipes
it, infinite churn). Per AGENTS.md "Kyverno Drift Suppression" every
pod-owning resource MUST carry the lifecycle block — this commit
brings this stack into convention.

## Topology trade-off recorded

Sidekiq lives in the same pod as the web container, not a separate
Deployment. This means:
- Every env bump during ramp bounces both containers (Recreate
  strategy) — brief UI blip accepted.
- `kubectl scale` alone can't pause Sidekiq — pausing requires
  `BACKGROUND_PROCESSING_CONCURRENCY=0` + apply, or re-commenting
  the container block + apply.
- Shared pod network namespace — only one process can bind any given
  port. This is why the plan explicitly avoided declaring a new
  `port { name = "prometheus" }` on the sidekiq container (the web
  container already reserves 9394 by name).

Accepted because the alternative (split Deployment) is significantly
more config for a single-instance service and a follow-up bead
(tracked in code-1q5 description area / Viktor's notes) already
captures "revisit if future crashes warrant blast-radius isolation".

## Rollback

Three levels, in order of increasing impact:
1. `BACKGROUND_PROCESSING_CONCURRENCY` → "0" + apply — pod stays up,
   no jobs processed, backlog preserved in Redis.
2. Drop concurrency to 1 or 2 + apply — reduce load but keep draining.
3. Re-comment the second container block (this diff in reverse) +
   apply — full disable, backlog stays in Redis DB 1, recoverable.

Never DEL queue:* keys directly — Redis DB 1 is where Dawarich lives,
and the jobs are recoverable state.

## Refs

- code-459 (P3) — Dawarich Sidekiq disabled. In progress; closes
  after 24h burn-in at concurrency=10 with restartCount=0, DeadSet
  delta < 100.
- code-1q5 (P3) — Follow-up: prometheus_exporter sidecar + 4 alerts.
  Depends on code-459.
- code-e9c (P2) — Viktor client-side POST bug 2026-04-16.
  Untouched; processing the backlog does not fix this but ensures
  future POSTs drain cleanly.
- code-72g (P3) — Anca ingestion silent since 2025-06-21. Untouched;
  same reasoning.

## Test Plan

### Automated

```
$ cd stacks/dawarich && ../../scripts/tg plan
...
Plan: 0 to add, 3 to change, 0 to destroy.
#   kubernetes_deployment.dawarich         (sidekiq container + probes + lifecycle)
#   kubernetes_namespace.dawarich          (drops stale goldilocks label, pre-existing drift)
#   module.tls_secret.kubernetes_secret.tls_secret  (Kyverno clone-label drift, pre-existing)

$ ../../scripts/tg apply --non-interactive
...
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.

(Second apply for PROMETHEUS_EXPORTER_ENABLED=false + annotation
removal — same 0/3/0 shape.)
```

### Manual Verification

Setup: kubectl context against the k8s cluster (10.0.20.100).

1. Pod has both containers Ready with zero restarts:
   ```
   $ kubectl -n dawarich get pods -o wide
   NAME                        READY  STATUS   RESTARTS  AGE
   dawarich-75b4ff9fbf-qh56v   2/2    Running  0         <fresh>
   ```

2. Sidekiq container is actively processing jobs:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=20
   Sidekiq 8.0.10 connecting to Redis ... db: 1
   queues: [data_migrations, points, default, mailers, families,
            imports, exports, stats, trips, tracks,
            reverse_geocoding, visit_suggesting, places,
            app_version_checking, cache, archival, digests,
            low_priority]
   Performing DataMigrations::BackfillMotionDataJob ...
   Backfilled motion_data for N000 points (N climbing)
   ```

3. Rails Sidekiq::API snapshot — procs registered, counters moving:
   ```
   $ kubectl -n dawarich exec deploy/dawarich -- bin/rails runner '
       require "sidekiq/api"
       s = Sidekiq::Stats.new
       puts "processed=#{s.processed} failed=#{s.failed} procs=#{Sidekiq::ProcessSet.new.size}"
     '
   processed=7 failed=2 procs=1
   retry=0 dead=0
   ```
   (The 2 "failures" are cumulative across two pod lifecycles during
   the Prometheus env flip — retried successfully, neither retry nor
   dead set holds any jobs.)

4. Per-container memory well under the 1Gi limit:
   ```
   $ kubectl -n dawarich top pod --containers
   POD                         NAME              CPU    MEMORY
   dawarich-75b4ff9fbf-qh56v   dawarich          1m     272Mi  (of 896Mi)
   dawarich-75b4ff9fbf-qh56v   dawarich-sidekiq  79m    333Mi  (of 1Gi)
   ```

5. No "Prometheus Exporter, failed to send" log lines since the second
   apply:
   ```
   $ kubectl -n dawarich logs deploy/dawarich -c dawarich-sidekiq --tail=500 \
       | grep -c "Prometheus Exporter"
   0
   ```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:13:05 +00:00
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
2b8bb849c0 [infra] Bump claude-agent-service + beadboard image tags
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):

1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
   (files in /usr/share/agent-seed/), the beads-task-runner agent, and
   hmac.compare_digest bearer verification. The tag moves from 382d6b14
   to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
   CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
   needs the Dispatch button + /api/agent-dispatch + /api/agent-status
   routes. Tag moves from :latest to :17a38e43 (fork HEAD on
   github.com/ViktorBarzin/beadboard).

## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
  default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
  "Use 8-char git SHA tags — `:latest` causes stale pull-through cache").

## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
  - claude-agent-service:0c24c9b6 verified via
    `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
    contains both seed files.
  - beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.

## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
   seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
   + copies beads-task-runner.md), beadboard pod rolls with new env vars
   sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
   should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
   -o yaml | grep image:` should show :17a38e43.

Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
Viktor Barzin
8d94688dde [infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]
## Context

Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:

  app.kubernetes.io/managed-by          = kyverno
  generate.kyverno.io/policy-name       = sync-tls-secret
  generate.kyverno.io/policy-namespace  = ""
  generate.kyverno.io/rule-name         = sync-tls-secret
  generate.kyverno.io/source-kind       = Secret
  generate.kyverno.io/source-namespace  = kyverno
  generate.kyverno.io/source-uid        = <uid>
  generate.kyverno.io/source-version    = v1
  generate.kyverno.io/source-group      = ""
  generate.kyverno.io/clone-source      = ""

Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.

## This change

Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.

The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.

## What is NOT in this change

- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
  — a different admission controller, different resource, different fix.
  Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
  stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
  sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
  policy is the only generate policy that produces Secrets in this repo
  (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).

## Verification

Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
   (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)

After:  Plan: 0 to add, 1 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```

Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:23:02 +00:00
Viktor Barzin
f79e3c563e [infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip]
## Context

On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member
Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone`
on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name
(selector switched to the standalone pod), all 20 databases/688 tables/14
users were dump-restored, and Vault rotated credentials against the new
instance. The InnoDB Cluster has been dark since — Phase 4 was to remove
the dead code and decommission its cluster-side Helm state.

Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster
+ mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0
(30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles."

## This change

Phase 4 scope executed in this session (beads code-qai):

1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack:
   - `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR
     + MySQL Router Deployment + 8 Services (mysql-cluster, -instances,
     ports 6446/6448/6447/6449/6450/8443, etc.)
   - `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator
     Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles
   - `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns
   - `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover
     permissions patch that existed to work around the sidecar's kopf
     permissions bug; unused without the operator
   - `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra`
   - `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override
     `innodb_doublewrite=OFF` via subPath mount; standalone does not need it
2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not
   garbage-collect PVCs; 30Gi reclaimed.
3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf`
   covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section
   and all six resources above.

The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context
deadline exceeded"). Uninstallation had in fact completed cluster-side by
that point but TF rolled back the state delta. A second `terragrunt destroy
-target` call with the same args resolved cleanly — destroyed the remaining
2 tracked resources (the first pass cleared 4) and encrypted+committed the
Tier 0 state.

## What is NOT in this change

- CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these
  on uninstall. Verified clean: `kubectl get crd | grep mysql.oracle.com`
  returns nothing.
- Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not
  a TF-managed resource.
- Stale Vault DB roles (health, linkwarden, affine, woodpecker,
  claude_memory, crowdsec, technitium) for services migrated MySQL→PG —
  sandbox denies `vault list database/roles` as credential scouting, so
  the user handles this manually.
- 2 state-commits preceding this one (`30fa411b`, `6cf3575e`) are automatic
  SOPS-encrypted-state commits produced by `scripts/tg` after each
  `terragrunt destroy` pass. Standard Tier 0 workflow.

## Verification

```
$ helm list -A | grep -E 'mysql-cluster|mysql-operator'
(no output)

$ kubectl get ns mysql-operator
Error from server (NotFound): namespaces "mysql-operator" not found

$ kubectl get pvc -n dbaas datadir-mysql-cluster-0
Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found

$ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone
NAME                 READY   STATUS    RESTARTS       AGE
mysql-standalone-0   1/1     Running   1 (118m ago)   2d

$ ../../scripts/tg state list | grep -i 'mysql_operator\|mysql_cluster\|mysql_sidecar\|mysql_extra_cnf'
(no output)

$ ../../scripts/tg plan | grep -E 'mysql_cluster|mysql_operator|mysql_sidecar|mysql_extra_cnf'
(no output — Wave 2 drift is gone; remaining plan items are pre-existing
drift unrelated to this change, see Wave 3 + in-flight payslip work)
```

## Reproduce locally
1. `git pull`
2. `cd stacks/dbaas && ../../scripts/tg state list | grep mysql_cluster` → no output
3. `helm list -A | grep mysql-cluster` → no output

Closes: code-qai

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:19:48 +00:00
Viktor Barzin
6cf3575ed9 state(dbaas): update encrypted state 2026-04-18 19:17:31 +00:00
Viktor Barzin
30fa411bf7 state(dbaas): update encrypted state 2026-04-18 19:17:20 +00:00
Viktor Barzin
61e94c21fe state(dbaas): update encrypted state 2026-04-18 19:16:41 +00:00
Viktor Barzin
c75beaac6c wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request)
## Context

Pod was OOMKilled after today's broker-sync Phase 3 import grew the
activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow
matches across 6 accounts). `/api/v1/net-worth` and
`/valuations/history` materialise the full history in memory to render
the dashboard chart.

`kubectl describe pod` showed Back-off restarting failed container;
`kubectl top pod` reported 14Mi steady-state but spikes crossed the
64Mi cap.

## This change

Bump container resources to:
- requests.memory: 64Mi → 256Mi
- limits.memory:  64Mi → 1Gi

CPU unchanged. 1Gi is generous for the current 700-activity DB +
chart rendering, with headroom for another year of growth before we
need to revisit (VPA will flag if actual use exceeds upperBound).

## Verification

### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 4 changed, 0 destroyed.

### Manual
$ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}'
→ {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}}

$ kubectl -n wealthfolio get pods -l app=wealthfolio
NAME                           READY   STATUS    RESTARTS   AGE
wealthfolio-86c8696b9c-nzwkf   1/1     Running   0          51s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:13:05 +00:00
Viktor Barzin
43b4e1d372 [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context

New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.

Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.

## What

### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
  WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
  `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
  `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
  design), `wait-for postgresql.dbaas:5432` annotation, init container runs
  `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
  dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
  uid `payslips-pg`) reading password from the db-creds K8s Secret.

### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).

### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
  (username `payslip_ingest`, 7d rotation).

### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
  idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
  `pg-cluster-1`.

### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
  JSON object matching the schema to stdout. No network, no file writes outside /tmp,
  no markdown fences.

## Trade-offs / decisions

- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
  initially described. The Alembic migration still creates a `payslip_ingest`
  schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
  and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
  surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
  `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
  auto-creator).

## Test Plan

### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.

terraform fmt -check -recursive
(exit 0)

python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```

### Manual Verification (post-merge)

Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.

Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
   (first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.

End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.

Closes: code-do7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
Viktor Barzin
81e7c3d6ee state(dbaas): update encrypted state 2026-04-18 18:59:51 +00:00
Viktor Barzin
bde713f8a4 broker-sync: add Fidelity PlanViewer CronJob (suspended)
## Context

Viktor's UK workplace pension is at Fidelity PlanViewer. The broker-sync
provider + CLI landed in the broker-sync repo (commits 804e6a8 +
7c9be54); this commit adds the infra bits so the monthly sync runs
in-cluster like the other broker-sync jobs.

One successful manual backfill on 2026-04-18 pulled 51 contributions +
valuation into a new WF WORKPLACE_PENSION account; Net Worth moved from
£865k → £1,003k. This commit productionises that flow.

## This change

- New kubernetes_cron_job_v1.fidelity in stacks/broker-sync/main.tf:
  - Schedule: 05:00 UK on the 20th of each month (after mid-month
    payroll settles; finance data shows credits on the 13th-18th).
  - Suspended by default — unsuspend once broker-sync image is rebuilt
    with Chromium baked in (Dockerfile change shipped separately in the
    broker-sync repo).
  - Init container materialises the storage_state JSON (projected from
    the broker-sync-secrets K8s Secret, synced from Vault by ESO) to the
    encrypted PVC at /data/fidelity_storage_state.json. Chromium then
    loads it.
  - Container: broker-sync fidelity-ingest with WF + FIDELITY_* env
    vars. Memory request 512Mi, limit 1280Mi — Chromium is hungry.
  - Lifecycle ignore_changes on dns_config per the KYVERNO_LIFECYCLE_V1
    convention documented in AGENTS.md.

## What is NOT in this change

- The Vault keys fidelity_storage_state + fidelity_plan_id — already
  staged via `vault kv patch` on 2026-04-18.
- Dockerfile Chromium install — in broker-sync repo (commit 7c9be54).
- Prometheus BrokerSyncFidelityFailed alert — deferred until the
  CronJob has run successfully for a month and we have a baseline.
  Existing broker-sync CronJobs also don't have per-job alerts yet;
  filing as a follow-up.

## Verification

### Automated
terraform fmt ran clean. `terragrunt plan` would show a single new
kubernetes_cron_job_v1 (suspended, so no pods scheduled).

### Manual (after apply + image rebuild)

1. Build + push broker-sync:<sha> with Chromium.
2. `scripts/tg apply stacks/broker-sync` (updates image_tag + adds
   fidelity CronJob).
3. Unsuspend: `kubectl -n broker-sync patch cronjob broker-sync-fidelity \
     -p '{"spec":{"suspend":false}}'` OR flip the tf flag.
4. Trigger a test run: `kubectl -n broker-sync create job \
     fidelity-test --from=cronjob/broker-sync-fidelity`.
5. Expect logs: `fidelity-ingest: fetched=N new=N imported=N failed=0`.
6. On FidelitySessionError: run `broker-sync fidelity-seed` locally +
   `vault kv patch secret/broker-sync fidelity_storage_state=@...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:51:20 +00:00
Viktor Barzin
4f54c959d7 [infra] Remove iscsi-csi stack — TrueNAS decommissioned [ci skip]
## Context

The iSCSI CSI driver was deployed against a TrueNAS appliance at 10.0.10.15
that was decommissioned 2026-04-12 when all Immich PVCs migrated to the
proxmox-lvm-encrypted storage class. The stack has been dead code since —
live survey (2026-04-18):

- iscsi-csi namespace: empty (0 resources), 27h old (since last TF apply)
- No iscsi CSI driver registered in the cluster
- No PVs/PVCs reference iscsi
- TF state held only the empty namespace
- helm_release.democratic_csi was not in state (already gone pre-session)

Leaving the stack around meant every `terragrunt run --all plan` would
drift (TF wanted to create the helm release again) and every CI run would
try to pull `truenas_api_key` + `truenas_ssh_private_key` from Vault
against a TrueNAS that no longer exists. Beads tracking: code-gw0.

## This change

- `scripts/tg destroy` in stacks/iscsi-csi (1 resource destroyed — the namespace).
- `rm -rf stacks/iscsi-csi/` — removes modules/, main.tf, terragrunt.hcl,
  secrets symlink, and the 4 terragrunt-generated files (backend.tf,
  providers.tf, cloudflare_provider.tf, tiers.tf).
- Dropped PG schema `iscsi-csi` on `10.0.20.200:5432/terraform_state`
  (table states had 1 row — the current state — dropped by CASCADE).
- Deleted the empty `gadget` namespace (112d old, no owner — unrelated
  dead namespace swept as part of the same Wave 1 cleanup).

## What is NOT in this change

- Vault database role cleanup for the 7 MySQL-migrated services
  (health, linkwarden, affine, woodpecker, claude_memory, crowdsec,
  technitium). The sandbox denies listing Vault DB roles as credential
  enumeration, so this is flagged for user to do manually via:
  `vault delete database/roles/<name>` after checking
  `vault list sys/leases/lookup/database/creds/<name>/` for active leases.

## Reproduce locally
1. `git pull`
2. `ls stacks/ | grep iscsi` → no output
3. `kubectl get ns iscsi-csi gadget` → both NotFound
4. psql to 10.0.20.200:5432/terraform_state → `\dn` shows no iscsi-csi schema

## Test Plan

### Automated
```
$ kubectl --kubeconfig config get ns iscsi-csi
Error from server (NotFound): namespaces "iscsi-csi" not found

$ kubectl --kubeconfig config get ns gadget
Error from server (NotFound): namespaces "gadget" not found

$ PGPASSWORD=... psql -h 10.0.20.200 -U ... -d terraform_state -c '\dn' | grep iscsi
(no output)

$ ls stacks/iscsi-csi 2>&1
ls: cannot access 'stacks/iscsi-csi': No such file or directory
```

### Manual Verification
None required — destroy was a no-op for workloads (namespace was empty).

Closes: code-b6l
Closes: code-gw0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:49:40 +00:00
Viktor Barzin
e1d20457c4 [infra/claude-agent-service] Seed beads metadata + scratch dir at runtime
## Context

Review of the BeadBoard Dispatch wiring found that the claude-agent-service
Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and
`COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land
on paths that are volume-mounted at runtime:

  - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398)
  - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427)

Kubernetes mounts hide image-layer content at those paths, so the COPYs are
dead. The companion commit in `claude-agent-service` restages both files to
`/usr/share/agent-seed/` (an image-layer path that is never mounted).

Additionally, the beads-task-runner agent rails expect
`/workspace/scratch/<job_id>/` to exist, but nothing was creating it.

## Layout before / after

```
  Before (dead COPYs):

    image layer          runtime (mounted volumes hide the files)
    -----------          -----------------------------------
    /workspace/          <- hidden by PVC mount
      .beads/
        metadata.json    <- UNREACHABLE
    /home/agent/.claude/ <- hidden by emptyDir mount
      agents/
        beads-task-runner.md  <- UNREACHABLE

  After (init container seeds volumes at pod start):

    image layer          runtime
    -----------          ------------------------------------
    /usr/share/agent-seed/
      beads-metadata.json    --+
      beads-task-runner.md    --+-> copied by seed-beads-agent init
                                    container into the mounted volumes
                                    on every pod start:
                                      /workspace/.beads/metadata.json
                                      /workspace/scratch/
                                      /home/agent/.claude/agents/beads-task-runner.md
```

## What

### New init container: `seed-beads-agent`
  - Positioned AFTER `git-init`, BEFORE the main container.
  - Uses the same service image (`${local.image}:${local.image_tag}`) — the
    seed files are baked in at `/usr/share/agent-seed/`.
  - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`).
  - Shell body:
      mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
      cp /usr/share/agent-seed/beads-metadata.json     /workspace/.beads/metadata.json
      cp /usr/share/agent-seed/beads-task-runner.md    /home/agent/.claude/agents/beads-task-runner.md
  - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`.
  - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`).

### Formatting
  - `terraform fmt -recursive` also normalised whitespace in the token-expiry
    locals block and the CronJob container definition. No semantic change.

## What is NOT in this change

  - No image tag bump. The Dockerfile refactor that produces the
    `/usr/share/agent-seed/` path lands in the claude-agent-service repo
    and will roll in on the next CI build. Until that build ships and the
    tag is bumped in this file, the new init container will `cp` from a
    path that doesn't exist yet — so do NOT apply this commit until the
    corresponding image tag bump is ready. The commit is declarative prep.
  - No changes to storage class, RBAC, Service, or any other init.
  - The main container mounts remain unchanged — only the init containers
    prepare volume contents.

## Test Plan

### Automated

```
$ terraform fmt -check -recursive stacks/claude-agent-service/
(no output — clean)

$ terraform -chdir=stacks/claude-agent-service/ init -backend=false
Terraform has been successfully initialized!

$ terraform -chdir=stacks/claude-agent-service/ validate
Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1)
Success! The configuration is valid, but there were some validation warnings
as shown above.
```

### Manual Verification (after image bump + apply)

1. Bump `local.image_tag` in main.tf to the SHA of a build that has
   `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE | jq ...`
   or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`).
2. `scripts/tg apply stacks/claude-agent-service`
3. `kubectl -n claude-agent get pods -w` — all init containers complete.
4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch`
   Expected: all three paths exist; first two are regular files with the
   expected content, `scratch` is a directory.
5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json`
   Expected: `dolt.beads-server.svc.cluster.local`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:23:19 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
a62b43d19e [infra] Document intended ignore_changes drift-workarounds [ci skip]
## Context

The infra repo has 31 `ignore_changes` blocks. Phase 1 of the state-drift
consolidation audit classified 21 as legitimate (immutable fields, cloud-computed
values) and 10 as intentional workarounds for known drift sources. The remaining
10 were indistinguishable from accidental/forgotten drift suppression without
reading the surrounding context.

This commit adds a uniform `# DRIFT_WORKAROUND: <reason>, reviewed 2026-04-18`
marker above the 8 intended-workaround blocks (6 CI image-tag decoupling + 2
non-deterministic secret hashes) so they are easy to distinguish from
accidental drift suppression during future audits.

## What is NOT in this change

- Functional behavior — `ignore_changes` lists are byte-identical.
- The Kyverno `dns_config` ignore paths (covered by Wave 3 shared module).
- Workarounds being removed — the CI decoupling is intentional by user decision.

## Files touched

CI image-tag decoupling (6):
- stacks/k8s-portal/modules/k8s-portal/main.tf (also has dns_config for Kyverno)
- stacks/novelapp/main.tf
- stacks/claude-memory/main.tf
- stacks/plotting-book/main.tf
- stacks/trading-bot/main.tf (api deployment)
- stacks/trading-bot/main.tf (workers deployment — 6 containers)

Non-deterministic secret hashes (2):
- stacks/owntracks/main.tf (htpasswd bcrypt)
- stacks/mailserver/modules/mailserver/main.tf (postfix-accounts.cf)

## Test Plan

### Automated
```
$ rg DRIFT_WORKAROUND stacks/ | wc -l
8

$ terraform fmt -recursive stacks/k8s-portal stacks/novelapp stacks/claude-memory \
    stacks/plotting-book stacks/trading-bot stacks/owntracks stacks/mailserver
(no output — already formatted)

$ git diff --stat
 stacks/claude-memory/main.tf                 | 1 +
 stacks/k8s-portal/modules/k8s-portal/main.tf | 1 +
 stacks/mailserver/modules/mailserver/main.tf | 3 ++-
 stacks/novelapp/main.tf                      | 1 +
 stacks/owntracks/main.tf                     | 1 +
 stacks/plotting-book/main.tf                 | 1 +
 stacks/trading-bot/main.tf                   | 2 ++
 7 files changed, 9 insertions(+), 1 deletion(-)
```

### Manual Verification
No apply required — HCL comments only, zero effect on plan output.

## Reproduce locally
1. `cd infra && git pull`
2. `rg "DRIFT_WORKAROUND.*reviewed 2026-04-18" stacks/ | wc -l` → expect 8
3. `terraform fmt -check -recursive stacks/` → expect clean exit

Closes: code-yrg

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:08:10 +00:00
Viktor Barzin
91165e31b9 [infra/beads-server] Wire BeadBoard to claude-agent-service
## Context

BeadBoard is the Next.js task visualization dashboard shipped in this
stack. We want users to trigger headless Claude agent runs directly from
a beads task row — "one-click dispatch" — instead of copy-pasting `bd`
IDs into a terminal. The agent runs in-cluster as claude-agent-service
(see stacks/claude-agent-service/), protected by a bearer token in
Vault at secret/claude-agent-service/api_bearer_token.

For BeadBoard to POST to /execute we need the service URL and the
bearer token available inside the pod as env vars. The URL is static
(cluster DNS); the token must come through External Secrets Operator
so rotation in Vault propagates without re-applying Terraform.

Secondary cleanup: the container was still pinned to :latest which
violates the 8-char-SHA convention and causes stale pulls through the
registry cache (see .claude/CLAUDE.md, Docker images). The image tag
is now variable-driven; the GHA pipeline will override the default
once it publishes the first SHA.

## This change

- Adds an ExternalSecret `beadboard-agent-service` in the
  `beads-server` namespace, mirroring the pattern in
  stacks/claude-agent-service/main.tf (same Vault path
  `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore,
  same 15m refresh). Exposes exactly one key: `api_bearer_token`.

- Adds two env vars to the `beadboard` container:
  - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL
    (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`)
  - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the
    ESO-managed Secret, key `api_bearer_token`

- Adds `reloader.stakater.com/auto = "true"` on the Deployment's
  top-level metadata — matches the convention used by rybbit,
  claude-memory, onlyoffice. When ESO refreshes the K8s Secret
  because Vault rotated the token, Reloader restarts the pod so the
  new token is picked up (env vars are read once at boot).

- Adds `variable "beadboard_image_tag"` (default `"latest"`, with a
  one-line comment flagging the temporary default). The image
  reference now interpolates `${var.beadboard_image_tag}`. No tfvars
  file is touched — orchestrator will flip the default to the first
  real 8-char SHA once GHA publishes it.

## What is NOT in this change

- No GHA workflow additions. The pipeline that builds
  `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard
  repo and is out of scope here.
- No Vault-side changes. `secret/claude-agent-service/api_bearer_token`
  already exists (it powers the claude-agent-service deployment
  itself).
- No Terraform `apply`. Orchestrator applies.

## Data flow

  Vault (secret/claude-agent-service)
    │  refresh every 15m
    ▼
  ESO → K8s Secret `beadboard-agent-service` (beads-server ns)
    │  envFrom.secretKeyRef
    ▼
  BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env)
    │  Authorization: Bearer <token>
    ▼
  claude-agent-service.claude-agent.svc:8080 /execute

On Vault rotation: ESO picks up new value at next refresh → K8s
Secret data changes → Reloader sees annotation + referenced Secret
changed → rolling-recreates the beadboard pod with the new token.

## Test Plan

### Automated
- `terraform fmt -recursive stacks/beads-server/` — clean (formatted
  the file once; subsequent run is a no-op).
- `terraform -chdir=stacks/beads-server validate` (after
  `terraform init -backend=false`) — `Success! The configuration is
  valid`. The 14 "Deprecated Resource" warnings are pre-existing
  (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this
  change.

### Manual Verification
1. Orchestrator applies:
   `scripts/tg -chdir=stacks/beads-server apply`
2. Verify the ExternalSecret synced:
   `kubectl -n beads-server get externalsecret beadboard-agent-service`
   Expected: `Ready=True`, `SyncedAt` recent.
3. Verify the K8s Secret exists with one key:
   `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8`
   Expected: first 8 chars of the bearer token.
4. Verify the deployment picked up the env vars:
   `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT`
   Expected: both env entries present, bearer via `secretKeyRef`.
5. Verify the reloader annotation is on the Deployment metadata:
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'`
   Expected: `true`.
6. Verify the image tag resolved to the variable default (for now):
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'`
   Expected: `registry.viktorbarzin.me:5050/beadboard:latest`
   (will become `...:<sha>` once `beadboard_image_tag` default is
   updated).
7. Smoke-test the env var inside the pod:
   `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'`
   Expected: URL printed, first 8 chars of token printed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
Viktor Barzin
82b7866bc9 [claude-agent-service] Remove orphaned DevVM SSH key wiring
## Context

The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.

This commit removes two cleanup artifacts left behind by that migration.

## This change

1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
   skill doc for the obsolete SSH-based pattern. Already in `archived/`,
   harmless but noise; deleting prevents anyone copy-pasting the old approach.

2. Removes `kubernetes_secret.ssh_key` from
   `stacks/claude-agent-service/main.tf`. The Secret was created from the
   `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
   into the agent pod. The pod's `git-init` init container uses HTTPS +
   `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
   and `https://github.com/` URL via `git config url.insteadOf`, so no
   downstream `git` invocation could fall through to SSH even if it tried.

3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
   the SSH key resource was its only consumer.

## What is NOT in this change

- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
  Removing it requires read/modify/put of the full secret and the upside
  is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
  non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
  left untouched per no-adjacent-refactor rule.

## Test plan

### Automated

- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
  pre-existing lines 464-505 are flagged; no new fmt warnings introduced
  by these deletions.

### Manual verification

1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
   The `ci_secrets` data source removal is plan-time only; does not appear
   in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
   curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
   with a minimal prompt; expect job completes with `exit_code=0`.

Closes: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:31:15 +00:00
Viktor Barzin
9a2e920006 [rybbit] Narrow CF Worker routes to SITE_IDS hosts — fix free-tier quota breach
## Context

The `rybbit-analytics` Cloudflare Worker hit the free-tier quota of 100k
requests/day. CF GraphQL analytics showed **97,153 invocations in the last
24h**, up from ~0 before 2026-04-17 21:26 UTC when Rybbit script injection
migrated off the broken Traefik rewrite-body plugin (Yaegi ResponseWriter
bug on Traefik v3.6.12) onto this Worker.

Root cause: `wrangler.toml` registered two wildcard routes
(`viktorbarzin.me/*` + `*.viktorbarzin.me/*`) which match every Cloudflare-
proxied request on the zone. Only 27 of ~119 proxied hostnames appear in
`SITE_IDS` in `index.js`; the rest burn Worker invocations for nothing since
`siteId` is `null` and the Worker no-ops. Worse, the wildcard caught
`rybbit.viktorbarzin.me` itself — every tracker `script.js` fetch and event
POST round-trip was spawning its own Worker invocation (self-amplification).

CF GraphQL per-host breakdown (last 24h, zone `viktorbarzin.me`):
- Top waste (NOT in SITE_IDS): tuya-bridge 96.6k, beadboard 55.8k,
  terminal 30.2k, authentik 19.9k, claude-memory 12.6k
- Sum of 27 SITE_IDS hosts: 47.2k
- `rybbit.viktorbarzin.me` self-amplifier: 782
- Projected post-narrow: 46.4k/day (52% reduction, well under quota)

## This change

Replaces the two wildcards with an explicit list of the **26** hostnames
present in `SITE_IDS`. `rybbit.viktorbarzin.me` is deliberately excluded
even though it has a site ID — it serves `/api/script.js` (JS) and
`/api/track` (JSON), both of which fail the Worker's `text/html`
content-type guard anyway. Leaving it routed just burned invocations.

    BEFORE                              AFTER
    ──────────────────────────          ──────────────────────────────────
    viktorbarzin.me/*          ┐        viktorbarzin.me/*          ┐
    *.viktorbarzin.me/*        ┘        www.viktorbarzin.me/*      │
                                        actualbudget.vb.me/*       │
    → matches ~119 hosts                ... (26 total)             │ → matches
    → ~97k Worker inv/day                stirling-pdf.vb.me/*      │   only 26
    → rybbit → self-amplifies            vaultwarden.vb.me/*       ┘   specific
                                                                        hosts
                                        rybbit.vb.me INTENTIONALLY
                                        EXCLUDED (self-amplifier)

Deployment is unchanged — this Worker is not in Terraform. Deploy from
`stacks/rybbit/worker/` via:

    CLOUDFLARE_EMAIL=vbarzin@gmail.com \
    CLOUDFLARE_API_KEY=$(vault kv get -field=cloudflare_api_key secret/platform) \
    npx --yes wrangler@latest deploy

`wrangler deploy` replaces all worker routes on the zone with the list from
`wrangler.toml`, so there is no cleanup step. Already deployed today as
version `d7f83980-a499-40f5-ba55-f8e18d531863` — this commit just captures
the source of truth in git.

## What is NOT in this change

- Self-hosted injection (nginx `sub_filter` sidecar, compiled Traefik
  plugin). Deferred — revisit only if analytics traffic grows past 80k/day
  again, or if we add more high-traffic hosts to `SITE_IDS`.
- Cloudflare Workers Paid plan ($5/mo for 10M requests). User declined.
- Moving the Worker into Terraform. Out of scope.
- Any Rybbit backend/frontend changes. Rybbit itself continues running.

## Test plan

### Automated

Post-deploy CF API enumeration of zone routes:

    $ curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
        "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | "\(.pattern)\t→ \(.script)"' | wc -l
    26

    # Wildcards gone:
    $ curl -s ... | jq -r '.result[].pattern' | grep -c '\*\.'
    0

### Manual Verification

Script injection behaviour, verified via `curl`:

1. SITE_IDS host — script IS injected:

       $ curl -s -L https://viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="da853a2438d0" defer>

       $ curl -s -L https://calibre.viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="ce5f8aed6bbb" defer>

2. Non-SITE_IDS host — script NOT injected:

       $ curl -s -L https://tuya-bridge.viktorbarzin.me/ | grep -c 'data-site-id'
       0

3. `rybbit.viktorbarzin.me` bypasses Worker entirely — tracker returns raw JS:

       $ curl -sI https://rybbit.viktorbarzin.me/api/script.js | grep -i content-type
       content-type: application/javascript; charset=utf-8

### Reproduce locally

    # 1. Confirm the Worker sees only the 26 narrowed routes.
    CF_EMAIL=vbarzin@gmail.com
    CF_KEY=$(vault kv get -field=cloudflare_api_key secret/platform)
    ZONE_ID=fd2c5dd4efe8fe38958944e74d0ced6d
    curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | .pattern' | sort

    # 2. 24h after deploy, re-check invocation count — expect < 80k.
    curl -s https://api.cloudflare.com/client/v4/graphql \
      -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      -H "Content-Type: application/json" \
      -d '{"query":"query($acc:String!,$since:Time!,$until:Time!){viewer{accounts(filter:{accountTag:$acc}){workersInvocationsAdaptive(limit:100,filter:{datetime_geq:$since,datetime_leq:$until}){sum{requests} dimensions{scriptName date}}}}}",
           "variables":{"acc":"02e035473cfc4834fb10c5d35470d8b4",
                        "since":"'"$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)"'",
                        "until":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'

Follow-up monitoring tracked in code-dka (P3, 3-day check).

Closes: code-l9b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:15 +00:00
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
9ea7eec362 [actualbudget] Upgrade 26.3.0 → 26.4.0 for native Sankey report
## Context

Actual Budget v26.4.0 (released 2026-04-05) re-introduces the Sankey
chart report for income/expense flow visualization (PR #7220). An earlier
experimental implementation was deleted in March 2024 (PR #2417) but a
proper reimplementation with "Other" grouping, date-range selection, and
percentage toggle is now shipped behind the experimental feature flag.

Viktor wanted Sankey visualization of budget cash flow; this is the lowest-
cost path since his existing Actual Budget deployment already holds all the
transaction data.

## This change

Bumps the `tag` input on all three factory module calls (viktor, anca, emo)
from `26.3.0` to `26.4.0`. No breaking changes, schema migrations, or config
changes per the 26.4.0 release notes.

## Rollout

Applied via `scripts/tg apply --non-interactive`. All three pods rolled
successfully to `actualbudget/actual-server:26.4.0` and passed readiness
probes. The http-api sidecars (`jhonderson/actual-http-api`) were untouched.

## Post-upgrade

Users need to toggle Settings → Experimental features → Sankey report to
access the chart, then Reports → new Sankey widget.

Closes: code-oof

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:19:27 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
6e19dce99e [docs] automated-upgrades: document long-lived OAuth + expiry monitoring
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.

Follow-up to 8a054752 and 50dea8f0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:00:07 +00:00
Viktor Barzin
e4a96591b3 .gitignore: ignore Terragrunt-generated cloudflare_provider.tf and tiers.tf
These files are regenerated by Terragrunt on every run and have a
"# Generated by Terragrunt. Sig: ..." header. Earlier today multiple parallel
agents working on bd-w97 accidentally staged them, requiring two corrective
commits (3e11bd1b, 4eb68d6b). Preventing the recurrence at the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:36:45 +00:00
Viktor Barzin
4eb68d6b1a [meshcentral] Remove accidentally-committed Terragrunt-generated files
My previous commit (c0ac24a5, [meshcentral] Import existing cluster
state + PVC) unintentionally committed two Terragrunt-generated
provider/locals files. These are auto-generated on every plan/apply
(marked 'Generated by Terragrunt. Sig:') and do not belong in the
repo. Mirrors 3e11bd1b which did the same cleanup for kyverno.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Updates: code-w97
2026-04-18 12:35:44 +00:00
Viktor Barzin
c0ac24a54c [meshcentral] Import existing cluster state + PVC (bd-w97)
Imported the two proxmox-lvm-encrypted PVCs into the Tier 1 PG state.
All other declared resources (namespace, deployment, service, ingress,
NFS-backed PV/PVC, tls secret) were already state-managed.

Imported:
- kubernetes_persistent_volume_claim.data_encrypted
    (meshcentral/meshcentral-data-encrypted, proxmox-lvm-encrypted, 1Gi)
- kubernetes_persistent_volume_claim.files_encrypted
    (meshcentral/meshcentral-files-encrypted, proxmox-lvm-encrypted, 1Gi)

Pre-import plan: 2 to add, 3 to change, 0 to destroy
Post-import plan: 0 to add, 5 to change, 0 to destroy (benign drift)
Apply: 0 added, 5 changed, 0 destroyed

Benign drift reconciled on apply:
- PVC wait_until_bound attribute aligned (true -> false)
- tls-secret Kyverno sync labels cleared
- deployment/namespace annotation drift

Source reconciliation: none required. Both declared PVCs already match
the cluster (proxmox-lvm-encrypted, 1Gi, RWO, names identical). NFS
PV/PVC meshcentral-backups-host (nfs-truenas, 10Gi, RWX) remained
bound throughout. Deployment kept 1/1 replicas on the same pod
(meshcentral-6c4f47c6f8-mj8sk).

Commits the auto-generated cloudflare_provider.tf and tiers.tf so the
stack matches the repo convention used by its peers.

Updates: code-w97
2026-04-18 12:35:26 +00:00
Viktor Barzin
3e11bd1b67 [kyverno] Remove accidentally-committed Terragrunt-generated files
My previous commit (dacf3d9e, [kyverno] Import existing cluster state)
unintentionally picked up two Terragrunt-generated provider/locals
files from the meshcentral stack that a parallel worker had just
created. These are auto-generated on every plan/apply (marked
"Generated by Terragrunt. Sig:") and do not belong in the repo.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Files removed:
- stacks/meshcentral/cloudflare_provider.tf
- stacks/meshcentral/tiers.tf

No impact on the kyverno import work. State-level changes from
dacf3d9e (3 imports + 3 in-place updates) stand.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:59 +00:00