Commit graph

2852 commits

Author SHA1 Message Date
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
2b8bb849c0 [infra] Bump claude-agent-service + beadboard image tags
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):

1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
   (files in /usr/share/agent-seed/), the beads-task-runner agent, and
   hmac.compare_digest bearer verification. The tag moves from 382d6b14
   to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
   CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
   needs the Dispatch button + /api/agent-dispatch + /api/agent-status
   routes. Tag moves from :latest to :17a38e43 (fork HEAD on
   github.com/ViktorBarzin/beadboard).

## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
  default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
  "Use 8-char git SHA tags — `:latest` causes stale pull-through cache").

## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
  - claude-agent-service:0c24c9b6 verified via
    `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
    contains both seed files.
  - beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.

## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
   seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
   + copies beads-task-runner.md), beadboard pod rolls with new env vars
   sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
   should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
   -o yaml | grep image:` should show :17a38e43.

Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
Viktor Barzin
8d94688dde [infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]
## Context

Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:

  app.kubernetes.io/managed-by          = kyverno
  generate.kyverno.io/policy-name       = sync-tls-secret
  generate.kyverno.io/policy-namespace  = ""
  generate.kyverno.io/rule-name         = sync-tls-secret
  generate.kyverno.io/source-kind       = Secret
  generate.kyverno.io/source-namespace  = kyverno
  generate.kyverno.io/source-uid        = <uid>
  generate.kyverno.io/source-version    = v1
  generate.kyverno.io/source-group      = ""
  generate.kyverno.io/clone-source      = ""

Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.

## This change

Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.

The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.

## What is NOT in this change

- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
  — a different admission controller, different resource, different fix.
  Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
  stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
  sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
  policy is the only generate policy that produces Secrets in this repo
  (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).

## Verification

Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
   (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)

After:  Plan: 0 to add, 1 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```

Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:23:02 +00:00
Viktor Barzin
f79e3c563e [infra] Remove mysql InnoDB Cluster + Operator HCL (Phase 4 cleanup) [ci skip]
## Context

On 2026-04-16 (memory #711) MySQL was migrated from InnoDB Cluster (3-member
Group Replication + MySQL Operator) to a raw `kubernetes_stateful_set_v1.mysql_standalone`
on `mysql:8.4`. The migration preserved the `mysql.dbaas` Service name
(selector switched to the standalone pod), all 20 databases/688 tables/14
users were dump-restored, and Vault rotated credentials against the new
instance. The InnoDB Cluster has been dark since — Phase 4 was to remove
the dead code and decommission its cluster-side Helm state.

Memory #711 explicitly notes Phase 4 as: "Remove helm_release.mysql_cluster
+ mysql_operator + namespace + RBAC + Delete PVC datadir-mysql-cluster-0
(30Gi) + Delete mysql-operator namespace + CRDs + stale Vault roles."

## This change

Phase 4 scope executed in this session (beads code-qai):

1. `terragrunt destroy -target` against 6 resources in the dbaas Tier 0 stack:
   - `module.dbaas.helm_release.mysql_cluster` — uninstalled InnoDBCluster CR
     + MySQL Router Deployment + 8 Services (mysql-cluster, -instances,
     ports 6446/6448/6447/6449/6450/8443, etc.)
   - `module.dbaas.helm_release.mysql_operator` — uninstalled MySQL Operator
     Deployment, InnoDBCluster CRD + webhook, operator ClusterRoles
   - `module.dbaas.kubernetes_namespace.mysql_operator` — deleted the ns
   - `module.dbaas.kubernetes_cluster_role.mysql_sidecar_extra` — leftover
     permissions patch that existed to work around the sidecar's kopf
     permissions bug; unused without the operator
   - `module.dbaas.kubernetes_cluster_role_binding.mysql_sidecar_extra`
   - `module.dbaas.kubernetes_config_map.mysql_extra_cnf` — used to override
     `innodb_doublewrite=OFF` via subPath mount; standalone does not need it
2. `kubectl delete pvc datadir-mysql-cluster-0 -n dbaas` — Helm does not
   garbage-collect PVCs; 30Gi reclaimed.
3. Removed 295 lines (lines 86–380) from `stacks/dbaas/modules/dbaas/main.tf`
   covering the `#### MYSQL — InnoDB Cluster via MySQL Operator` section
   and all six resources above.

The first destroy hit a Helm timeout on `mysql-cluster` uninstall ("context
deadline exceeded"). Uninstallation had in fact completed cluster-side by
that point but TF rolled back the state delta. A second `terragrunt destroy
-target` call with the same args resolved cleanly — destroyed the remaining
2 tracked resources (the first pass cleared 4) and encrypted+committed the
Tier 0 state.

## What is NOT in this change

- CRDs (`innodbclusters.mysql.oracle.com`, etc.) — Helm does delete these
  on uninstall. Verified clean: `kubectl get crd | grep mysql.oracle.com`
  returns nothing.
- Orphan PVC `datadir-mysql-cluster-0` — already deleted via kubectl; not
  a TF-managed resource.
- Stale Vault DB roles (health, linkwarden, affine, woodpecker,
  claude_memory, crowdsec, technitium) for services migrated MySQL→PG —
  sandbox denies `vault list database/roles` as credential scouting, so
  the user handles this manually.
- 2 state-commits preceding this one (`30fa411b`, `6cf3575e`) are automatic
  SOPS-encrypted-state commits produced by `scripts/tg` after each
  `terragrunt destroy` pass. Standard Tier 0 workflow.

## Verification

```
$ helm list -A | grep -E 'mysql-cluster|mysql-operator'
(no output)

$ kubectl get ns mysql-operator
Error from server (NotFound): namespaces "mysql-operator" not found

$ kubectl get pvc -n dbaas datadir-mysql-cluster-0
Error from server (NotFound): persistentvolumeclaims "datadir-mysql-cluster-0" not found

$ kubectl get pod -n dbaas -l app.kubernetes.io/instance=mysql-standalone
NAME                 READY   STATUS    RESTARTS       AGE
mysql-standalone-0   1/1     Running   1 (118m ago)   2d

$ ../../scripts/tg state list | grep -i 'mysql_operator\|mysql_cluster\|mysql_sidecar\|mysql_extra_cnf'
(no output)

$ ../../scripts/tg plan | grep -E 'mysql_cluster|mysql_operator|mysql_sidecar|mysql_extra_cnf'
(no output — Wave 2 drift is gone; remaining plan items are pre-existing
drift unrelated to this change, see Wave 3 + in-flight payslip work)
```

## Reproduce locally
1. `git pull`
2. `cd stacks/dbaas && ../../scripts/tg state list | grep mysql_cluster` → no output
3. `helm list -A | grep mysql-cluster` → no output

Closes: code-qai

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:19:48 +00:00
Viktor Barzin
6cf3575ed9 state(dbaas): update encrypted state 2026-04-18 19:17:31 +00:00
Viktor Barzin
30fa411bf7 state(dbaas): update encrypted state 2026-04-18 19:17:20 +00:00
Viktor Barzin
61e94c21fe state(dbaas): update encrypted state 2026-04-18 19:16:41 +00:00
Viktor Barzin
c75beaac6c wealthfolio: bump memory 64Mi → 1Gi (limit) / 256Mi (request)
## Context

Pod was OOMKilled after today's broker-sync Phase 3 import grew the
activity DB from ~10 rows (Phase 0 demo) to ~700 (Fidelity + cash-flow
matches across 6 accounts). `/api/v1/net-worth` and
`/valuations/history` materialise the full history in memory to render
the dashboard chart.

`kubectl describe pod` showed Back-off restarting failed container;
`kubectl top pod` reported 14Mi steady-state but spikes crossed the
64Mi cap.

## This change

Bump container resources to:
- requests.memory: 64Mi → 256Mi
- limits.memory:  64Mi → 1Gi

CPU unchanged. 1Gi is generous for the current 700-activity DB +
chart rendering, with headroom for another year of growth before we
need to revisit (VPA will flag if actual use exceeds upperBound).

## Verification

### Automated
`scripts/tg apply stacks/wealthfolio` → Apply complete! Resources: 0
added, 4 changed, 0 destroyed.

### Manual
$ kubectl -n wealthfolio get pod -l app=wealthfolio -o jsonpath='{.items[0].spec.containers[0].resources}'
→ {"limits":{"memory":"1Gi"},"requests":{"cpu":"10m","memory":"256Mi"}}

$ kubectl -n wealthfolio get pods -l app=wealthfolio
NAME                           READY   STATUS    RESTARTS   AGE
wealthfolio-86c8696b9c-nzwkf   1/1     Running   0          51s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:13:05 +00:00
Viktor Barzin
43b4e1d372 [payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context

New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.

Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.

## What

### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
  WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
  `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
  `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
  design), `wait-for postgresql.dbaas:5432` annotation, init container runs
  `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
  dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
  uid `payslips-pg`) reading password from the db-creds K8s Secret.

### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).

### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
  (username `payslip_ingest`, 7d rotation).

### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
  idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
  `pg-cluster-1`.

### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
  JSON object matching the schema to stdout. No network, no file writes outside /tmp,
  no markdown fences.

## Trade-offs / decisions

- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
  initially described. The Alembic migration still creates a `payslip_ingest`
  schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
  and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
  surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
  `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
  auto-creator).

## Test Plan

### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.

terraform fmt -check -recursive
(exit 0)

python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```

### Manual Verification (post-merge)

Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.

Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
   (first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.

End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.

Closes: code-do7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
Viktor Barzin
81e7c3d6ee state(dbaas): update encrypted state 2026-04-18 18:59:51 +00:00
Viktor Barzin
bde713f8a4 broker-sync: add Fidelity PlanViewer CronJob (suspended)
## Context

Viktor's UK workplace pension is at Fidelity PlanViewer. The broker-sync
provider + CLI landed in the broker-sync repo (commits 804e6a8 +
7c9be54); this commit adds the infra bits so the monthly sync runs
in-cluster like the other broker-sync jobs.

One successful manual backfill on 2026-04-18 pulled 51 contributions +
valuation into a new WF WORKPLACE_PENSION account; Net Worth moved from
£865k → £1,003k. This commit productionises that flow.

## This change

- New kubernetes_cron_job_v1.fidelity in stacks/broker-sync/main.tf:
  - Schedule: 05:00 UK on the 20th of each month (after mid-month
    payroll settles; finance data shows credits on the 13th-18th).
  - Suspended by default — unsuspend once broker-sync image is rebuilt
    with Chromium baked in (Dockerfile change shipped separately in the
    broker-sync repo).
  - Init container materialises the storage_state JSON (projected from
    the broker-sync-secrets K8s Secret, synced from Vault by ESO) to the
    encrypted PVC at /data/fidelity_storage_state.json. Chromium then
    loads it.
  - Container: broker-sync fidelity-ingest with WF + FIDELITY_* env
    vars. Memory request 512Mi, limit 1280Mi — Chromium is hungry.
  - Lifecycle ignore_changes on dns_config per the KYVERNO_LIFECYCLE_V1
    convention documented in AGENTS.md.

## What is NOT in this change

- The Vault keys fidelity_storage_state + fidelity_plan_id — already
  staged via `vault kv patch` on 2026-04-18.
- Dockerfile Chromium install — in broker-sync repo (commit 7c9be54).
- Prometheus BrokerSyncFidelityFailed alert — deferred until the
  CronJob has run successfully for a month and we have a baseline.
  Existing broker-sync CronJobs also don't have per-job alerts yet;
  filing as a follow-up.

## Verification

### Automated
terraform fmt ran clean. `terragrunt plan` would show a single new
kubernetes_cron_job_v1 (suspended, so no pods scheduled).

### Manual (after apply + image rebuild)

1. Build + push broker-sync:<sha> with Chromium.
2. `scripts/tg apply stacks/broker-sync` (updates image_tag + adds
   fidelity CronJob).
3. Unsuspend: `kubectl -n broker-sync patch cronjob broker-sync-fidelity \
     -p '{"spec":{"suspend":false}}'` OR flip the tf flag.
4. Trigger a test run: `kubectl -n broker-sync create job \
     fidelity-test --from=cronjob/broker-sync-fidelity`.
5. Expect logs: `fidelity-ingest: fetched=N new=N imported=N failed=0`.
6. On FidelitySessionError: run `broker-sync fidelity-seed` locally +
   `vault kv patch secret/broker-sync fidelity_storage_state=@...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:51:20 +00:00
Viktor Barzin
4f54c959d7 [infra] Remove iscsi-csi stack — TrueNAS decommissioned [ci skip]
## Context

The iSCSI CSI driver was deployed against a TrueNAS appliance at 10.0.10.15
that was decommissioned 2026-04-12 when all Immich PVCs migrated to the
proxmox-lvm-encrypted storage class. The stack has been dead code since —
live survey (2026-04-18):

- iscsi-csi namespace: empty (0 resources), 27h old (since last TF apply)
- No iscsi CSI driver registered in the cluster
- No PVs/PVCs reference iscsi
- TF state held only the empty namespace
- helm_release.democratic_csi was not in state (already gone pre-session)

Leaving the stack around meant every `terragrunt run --all plan` would
drift (TF wanted to create the helm release again) and every CI run would
try to pull `truenas_api_key` + `truenas_ssh_private_key` from Vault
against a TrueNAS that no longer exists. Beads tracking: code-gw0.

## This change

- `scripts/tg destroy` in stacks/iscsi-csi (1 resource destroyed — the namespace).
- `rm -rf stacks/iscsi-csi/` — removes modules/, main.tf, terragrunt.hcl,
  secrets symlink, and the 4 terragrunt-generated files (backend.tf,
  providers.tf, cloudflare_provider.tf, tiers.tf).
- Dropped PG schema `iscsi-csi` on `10.0.20.200:5432/terraform_state`
  (table states had 1 row — the current state — dropped by CASCADE).
- Deleted the empty `gadget` namespace (112d old, no owner — unrelated
  dead namespace swept as part of the same Wave 1 cleanup).

## What is NOT in this change

- Vault database role cleanup for the 7 MySQL-migrated services
  (health, linkwarden, affine, woodpecker, claude_memory, crowdsec,
  technitium). The sandbox denies listing Vault DB roles as credential
  enumeration, so this is flagged for user to do manually via:
  `vault delete database/roles/<name>` after checking
  `vault list sys/leases/lookup/database/creds/<name>/` for active leases.

## Reproduce locally
1. `git pull`
2. `ls stacks/ | grep iscsi` → no output
3. `kubectl get ns iscsi-csi gadget` → both NotFound
4. psql to 10.0.20.200:5432/terraform_state → `\dn` shows no iscsi-csi schema

## Test Plan

### Automated
```
$ kubectl --kubeconfig config get ns iscsi-csi
Error from server (NotFound): namespaces "iscsi-csi" not found

$ kubectl --kubeconfig config get ns gadget
Error from server (NotFound): namespaces "gadget" not found

$ PGPASSWORD=... psql -h 10.0.20.200 -U ... -d terraform_state -c '\dn' | grep iscsi
(no output)

$ ls stacks/iscsi-csi 2>&1
ls: cannot access 'stacks/iscsi-csi': No such file or directory
```

### Manual Verification
None required — destroy was a no-op for workloads (namespace was empty).

Closes: code-b6l
Closes: code-gw0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:49:40 +00:00
Viktor Barzin
e1d20457c4 [infra/claude-agent-service] Seed beads metadata + scratch dir at runtime
## Context

Review of the BeadBoard Dispatch wiring found that the claude-agent-service
Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and
`COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land
on paths that are volume-mounted at runtime:

  - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398)
  - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427)

Kubernetes mounts hide image-layer content at those paths, so the COPYs are
dead. The companion commit in `claude-agent-service` restages both files to
`/usr/share/agent-seed/` (an image-layer path that is never mounted).

Additionally, the beads-task-runner agent rails expect
`/workspace/scratch/<job_id>/` to exist, but nothing was creating it.

## Layout before / after

```
  Before (dead COPYs):

    image layer          runtime (mounted volumes hide the files)
    -----------          -----------------------------------
    /workspace/          <- hidden by PVC mount
      .beads/
        metadata.json    <- UNREACHABLE
    /home/agent/.claude/ <- hidden by emptyDir mount
      agents/
        beads-task-runner.md  <- UNREACHABLE

  After (init container seeds volumes at pod start):

    image layer          runtime
    -----------          ------------------------------------
    /usr/share/agent-seed/
      beads-metadata.json    --+
      beads-task-runner.md    --+-> copied by seed-beads-agent init
                                    container into the mounted volumes
                                    on every pod start:
                                      /workspace/.beads/metadata.json
                                      /workspace/scratch/
                                      /home/agent/.claude/agents/beads-task-runner.md
```

## What

### New init container: `seed-beads-agent`
  - Positioned AFTER `git-init`, BEFORE the main container.
  - Uses the same service image (`${local.image}:${local.image_tag}`) — the
    seed files are baked in at `/usr/share/agent-seed/`.
  - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`).
  - Shell body:
      mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
      cp /usr/share/agent-seed/beads-metadata.json     /workspace/.beads/metadata.json
      cp /usr/share/agent-seed/beads-task-runner.md    /home/agent/.claude/agents/beads-task-runner.md
  - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`.
  - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`).

### Formatting
  - `terraform fmt -recursive` also normalised whitespace in the token-expiry
    locals block and the CronJob container definition. No semantic change.

## What is NOT in this change

  - No image tag bump. The Dockerfile refactor that produces the
    `/usr/share/agent-seed/` path lands in the claude-agent-service repo
    and will roll in on the next CI build. Until that build ships and the
    tag is bumped in this file, the new init container will `cp` from a
    path that doesn't exist yet — so do NOT apply this commit until the
    corresponding image tag bump is ready. The commit is declarative prep.
  - No changes to storage class, RBAC, Service, or any other init.
  - The main container mounts remain unchanged — only the init containers
    prepare volume contents.

## Test Plan

### Automated

```
$ terraform fmt -check -recursive stacks/claude-agent-service/
(no output — clean)

$ terraform -chdir=stacks/claude-agent-service/ init -backend=false
Terraform has been successfully initialized!

$ terraform -chdir=stacks/claude-agent-service/ validate
Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1)
Success! The configuration is valid, but there were some validation warnings
as shown above.
```

### Manual Verification (after image bump + apply)

1. Bump `local.image_tag` in main.tf to the SHA of a build that has
   `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE | jq ...`
   or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`).
2. `scripts/tg apply stacks/claude-agent-service`
3. `kubectl -n claude-agent get pods -w` — all init containers complete.
4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch`
   Expected: all three paths exist; first two are regular files with the
   expected content, `scratch` is a directory.
5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json`
   Expected: `dolt.beads-server.svc.cluster.local`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:23:19 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
a62b43d19e [infra] Document intended ignore_changes drift-workarounds [ci skip]
## Context

The infra repo has 31 `ignore_changes` blocks. Phase 1 of the state-drift
consolidation audit classified 21 as legitimate (immutable fields, cloud-computed
values) and 10 as intentional workarounds for known drift sources. The remaining
10 were indistinguishable from accidental/forgotten drift suppression without
reading the surrounding context.

This commit adds a uniform `# DRIFT_WORKAROUND: <reason>, reviewed 2026-04-18`
marker above the 8 intended-workaround blocks (6 CI image-tag decoupling + 2
non-deterministic secret hashes) so they are easy to distinguish from
accidental drift suppression during future audits.

## What is NOT in this change

- Functional behavior — `ignore_changes` lists are byte-identical.
- The Kyverno `dns_config` ignore paths (covered by Wave 3 shared module).
- Workarounds being removed — the CI decoupling is intentional by user decision.

## Files touched

CI image-tag decoupling (6):
- stacks/k8s-portal/modules/k8s-portal/main.tf (also has dns_config for Kyverno)
- stacks/novelapp/main.tf
- stacks/claude-memory/main.tf
- stacks/plotting-book/main.tf
- stacks/trading-bot/main.tf (api deployment)
- stacks/trading-bot/main.tf (workers deployment — 6 containers)

Non-deterministic secret hashes (2):
- stacks/owntracks/main.tf (htpasswd bcrypt)
- stacks/mailserver/modules/mailserver/main.tf (postfix-accounts.cf)

## Test Plan

### Automated
```
$ rg DRIFT_WORKAROUND stacks/ | wc -l
8

$ terraform fmt -recursive stacks/k8s-portal stacks/novelapp stacks/claude-memory \
    stacks/plotting-book stacks/trading-bot stacks/owntracks stacks/mailserver
(no output — already formatted)

$ git diff --stat
 stacks/claude-memory/main.tf                 | 1 +
 stacks/k8s-portal/modules/k8s-portal/main.tf | 1 +
 stacks/mailserver/modules/mailserver/main.tf | 3 ++-
 stacks/novelapp/main.tf                      | 1 +
 stacks/owntracks/main.tf                     | 1 +
 stacks/plotting-book/main.tf                 | 1 +
 stacks/trading-bot/main.tf                   | 2 ++
 7 files changed, 9 insertions(+), 1 deletion(-)
```

### Manual Verification
No apply required — HCL comments only, zero effect on plan output.

## Reproduce locally
1. `cd infra && git pull`
2. `rg "DRIFT_WORKAROUND.*reviewed 2026-04-18" stacks/ | wc -l` → expect 8
3. `terraform fmt -check -recursive stacks/` → expect clean exit

Closes: code-yrg

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:08:10 +00:00
Viktor Barzin
91165e31b9 [infra/beads-server] Wire BeadBoard to claude-agent-service
## Context

BeadBoard is the Next.js task visualization dashboard shipped in this
stack. We want users to trigger headless Claude agent runs directly from
a beads task row — "one-click dispatch" — instead of copy-pasting `bd`
IDs into a terminal. The agent runs in-cluster as claude-agent-service
(see stacks/claude-agent-service/), protected by a bearer token in
Vault at secret/claude-agent-service/api_bearer_token.

For BeadBoard to POST to /execute we need the service URL and the
bearer token available inside the pod as env vars. The URL is static
(cluster DNS); the token must come through External Secrets Operator
so rotation in Vault propagates without re-applying Terraform.

Secondary cleanup: the container was still pinned to :latest which
violates the 8-char-SHA convention and causes stale pulls through the
registry cache (see .claude/CLAUDE.md, Docker images). The image tag
is now variable-driven; the GHA pipeline will override the default
once it publishes the first SHA.

## This change

- Adds an ExternalSecret `beadboard-agent-service` in the
  `beads-server` namespace, mirroring the pattern in
  stacks/claude-agent-service/main.tf (same Vault path
  `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore,
  same 15m refresh). Exposes exactly one key: `api_bearer_token`.

- Adds two env vars to the `beadboard` container:
  - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL
    (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`)
  - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the
    ESO-managed Secret, key `api_bearer_token`

- Adds `reloader.stakater.com/auto = "true"` on the Deployment's
  top-level metadata — matches the convention used by rybbit,
  claude-memory, onlyoffice. When ESO refreshes the K8s Secret
  because Vault rotated the token, Reloader restarts the pod so the
  new token is picked up (env vars are read once at boot).

- Adds `variable "beadboard_image_tag"` (default `"latest"`, with a
  one-line comment flagging the temporary default). The image
  reference now interpolates `${var.beadboard_image_tag}`. No tfvars
  file is touched — orchestrator will flip the default to the first
  real 8-char SHA once GHA publishes it.

## What is NOT in this change

- No GHA workflow additions. The pipeline that builds
  `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard
  repo and is out of scope here.
- No Vault-side changes. `secret/claude-agent-service/api_bearer_token`
  already exists (it powers the claude-agent-service deployment
  itself).
- No Terraform `apply`. Orchestrator applies.

## Data flow

  Vault (secret/claude-agent-service)
    │  refresh every 15m
    ▼
  ESO → K8s Secret `beadboard-agent-service` (beads-server ns)
    │  envFrom.secretKeyRef
    ▼
  BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env)
    │  Authorization: Bearer <token>
    ▼
  claude-agent-service.claude-agent.svc:8080 /execute

On Vault rotation: ESO picks up new value at next refresh → K8s
Secret data changes → Reloader sees annotation + referenced Secret
changed → rolling-recreates the beadboard pod with the new token.

## Test Plan

### Automated
- `terraform fmt -recursive stacks/beads-server/` — clean (formatted
  the file once; subsequent run is a no-op).
- `terraform -chdir=stacks/beads-server validate` (after
  `terraform init -backend=false`) — `Success! The configuration is
  valid`. The 14 "Deprecated Resource" warnings are pre-existing
  (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this
  change.

### Manual Verification
1. Orchestrator applies:
   `scripts/tg -chdir=stacks/beads-server apply`
2. Verify the ExternalSecret synced:
   `kubectl -n beads-server get externalsecret beadboard-agent-service`
   Expected: `Ready=True`, `SyncedAt` recent.
3. Verify the K8s Secret exists with one key:
   `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8`
   Expected: first 8 chars of the bearer token.
4. Verify the deployment picked up the env vars:
   `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT`
   Expected: both env entries present, bearer via `secretKeyRef`.
5. Verify the reloader annotation is on the Deployment metadata:
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'`
   Expected: `true`.
6. Verify the image tag resolved to the variable default (for now):
   `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'`
   Expected: `registry.viktorbarzin.me:5050/beadboard:latest`
   (will become `...:<sha>` once `beadboard_image_tag` default is
   updated).
7. Smoke-test the env var inside the pod:
   `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'`
   Expected: URL printed, first 8 chars of token printed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
Viktor Barzin
82b7866bc9 [claude-agent-service] Remove orphaned DevVM SSH key wiring
## Context

The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.

This commit removes two cleanup artifacts left behind by that migration.

## This change

1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
   skill doc for the obsolete SSH-based pattern. Already in `archived/`,
   harmless but noise; deleting prevents anyone copy-pasting the old approach.

2. Removes `kubernetes_secret.ssh_key` from
   `stacks/claude-agent-service/main.tf`. The Secret was created from the
   `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
   into the agent pod. The pod's `git-init` init container uses HTTPS +
   `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
   and `https://github.com/` URL via `git config url.insteadOf`, so no
   downstream `git` invocation could fall through to SSH even if it tried.

3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
   the SSH key resource was its only consumer.

## What is NOT in this change

- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
  Removing it requires read/modify/put of the full secret and the upside
  is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
  non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
  left untouched per no-adjacent-refactor rule.

## Test plan

### Automated

- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
  pre-existing lines 464-505 are flagged; no new fmt warnings introduced
  by these deletions.

### Manual verification

1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
   The `ci_secrets` data source removal is plan-time only; does not appear
   in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
   curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
   with a minimal prompt; expect job completes with `exit_code=0`.

Closes: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:31:15 +00:00
Viktor Barzin
9a2e920006 [rybbit] Narrow CF Worker routes to SITE_IDS hosts — fix free-tier quota breach
## Context

The `rybbit-analytics` Cloudflare Worker hit the free-tier quota of 100k
requests/day. CF GraphQL analytics showed **97,153 invocations in the last
24h**, up from ~0 before 2026-04-17 21:26 UTC when Rybbit script injection
migrated off the broken Traefik rewrite-body plugin (Yaegi ResponseWriter
bug on Traefik v3.6.12) onto this Worker.

Root cause: `wrangler.toml` registered two wildcard routes
(`viktorbarzin.me/*` + `*.viktorbarzin.me/*`) which match every Cloudflare-
proxied request on the zone. Only 27 of ~119 proxied hostnames appear in
`SITE_IDS` in `index.js`; the rest burn Worker invocations for nothing since
`siteId` is `null` and the Worker no-ops. Worse, the wildcard caught
`rybbit.viktorbarzin.me` itself — every tracker `script.js` fetch and event
POST round-trip was spawning its own Worker invocation (self-amplification).

CF GraphQL per-host breakdown (last 24h, zone `viktorbarzin.me`):
- Top waste (NOT in SITE_IDS): tuya-bridge 96.6k, beadboard 55.8k,
  terminal 30.2k, authentik 19.9k, claude-memory 12.6k
- Sum of 27 SITE_IDS hosts: 47.2k
- `rybbit.viktorbarzin.me` self-amplifier: 782
- Projected post-narrow: 46.4k/day (52% reduction, well under quota)

## This change

Replaces the two wildcards with an explicit list of the **26** hostnames
present in `SITE_IDS`. `rybbit.viktorbarzin.me` is deliberately excluded
even though it has a site ID — it serves `/api/script.js` (JS) and
`/api/track` (JSON), both of which fail the Worker's `text/html`
content-type guard anyway. Leaving it routed just burned invocations.

    BEFORE                              AFTER
    ──────────────────────────          ──────────────────────────────────
    viktorbarzin.me/*          ┐        viktorbarzin.me/*          ┐
    *.viktorbarzin.me/*        ┘        www.viktorbarzin.me/*      │
                                        actualbudget.vb.me/*       │
    → matches ~119 hosts                ... (26 total)             │ → matches
    → ~97k Worker inv/day                stirling-pdf.vb.me/*      │   only 26
    → rybbit → self-amplifies            vaultwarden.vb.me/*       ┘   specific
                                                                        hosts
                                        rybbit.vb.me INTENTIONALLY
                                        EXCLUDED (self-amplifier)

Deployment is unchanged — this Worker is not in Terraform. Deploy from
`stacks/rybbit/worker/` via:

    CLOUDFLARE_EMAIL=vbarzin@gmail.com \
    CLOUDFLARE_API_KEY=$(vault kv get -field=cloudflare_api_key secret/platform) \
    npx --yes wrangler@latest deploy

`wrangler deploy` replaces all worker routes on the zone with the list from
`wrangler.toml`, so there is no cleanup step. Already deployed today as
version `d7f83980-a499-40f5-ba55-f8e18d531863` — this commit just captures
the source of truth in git.

## What is NOT in this change

- Self-hosted injection (nginx `sub_filter` sidecar, compiled Traefik
  plugin). Deferred — revisit only if analytics traffic grows past 80k/day
  again, or if we add more high-traffic hosts to `SITE_IDS`.
- Cloudflare Workers Paid plan ($5/mo for 10M requests). User declined.
- Moving the Worker into Terraform. Out of scope.
- Any Rybbit backend/frontend changes. Rybbit itself continues running.

## Test plan

### Automated

Post-deploy CF API enumeration of zone routes:

    $ curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
        "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | "\(.pattern)\t→ \(.script)"' | wc -l
    26

    # Wildcards gone:
    $ curl -s ... | jq -r '.result[].pattern' | grep -c '\*\.'
    0

### Manual Verification

Script injection behaviour, verified via `curl`:

1. SITE_IDS host — script IS injected:

       $ curl -s -L https://viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="da853a2438d0" defer>

       $ curl -s -L https://calibre.viktorbarzin.me/ | grep -oE '<script[^>]*rybbit[^>]*>'
       <script src="https://rybbit.viktorbarzin.me/api/script.js"
         data-site-id="ce5f8aed6bbb" defer>

2. Non-SITE_IDS host — script NOT injected:

       $ curl -s -L https://tuya-bridge.viktorbarzin.me/ | grep -c 'data-site-id'
       0

3. `rybbit.viktorbarzin.me` bypasses Worker entirely — tracker returns raw JS:

       $ curl -sI https://rybbit.viktorbarzin.me/api/script.js | grep -i content-type
       content-type: application/javascript; charset=utf-8

### Reproduce locally

    # 1. Confirm the Worker sees only the 26 narrowed routes.
    CF_EMAIL=vbarzin@gmail.com
    CF_KEY=$(vault kv get -field=cloudflare_api_key secret/platform)
    ZONE_ID=fd2c5dd4efe8fe38958944e74d0ced6d
    curl -s -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/workers/routes" \
      | jq -r '.result[] | .pattern' | sort

    # 2. 24h after deploy, re-check invocation count — expect < 80k.
    curl -s https://api.cloudflare.com/client/v4/graphql \
      -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_KEY" \
      -H "Content-Type: application/json" \
      -d '{"query":"query($acc:String!,$since:Time!,$until:Time!){viewer{accounts(filter:{accountTag:$acc}){workersInvocationsAdaptive(limit:100,filter:{datetime_geq:$since,datetime_leq:$until}){sum{requests} dimensions{scriptName date}}}}}",
           "variables":{"acc":"02e035473cfc4834fb10c5d35470d8b4",
                        "since":"'"$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)"'",
                        "until":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'

Follow-up monitoring tracked in code-dka (P3, 3-day check).

Closes: code-l9b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:15 +00:00
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
9ea7eec362 [actualbudget] Upgrade 26.3.0 → 26.4.0 for native Sankey report
## Context

Actual Budget v26.4.0 (released 2026-04-05) re-introduces the Sankey
chart report for income/expense flow visualization (PR #7220). An earlier
experimental implementation was deleted in March 2024 (PR #2417) but a
proper reimplementation with "Other" grouping, date-range selection, and
percentage toggle is now shipped behind the experimental feature flag.

Viktor wanted Sankey visualization of budget cash flow; this is the lowest-
cost path since his existing Actual Budget deployment already holds all the
transaction data.

## This change

Bumps the `tag` input on all three factory module calls (viktor, anca, emo)
from `26.3.0` to `26.4.0`. No breaking changes, schema migrations, or config
changes per the 26.4.0 release notes.

## Rollout

Applied via `scripts/tg apply --non-interactive`. All three pods rolled
successfully to `actualbudget/actual-server:26.4.0` and passed readiness
probes. The http-api sidecars (`jhonderson/actual-http-api`) were untouched.

## Post-upgrade

Users need to toggle Settings → Experimental features → Sankey report to
access the chart, then Reports → new Sankey widget.

Closes: code-oof

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:19:27 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
6e19dce99e [docs] automated-upgrades: document long-lived OAuth + expiry monitoring
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.

Follow-up to 8a054752 and 50dea8f0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:00:07 +00:00
Viktor Barzin
e4a96591b3 .gitignore: ignore Terragrunt-generated cloudflare_provider.tf and tiers.tf
These files are regenerated by Terragrunt on every run and have a
"# Generated by Terragrunt. Sig: ..." header. Earlier today multiple parallel
agents working on bd-w97 accidentally staged them, requiring two corrective
commits (3e11bd1b, 4eb68d6b). Preventing the recurrence at the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:36:45 +00:00
Viktor Barzin
4eb68d6b1a [meshcentral] Remove accidentally-committed Terragrunt-generated files
My previous commit (c0ac24a5, [meshcentral] Import existing cluster
state + PVC) unintentionally committed two Terragrunt-generated
provider/locals files. These are auto-generated on every plan/apply
(marked 'Generated by Terragrunt. Sig:') and do not belong in the
repo. Mirrors 3e11bd1b which did the same cleanup for kyverno.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Updates: code-w97
2026-04-18 12:35:44 +00:00
Viktor Barzin
c0ac24a54c [meshcentral] Import existing cluster state + PVC (bd-w97)
Imported the two proxmox-lvm-encrypted PVCs into the Tier 1 PG state.
All other declared resources (namespace, deployment, service, ingress,
NFS-backed PV/PVC, tls secret) were already state-managed.

Imported:
- kubernetes_persistent_volume_claim.data_encrypted
    (meshcentral/meshcentral-data-encrypted, proxmox-lvm-encrypted, 1Gi)
- kubernetes_persistent_volume_claim.files_encrypted
    (meshcentral/meshcentral-files-encrypted, proxmox-lvm-encrypted, 1Gi)

Pre-import plan: 2 to add, 3 to change, 0 to destroy
Post-import plan: 0 to add, 5 to change, 0 to destroy (benign drift)
Apply: 0 added, 5 changed, 0 destroyed

Benign drift reconciled on apply:
- PVC wait_until_bound attribute aligned (true -> false)
- tls-secret Kyverno sync labels cleared
- deployment/namespace annotation drift

Source reconciliation: none required. Both declared PVCs already match
the cluster (proxmox-lvm-encrypted, 1Gi, RWO, names identical). NFS
PV/PVC meshcentral-backups-host (nfs-truenas, 10Gi, RWX) remained
bound throughout. Deployment kept 1/1 replicas on the same pod
(meshcentral-6c4f47c6f8-mj8sk).

Commits the auto-generated cloudflare_provider.tf and tiers.tf so the
stack matches the repo convention used by its peers.

Updates: code-w97
2026-04-18 12:35:26 +00:00
Viktor Barzin
3e11bd1b67 [kyverno] Remove accidentally-committed Terragrunt-generated files
My previous commit (dacf3d9e, [kyverno] Import existing cluster state)
unintentionally picked up two Terragrunt-generated provider/locals
files from the meshcentral stack that a parallel worker had just
created. These are auto-generated on every plan/apply (marked
"Generated by Terragrunt. Sig:") and do not belong in the repo.

Removes from tracking only — files remain on disk so concurrent work
is unaffected.

Files removed:
- stacks/meshcentral/cloudflare_provider.tf
- stacks/meshcentral/tiers.tf

No impact on the kyverno import work. State-level changes from
dacf3d9e (3 imports + 3 in-place updates) stand.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:59 +00:00
Viktor Barzin
2fe3bb3307 [travel_blog] Import existing cluster state (bd-w97)
All resources were already present in the Tier 1 PG state — no imports
required. The travel_blog stack has no PVC (content baked into the
Docker image, deployed via Woodpecker with 1.4GB context).

Pre-apply plan: 0 to add, 4 to change, 0 to destroy
Apply: 0 added, 4 changed, 0 destroyed
Post-apply plan: 0 to add, 3 to change, 0 to destroy (persistent benign drift)

Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode=off label removed
- Ingress external-monitor=false annotation removed (now auto-managed
  by ingress_factory dns_type)
- TLS secret Kyverno sync labels removed

Post-apply drift (persists via external controllers, out of scope):
- Kyverno re-injects ndots:2 dns_config and sync-tls-secret labels
- Goldilocks re-adds vpa-update-mode label
  (tracked separately — future work to add lifecycle ignore_changes)

Image tag viktorbarzin/travel_blog:latest unchanged — TF matches cluster.
Deployment remains at replicas=0 (intentional, per source comment:
"Scaled down — clears ExternalAccessDivergence alert"). Site is
intentionally offline.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:36 +00:00
Viktor Barzin
dacf3d9e11 [kyverno] Import existing cluster state (bd-w97)
Imported 3 missing cluster resources into the Tier 1 PG state for the
kyverno stack. The Helm release, 6 PriorityClasses, 14 ClusterPolicies,
both Secrets (registry-credentials, tls-secret), and all prior RBAC
resources were already managed in state. The strip-cpu-limits
ClusterPolicy (commit 1de2ee30, 56m prior to this import) was already
in state from its targeted apply.

Resources imported:
- module.kyverno.kubernetes_cluster_role_v1.kyverno_cleanup_pods
  (kyverno:cleanup-controller:pods — RBAC for ClusterCleanupPolicy)
- module.kyverno.kubernetes_cluster_role_binding_v1.kyverno_cleanup_pods
  (kyverno:cleanup-controller:pods — binding to cleanup-controller SA)
- module.kyverno.kubernetes_manifest.cleanup_failed_pods
  (apiVersion=kyverno.io/v2,kind=ClusterCleanupPolicy,name=cleanup-failed-pods)

All three originated from commit cf578516 (auto-cleanup failed/evicted
pods), which added the declarations but apparently never made it into
PG state before the global state reorg.

Pre-import plan:  3 to add,  2 to change, 0 to destroy
Post-import plan: 0 to add,  3 to change, 0 to destroy (benign)
Apply:            0 added,   3 changed,   0 destroyed

Benign drift reconciled on apply:
- cleanup_failed_pods manifest field populated in state post-import
  (annotations re-applied, no spec change)
- registry_credentials + tls_secret: null `generate.kyverno.io/clone-source`
  label dropped from Terraform metadata (no K8s object change — the label
  was only `null` in state, never existed on the live Secret)

Safety checks — all clean:
- ClusterPolicy count: 16 (unchanged, 14 owned here + 1 external
  goldilocks-vpa-auto-mode + strip-cpu-limits); all status=Ready=True
- ClusterCleanupPolicy cleanup-failed-pods: intact, schedule 15 * * * *
- helm_release.kyverno: no diff (revision unchanged)
- Mutating/validating webhook configurations: 3 + 7 intact
- All 4 Kyverno Deployments Running (admission x2, background, cleanup, reports)

Kyverno failurePolicy stays Ignore (forceFailurePolicyIgnore=true) so
admission degrades open if ever unavailable.

Updates: code-w97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:34:32 +00:00
Viktor Barzin
9ea4ccf17e [pvc-autoresizer] Import existing cluster state (bd-w97)
Imported both resources for the pvc-autoresizer stack into the Tier 1 PG
state. The stack was previously unmanaged — cluster had the running
controller from a prior manual helm install (rev 1, 2026-04-03).

Resources imported:
- module.pvc_autoresizer.kubernetes_namespace.pvc_autoresizer (pvc-autoresizer)
- module.pvc_autoresizer.helm_release.pvc_autoresizer (pvc-autoresizer/pvc-autoresizer)

Pre-import plan:  2 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 2 to change, 0 to destroy (benign drift)
Apply:            0 added, 2 changed, 0 destroyed

Benign drift reconciled on apply:
- Namespace goldilocks.fairwinds.com/vpa-update-mode=off label removed
  (Kyverno ClusterPolicy goldilocks-vpa-auto-mode re-adds it immediately)
- Helm release metadata refresh only (atomic read-back, revision 1 -> 2;
  chart pvc-autoresizer-0.17.0 and app 0.20.0 unchanged — no upgrade)

Controller pods pvc-autoresizer-controller-7dcc745f68-57bk6 and -n4bh9
stayed Running throughout (restart counts unchanged: 17 and 1, both
pre-existing from pre-apply state). No PVCs entered non-Bound state.

Updates: code-w97
2026-04-18 12:33:37 +00:00
Viktor Barzin
7b88479278 [tor-proxy] Import existing cluster state (bd-w97)
Imported all 9 cluster resources into the Tier 1 PG state. Stack was
previously unmanaged — source was fully declared in main.tf but state
was empty.

Pre-import plan: 9 to add, 0 to change, 0 to destroy
Post-import plan: 0 to add, 9 to change, 0 to destroy
Apply: 0 added, 9 changed, 0 destroyed

Resources imported:
- kubernetes_namespace.tor-proxy
- kubernetes_deployment.tor-proxy
- kubernetes_deployment.torrserver
- kubernetes_service.tor-proxy
- kubernetes_service.torrserver
- kubernetes_service.torrserver-bt (LoadBalancer, IP 10.0.20.200)
- kubernetes_persistent_volume_claim.torrserver_data_proxmox
- module.tls_secret.kubernetes_secret.tls_secret
- module.torrserver_ingress.kubernetes_ingress_v1.proxied-ingress

Service pods tor-proxy-7fb4644dd8-npdwg and torrserver-7788ff4c4d-jnh85
stayed Running throughout. Tor circuit preserved — no deployment restarts.

Updates: code-w97
2026-04-18 12:33:26 +00:00
Viktor Barzin
8a42a1708d [isponsorblocktv] Import existing cluster state (bd-w97)
Imported kubernetes_persistent_volume_claim.data_proxmox into the
Tier 1 PG state. Namespace and deployment were already managed.

Pre-import plan: 1 to add, 2 to change, 0 to destroy
Post-import plan: 0 to add, 3 to change, 0 to destroy (benign drift)
Apply: 0 added, 3 changed, 0 destroyed

Benign drift reconciled on apply:
- Deployment dns_config (Kyverno-injected ndots:2) removed
- Namespace goldilocks vpa-update-mode label removed
- PVC wait_until_bound aligned (true -> false)

Service pod isponsorblocktv-vermont-55bdb8998-889hn stayed Running
on the same PVC throughout.

Updates: code-w97
2026-04-18 12:31:46 +00:00
Viktor Barzin
50dea8f0a7 [monitoring] Add Claude OAuth token expiry monitoring + alerts
## Context

The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.

A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.

## This change

**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.

**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).

**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon`  — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical`      — <7d,  critical, for 10m
- `ClaudeOAuthTokenMonitorStale`  — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning

Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).

## Verification

```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09

$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
  primary: 365.2 days
  spare-1: 365.2 days
  spare-2: 365.2 days
```

## Rotation playbook (future)

1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
   (or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
   `stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
   `ClaudeOAuthTokenExpiringSoon` "for:" duration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:27:11 +00:00
Viktor Barzin
8a05475218 [claude-agent-service] Add CLAUDE_CODE_OAUTH_TOKEN env var — 1-year long-lived auth
## Context

Earlier today we hit a silent auth failure on the upgrade agent: the
short-lived `sk-ant-oat01-*` access token in `.credentials.json` had
expired and the CLI's refresh path failed (refresh token either stale
or invalidated after the creds sat in Vault for 5 days).

The real fix isn't "refresh more often" — it's switching to the
long-lived auth mechanism `claude setup-token` provides. Unlike
`claude login` (OAuth flow → 6–8h access token + refresh token JSON),
`setup-token` mints a single opaque token valid for **1 year** that
the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh
dance, no JSON file, no rotation for a year.

## This change

Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing
`claude-agent-secrets` ExternalSecret, sourced from a new
`claude_oauth_token` field at `secret/claude-agent-service`. The
container already pulls that secret via `envFrom`, so no other wiring
needed.

The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON
file when both are present, so this is additive — `.credentials.json`
stays mounted as a fallback while we validate the long-lived path.
Future cleanup can remove the JSON mount entirely.

Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd`
→ n8n → claude-agent-service /execute → agent job `fea5ff70dcfe`
completed in 30s with exit_code=0, agent correctly identified no
matching stack and aborted without changes. No API auth errors.

## Spares

Harvested two additional long-lived tokens and stored them at
`secret/claude-agent-service-spare-{1,2}` for failover if the
primary is compromised or revoked. Verified both coexist with the
primary (no revocation on mint).

## What is NOT in this change

- No removal of `.credentials.json` mount or its Vault source (keep
  as fallback until we've run for 24h on env-var auth with no issues).
- No cron rotator — 1-year TTL means this can be a yearly manual
  rotation, alerted on from Vault metadata. If we add rotation, we'll
  source from the spares pool rather than minting new tokens.

## Reproduce locally

```
1. vault login -method=oidc
2. vault kv get -field=claude_oauth_token secret/claude-agent-service | head -c 25
3. cd stacks/claude-agent-service && ../../scripts/tg apply
4. kubectl -n claude-agent exec deploy/claude-agent-service -- \
     printenv CLAUDE_CODE_OAUTH_TOKEN   # should be 108 chars
5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md)
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:12:30 +00:00
Viktor Barzin
50e8184d99 [uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob
## Context

Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.

## This change

Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.

- `local.internal_monitors` — list of desired monitors (name, type,
  connection string, Vault password key, interval, retries). Seeded with
  the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
  referenced DB passwords from Vault `secret/viktor` at apply time. Secret
  key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
  list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
  looks up monitors by name, creates if missing, patches if drifted,
  leaves id and history untouched when already in desired state.

## Why this approach (Option B, not a Terraform provider)

The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.

## Monitor 663: no-op on first sync

Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:

  Monitor MySQL Standalone (dbaas) (id=663) already in desired state
  Internal monitor sync complete

DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.

## Vault key — left manual (by design)

`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.

## Plan + apply

  Plan: 3 to add, 0 to change, 0 to destroy.
  Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
  Re-plan: No changes. Your infrastructure matches the configuration.

Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.

Closes: code-ed2
2026-04-18 12:04:17 +00:00
Viktor Barzin
d3bdf87676 [docs] Clarify external-monitor auto-annotation in CLAUDE.md
## Context
During a false-alarm investigation of terminal.viktorbarzin.me, an Explore
agent misdiagnosed "no monitoring" by checking cloudflare_proxied_names in
config.tfvars (a legacy fallback list) instead of the ingress_factory
auto-annotation. Both [External] monitors for terminal/terminal-ro exist and
are active — the original agent just looked in the wrong place.

## This change
Expands the Monitoring & Alerting bullet to spell out the mechanism:
ingress_factory auto-adds uptime.viktorbarzin.me/external-monitor=true when
dns_type != "none", and cloudflare_proxied_names is a legacy fallback for
the 17 hostnames not yet migrated. Future agents debugging "is this
monitored?" questions should not check cloudflare_proxied_names.

## What is NOT in this change
No Terraform, no K8s, no service config. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:45:56 +00:00
Viktor Barzin
dad62647cd [grampsweb] Align PVC resource to encrypted storage; imported state
## Context

Grampsweb stack had an empty Terraform state — 7 K8s resources (namespace,
PVC, service, deployment, ingress, ExternalSecret manifest, TLS secret)
existed in the cluster but weren't tracked. This blocked commit 7b248897
(ollama LLM env-var removal) from being applied because any apply would
attempt to re-create existing resources.

Additionally, the TF source declared a **grampsweb-data-proxmox** PVC on
**storage_class=proxmox-lvm**, while the cluster had **grampsweb-data-encrypted**
on **proxmox-lvm-encrypted** (1 Gi, bound). The deployment was referencing
the encrypted PVC. This divergence predated this change — the source was
simply out of date vs cluster reality.

## This change

Two things:

1. **Source alignment** (the only file diff):
   - Renames `kubernetes_persistent_volume_claim.data_proxmox` →
     `data_encrypted`, metadata.name to match cluster, storage class to
     `proxmox-lvm-encrypted`.
   - Updates the deployment volume `claim_name` reference accordingly.
   - Aligns with the newer project convention documented in
     `.claude/CLAUDE.md`: "Default for sensitive data is
     proxmox-lvm-encrypted" and "Convention: PVC names end in `-encrypted`".
   - No destroy/recreate: the PVC and deployment already use the encrypted
     PVC in the cluster; TF source now just describes reality.

2. **State imports** (out-of-band, via `scripts/tg import`, not in diff):
   - `kubernetes_namespace.grampsweb` <- `grampsweb`
   - `kubernetes_persistent_volume_claim.data_encrypted` <- `grampsweb/grampsweb-data-encrypted`
   - `kubernetes_service.grampsweb` <- `grampsweb/grampsweb`
   - `kubernetes_deployment.grampsweb` <- `grampsweb/grampsweb`
   - `module.ingress.kubernetes_ingress_v1.proxied-ingress` <- `grampsweb/family`
   - `module.tls_secret.kubernetes_secret.tls_secret` <- `grampsweb/tls-secret`
   - `kubernetes_manifest.external_secret` <- `apiVersion=external-secrets.io/v1beta1,kind=ExternalSecret,namespace=grampsweb,name=grampsweb-secrets`

## Apply result

`Apply complete! Resources: 0 added, 7 changed, 0 destroyed.`

In-place updates applied:
- Deployment: dropped `GRAMPSWEB_LLM_BASE_URL` + `GRAMPSWEB_LLM_MODEL` env
  vars (both containers) — realising the intent of commit 7b248897.
- Ingress: realigned Traefik middleware annotation + cleaned stale
  `uptime.viktorbarzin.me/external-monitor=false` annotation.
- TLS secret: removed Kyverno-generated labels (Kyverno's
  `sync-tls-secret` ClusterPolicy re-applies them on next reconcile —
  no functional impact; same pattern in 29 other stacks using
  `setup_tls_secret` module).
- Namespace, PVC, service: trivial metadata alignments (label /
  `wait_until_bound` / `wait_for_load_balancer`).
- `kubernetes_manifest.external_secret`: populated the `manifest`
  attribute after import (expected).

## What is NOT in this change

- No replica bump: deployment stays at `replicas=0` (stack is intentionally
  inactive per 2026-03-14 OOM incident note).
- No destroy/recreate of any resource.
- The broader code-w97 (11 stacks with empty state) is NOT closed — only
  grampsweb is imported. 10 stacks remain: beads-server, insta2spotify,
  isponsorblocktv, kyverno, meshcentral, pvc-autoresizer, shadowsocks,
  tor-proxy, travel_blog, + meshcentral PVC.

## Reproduce locally

```
KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# Deployment still replicas=0; PVC grampsweb-data-encrypted Bound; ingress 'family'
# on family.viktorbarzin.me; ExternalSecret SecretSynced True.

cd /home/wizard/code/infra/stacks/grampsweb
/home/wizard/code/infra/scripts/tg plan
# Expected: 'No changes.' (clean state after apply).
```

## Test Plan

### Automated
```
$ cd /home/wizard/code/infra/stacks/grampsweb && /home/wizard/code/infra/scripts/tg plan
Plan: 0 to add, 7 to change, 0 to destroy.  [pre-apply]

$ /home/wizard/code/infra/scripts/tg apply --non-interactive
Plan: 0 to add, 7 to change, 0 to destroy.
kubernetes_namespace.grampsweb: Modifications complete after 0s [id=grampsweb]
kubernetes_persistent_volume_claim.data_encrypted: Modifications complete after 0s [id=grampsweb/grampsweb-data-encrypted]
kubernetes_service.grampsweb: Modifications complete after 0s [id=grampsweb/grampsweb]
module.ingress.kubernetes_ingress_v1.proxied-ingress: Modifications complete after 0s [id=grampsweb/family]
module.tls_secret.kubernetes_secret.tls_secret: Modifications complete after 0s [id=grampsweb/tls-secret]
kubernetes_manifest.external_secret: Modifications complete after 0s
kubernetes_deployment.grampsweb: Modifications complete after 1s [id=grampsweb/grampsweb]
Apply complete! Resources: 0 added, 7 changed, 0 destroyed.

$ terraform fmt -check -recursive stacks/grampsweb
(no output - formatted clean)
```

### Manual Verification
```
$ KUBECONFIG=/home/wizard/code/config kubectl get all,ingress,pvc,externalsecret,secret -n grampsweb
# - deployment.apps/grampsweb 0/0 0 0 47d   (replicas=0 preserved)
# - service/grampsweb ClusterIP 10.106.232.205:80/TCP
# - persistentvolumeclaim/grampsweb-data-encrypted Bound pvc-c9a5dcf4... 1Gi RWO proxmox-lvm-encrypted
# - ingress/family traefik family.viktorbarzin.me -> 10.0.20.200:80,443
# - externalsecret/grampsweb-secrets vault-kv 15m SecretSynced True
# - secret/tls-secret kubernetes.io/tls
# No pod crashes (no pods — replicas=0).
```

Closes: code-8m6
2026-04-18 11:37:45 +00:00
Viktor Barzin
1de2ee307f kyverno: strip resources.limits.cpu cluster-wide via ClusterPolicy
Context
-------
The cluster policy is "no CPU limits anywhere" — CFS throttling causes
more harm than good for bursty single-threaded workloads (Node.js,
Python). LimitRanges are already correct (defaultRequest.cpu only, no
default.cpu), but 22 pods still carried CPU limits injected by upstream
Helm chart defaults — CrowdSec (lapi + agents), descheduler,
kubernetes-dashboard (×4), nvidia gpu-operator.

Previous attempts were ad-hoc: patch each values.yaml, occasionally
missing things on chart upgrade. This replaces that with a declarative
Kyverno mutation at admission time.

This change
-----------
Adds a new ClusterPolicy `strip-cpu-limits` with two foreach rules:

  strip-container-cpu-limit      → containers[]
  strip-initcontainer-cpu-limit  → initContainers[]

Each rule uses `patchesJson6902` with an `op: remove` on
`resources/limits/cpu`. JSON6902 `remove` fails on missing paths, so
per-element preconditions gate the mutation — pods without CPU limits
pass through untouched. A top-level rule precondition short-circuits
using JMESPath filter (`[?resources.limits.cpu != null] | length(@) > 0`)
so the mutation is a no-op for the overwhelming majority of pods.

Admission-time only. No `mutateExistingOnPolicyUpdate`, no `background`.
Existing pods keep their CPU limits until they're restarted naturally
(Helm upgrade, node drain, rollout). We rely on churn, not forced
restarts, to avoid unnecessary thrash.

Memory limits are preserved — they prevent OOM, still useful.

Flow
----

    admission request → match Pod + CREATE
                     → top-level precondition: any container has limits.cpu?
                           no  → skip (fast path)
                           yes → foreach container:
                                   element.limits.cpu present?
                                       no  → skip element
                                       yes → remove /spec/containers/N/resources/limits/cpu
                     → same again for initContainers
                     → mutated pod proceeds to API server

Verification
------------
  kubectl run test-strip-cpu --overrides='{limits:{cpu:500m,memory:64Mi}}'
    → admitted pod.resources = {limits:{memory:64Mi}, requests:{cpu:50m,memory:32Mi}}
    → CPU limit stripped, memory preserved, requests untouched

  kubectl rollout restart deploy/kubernetes-dashboard-metrics-scraper
    → new pod.resources = {limits:{memory:400Mi}, requests:{cpu:100m,memory:200Mi}}
    → cluster-wide count of pods with CPU limits: 22 → 21

Rollout
-------
Remaining 21 pods will drop their CPU limits on natural churn. No manual
restarts in this change — user may want to time a mass restart with a
maintenance window.

Closes: code-eaf
Closes: code-4bz

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:34:39 +00:00
Viktor Barzin
903fc8377f [cleanup] Remove ollama from dashy + docs + nfs_directories
## Context
Final stage (9) of ollama decommission. After the stack was destroyed in
commit 0386f03f, several residual references remained:
- Vault KV `secret/ollama` (metadata + versions)
- `secrets/nfs_directories.txt` line listing `ollama` as a backup target
- `stacks/dashy/conf.yml` — "Ollama" tile linking to `ollama.viktorbarzin.me`
- `stacks/homepage/INGRESS_WIDGET_MAPPING.md` — 3 rows documenting the
  now-removed ingresses (ollama, ollama-api, ollama-server)

## This change
- `vault kv metadata delete secret/ollama` → all versions + metadata deleted.
- `secrets/nfs_directories.txt`: removed the `ollama` entry (line 71).
- `stacks/dashy/conf.yml`: removed the Ollama tile (`&ref_42`) and its
  reference at the end of the list; applied via Terragrunt so the running
  dashy ConfigMap picks up the change. Dashy apply: 0 added, 4 changed, 0
  destroyed (the ConfigMap diff plus the usual benign Kyverno drift).
- `stacks/homepage/INGRESS_WIDGET_MAPPING.md`: removed the 3 ollama rows.

## What was considered but NOT changed
- `stacks/ytdlp/yt-highlights/app/main.py`: `OLLAMA_URL = os.getenv("OLLAMA_URL", "")`
  already falls back to empty string when unset; the env var is no longer
  injected (stage 3) so this path is dead at runtime. Leaving source alone
  to keep this commit scoped to infra-only cleanup — future app-level
  cleanup can remove the dead fallback code.
- `stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts`:
  only mentions `var.ollama_host` in a documentation string inside a
  system-prompt template — non-functional. Will fix in a separate commit
  alongside the k8s-portal agent docs pass.

## Test plan
### Automated
- `vault kv get secret/ollama` → "No value found" (confirmed after delete).
- `scripts/tg apply` on dashy → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."
- `grep -n ollama secrets/nfs_directories.txt` → empty.

### Manual Verification
1. Open `https://dashy.viktorbarzin.me/` → Ollama tile is gone.
2. `kubectl get cm -n dashy dashy-config -o yaml | grep -i ollama` → no matches.
3. `vault kv get secret/ollama` → error "No value found at secret/data/ollama".
4. On PVE host: `rm -rf /srv/nfs-ssd/ollama` (optional — I skipped the
   on-host disk cleanup; it's a manual ops step the user can run when
   comfortable).

Closes: code-1gu

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:17:59 +00:00
Viktor Barzin
0386f03f1a [ollama] Destroy stack — decommissioned
## Context
Stage 8 of ollama decommission. With the ollama-tcp Traefik entrypoint and
IngressRouteTCP removed (stages 1-2), all downstream consumers re-routed or
cleaned (stages 3-6), and the root tfvar dropped (stage 7), the ollama stack
has no live consumers and can be destroyed.

## This change
- `terragrunt destroy -auto-approve` on stacks/ollama.
- Result: `Destroy complete! Resources: 18 destroyed.`
  - 1 namespace (ollama)
  - 2 deployments (ollama, ollama-ui)
  - 2 services (ollama, ollama-ui)
  - 3 ingresses (ollama, ollama-server, ollama-api) + 3 Cloudflare DNS
    records (proxied ollama, non-proxied A + AAAA for ollama-api)
  - 2 PVCs (ollama-data-host NFS, ollama-ui-data-proxmox — including the
    stuck Pending one from 47h ago; no finalizer trick needed)
  - 1 NFS PV (ollama-data-host)
  - 1 middleware (ollama_api_basic_auth_middleware)
  - 2 secrets (tls_secret, ollama_api_basic_auth)
  - 1 ExternalSecret manifest (external_secret)
- Directory `stacks/ollama/` fully removed.
- Verified `kubectl get ns ollama` → NotFound.

## Destroy blocker and fix
The initial `tg destroy` failed because `variable "ollama_host"` in
`stacks/ollama/main.tf` had no default and we had already removed it from
`config.tfvars` in stage 7. Added `default = "ollama.ollama.svc.cluster.local"`
to the variable, re-ran destroy successfully, then removed the whole
directory as part of this commit (so the temporary default never ships).

## What is NOT in this change
- Vault `secret/ollama` still present (stage 9 cleanup pending if vault
  authenticated interactively).
- NFS data at `/srv/nfs-ssd/ollama/` still present (stage 9 cleanup).
- `/home/wizard/code/infra/secrets/nfs_directories.txt` still lists ollama
  (stage 9 — requires git-crypt unlock).

## Test plan
### Automated
- `scripts/tg destroy -auto-approve` → "Destroy complete! Resources: 18 destroyed."
- `kubectl get ns ollama` → "NotFound" (confirmed).

### Manual Verification
1. `kubectl get ns ollama` → NotFound.
2. `dig ollama.viktorbarzin.me @1.1.1.1` → Cloudflare record removed
   (propagation may take up to 5m).
3. `ls /home/wizard/code/infra/stacks/ollama/` → directory does not exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:16:21 +00:00
Viktor Barzin
a12b06c608 [config] Remove ollama_host root variable
## Context
Stage 7 of ollama decommission. `ollama_host` was a shared tfvar consumed by
grampsweb, trading-bot, and ytdlp (all three cleaned in previous commits in
this stack). With no consumers left, the variable is dead config.

## This change
- Removes `ollama_host = "ollama.ollama.svc.cluster.local"` from
  `config.tfvars` (root-level).
- No direct apply — future stack applies automatically stop emitting
  "Value for undeclared variable" warnings for this name.

## What is NOT in this change
- Ollama namespace + deployments still running (stage 8 destroys them).
- Stages 3, 4, 5 already removed the `variable "ollama_host"` declaration
  in each consuming stack; with this commit the shared vars file matches.

## Test plan
### Automated
- None — tfvars change takes effect on next stack apply.

### Manual Verification
- `grep ollama_host config.tfvars` → empty (confirmed).
- `grep -r ollama_host stacks/` → only `ollama.svc.cluster.local` string
  literals inside comments (rybbit worker) or the hub stack itself (ollama
  stack being destroyed next).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:14:53 +00:00
Viktor Barzin
57fdea4b99 [rybbit] Remove ollama favicon cache entry (deploy on next manual wrangler)
## Context
Stage 6 of ollama decommission. The Cloudflare Worker at
stacks/rybbit/worker/index.js maps hostnames → rybbit analytics site IDs.
With `ollama.viktorbarzin.me` going away, the mapping is dead.

## This change
- Removes the `"ollama.viktorbarzin.me": "e73bebea399f"` entry from SITE_IDS.
- **Source-only** — does NOT auto-deploy. Cloudflare Workers are deployed
  via `wrangler deploy` (manual, per user preference). The change will take
  effect on the next manual deploy at the user's convenience.

## Manual deploy (when convenient)
```
cd stacks/rybbit/worker
wrangler deploy
```

## Test plan
### Automated
- Node syntax check: file remains valid JS (trailing comma rules preserved).

### Manual Verification
After `wrangler deploy`:
1. Hit `ollama.viktorbarzin.me` (while it still exists) — should NOT inject
   rybbit script (map lookup misses, DEFAULT_SITE_ID is null).
2. Hit any other mapped host (e.g. `immich.viktorbarzin.me`) — should
   continue to inject correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:14:38 +00:00
Viktor Barzin
7091ef2dd6 [trading-bot] Remove ollama refs from commented-out source
## Context
Stage 5 of ollama decommission. The `trading-bot` stack has been entirely
commented out since 2026-04-06 (deployments scaled to 0, infra disabled to
prevent re-creation on apply). The commented body still contained references
to `var.ollama_host`, `TRADING_OLLAMA_HOST`, and `TRADING_OLLAMA_MODEL`.
Removing them now so if/when the stack is ever re-enabled, those dead
references don't need remembering.

## This change
- Removes `variable "ollama_host"` from the commented-out block.
- Removes `TRADING_OLLAMA_HOST` and `TRADING_OLLAMA_MODEL` from the
  commented `common_env` locals.
- Verified the outer `/* ... */` comment block still wraps the entire stack
  (head: `/*`, tail: `*/`).
- No apply needed — stack is disabled.

## Test plan
### Automated
- None — file content is inside a block comment; Terraform parser ignores it.
- `terraform fmt` check: no effect (commented content).

### Manual Verification
- `head -1 stacks/trading-bot/main.tf` → `/*`
- `tail -1 stacks/trading-bot/main.tf` → `*/`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:14:22 +00:00
Viktor Barzin
7b248897d3 [grampsweb] Remove ollama_host source refs (apply blocked by bd-w97)
## Context
Stage 4 of ollama decommission. `grampsweb` referenced `var.ollama_host` for
its `GRAMPSWEB_LLM_BASE_URL` + `GRAMPSWEB_LLM_MODEL` env vars. This stack is
currently missing from Terraform state (blocked by bd-w97, which handles
state imports for 11 stacks including grampsweb) — so an apply would fail on
"resource already exists" errors.

## This change
- Deletes `variable "ollama_host"` declaration (stacks/grampsweb/main.tf).
- Deletes the two env entries `GRAMPSWEB_LLM_BASE_URL` and
  `GRAMPSWEB_LLM_MODEL` from the `common_env` locals block.
- **Source-only** — NO apply performed, because the stack cannot apply
  cleanly until bd-w97 resolves state imports. When that unblocks, the next
  apply will pick up the already-clean source.

## Why not apply now
- Running `scripts/tg apply` would try to create ~7 resources that already
  exist in K8s (namespace, PVCs, deployments, ingress, etc.), producing
  "already exists" errors for each.
- Once bd-w97 imports those into state, the next apply will be a no-op for
  them and will rollout the LLM env-var removal without issue.

## Test plan
### Automated
- No apply performed — stack blocked on bd-w97.
- `terraform fmt` on main.tf: no issues.

### Manual Verification
After bd-w97 resolves:
1. `scripts/tg plan` should show only the env-var removal on `grampsweb`
   deployments (no resource creates).
2. `scripts/tg apply` → deployments rollout with `GRAMPSWEB_LLM_*` vars gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:14:01 +00:00
Viktor Barzin
c175cfd69b [ytdlp] Remove ollama_host variable and fallback env vars
## Context
Stage 3 of ollama decommission. `ytdlp` had an Ollama fallback path for when
OpenRouter models failed. With ollama going away, that fallback is
inoperable — removing the variable and two env entries prevents pods from
ever attempting to hit a service that no longer exists.

## This change
- Drops `variable "ollama_host"` from stacks/ytdlp/main.tf.
- Drops the two env entries `OLLAMA_URL` and `OLLAMA_MODEL` (plus their
  preceding comment) from the yt-highlights container.
- Apply: `0 added, 4 changed, 0 destroyed` — deployments rolled out fresh
  env, plus benign Kyverno ndots drift (already accepted).
- Verified `kubectl get deploy -n ytdlp` no longer exposes OLLAMA_URL.

## What is NOT in this change
- OpenRouter primary path unchanged.
- config.tfvars `ollama_host` still present (stage 7 removes it).

## Test plan
### Automated
- `scripts/tg plan` → 4 in-place updates, 0 destroy.
- `scripts/tg apply` → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."

### Manual Verification
1. `kubectl get deploy -n ytdlp -o yaml | grep OLLAMA` → empty.
2. yt-highlights continues processing via OpenRouter (check container logs for
   successful OpenRouter responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:13:42 +00:00
Viktor Barzin
cc44bccfaa [traefik] Remove ollama-tcp entrypoint
## Context
Stage 2 of ollama decommission. The Traefik `ollama-tcp` entrypoint on port
11434 forwarded TCP traffic to the ollama service. With the IngressRouteTCP
already deleted (previous commit), the entrypoint is now orphaned — removing
it cleans up the Helm values and closes the port on the LB IP.

## This change
- Deletes the `ollama-tcp` entry from the `ports` map in traefik Helm values.
- Apply: `0 added, 4 changed, 0 destroyed` — helm_release.traefik rolled out
  new config, 3 auxiliary deployments picked up benign Kyverno ndots drift
  (already accepted per user approval).

## Verification
- `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'`
  output: `piper-tcp web websecure websecure-http3 whisper-tcp`
- `ollama-tcp` no longer listed.

## Test plan
### Automated
- `scripts/tg plan` showed 4 in-place updates, 0 destroy.
- `scripts/tg apply` → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed."

### Manual Verification
1. `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'`
2. Confirm `ollama-tcp` is absent from the output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:12:59 +00:00
Viktor Barzin
dbf7732a66 [uptime-kuma] Bump CPU + memory requests to reduce TTFB jitter
## Context
Uptime Kuma TTFB was bimodal — fast ~150ms responses mixed with slow
~3s responses — median 1.7s, p95 3.2s across 20 samples. CPU request
was 50m (5% of one core) against a Node.js process that handles ~190
monitors plus SQLite DB maintenance. Memory request was 64Mi while
actual RSS sat around 221Mi, so the pod was also running above its
guaranteed memory floor and subject to eviction pressure when nodes got
tight.

CPU limits are intentionally absent cluster-wide (CFS throttling caused
more pain than it solved), so the only knob to give the scheduler a
higher floor is the request itself. Raising the request makes the node
reserve more CPU for the pod and lets the kernel's CFS weight it more
generously when the node is busy — should reduce the tail on the slow
path without introducing throttling.

## This change
- requests.cpu: 50m -> 100m
- requests.memory: 64Mi -> 128Mi
- limits.memory: unchanged at 512Mi
- limits.cpu: still unset (explicit — cluster-wide rule)

## What is NOT in this change
- No CPU limit added
- No readiness/liveness probe tuning
- No replica count change (still 1, Recreate strategy)
- No DB layer / SQLite tuning

## Measurements (20 curl samples of https://uptime.viktorbarzin.me/)

Before:
  min    0.143s
  median 1.727s
  p95    3.163s
  max    3.204s
  mean   1.768s

After:
  min    0.149s
  median 1.228s
  p95    3.154s
  max    3.283s
  mean   1.590s

Median dropped ~29% (1.73s -> 1.23s). Tail (p95/max) essentially
unchanged — the slow bucket appears driven by something other than
CPU scheduling (likely socket.io / SSR render path inside the app,
or TLS/cf-tunnel handshake — worth a separate investigation).

Closes: code-79d
2026-04-18 11:11:39 +00:00
Viktor Barzin
80b6591e8b [whisper] Remove ollama_tcp IngressRouteTCP (ollama decom)
## Context
Ollama is being decommissioned. The `ollama_tcp_ingressroute` manifest in
stacks/whisper routed Traefik TCP entrypoint 11434 → ollama service in the
ollama namespace. With ollama going away, this route is dead weight and
blocks the subsequent destroy of the ollama stack.

## This change
- Deletes `kubernetes_manifest.ollama_tcp_ingressroute` from stacks/whisper/main.tf
- Apply result: 0 added, 5 changed, 0 destroyed (the manifest destroy happened in a
  previous partial-apply; the 5 "changed" resources are benign Kyverno ndots /
  PVC ownership drift which was already accepted per the user's approval).
- Verified `kubectl get ingressroutetcp -n traefik ollama-tcp` returns NotFound.

## What is NOT in this change
- Traefik entrypoint 11434 still exists (stage 2)
- Ollama namespace, deployments, services still present (stage 8)

## Test plan
### Automated
- `scripts/tg plan` showed 1 destroy (ollama_tcp_ingressroute), 1 create (data_proxmox
  PVC import), 4 benign updates.
- `scripts/tg apply -auto-approve` → "Apply complete! Resources: 0 added, 5 changed, 0 destroyed."

### Manual Verification
- kubectl get ingressroutetcp -n traefik ollama-tcp → NotFound (confirmed)
- kubectl get ingressroutetcp -n traefik whisper-tcp piper-tcp → still present

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:11:21 +00:00
Viktor Barzin
69fbd0ffd6 [docs] Update auto-upgrade docs — new HTTP auth path + n8n expression gotcha
Replaces the stale "Dev VM SSH key" secret entry with the current
`claude-agent-service` bearer token path (synced to both consumer +
caller namespaces). Adds an "n8n workflow gotchas" section documenting:

1. The workflow is DB-state, not Terraform-managed — the JSON in the
   repo is a backup, not authoritative.
2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat
   `='Bearer ' + $env.X` does NOT — costs silent 401s.
3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement.
4. 401-troubleshooting steps and the UPDATE pattern for in-place
   workflow patches.

Follow-up to 99180bec which fixed the actual pipeline break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:42:11 +00:00
Viktor Barzin
99180bec42 [n8n] Fix broken DIUN auto-upgrade pipeline — missing auth token to claude-agent-service
## Context

DIUN has been detecting image updates and firing Slack + webhook
notifications for weeks, but zero automated upgrades ran because the
handoff from n8n to claude-agent-service was silently 401-ing.

The pipeline (DIUN → n8n webhook → claude-agent-service /execute →
service-upgrade agent) was migrated from DevVM SSH to K8s HTTP in
42f1c3cf. The migration wired `claude-agent-service` (API_BEARER_TOKEN
env set), updated the n8n workflow JSON to POST with `Authorization:
Bearer $env.CLAUDE_AGENT_API_TOKEN`, but missed two things on the n8n
side:

1. The deployment didn't expose `CLAUDE_AGENT_API_TOKEN` to the n8n
   container — workflow sent `Authorization: Bearer ` (empty).
2. The workflow header expression used JS concat (`='Bearer ' + $env.X`)
   which n8n 1.x does NOT evaluate in HTTP Request node header params.
   It needs template-literal form: `=Bearer {{ $env.X }}`.

Evidence: `claude-agent-service` logs showed only `/health` probes —
zero `/execute` calls over 12h despite DIUN firing webhooks. n8n PG
execution 2250 returned `401 Missing bearer token`.

## This change

- Adds ExternalSecret `claude-agent-token` in the `n8n` namespace that
  pulls `api_bearer_token` from Vault `secret/claude-agent-service`
  (same source as the receiving service's token).
- Wires the token into the n8n container as env var
  `CLAUDE_AGENT_API_TOKEN` via `secret_key_ref`.
- Sets `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` so expressions CAN read
  `$env.*` at all (default in 1.x is false already, but setting
  explicitly guards against upstream default flips).
- Fixes the workflow JSON backup (`workflows/diun-upgrade.json`) header
  expression to use `{{ $env.X }}` template syntax.

The live workflow in n8n's PG DB was also patched in place (one-time
`UPDATE workflow_entity SET nodes = REPLACE(...)` — workflows are not
TF-managed; they were imported once).

## What is NOT in this change

- No retroactive re-run of skipped DIUN events. They'll be rediscovered
  in future scans.
- No change to the `claude-agent-service` side — its token and endpoint
  were already correct.
- No Slack alert on n8n HTTP-node failures — future work; right now a
  broken workflow fails silently unless you check Execution History.

## End-to-end verification

```
$ curl -X POST n8n.viktorbarzin.me/webhook/30805ab6-... \
    -d '{"diun_entry_status":"update","diun_entry_image":"docker.io/library/httpd","diun_entry_imagetag":"2.4.66",...}'
{"message":"Workflow was started"}  HTTP 200

# n8n PG: execution_entity latest row  → status=success
# claude-agent-service logs           → "POST /execute HTTP/1.1" 202 Accepted
```

## Reproduce locally

```
1. vault login -method=oidc
2. cd stacks/n8n && ../../scripts/tg apply
3. kubectl -n n8n exec deploy/n8n -- printenv CLAUDE_AGENT_API_TOKEN
   (should print 64-char hex)
4. Fire synthetic webhook with non-critical image (httpd / alpine)
5. Check n8n execution is success, claude-agent-service shows 202
```

Closes: code-ekz
Related: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:41:09 +00:00