Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.
beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.
Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).
Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.
Closes: code-8hk
Closes: code-jh3c
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.
The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`
So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.
## Flow
```
user: bd assign <id> agent
│
▼
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy? skip) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Decisions
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
`curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The script copies it into `/tmp/.beads/` because bd
may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
`beads-task-runner` never trips the reaper. Failures trip it; pod crashes
(in-memory job state lost) also trip it.
## What is NOT in this change
- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
`cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.
## Deviations from plan
Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
serializes `notes` as a string (not an array), and every `bd note` bumps
`updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
`-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.
## Test plan
### Automated
```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!
$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.
$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```
### Manual verification
1. **Apply**
```
vault login -method=oidc
cd infra/stacks/beads-server
scripts/tg apply
```
Expect: `kubernetes_config_map.beads_metadata`,
`kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
created. No changes to existing resources.
2. **CronJobs exist with right schedule**
```
kubectl -n beads-server get cronjob
```
Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`,
both with `SUSPEND=False`.
3. **End-to-end smoke**
```
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' line and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '{status, notes}'
```
Expect notes to contain `auto-dispatcher claimed at …` and
`dispatched: job=<uuid>`, status `in_progress`.
4. **Reaper smoke**
Assign + dispatch a long bead, then
`kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
30 min + one reaper tick, `bd show <id>` shows `blocked` with a
`reaper: no progress for Nm — blocking` note.
5. **Kill switch**
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
kubectl -n beads-server get cronjob
```
Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
nothing happens within 5 min. Re-apply with `=true` to re-enable.
Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.
Closes: code-8sm
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):
1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
(files in /usr/share/agent-seed/), the beads-task-runner agent, and
hmac.compare_digest bearer verification. The tag moves from 382d6b14
to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
needs the Dispatch button + /api/agent-dispatch + /api/agent-status
routes. Tag moves from :latest to :17a38e43 (fork HEAD on
github.com/ViktorBarzin/beadboard).
## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
"Use 8-char git SHA tags — `:latest` causes stale pull-through cache").
## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
- claude-agent-service:0c24c9b6 verified via
`docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
contains both seed files.
- beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.
## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
+ copies beads-task-runner.md), beadboard pod rolls with new env vars
sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
-o yaml | grep image:` should show :17a38e43.
Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
BeadBoard is the Next.js task visualization dashboard shipped in this
stack. We want users to trigger headless Claude agent runs directly from
a beads task row — "one-click dispatch" — instead of copy-pasting `bd`
IDs into a terminal. The agent runs in-cluster as claude-agent-service
(see stacks/claude-agent-service/), protected by a bearer token in
Vault at secret/claude-agent-service/api_bearer_token.
For BeadBoard to POST to /execute we need the service URL and the
bearer token available inside the pod as env vars. The URL is static
(cluster DNS); the token must come through External Secrets Operator
so rotation in Vault propagates without re-applying Terraform.
Secondary cleanup: the container was still pinned to :latest which
violates the 8-char-SHA convention and causes stale pulls through the
registry cache (see .claude/CLAUDE.md, Docker images). The image tag
is now variable-driven; the GHA pipeline will override the default
once it publishes the first SHA.
## This change
- Adds an ExternalSecret `beadboard-agent-service` in the
`beads-server` namespace, mirroring the pattern in
stacks/claude-agent-service/main.tf (same Vault path
`secret/claude-agent-service`, same `vault-kv` ClusterSecretStore,
same 15m refresh). Exposes exactly one key: `api_bearer_token`.
- Adds two env vars to the `beadboard` container:
- `CLAUDE_AGENT_SERVICE_URL` — static cluster URL
(`http://claude-agent-service.claude-agent.svc.cluster.local:8080`)
- `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the
ESO-managed Secret, key `api_bearer_token`
- Adds `reloader.stakater.com/auto = "true"` on the Deployment's
top-level metadata — matches the convention used by rybbit,
claude-memory, onlyoffice. When ESO refreshes the K8s Secret
because Vault rotated the token, Reloader restarts the pod so the
new token is picked up (env vars are read once at boot).
- Adds `variable "beadboard_image_tag"` (default `"latest"`, with a
one-line comment flagging the temporary default). The image
reference now interpolates `${var.beadboard_image_tag}`. No tfvars
file is touched — orchestrator will flip the default to the first
real 8-char SHA once GHA publishes it.
## What is NOT in this change
- No GHA workflow additions. The pipeline that builds
`registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard
repo and is out of scope here.
- No Vault-side changes. `secret/claude-agent-service/api_bearer_token`
already exists (it powers the claude-agent-service deployment
itself).
- No Terraform `apply`. Orchestrator applies.
## Data flow
Vault (secret/claude-agent-service)
│ refresh every 15m
▼
ESO → K8s Secret `beadboard-agent-service` (beads-server ns)
│ envFrom.secretKeyRef
▼
BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env)
│ Authorization: Bearer <token>
▼
claude-agent-service.claude-agent.svc:8080 /execute
On Vault rotation: ESO picks up new value at next refresh → K8s
Secret data changes → Reloader sees annotation + referenced Secret
changed → rolling-recreates the beadboard pod with the new token.
## Test Plan
### Automated
- `terraform fmt -recursive stacks/beads-server/` — clean (formatted
the file once; subsequent run is a no-op).
- `terraform -chdir=stacks/beads-server validate` (after
`terraform init -backend=false`) — `Success! The configuration is
valid`. The 14 "Deprecated Resource" warnings are pre-existing
(`kubernetes_namespace` vs `_v1` etc.) and unrelated to this
change.
### Manual Verification
1. Orchestrator applies:
`scripts/tg -chdir=stacks/beads-server apply`
2. Verify the ExternalSecret synced:
`kubectl -n beads-server get externalsecret beadboard-agent-service`
Expected: `Ready=True`, `SyncedAt` recent.
3. Verify the K8s Secret exists with one key:
`kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8`
Expected: first 8 chars of the bearer token.
4. Verify the deployment picked up the env vars:
`kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT`
Expected: both env entries present, bearer via `secretKeyRef`.
5. Verify the reloader annotation is on the Deployment metadata:
`kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'`
Expected: `true`.
6. Verify the image tag resolved to the variable default (for now):
`kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'`
Expected: `registry.viktorbarzin.me:5050/beadboard:latest`
(will become `...:<sha>` once `beadboard_image_tag` default is
updated).
7. Smoke-test the env var inside the pod:
`kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'`
Expected: URL printed, first 8 chars of token printed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [f1-stream] Remove committed cluster-admin kubeconfig
## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.
Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.
## This change
- git rm stacks/f1-stream/files/.config
## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
must be invalidated separately via `kubeadm certs renew admin.conf` or
CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
fresh mirror is planned and will be force-pushed to both remotes.
## Test plan
### Automated
No tests needed for a file removal. Sanity:
$ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output)
### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
fails with 401/403 once the admin cert is renewed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.
Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.
## This change
- git rm modules/kubernetes/frigate/config.yaml
## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
must be coordinated out-of-band with the camera operators. The DDNS
camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
commits from bcad200a forward. Scheduled to be purged via
`git filter-repo --path modules/kubernetes/frigate/config.yaml
--invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
source config from Git rather than the PVC, the replacement should go
through ExternalSecret + env-var interpolation, not an inline YAML.
## Test plan
### Automated
$ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms orphan status)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
PVC bound (unaffected by this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.
The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.
A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.
## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh
## What is NOT in this change
- Technitium token rotation. The leaked token still works against
`technitium-web.technitium.svc.cluster.local:5380` until revoked in the
Technitium admin UI. Rotation is a prerequisite for the upcoming
git-history scrub, which will remove the token from every commit via
`git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
stacks keep working.
## Test plan
### Automated
$ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms no consumer)
$ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
(no output in HEAD after this commit)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
behavioral regression because renew.sh was never part of the automated
flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.
The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.
## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh
main.py is retained unchanged.
## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
vendor default `calvin` regardless; rotation is tracked in the broader
remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
(the redfish-exporter ConfigMap has `default: username: root, password:
calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
addressed here — filed as its own task so the fix (drop the default
block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.
## Test plan
### Automated
$ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
--include='*.tf' --include='*.hcl' --include='*.yaml' \
--include='*.yml' --include='*.sh'
(no consumer references)
### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
it was never coupled to main.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.
The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).
Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
distinct cert material but equivalent coverage.
## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
.gitattributes as a regression guard — any future real file placed
under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.
setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.
## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
once the user's LE account is authenticated. Revocation must happen
before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
commit from 2026-03-11 forward. Scheduled for removal via
`git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
(and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
this commit matches the existing symlink-to-wildcard pattern rather
than introducing a new component.
## Test plan
### Automated
$ readlink stacks/foolery/secrets
../../secrets
(likewise for terminal, claude-memory)
$ for s in foolery terminal claude-memory; do
openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
done
subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard)
$ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
stacks/foolery/secrets/fullchain.pem: filter: git-crypt
(now matched by the new rule, though for the symlink target the
repo-root rule already applied)
### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
shows only the K8s TLS secret being re-created with the root-wildcard
material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
<name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
claude-memory) → cert chain presents the new serial, handshake OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.
This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
DockerHub image; no pull secret):
* `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
version`), used to smoke-test each new image.
* `broker-sync-trading212` — daily 02:00 `broker-sync trading212
--mode steady`.
* `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
* `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
* `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
(Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
the convention in infra/.claude/CLAUDE.md §3-2-1.
NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
stacks/wealthfolio/main.tf — happens in the same commit that first
applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
required keys documented in the ExternalSecret comment block.
Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
apply time.
## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
`broker-sync 0.1.0`.
* fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.
TODO: Create Authentik application for dolt-workbench.viktorbarzin.me
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external
URL which goes through Authentik. SSR can't authenticate to Authentik.
Also removed Authentik from /graphql ingress — browser fetch() can't
follow 302 redirects on POST requests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The env var was only set via kubectl and got overwritten on next apply.
Now permanently in the deployment spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Workbench's database connection is in-memory and lost on pod restart.
Added startup script that waits for GraphQL server readiness, then calls
addDatabaseConnection mutation automatically. No more manual reconnection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed project_id mismatch (was "beadboard", should be actual DB project ID)
- Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat)
- Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.)
- Task creation and bd CLI now work inside the container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BeadBoard needs to create templates/ and archetypes/ subdirectories
inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors
and 503 responses. Fix: init container copies ConfigMap to emptyDir.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench
for task dependency graph, kanban, and agent coordination views.
- Built custom Docker image (registry.viktorbarzin.me:5050/beadboard)
- ConfigMap provides .beads/metadata.json pointing to Dolt server
- Behind Authentik auth at beadboard.viktorbarzin.me
- Also fixed: GraphQL ingress now has Authentik middleware
- Also fixed: Workbench store.json type enum (mysql → Mysql)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS.
For k8s hosting, init container patches this to relative /graphql path.
Second ingress routes /graphql to port 9002 behind Authentik auth.
- Init container copies static JS to writable emptyDir, patches URL
- Pre-seeds store.json with Dolt connection config
- Added /graphql ingress with Authentik forward-auth
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set protected=true on ingress (Authentik forward-auth)
- Remove unused DATABASE_URL env var (Workbench uses browser-based connection config)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server
namespace. Provides SQL console, spreadsheet editor, and commit graph
visualization for the centralized beads task database.
- Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied)
- Connects to Dolt server via in-cluster service DNS
- Added to cloudflare_proxied_names for external access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>