Commit graph

20 commits

Author SHA1 Message Date
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
a62b43d19e [infra] Document intended ignore_changes drift-workarounds [ci skip]
## Context

The infra repo has 31 `ignore_changes` blocks. Phase 1 of the state-drift
consolidation audit classified 21 as legitimate (immutable fields, cloud-computed
values) and 10 as intentional workarounds for known drift sources. The remaining
10 were indistinguishable from accidental/forgotten drift suppression without
reading the surrounding context.

This commit adds a uniform `# DRIFT_WORKAROUND: <reason>, reviewed 2026-04-18`
marker above the 8 intended-workaround blocks (6 CI image-tag decoupling + 2
non-deterministic secret hashes) so they are easy to distinguish from
accidental drift suppression during future audits.

## What is NOT in this change

- Functional behavior — `ignore_changes` lists are byte-identical.
- The Kyverno `dns_config` ignore paths (covered by Wave 3 shared module).
- Workarounds being removed — the CI decoupling is intentional by user decision.

## Files touched

CI image-tag decoupling (6):
- stacks/k8s-portal/modules/k8s-portal/main.tf (also has dns_config for Kyverno)
- stacks/novelapp/main.tf
- stacks/claude-memory/main.tf
- stacks/plotting-book/main.tf
- stacks/trading-bot/main.tf (api deployment)
- stacks/trading-bot/main.tf (workers deployment — 6 containers)

Non-deterministic secret hashes (2):
- stacks/owntracks/main.tf (htpasswd bcrypt)
- stacks/mailserver/modules/mailserver/main.tf (postfix-accounts.cf)

## Test Plan

### Automated
```
$ rg DRIFT_WORKAROUND stacks/ | wc -l
8

$ terraform fmt -recursive stacks/k8s-portal stacks/novelapp stacks/claude-memory \
    stacks/plotting-book stacks/trading-bot stacks/owntracks stacks/mailserver
(no output — already formatted)

$ git diff --stat
 stacks/claude-memory/main.tf                 | 1 +
 stacks/k8s-portal/modules/k8s-portal/main.tf | 1 +
 stacks/mailserver/modules/mailserver/main.tf | 3 ++-
 stacks/novelapp/main.tf                      | 1 +
 stacks/owntracks/main.tf                     | 1 +
 stacks/plotting-book/main.tf                 | 1 +
 stacks/trading-bot/main.tf                   | 2 ++
 7 files changed, 9 insertions(+), 1 deletion(-)
```

### Manual Verification
No apply required — HCL comments only, zero effect on plan output.

## Reproduce locally
1. `cd infra && git pull`
2. `rg "DRIFT_WORKAROUND.*reviewed 2026-04-18" stacks/ | wc -l` → expect 8
3. `terraform fmt -check -recursive stacks/` → expect clean exit

Closes: code-yrg

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:08:10 +00:00
Viktor Barzin
e51bdb2af8 Add broker-sync Terraform stack (#7)
* [f1-stream] Remove committed cluster-admin kubeconfig

## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.

Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.

## This change
- git rm stacks/f1-stream/files/.config

## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
  must be invalidated separately via `kubeadm certs renew admin.conf` or
  CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
  c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
  fresh mirror is planned and will be force-pushed to both remotes.

## Test plan
### Automated
No tests needed for a file removal. Sanity:
  $ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output)

### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
     stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
   verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
   fails with 401/403 once the admin cert is renewed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [frigate] Remove orphan config.yaml with leaked RTSP passwords

## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token

## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds

## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink

## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add broker-sync Terraform stack (pending apply)

Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.

* fix(beads-server): disable Authentik + CrowdSec on Workbench

Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:17:45 +01:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
216d4240c9 [infra] Add Cloudflare provider to all stack lock files and generated providers
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:31:36 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
faa6868f79 remove claude-memory PDB (blocks drains with single replica)
Single replica + minAvailable=1 makes drains impossible.
claude-memory is non-critical and recovers quickly. [ci skip]
2026-04-06 00:47:40 +03:00
Viktor Barzin
e4f478b490 switch claude-memory server to multi-user API_KEYS auth
Enables isolated memory namespaces per user (wizard/emo) by switching
from single API_KEY to API_KEYS JSON map env var.
2026-03-22 20:08:07 +02:00
Viktor Barzin
ad689076d8 scale down non-critical services to free cluster memory
- authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1
- tuya-bridge: 3→1
- realestate-crawler-api: 2→1
- claude-memory: 2→1
- grafana: 2→1 (config only, apply pending)
- alertmanager: 2→1 (config only, apply pending)

Estimated savings: ~1.2 Gi total
2026-03-22 03:10:12 +02:00
Viktor Barzin
a04335d0f3 right-size 14 services and scale down GPU-heavy workloads [ci skip]
Memory right-sizing based on VPA upperBound analysis:
- Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi,
  dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi,
  navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi,
  ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi,
  mysql-operator 384→512Mi
- Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi,
  dcgm-exporter 2560→1536Mi
- Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure)

Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved,
all ResourceQuota pressures cleared, GPU health green.
2026-03-15 23:00:49 +00:00
Viktor Barzin
745e43c983 fix DB password desync + migrate remaining tfvars to Vault
DB desync fix: Stacks with Vault DB engine rotation (24h) now read
the password from vault-database ClusterSecretStore instead of vault-kv.
9 stacks updated with db ExternalSecrets reading from static-creds/*.

Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory,
woodpecker, linkwarden, nextcloud, url.

terraform.tfvars migration:
- plotting-book: google_client_id/secret → Vault KV + secret_key_ref
- tandoor: email_password var removed (was default="", now optional ESO)
- infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password
  → Vault KV at secret/infra + data source
2026-03-15 21:39:45 +00:00
Viktor Barzin
06a0d0599a regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
Viktor Barzin
0f262ceda3 add pod dependency management via Kyverno init container injection
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).

Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
2026-03-15 19:17:57 +00:00
Viktor Barzin
1acf8cc4e8 migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
Viktor Barzin
a2720f6a4c claude-memory: read DB password from Vault KV instead of tfvars
Vault DB engine rotates the password every 24h, so the static tfvars
value was stale. Now reads from secret/claude-memory db_password key.
2026-03-15 18:22:29 +00:00
Viktor Barzin
3f0e8541f6 claude-memory: pin image to :17, fixes URL-decode crash on sync endpoint
The + in +00:00 timezone offsets was being URL-decoded to a space,
causing ValueError on the /api/memories/sync endpoint. Build :17
includes the fix. Using versioned tag instead of :latest to avoid
pull-through cache serving stale images.
2026-03-15 15:32:50 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
a6f71fc6f0 feat(claude-memory): add stack and update image to standalone repo
- Add claude-memory stack (was previously untracked)
- Update Docker image from viktorbarzin/claude-memory to
  viktorbarzin/claude-memory-mcp (standalone open-source repo)
- CI/CD now lives in the standalone repo's .woodpecker.yml
2026-03-14 09:49:38 +00:00