Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).
Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.
Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).
End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
The default upstream policy only WEIGHs Mozilla|Opera UAs and lets
everything else (curl, wget, python-requests, scrapy, headless CLI
scrapers) fall through to the implicit ALLOW. On non-CDN-fronted
hosts (kms, anything dns_type=non-proxied) this meant a plain
`curl https://kms.viktorbarzin.me/` returned the real backend
content with no challenge — defeating the whole point of the
"avoid casual scrapers" intent.
Now the module ships a custom POLICY_FNAME mounted via ConfigMap:
- Imports the upstream deny-pathological / ai-block-aggressive /
allow-good-crawlers / keep-internet-working snippets unchanged
- Adds a final `path_regex: .*` → action: CHALLENGE catch-all
Result: only IP-verified search engines (Googlebot from Google IPs,
Bingbot, etc.) and well-known paths (robots.txt, .well-known,
favicon, sitemap) skip the challenge. Everything else — including
spoofed-Googlebot-UA-from-random-IP — solves PoW or gets nothing.
Verified post-apply: curl default UA on viktorbarzin.me + kms +
travel returns the Anubis challenge HTML; /robots.txt still 200s
straight through.
Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy
instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance
issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny
proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The
shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key)
makes a single solve good across every Anubis-fronted subdomain.
Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and
travel.viktorbarzin.me — each with anti_ai_scraping=false on the
ingress so the redundant ai-bot-block forwardAuth is dropped from the
chain. Skipped forgejo (Git/API clients can't solve PoW) and resume
(replicas=0).
Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so
any ingress still using the ai-bot-block forwardAuth pays at most
~150 ms when poison-fountain is scaled down, instead of 3 s.
End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms.
Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to
4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and
Forgejo/API caveat.
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.
* infra/modules/docker-registry/docker-compose.yml — registry-private
service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
integrity step removed (the every-15min forgejo-integrity-probe
in monitoring covers it). Break-glass tarball step still runs but
pulls from Forgejo (the only registry left).
The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.
Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.
Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.
Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live run on the registry VM surfaced 632 "orphaned" index children across
156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden,
openclaw). These aren't bugs — pull-through caches only fetch what's been
requested, so missing arm64 / arm / attestation children are normal partial
state. Scanning them generates noise that would mask the real signal from
the private registry (where we push full manifests ourselves and a missing
child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode).
Change: index-child scan is now gated on registry_name == "private". Layer-
link scan still runs across all registries (missing blob under a live link
is always a bug, regardless of pull-through semantics).
Verified: live run now reports 0 orphans in private registry — consistent
with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan
still inspects 425 links across all registries and finds 0 orphans.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `external-monitor-sync` script is opt-IN by default for any
*.viktorbarzin.me ingress, so a missing annotation means "monitored."
Both ingress factories previously OMITTED the annotation when
`external_monitor = false`, which silently left monitors in place.
Fix: when the caller sets `external_monitor = false` explicitly, emit
`uptime.viktorbarzin.me/external-monitor = "false"` so the sync script
deletes the monitor. Keep the previous behavior (no annotation) for
callers that leave external_monitor null — otherwise 19 publicly-reachable
services with `dns_type="none"` would lose monitoring.
Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy)
to match the other two already-flagged services. Delete the r730 ingress
module entirely — the Dell server has been decommissioned.
## Context
Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:
app.kubernetes.io/managed-by = kyverno
generate.kyverno.io/policy-name = sync-tls-secret
generate.kyverno.io/policy-namespace = ""
generate.kyverno.io/rule-name = sync-tls-secret
generate.kyverno.io/source-kind = Secret
generate.kyverno.io/source-namespace = kyverno
generate.kyverno.io/source-uid = <uid>
generate.kyverno.io/source-version = v1
generate.kyverno.io/source-group = ""
generate.kyverno.io/clone-source = ""
Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.
## This change
Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.
The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.
## What is NOT in this change
- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
— a different admission controller, different resource, different fix.
Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
policy is the only generate policy that produces Secrets in this repo
(verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).
## Verification
Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
(kubernetes_namespace.dawarich — Goldilocks drift, untouched)
(module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)
After: Plan: 0 to add, 1 to change, 0 to destroy.
(kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```
Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [f1-stream] Remove committed cluster-admin kubeconfig
## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.
Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.
## This change
- git rm stacks/f1-stream/files/.config
## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
must be invalidated separately via `kubeadm certs renew admin.conf` or
CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
fresh mirror is planned and will be force-pushed to both remotes.
## Test plan
### Automated
No tests needed for a file removal. Sanity:
$ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output)
### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
fails with 401/403 once the admin cert is renewed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.
Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.
## This change
- git rm modules/kubernetes/frigate/config.yaml
## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
must be coordinated out-of-band with the camera operators. The DDNS
camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
commits from bcad200a forward. Scheduled to be purged via
`git filter-repo --path modules/kubernetes/frigate/config.yaml
--invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
source config from Git rather than the PVC, the replacement should go
through ExternalSecret + env-var interpolation, not an inline YAML.
## Test plan
### Automated
$ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms orphan status)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
PVC bound (unaffected by this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.
The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.
A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.
## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh
## What is NOT in this change
- Technitium token rotation. The leaked token still works against
`technitium-web.technitium.svc.cluster.local:5380` until revoked in the
Technitium admin UI. Rotation is a prerequisite for the upcoming
git-history scrub, which will remove the token from every commit via
`git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
stacks keep working.
## Test plan
### Automated
$ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms no consumer)
$ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
(no output in HEAD after this commit)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
behavioral regression because renew.sh was never part of the automated
flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.
The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.
## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh
main.py is retained unchanged.
## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
vendor default `calvin` regardless; rotation is tracked in the broader
remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
(the redfish-exporter ConfigMap has `default: username: root, password:
calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
addressed here — filed as its own task so the fix (drop the default
block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.
## Test plan
### Automated
$ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
--include='*.tf' --include='*.hcl' --include='*.yaml' \
--include='*.yml' --include='*.sh'
(no consumer references)
### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
it was never coupled to main.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.
The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).
Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
distinct cert material but equivalent coverage.
## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
.gitattributes as a regression guard — any future real file placed
under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.
setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.
## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
once the user's LE account is authenticated. Revocation must happen
before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
commit from 2026-03-11 forward. Scheduled for removal via
`git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
(and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
this commit matches the existing symlink-to-wildcard pattern rather
than introducing a new component.
## Test plan
### Automated
$ readlink stacks/foolery/secrets
../../secrets
(likewise for terminal, claude-memory)
$ for s in foolery terminal claude-memory; do
openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
done
subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard)
$ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
stacks/foolery/secrets/fullchain.pem: filter: git-crypt
(now matched by the new rule, though for the symlink target the
repo-root rule already applied)
### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
shows only the K8s TLS secret being re-created with the root-wildcard
material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
<name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
claude-memory) → cert chain presents the new serial, handshake OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.
This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
DockerHub image; no pull secret):
* `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
version`), used to smoke-test each new image.
* `broker-sync-trading212` — daily 02:00 `broker-sync trading212
--mode steady`.
* `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
* `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
* `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
(Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
the convention in infra/.claude/CLAUDE.md §3-2-1.
NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
stacks/wealthfolio/main.tf — happens in the same commit that first
applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
required keys documented in the ExternalSecret comment block.
Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
apply time.
## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
`broker-sync 0.1.0`.
* fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.
TODO: Create Authentik application for dolt-workbench.viktorbarzin.me
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Two operational gaps surfaced during a healthcheck sweep today:
1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
`ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
registered for external probing — so outages like Immich going down externally were
invisible until a user complained. 99 of ~125 public ingresses had no external
monitor.
2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
at plan time. Blocked CI applies and drift reconciliation.
## This change
### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
nullable). Default is "follow dns_type" — enabled for any public DNS record
(`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
direct-A records are also monitored).
- Emits two annotations on the Ingress:
- `uptime.viktorbarzin.me/external-monitor = "true"`
- `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)
### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
`list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
instead of `kubernetes.default.svc` — the search-domain expansion failed in the
CronJob pod's DNS config. Verified working: CronJob now logs
`Loaded N external monitor targets (source=k8s-api)`.
### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
`proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
`data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
removed. State was rm'd + re-imported with matching UIDs, so no data was moved.
## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
(was 13 on the central list)
## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
`[ci skip]` here so those don't auto-apply; they will be fixed manually before the
next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
grafana, vault, forgejo) are annotated — separate PR.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged
\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
https://dawarich.viktorbarzin.me/https://nextcloud.viktorbarzin.me/ \\
https://budget-viktor.viktorbarzin.me/
200 302 200
\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor 1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted
\`\`\`
### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
\`\`\`
kubectl -n dawarich get ingress dawarich -o \\
jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
# Expected: "true"
\`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
(CronJob interval). For Immich specifically, it will appear after the immich stack
is re-applied.
3. Verify actualbudget plan is clean:
\`\`\`
cd stacks/actualbudget && scripts/tg plan --non-interactive
# Expected: no "Invalid count argument" errors
\`\`\`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.
fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.
Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4
only), causing wget to fail with "Connection refused". The nginx
proxy had 18,462 consecutive health check failures because of this.
Also cleared corrupted pull-through cache for mghee/novelapp — the
registry had layer link files pointing to non-existent blob data,
causing containerd to get 200 responses with 0 bytes (unexpected EOF).
- ingress_factory now injects gethomepage.dev/* annotations on all ingresses
(name, group, href, icon) with namespace-to-group mapping
- Stacks with explicit annotations override defaults via merge order
- New homepage_enabled var allows opt-out for internal-only ingresses
- Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap)
- Added hideErrors and quicklaunch settings for clean service directory
- Result: 116/134 ingresses now discoverable (up from ~30)
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
to prevent caching truncated upstream responses (root cause of repeated
ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.
Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
blobs on node3/node4
Split /v2/ location into two: regex match for blobs (cached 24h, immutable
content-addressed by SHA256) and prefix match for everything else including
manifests (proxy_cache off, mutable tags). Also remove disabled registries
(quay, k8s, kyverno) whose containers/configs don't exist on the VM.
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.