Commit graph

1101 commits

Author SHA1 Message Date
Viktor Barzin
ce4a75d79a x402: deploy payment gateway in front of Anubis on all 9 public sites
Adds modules/kubernetes/x402_instance/ — a small Go reverse proxy
(forgejo.viktorbarzin.me/viktor/x402-gateway:ce333419) that selectively
issues HTTP 402 Payment Required to declared AI-bot User-Agents and
validates X-PAYMENT headers against a Coinbase x402 facilitator.
Browsers are forwarded transparently to Anubis (which then handles the
JS PoW gate as before).

Wired into all nine Anubis-fronted sites:
  ingress -> x402-X -> anubis-X -> backend

While `wallet_address` is empty the gateway runs in DRY_RUN — every
request is transparent-proxied, no 402s issued. This lets the pod sit
in the request path with zero behavioural impact today; flipping the
wallet variable in the per-stack module call activates payment-required
mode for AI-bot UAs.

Default config: Base mainnet USDC, $0.01/req, x402.org/facilitator,
catch-all UA list (ClaudeBot|GPTBot|Bytespider|meta-externalagent|
PerplexityBot|GoogleOther|cohere-ai|Diffbot|Amazonbot|
Applebot-Extended|FacebookBot|ImagesiftBot|YouBot|anthropic-ai|
Claude-Web|petalbot|spawning-ai|scrapy|python-requests).

Verified post-apply: 9/9 pods Running, all 9 sites still serve the
Anubis challenge to plain curl with identical TTFB, x402 logs confirm
"dry_run":true on every instance.
2026-05-10 11:12:40 +00:00
Viktor Barzin
8197842646 anubis: fix 500 on multi-replica + roll out to 6 more public sites
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).

Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.

Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).

End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
2026-05-10 11:12:40 +00:00
Viktor Barzin
abdef1781c anubis: strict bot policy — catch-all CHALLENGE for unmatched UAs
The default upstream policy only WEIGHs Mozilla|Opera UAs and lets
everything else (curl, wget, python-requests, scrapy, headless CLI
scrapers) fall through to the implicit ALLOW. On non-CDN-fronted
hosts (kms, anything dns_type=non-proxied) this meant a plain
`curl https://kms.viktorbarzin.me/` returned the real backend
content with no challenge — defeating the whole point of the
"avoid casual scrapers" intent.

Now the module ships a custom POLICY_FNAME mounted via ConfigMap:
- Imports the upstream deny-pathological / ai-block-aggressive /
  allow-good-crawlers / keep-internet-working snippets unchanged
- Adds a final `path_regex: .*` → action: CHALLENGE catch-all

Result: only IP-verified search engines (Googlebot from Google IPs,
Bingbot, etc.) and well-known paths (robots.txt, .well-known,
favicon, sitemap) skip the challenge. Everything else — including
spoofed-Googlebot-UA-from-random-IP — solves PoW or gets nothing.

Verified post-apply: curl default UA on viktorbarzin.me + kms +
travel returns the Anubis challenge HTML; /robots.txt still 200s
straight through.
2026-05-10 11:12:40 +00:00
Viktor Barzin
58fd4025f8 anubis: per-site PoW reverse proxy on blog + kms + travel-blog
Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy
instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance
issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny
proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The
shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key)
makes a single solve good across every Anubis-fronted subdomain.

Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and
travel.viktorbarzin.me — each with anti_ai_scraping=false on the
ingress so the redundant ai-bot-block forwardAuth is dropped from the
chain. Skipped forgejo (Git/API clients can't solve PoW) and resume
(replicas=0).

Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so
any ingress still using the ai-bot-block forwardAuth pays at most
~150 ms when poison-fountain is scaled down, instead of 3 s.

End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms.

Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to
4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and
Forgejo/API caveat.
2026-05-10 11:12:40 +00:00
Viktor Barzin
8c73a0243a [forgejo] Phase 4 final decommission: drop registry-private container + port 5050
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.

* infra/modules/docker-registry/docker-compose.yml — registry-private
  service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
  private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
  registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
  integrity step removed (the every-15min forgejo-integrity-probe
  in monitoring covers it). Break-glass tarball step still runs but
  pulls from Forgejo (the only registry left).

The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
  ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
56fbd281c9 [forgejo] Restore registry-private temporarily until image migration completes
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.

Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
4ec40ea804 [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
f793a5f50b [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
42961a5f58 [registry] fix-broken-blobs.sh — check revision-link, not blob data
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:43:35 +00:00
Viktor Barzin
9f9d7d10ff [registry] Scope OCI-index scan to private registry only
Live run on the registry VM surfaced 632 "orphaned" index children across
156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden,
openclaw). These aren't bugs — pull-through caches only fetch what's been
requested, so missing arm64 / arm / attestation children are normal partial
state. Scanning them generates noise that would mask the real signal from
the private registry (where we push full manifests ourselves and a missing
child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode).

Change: index-child scan is now gated on registry_name == "private". Layer-
link scan still runs across all registries (missing blob under a live link
is always a bug, regardless of pull-through semantics).

Verified: live run now reports 0 orphans in private registry — consistent
with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan
still inspects 425 links across all registries and finds 0 orphans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:23:04 +00:00
Viktor Barzin
7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
Viktor Barzin
752f94ab8f [monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730
The `external-monitor-sync` script is opt-IN by default for any
*.viktorbarzin.me ingress, so a missing annotation means "monitored."
Both ingress factories previously OMITTED the annotation when
`external_monitor = false`, which silently left monitors in place.

Fix: when the caller sets `external_monitor = false` explicitly, emit
`uptime.viktorbarzin.me/external-monitor = "false"` so the sync script
deletes the monitor. Keep the previous behavior (no annotation) for
callers that leave external_monitor null — otherwise 19 publicly-reachable
services with `dns_type="none"` would lose monitoring.

Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy)
to match the other two already-flagged services. Delete the r730 ingress
module entirely — the Dell server has been decommissioned.
2026-04-19 15:18:27 +00:00
Viktor Barzin
8d94688dde [infra] Suppress Kyverno label drift on module.tls_secret Secrets [ci skip]
## Context

Wave 3B of the state-drift consolidation audit (plan section "Shared Kyverno
drift-suppression") identified a second Kyverno admission-induced drift
class, complementary to the `# KYVERNO_LIFECYCLE_V1` ndots dns_config suppression
landed in c9d221d5. The ClusterPolicy `sync-tls-secret` runs on every
`kubernetes_secret` created via `modules/kubernetes/setup_tls_secret` and
stamps the following labels on the generated Secret:

  app.kubernetes.io/managed-by          = kyverno
  generate.kyverno.io/policy-name       = sync-tls-secret
  generate.kyverno.io/policy-namespace  = ""
  generate.kyverno.io/rule-name         = sync-tls-secret
  generate.kyverno.io/source-kind       = Secret
  generate.kyverno.io/source-namespace  = kyverno
  generate.kyverno.io/source-uid        = <uid>
  generate.kyverno.io/source-version    = v1
  generate.kyverno.io/source-group      = ""
  generate.kyverno.io/clone-source      = ""

Terraform does not manage any labels on this Secret, so every `terragrunt
plan` showed all 10 labels as `-> null`. This was observed on the dawarich
stack (one of the 93 callers of setup_tls_secret) and reproduces identically
on any stack that consumes this module. Root cause ticket: beads `code-seq`.

## This change

Adds a single `lifecycle { ignore_changes = [metadata[0].labels] }` block
to `modules/kubernetes/setup_tls_secret/main.tf`. One module edit,
93 callers' `module.tls_secret.kubernetes_secret.tls_secret` drift cleared.

The marker comment `# KYVERNO_LIFECYCLE_V1` stays consistent with the Wave 3A
convention (c9d221d5) — the rule now stands for "any Kyverno-induced
drift", not only ndots dns_config. AGENTS.md's "Kyverno Drift Suppression"
section will grow to catalog the fields ignored; this commit keeps the scope
tight to the code change.

## What is NOT in this change

- Namespace-level Goldilocks label drift (`goldilocks.fairwinds.com/vpa-update-mode = off`)
  — a different admission controller, different resource, different fix.
  Filed as beads `code-dwx` for a follow-up sweep across all 105 Tier 1
  stacks.
- AGENTS.md documentation expansion — will land alongside the Goldilocks
  sweep so both patterns are catalogued together.
- Retroactive marker on other Kyverno-generated Secrets — the sync-tls-secret
  policy is the only generate policy that produces Secrets in this repo
  (verified: `kubectl get cpol -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'` + cross-reference).

## Verification

Dawarich stack:
```
Before: Plan: 0 to add, 2 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
   (module.tls_secret.kubernetes_secret.tls_secret — Kyverno label drift)

After:  Plan: 0 to add, 1 to change, 0 to destroy.
   (kubernetes_namespace.dawarich — Goldilocks drift, untouched)
```

Closes: code-seq (partial — tls_secret branch)
Refs: code-dwx (Goldilocks follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:23:02 +00:00
Viktor Barzin
e51bdb2af8 Add broker-sync Terraform stack (#7)
* [f1-stream] Remove committed cluster-admin kubeconfig

## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.

Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.

## This change
- git rm stacks/f1-stream/files/.config

## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
  must be invalidated separately via `kubeadm certs renew admin.conf` or
  CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
  c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
  fresh mirror is planned and will be force-pushed to both remotes.

## Test plan
### Automated
No tests needed for a file removal. Sanity:
  $ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output)

### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
     stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
   verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
   fails with 401/403 once the admin cert is renewed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [frigate] Remove orphan config.yaml with leaked RTSP passwords

## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token

## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds

## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink

## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add broker-sync Terraform stack (pending apply)

Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.

* fix(beads-server): disable Authentik + CrowdSec on Workbench

Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:17:45 +01:00
Viktor Barzin
b034c868db [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.

Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources

Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged

Next: Cloudflare Workers with HTMLRewriter for edge-side injection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 12:41:17 +00:00
Viktor Barzin
66d2d9916b [infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip]
## Context
Two operational gaps surfaced during a healthcheck sweep today:

1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
   in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
   `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
   registered for external probing — so outages like Immich going down externally were
   invisible until a user complained. 99 of ~125 public ingresses had no external
   monitor.

2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
   ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
   value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
   at plan time. Blocked CI applies and drift reconciliation.

## This change

### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
  nullable). Default is "follow dns_type" — enabled for any public DNS record
  (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
  direct-A records are also monitored).
- Emits two annotations on the Ingress:
  - `uptime.viktorbarzin.me/external-monitor = "true"`
  - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)

### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
  annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
  API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
  `list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
  instead of `kubernetes.default.svc` — the search-domain expansion failed in the
  CronJob pod's DNS config. Verified working: CronJob now logs
  `Loaded N external monitor targets (source=k8s-api)`.

### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
  plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
  plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
  unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
  `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
  `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
  removed. State was rm'd + re-imported with matching UIDs, so no data was moved.

## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
  (was 13 on the central list)

## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
  rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
  `[ci skip]` here so those don't auto-apply; they will be fixed manually before the
  next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
  grafana, vault, forgejo) are annotated — separate PR.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged

\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
    https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\
    https://budget-viktor.viktorbarzin.me/
200 302 200

\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor     1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted  Bound  10Gi  RWO  proxmox-lvm-encrypted
\`\`\`

### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
   \`\`\`
   kubectl -n dawarich get ingress dawarich -o \\
     jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
   # Expected: "true"
   \`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
   (CronJob interval). For Immich specifically, it will appear after the immich stack
   is re-applied.
3. Verify actualbudget plan is clean:
   \`\`\`
   cd stacks/actualbudget && scripts/tg plan --non-interactive
   # Expected: no "Invalid count argument" errors
   \`\`\`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:34:32 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
c2f9ca0d13 modules: improve create-vm with additional config options and cloud-init updates 2026-04-06 11:57:55 +03:00
Viktor Barzin
d1059d6017 registry: set proxy TTL to 0 to prevent stale :latest images
Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
2026-03-30 00:02:48 +03:00
Viktor Barzin
28587c674d fix-broken-blobs: use argparse for proper flag handling
--dry-run as first arg was being parsed as the BASE directory path.
2026-03-29 22:33:33 +03:00
Viktor Barzin
dd461beb33 add registry blob integrity checker to self-heal corrupted cache
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.

fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.

Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
2026-03-29 22:31:39 +03:00
Viktor Barzin
facf959ecf fix registry healthchecks: use 127.0.0.1 instead of localhost
localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4
only), causing wget to fail with "Connection refused". The nginx
proxy had 18,462 consecutive health check failures because of this.

Also cleared corrupted pull-through cache for mghee/novelapp — the
registry had layer link files pointing to non-existent blob data,
causing containerd to get 200 responses with 0 bytes (unexpected EOF).
2026-03-29 22:29:27 +03:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
8c6f238697 add default Homepage annotations to ingress_factory for auto-discovery
- ingress_factory now injects gethomepage.dev/* annotations on all ingresses
  (name, group, href, icon) with namespace-to-group mapping
- Stacks with explicit annotations override defaults via merge order
- New homepage_enabled var allows opt-out for internal-only ingresses
- Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap)
- Added hideErrors and quicklaunch settings for clean service directory
- Result: 116/134 ingresses now discoverable (up from ~30)
2026-03-25 11:00:38 +02:00
Viktor Barzin
2dcb4b7fa4 fix(renew-tls): clean stale _acme-challenge TXT records before certbot
21+ stale TXT records accumulated from previous runs, causing certbot
DNS-01 challenge to fail. Now deletes all _acme-challenge records
from Cloudflare before certbot creates fresh ones.
2026-03-23 22:32:27 +02:00
Viktor Barzin
3f0ecda737 harden pull-through cache: intercept errors, reduce lock timeout, add healthz
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
  to prevent caching truncated upstream responses (root cause of repeated
  ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
  retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
2026-03-23 11:33:06 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
250a058627 feat(traefik): add custom error pages with tarampampam/error-pages
Deploy error-pages service to show themed error pages instead of raw
Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1)
for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.
2026-03-19 23:14:27 +00:00
Viktor Barzin
67d1ce453c add /sentinel dir to cloud-init for kured reboot gating
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
2026-03-19 19:57:27 +00:00
Viktor Barzin
f8a36f0621 fix pull-through cache: remove maxsize, harden nginx caching [ci skip]
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.

Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
  blobs on node3/node4
2026-03-16 07:41:11 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
7e72a10848 exclude manifest requests from nginx registry cache
Split /v2/ location into two: regex match for blobs (cached 24h, immutable
content-addressed by SHA256) and prefix match for everything else including
manifests (proxy_cache off, mutable tags). Also remove disabled registries
(quay, k8s, kyverno) whose containers/configs don't exist on the VM.
2026-03-14 23:42:17 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
1b78e44ab6 [ci skip] fix: add mount_options to nfs_volume PV spec
StorageClass mountOptions only apply during dynamic provisioning.
Static PVs (created by Terraform) need mount_options set explicitly.
Without this, all CSI NFS mounts default to hard,timeo=600 — the
exact problem we were trying to fix.
2026-03-02 20:22:47 +00:00
Viktor Barzin
c702fd2565 [ci skip] add NFS CSI driver + nfs_volume shared module
- Deploy csi-driver-nfs Helm chart as platform module (nfs-csi)
- Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options
- Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)
2026-03-01 23:38:58 +00:00
Viktor Barzin
7ff3c61bd7 [ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain 2026-03-01 14:35:53 +00:00
Viktor Barzin
946b5b1745 [ci skip] add qemu-guest-agent to VM templates and enable agent by default 2026-03-01 01:58:46 +00:00
Viktor Barzin
09a810f8fb [ci skip] fix: use $http_host in nginx to preserve port in registry redirects 2026-02-28 20:16:03 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
925dbe39c1 [ci skip] add registry-private service to Docker Compose stack 2026-02-28 17:57:04 +00:00
Viktor Barzin
64c55a6710 [ci skip] add nginx upstream and server block for private registry on port 5050 2026-02-28 17:57:03 +00:00
Viktor Barzin
2102ffdb8b [ci skip] add private R/W registry config for CI build caching 2026-02-28 17:56:50 +00:00
Viktor Barzin
865b68ce77 [ci skip] Rebuild docker-registry with nginx serialization on all ports
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
2026-02-22 21:45:53 +00:00
Viktor Barzin
006f95337e [ci skip] Add anti_ai_scraping option to ingress_factory (default: true) 2026-02-22 19:50:07 +00:00
Viktor Barzin
116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00